Simeon Simeonov created SPARK-11770:
---------------------------------------

             Summary: Spark SQL field resolution error in GROUP BY HAVING clause
                 Key: SPARK-11770
                 URL: https://issues.apache.org/jira/browse/SPARK-11770
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.1
            Reporter: Simeon Simeonov


A query fails to resolve columns from the source data when an alias is added to 
the SELECT clause. I have not been able to isolate a reproducible standalone 
test. Below are a series of {{spark-shell}} operations that show the problem 
step-by-step. Spark SQL execution happens via {{HiveContext}}.

I believe the root cause of the problem is that when (and only when) there are 
aliased expression columns in the SELECT clause, Spark SQL "sees" columns from 
the SELECT clause while evaluating a HAVING clause. According to the SQL 
standard that should not happen.

The table in question is simple:

{code}
scala> ctx.table("hevents_test").printSchema
15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
15/11/16 22:19:19 INFO audit: ugi=sim   ip=unknown-ip-addr      cmd=get_table : 
db=default tbl=hevents_test
root
 |-- vertical: string (nullable = true)
 |-- did: string (nullable = true)
 |-- surl: string (nullable = true)
 |-- creative_id: long (nullable = true)
 |-- keyword_text: string (nullable = true)
 |-- errors: integer (nullable = true)
 |-- views: integer (nullable = true)
 |-- clicks: integer (nullable = true)
 |-- actions: long (nullable = true)
{code}

A basic aggregation with a SELECT expression works without a problem:

{code}
cala> ctx.sql("""
     |   select 1.0*creative_id, sum(views) as views
     |   from hevents_test
     |   group by creative_id
     |   having sum(views) > 500
     | """)
15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
sum(views) as views
  from hevents_test
  group by creative_id
  having sum(views) > 500
15/11/16 22:25:53 INFO ParseDriver: Parse Completed
15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
15/11/16 22:25:53 INFO audit: ugi=sim   ip=unknown-ip-addr      cmd=get_table : 
db=default tbl=hevents_test
res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
{code}

However, if the expression is aliased, the analyzer gets confused about 
{{views}}.

{code}
scala> ctx.sql("""
     | select 1.0*creative_id as cid, sum(views) as views
     | from hevents_test
     | group by creative_id
     | having sum(views) > 500
     | """)
15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id as 
cid, sum(views) as views
from hevents_test
group by creative_id
having sum(views) > 500
15/11/16 22:26:59 INFO ParseDriver: Parse Completed
15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
15/11/16 22:26:59 INFO audit: ugi=sim   ip=unknown-ip-addr      cmd=get_table : 
db=default tbl=hevents_test
org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L missing 
from 
vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
 in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
        at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
        at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
        at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
        at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
        at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
        at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
        at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
        at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
        at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:71)
        at $iwC$$iwC$$iwC.<init>(<console>:73)
        at $iwC$$iwC.<init>(<console>:75)
        at $iwC.<init>(<console>:77)
        at <init>(<console>:79)
        at .<init>(<console>:83)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
        at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
        at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
        at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
        at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
        at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
        at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

If the {{views}} column in the HAVING clause is explicitly disambiguated, the 
problem goes away:

{code}
scala> ctx.sql("""
     | select 1.0*creative_id as cid, sum(views) as views
     | from hevents_test
     | group by creative_id
     | having sum(hevents_test.views) > 500
     | """)
15/11/16 22:29:17 INFO ParseDriver: Parsing command: select 1.0*creative_id as 
cid, sum(views) as views
from hevents_test
group by creative_id
having sum(hevents_test.views) > 500
15/11/16 22:29:17 INFO ParseDriver: Parse Completed
15/11/16 22:29:17 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
15/11/16 22:29:17 INFO audit: ugi=sim   ip=unknown-ip-addr      cmd=get_table : 
db=default tbl=hevents_test
res23: org.apache.spark.sql.DataFrame = [cid: double, views: bigint]
{code}

That disambiguation should not be necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to