[jira] [Updated] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:49:38 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-11770:
---------------------------------
    Labels: SQL bulk-closed  (was: SQL)

> Spark SQL field resolution error in GROUP BY HAVING clause
> ----------------------------------------------------------
>
>                 Key: SPARK-11770
>                 URL: https://issues.apache.org/jira/browse/SPARK-11770
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>            Reporter: Simeon Simeonov
>            Priority: Major
>              Labels: SQL, bulk-closed
>
> A query fails to resolve columns from the source data when an alias is added 
> to the SELECT clause. I have not been able to isolate a reproducible 
> standalone test. Below are a series of {{spark-shell}} operations that show 
> the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there 
> are aliased expression columns in the SELECT clause, Spark SQL "sees" columns 
> from the SELECT clause while evaluating a HAVING clause. According to the SQL 
> standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> root
>  |-- vertical: string (nullable = true)
>  |-- did: string (nullable = true)
>  |-- surl: string (nullable = true)
>  |-- creative_id: long (nullable = true)
>  |-- keyword_text: string (nullable = true)
>  |-- errors: integer (nullable = true)
>  |-- views: integer (nullable = true)
>  |-- clicks: integer (nullable = true)
>  |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
>      |   select 1.0*creative_id, sum(views) as views
>      |   from hevents_test
>      |   group by creative_id
>      |   having sum(views) > 500
>      | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
> sum(views) as views
>   from hevents_test
>   group by creative_id
>   having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about 
> {{views}}.
> {code}
> scala> ctx.sql("""
>      | select 1.0*creative_id as cid, sum(views) as views
>      | from hevents_test
>      | group by creative_id
>      | having sum(views) > 500
>      | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L 
> missing from 
> vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
>  in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
> bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
> double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>       at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
>       at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>       at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
>       at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69)
>       at $iwC$$iwC$$iwC$$iwC.<init>(<console>:71)
>       at $iwC$$iwC$$iwC.<init>(<console>:73)
>       at $iwC$$iwC.<init>(<console>:75)
>       at $iwC.<init>(<console>:77)
>       at <init>(<console>:79)
>       at .<init>(<console>:83)
>       at .<clinit>(<console>)
>       at .<init>(<console>:7)
>       at .<clinit>(<console>)
>       at $print(<console>)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>       at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>       at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>       at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>       at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>       at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>       at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>       at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>       at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>       at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>       at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>       at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>       at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>       at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>       at org.apache.spark.repl.Main$.main(Main.scala:31)
>       at org.apache.spark.repl.Main.main(Main.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>       at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> If the {{views}} column in the HAVING clause is explicitly disambiguated, the 
> problem goes away:
> {code}
> scala> ctx.sql("""
>      | select 1.0*creative_id as cid, sum(views) as views
>      | from hevents_test
>      | group by creative_id
>      | having sum(hevents_test.views) > 500
>      | """)
> 15/11/16 22:29:17 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(hevents_test.views) > 500
> 15/11/16 22:29:17 INFO ParseDriver: Parse Completed
> 15/11/16 22:29:17 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:29:17 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> res23: org.apache.spark.sql.DataFrame = [cid: double, views: bigint]
> {code}
> That disambiguation should not be necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

Reply via email to