[
https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-11770:
---------------------------------
Labels: SQL bulk-closed (was: SQL)
> Spark SQL field resolution error in GROUP BY HAVING clause
> ----------------------------------------------------------
>
> Key: SPARK-11770
> URL: https://issues.apache.org/jira/browse/SPARK-11770
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.1
> Reporter: Simeon Simeonov
> Priority: Major
> Labels: SQL, bulk-closed
>
> A query fails to resolve columns from the source data when an alias is added
> to the SELECT clause. I have not been able to isolate a reproducible
> standalone test. Below are a series of {{spark-shell}} operations that show
> the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there
> are aliased expression columns in the SELECT clause, Spark SQL "sees" columns
> from the SELECT clause while evaluating a HAVING clause. According to the SQL
> standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default
> tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table :
> db=default tbl=hevents_test
> root
> |-- vertical: string (nullable = true)
> |-- did: string (nullable = true)
> |-- surl: string (nullable = true)
> |-- creative_id: long (nullable = true)
> |-- keyword_text: string (nullable = true)
> |-- errors: integer (nullable = true)
> |-- views: integer (nullable = true)
> |-- clicks: integer (nullable = true)
> |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
> | select 1.0*creative_id, sum(views) as views
> | from hevents_test
> | group by creative_id
> | having sum(views) > 500
> | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id,
> sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default
> tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table :
> db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about
> {{views}}.
> {code}
> scala> ctx.sql("""
> | select 1.0*creative_id as cid, sum(views) as views
> | from hevents_test
> | group by creative_id
> | having sum(views) > 500
> | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default
> tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table :
> db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L
> missing from
> vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
> in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as
> bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as
> double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
> at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69)
> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:71)
> at $iwC$$iwC$$iwC.<init>(<console>:73)
> at $iwC$$iwC.<init>(<console>:75)
> at $iwC.<init>(<console>:77)
> at <init>(<console>:79)
> at .<init>(<console>:83)
> at .<clinit>(<console>)
> at .<init>(<console>:7)
> at .<clinit>(<console>)
> at $print(<console>)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> If the {{views}} column in the HAVING clause is explicitly disambiguated, the
> problem goes away:
> {code}
> scala> ctx.sql("""
> | select 1.0*creative_id as cid, sum(views) as views
> | from hevents_test
> | group by creative_id
> | having sum(hevents_test.views) > 500
> | """)
> 15/11/16 22:29:17 INFO ParseDriver: Parsing command: select 1.0*creative_id
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(hevents_test.views) > 500
> 15/11/16 22:29:17 INFO ParseDriver: Parse Completed
> 15/11/16 22:29:17 INFO HiveMetaStore: 0: get_table : db=default
> tbl=hevents_test
> 15/11/16 22:29:17 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table :
> db=default tbl=hevents_test
> res23: org.apache.spark.sql.DataFrame = [cid: double, views: bigint]
> {code}
> That disambiguation should not be necessary.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]