[ 
https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010157#comment-15010157
 ] 

Xiao Li edited comment on SPARK-11770 at 11/18/15 3:47 AM:
-----------------------------------------------------------

Hi, [~simeons]

I am unable to reproduce your issue. Could you try it using my script? 

  case class Individual2(F1: Integer, F2: Integer, F3: Integer, F4: Integer)

  def main(args: Array[String]) {

    val sc = new SparkContext("local", "join DFs")

    val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

    val rdd1 = sc.parallelize(Seq( Individual2(1,2,3,4), Individual2(5,6,7,8)))

    val df1 = hiveContext.createDataFrame(rdd1)

    df1.registerTempTable("foo")

    val df2 = hiveContext.sql("SELECT F1+1 as c, SUM(F2) as F2 FROM foo GROUP 
BY F1 HAVING SUM(F2) > 5")

    df2.explain(true)

    df2.show()

  }

Thanks, 

Xiao


was (Author: smilegator):
Hi, [~simeons]

I am unable to reproduce your issue. Could you try it using my script? 

  case class Individual2(F1: Integer, F2: Integer, F3: Integer, F4: Integer)
  def main(args: Array[String]) {
    val sc = new SparkContext("local", "join DFs")
    val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
    val rdd1 = sc.parallelize(Seq( Individual2(1,2,3,4), Individual2(5,6,7,8)))
    val df1 = hiveContext.createDataFrame(rdd1)
    df1.registerTempTable("foo")
    val df2 = hiveContext.sql("SELECT F1+1 as c, SUM(F2) as F2 FROM foo GROUP 
BY F1 HAVING SUM(F2) > 5")
    df2.explain(true)
    df2.show()
  }

Thanks, 

Xiao

> Spark SQL field resolution error in GROUP BY HAVING clause
> ----------------------------------------------------------
>
>                 Key: SPARK-11770
>                 URL: https://issues.apache.org/jira/browse/SPARK-11770
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>            Reporter: Simeon Simeonov
>              Labels: SQL
>
> A query fails to resolve columns from the source data when an alias is added 
> to the SELECT clause. I have not been able to isolate a reproducible 
> standalone test. Below are a series of {{spark-shell}} operations that show 
> the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there 
> are aliased expression columns in the SELECT clause, Spark SQL "sees" columns 
> from the SELECT clause while evaluating a HAVING clause. According to the SQL 
> standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> root
>  |-- vertical: string (nullable = true)
>  |-- did: string (nullable = true)
>  |-- surl: string (nullable = true)
>  |-- creative_id: long (nullable = true)
>  |-- keyword_text: string (nullable = true)
>  |-- errors: integer (nullable = true)
>  |-- views: integer (nullable = true)
>  |-- clicks: integer (nullable = true)
>  |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
>      |   select 1.0*creative_id, sum(views) as views
>      |   from hevents_test
>      |   group by creative_id
>      |   having sum(views) > 500
>      | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
> sum(views) as views
>   from hevents_test
>   group by creative_id
>   having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about 
> {{views}}.
> {code}
> scala> ctx.sql("""
>      | select 1.0*creative_id as cid, sum(views) as views
>      | from hevents_test
>      | group by creative_id
>      | having sum(views) > 500
>      | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L 
> missing from 
> vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
>  in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
> bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
> double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>       at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
>       at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>       at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
>       at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
>       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
>       at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69)
>       at $iwC$$iwC$$iwC$$iwC.<init>(<console>:71)
>       at $iwC$$iwC$$iwC.<init>(<console>:73)
>       at $iwC$$iwC.<init>(<console>:75)
>       at $iwC.<init>(<console>:77)
>       at <init>(<console>:79)
>       at .<init>(<console>:83)
>       at .<clinit>(<console>)
>       at .<init>(<console>:7)
>       at .<clinit>(<console>)
>       at $print(<console>)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>       at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>       at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>       at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>       at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>       at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>       at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>       at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>       at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>       at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>       at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>       at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>       at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>       at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>       at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>       at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>       at org.apache.spark.repl.Main$.main(Main.scala:31)
>       at org.apache.spark.repl.Main.main(Main.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>       at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> If the {{views}} column in the HAVING clause is explicitly disambiguated, the 
> problem goes away:
> {code}
> scala> ctx.sql("""
>      | select 1.0*creative_id as cid, sum(views) as views
>      | from hevents_test
>      | group by creative_id
>      | having sum(hevents_test.views) > 500
>      | """)
> 15/11/16 22:29:17 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(hevents_test.views) > 500
> 15/11/16 22:29:17 INFO ParseDriver: Parse Completed
> 15/11/16 22:29:17 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:29:17 INFO audit: ugi=sim ip=unknown-ip-addr      cmd=get_table : 
> db=default tbl=hevents_test
> res23: org.apache.spark.sql.DataFrame = [cid: double, views: bigint]
> {code}
> That disambiguation should not be necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to