[jira] [Commented] (SPARK-9338) Aliases from SELECT not available in GROUP BY

James Aley (JIRA) Tue, 28 Jul 2015 00:27:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643984#comment-14643984
 ]


James Aley commented on SPARK-9338:
-----------------------------------

It seems to work in Postgres and Redshift for me. I also just opened up a 
sqlite3 shell for comparison and the following seems to work:

{code}
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

sqlite> create table foo(bar int, baz int);

sqlite> insert into foo values (1, 1), (1, 2), (1, 3), (2, 1), (2, 4);

sqlite> select * from foo;
1|1
1|2
1|3
2|1
2|4

sqlite> select bar as b, sum(baz) from foo group by b order by b;
1|6
2|5

sqlite>
{code}

I think that's the same idea, right? 

> Aliases from SELECT not available in GROUP BY
> ---------------------------------------------
>
>                 Key: SPARK-9338
>                 URL: https://issues.apache.org/jira/browse/SPARK-9338
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1
>         Environment: Observed on Mac OS X and Ubuntu 14.04
>            Reporter: James Aley
>
> It feels like this should really be a known issue, but I've not been able to 
> find any mailing list or JIRA tickets for exactly this. There are a few 
> closed/resolved tickets about specific types of exceptions, but I couldn't 
> find this exact problem, so apologies if this is a dupe!
> Spark SQL doesn't appear to support referencing aliases from a SELECT in the 
> GROUP BY part of the query. This is confusing our analysts, as it works in 
> most other tools they use. Here's an example to reproduce:
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema =
>   StructType(
>     StructField("x", IntegerType, nullable=false) ::
>     StructField("y",
>       StructType(StructField("a", DoubleType, nullable=false) :: Nil),
>       nullable=false) :: Nil)
> val rdd = sc.parallelize(
>   Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4, 
> Row(2.5)) :: Nil)
> val df = sqlContext.createDataFrame(rdd, schema)
> // DataFrame content looks like this:
> // x   z
> // 1   {a: 1.0}
> // 2   {a: 1.34}
> // 3   {a: 2.3}
> // 4   {a: 2.5}
> df.registerTempTable("test_data")
> sqlContext.udf.register("roundToInt", (x: Double) => x.toInt)
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data 
> GROUP BY grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given 
> input columns x, y
> sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY 
> grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given 
> input columns x, y;
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data 
> GROUP BY roundToInt(y.a)").show()
> // =>
> // +---+----+
> // |grp|   s|
> // +---+----+
> // |  1|2.34|
> // |  2| 4.8|
> // +---+----+
> {code}
> As you can see, it's particularly inconvenient when using UDFs on nested 
> fields, as it means repeating some potentially complex expressions. It's very 
> common for us to want to make a date type conversion (from epoch milliseconds 
> or something) from some nested field, then reference it in multiple places in 
> the query. With this issue, it makes for quite verbose queries. 
> Might it also mean that we're mapping these functions over the data twice? I 
> can't quite tell from the explain output whether that's been optimised out or 
> not, but here it is for somebody who understands :-)
> {code}
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data 
> GROUP BY roundToInt(y.a)").explain()
> // == Physical Plan ==
> // Aggregate false, [PartialGroup#126], [PartialGroup#126 AS 
> grp#116,CombineSum(PartialSum#125L) AS s#117L]
> // Exchange (HashPartitioning 200)
> // Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS 
> PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L]
> // PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at 
> <console>:31
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9338) Aliases from SELECT not available in GROUP BY

Reply via email to