[
https://issues.apache.org/jira/browse/SPARK-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643984#comment-14643984
]
James Aley commented on SPARK-9338:
-----------------------------------
It seems to work in Postgres and Redshift for me. I also just opened up a
sqlite3 shell for comparison and the following seems to work:
{code}
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create table foo(bar int, baz int);
sqlite> insert into foo values (1, 1), (1, 2), (1, 3), (2, 1), (2, 4);
sqlite> select * from foo;
1|1
1|2
1|3
2|1
2|4
sqlite> select bar as b, sum(baz) from foo group by b order by b;
1|6
2|5
sqlite>
{code}
I think that's the same idea, right?
> Aliases from SELECT not available in GROUP BY
> ---------------------------------------------
>
> Key: SPARK-9338
> URL: https://issues.apache.org/jira/browse/SPARK-9338
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.1
> Environment: Observed on Mac OS X and Ubuntu 14.04
> Reporter: James Aley
>
> It feels like this should really be a known issue, but I've not been able to
> find any mailing list or JIRA tickets for exactly this. There are a few
> closed/resolved tickets about specific types of exceptions, but I couldn't
> find this exact problem, so apologies if this is a dupe!
> Spark SQL doesn't appear to support referencing aliases from a SELECT in the
> GROUP BY part of the query. This is confusing our analysts, as it works in
> most other tools they use. Here's an example to reproduce:
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema =
> StructType(
> StructField("x", IntegerType, nullable=false) ::
> StructField("y",
> StructType(StructField("a", DoubleType, nullable=false) :: Nil),
> nullable=false) :: Nil)
> val rdd = sc.parallelize(
> Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4,
> Row(2.5)) :: Nil)
> val df = sqlContext.createDataFrame(rdd, schema)
> // DataFrame content looks like this:
> // x z
> // 1 {a: 1.0}
> // 2 {a: 1.34}
> // 3 {a: 2.3}
> // 4 {a: 2.5}
> df.registerTempTable("test_data")
> sqlContext.udf.register("roundToInt", (x: Double) => x.toInt)
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data
> GROUP BY grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given
> input columns x, y
> sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY
> grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given
> input columns x, y;
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data
> GROUP BY roundToInt(y.a)").show()
> // =>
> // +---+----+
> // |grp| s|
> // +---+----+
> // | 1|2.34|
> // | 2| 4.8|
> // +---+----+
> {code}
> As you can see, it's particularly inconvenient when using UDFs on nested
> fields, as it means repeating some potentially complex expressions. It's very
> common for us to want to make a date type conversion (from epoch milliseconds
> or something) from some nested field, then reference it in multiple places in
> the query. With this issue, it makes for quite verbose queries.
> Might it also mean that we're mapping these functions over the data twice? I
> can't quite tell from the explain output whether that's been optimised out or
> not, but here it is for somebody who understands :-)
> {code}
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data
> GROUP BY roundToInt(y.a)").explain()
> // == Physical Plan ==
> // Aggregate false, [PartialGroup#126], [PartialGroup#126 AS
> grp#116,CombineSum(PartialSum#125L) AS s#117L]
> // Exchange (HashPartitioning 200)
> // Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS
> PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L]
> // PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at
> <console>:31
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]