Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/22944#discussion_r232571836
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
---
@@ -1556,6 +1556,20 @@ class DatasetSuite extends QueryTest with
SharedSQLContext {
df.where($"city".contains(new java.lang.Character('A'))),
Seq(Row("Amsterdam")))
}
+
+ test("SPARK-25942: typed aggregation on primitive type") {
+ val ds = Seq(1, 2, 3).toDS()
+
+ val agg = ds.groupByKey(_ >= 2)
+ .agg(sum("value").as[Long], sum($"value" + 1).as[Long])
--- End diff --
I think typed groupByKey is somehow different to untyped groupBy API.
For untyped groupBy API, the grouping attributes are from original relation
and visible to users before grouping. It is reasonable to access grouping
attributes in aggregate expressions for untype groupBy.
The key attributes of typed groupByKey API are appended by users in
addition to original data attributes. The key attributes are from typed
user-provided function. It can be primitive type or complex type like case
class.
It makes difficulties to have general and reasonable access to key
attributes in aggregate expressions for typed groupByKey API. For example, for
case class type, key attributes are flatten alongside original data attributes.
So aggregate expressions should access single typed key or flatten attributes?
In other words, if key object is `(Int, String)`, how aggregate expressions use
it?
From the detail of `aggUntyped`, it also shows that key attributes should
not be accessed in aggregate expressions in typed groupByKey API. In
`aggUntyped`, we call `withInputType` to bind value encoders on data
attributes. So typed aggregate expressions can't access key attributes too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]