[GitHub] spark pull request #22944: [SPARK-25942][SQL] Aggregate expressions shouldn'...

viirya Mon, 12 Nov 2018 00:50:25 -0800

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22944#discussion_r232571836
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
    @@ -1556,6 +1556,20 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
           df.where($"city".contains(new java.lang.Character('A'))),
           Seq(Row("Amsterdam")))
       }
    +
    +  test("SPARK-25942: typed aggregation on primitive type") {
    +    val ds = Seq(1, 2, 3).toDS()
    +
    +    val agg = ds.groupByKey(_ >= 2)
    +      .agg(sum("value").as[Long], sum($"value" + 1).as[Long])
    --- End diff --
    
    I think typed groupByKey is somehow different to untyped groupBy API.
    
    For untyped groupBy API, the grouping attributes are from original relation 
and visible to users before grouping. It is reasonable to access grouping 
attributes in aggregate expressions for untype groupBy.
    
    The key attributes of typed groupByKey API are appended by users in 
addition to original data attributes. The key attributes are from typed 
user-provided function. It can be primitive type or complex type like case 
class.
    
    It makes difficulties to have general and reasonable access to key 
attributes in aggregate expressions for typed groupByKey API. For example, for 
case class type, key attributes are flatten alongside original data attributes. 
So aggregate expressions should access single typed key or flatten attributes? 
In other words, if key object is `(Int, String)`, how aggregate expressions use 
it?
    
    From the detail of `aggUntyped`, it also shows that key attributes should 
not be accessed in aggregate expressions in typed groupByKey API. In 
`aggUntyped`, we call `withInputType` to bind value encoders on data 
attributes. So typed aggregate expressions can't access key attributes too.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Aggregate expressions shouldn'...

Reply via email to