[GitHub] [spark] EnricoMi opened a new pull request, #38296: [SPARK-40830][SQL] Provide groupByKey shurtcuts for groupBy.as

GitBox Tue, 18 Oct 2022 01:10:33 -0700


EnricoMi opened a new pull request, #38296:
URL: https://github.com/apache/spark/pull/38296


   ### What changes were proposed in this pull request?
   This documents that `groupBy(...).as[...]` should be preferred over 
`groupByKey(...)`.
   
   It further provides shortcuts for `groupBy(...).as[...]` that make it easier 
to use column-based `groupByKey`.
   
   ### Why are the changes needed?
   
   Calling `Dataset.groupBy(...).as[K, T]` should be preferred over calling 
`Dataset.groupByKey(...)` whenever possible. The former allows Catalyst to 
exploit existing partitioning and ordering of the Dataset, while the latter 
hides from Catalyst which columns are used to create the keys.
   
   _When the dataset is already partitioned and ordered by the grouping 
columns, `Dataset.groupByKey(...)` will repartition and order the entire 
dataset again._
   
   Example:
   
   Calling `ds.groupByKey(_.id)` hides from Catalyst that column `id` is the 
grouping key, while `ds.groupBy($"id").as[Int, V]` tells Catalyst that `ds` is 
to be grouped by (partitioned and ordered by) column `id`.
   
   The new column-based `groupByKey` methods make it easier for users to find a 
way to express the grouping by expressions. Looking at the `Dataset` API, the 
user finds `groupByKey` with `Column`. The existing `groupBy` method returns a 
`RelationalGroupedDataset`, which provides the `as[K, V]` method, which allows 
for the same semantics, but is difficult to find.
   
   The new column-based `groupByKey` methods further do not require the user to 
specify the type `V` of the original `Dataset[V]`, as `groupByKey` has access 
to the type / encoder:
   
       ds.groupBy($"id").as[Int, V]
   
   vs.
   
       ds.groupByKey[Int]($"id")
   
   ### Does this PR introduce _any_ user-facing change?
   
   **Note: This breaks compilation when function name is used with existing 
`groupByKey`:**
   
   Was:
   
       groupByKey(identity)
   
   Now:
   
       groupByKey(identity(_))
   
   It adds these methods:
   
   Scala:
   - `Dataset.groupByKey[K](Column*)`
   - `Dataset.groupByKey[K](String, String*)`
   
   Java:
   - `Dataset.groupByKey[K](Encoder[K], Column...)`
   - `Dataset.groupByKey[K](Encoder[K], String, String...)`
   
   
   ### How was this patch tested?
   Adds tests to `DatasetSuite` and `JavaDatasetSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi opened a new pull request, #38296: [SPARK-40830][SQL] Provide groupByKey shurtcuts for groupBy.as

Reply via email to