EnricoMi opened a new pull request, #39640:
URL: https://github.com/apache/spark/pull/39640

   ### What changes were proposed in this pull request?
   This adds a sorted version of `Dataset.groupByKey(…).flatMapGroups(…)` and 
`Dataset.groupByKey(…).cogroup(…)`.
   
   ### Why are the changes needed?
   The existing methods `KeyValueGroupedDataset.flatMapGroups` and 
`KeyValueGroupedDataset.cogroup` provide iterators of rows for each group key.
   
   Sorting entire groups inside `flatMapGroups` / `cogroup` requires 
materialising all rows, which is against the idea of an iterator in the first 
place. Methods `flatMapGroups` and `cogroup` have the great advantage that they 
work with groups that are _too large to fit into memory of one executor_. 
Sorting them in the user function breaks this property.
   
   
[org.apache.spark.sql.KeyValueGroupedDataset](https://github.com/apache/spark/blob/47485a3c2df3201c838b939e82d5b26332e2d858/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L134-L137):
   > Internally, the implementation will spill to disk if any given group is 
too large to fit into
   > memory.  However, users must take care to avoid materializing the whole 
iterator for a group
   > (for example, by calling `toList`) unless they are sure that this is 
possible given the memory
   > constraints of their cluster.
   
   The implementations of `KeyValueGroupedDataset.flatMapGroups` and 
`KeyValueGroupedDataset.cogroup` already sort each partition according to the 
group key. By additionally sorting by some data columns, the iterator can be 
guaranteed to provide some order.
   
   ### Does this PR introduce _any_ user-facing change?
   This adds `KeyValueGroupedDataset.flatMapSortedGroups` and 
`KeyValueGroupedDataset.cogroupSorted`, which guarantees order of group 
iterators.
   
   ### How was this patch tested?
   Tests have been added to `DatasetSuite`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to