Enrico Minack created SPARK-38591:
-------------------------------------

             Summary:  Add flatMapSortedGroups to KeyValueGroupedDataset
                 Key: SPARK-38591
                 URL: https://issues.apache.org/jira/browse/SPARK-38591
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Enrico Minack


The existing method {{KeyValueGroupedDataset.flatMapGroups}} provides an 
iterator of rows for each group key. If user code would requires those rows in 
a particular order, that iterator would have to be sorted first, which is 
against the idea of an iterator in the first place. For groups that do not fit 
into memory of one executor, this approach does not work.

[org.apache.spark.sql.KeyValueGroupedDataset|https://github.com/apache/spark/blob/47485a3c2df3201c838b939e82d5b26332e2d858/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L134-L137]:
{noformat}
Internally, the implementation will spill to disk if any given group is too 
large to fit into
memory. However, users must take care to avoid materializing the whole iterator 
for a group
(for example, by calling `toList`) unless they are sure that this is possible 
given the memory
constraints of their cluster.
{noformat}


The implementation of {{KeyValueGroupedDataset.flatMapGroups}} already sorts 
each partition according to the group key. By additionally sorting by some data 
columns, the iterator can be guaranteed to provide some order.

A new method {{KeyValueGroupedDataset.flatMapSortedGroups}} could allow to 
define order within the groups.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to