[GitHub] [spark] HeartSaVioR commented on pull request #37551: [SPARK-38591][SQL] Add sortWithinGroups to KeyValueGroupedDataset

GitBox Tue, 17 Jan 2023 22:34:28 -0800


HeartSaVioR commented on PR #37551:
URL: https://github.com/apache/spark/pull/37551#issuecomment-1386559000


   I like the idea - having a secondary sort key while we are sorting with 
grouping keys - but the direction does not seem to be right.
   
   As you mentioned in PR description, the effect of calling this method only 
happens with following flatMapGroups and cogroup. It doesn't seem odd for them, 
but there are other operations for `KeyValueGroupedDataset`, and users may 
expect the "same" sort order for other operations. For example, `agg` and 
`reduceGroups` can be sensitive on the orderness, e.g. first() and last().
   
   If the intention is to address (flat)MapGroups and cogroup specifically, 
addressing these APIs directly sounds to me as more straightforward way to go. 
I guess you'd want to disallow sort for streaming, but either 1) you can 
disallow it in logical planning phase or 2) we can document that sorting is 
applied per microbatch in streaming query.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #37551: [SPARK-38591][SQL] Add sortWithinGroups to KeyValueGroupedDataset

Reply via email to