HeartSaVioR commented on PR #37551: URL: https://github.com/apache/spark/pull/37551#issuecomment-1386559000
I like the idea - having a secondary sort key while we are sorting with grouping keys - but the direction does not seem to be right. As you mentioned in PR description, the effect of calling this method only happens with following flatMapGroups and cogroup. It doesn't seem odd for them, but there are other operations for `KeyValueGroupedDataset`, and users may expect the "same" sort order for other operations. For example, `agg` and `reduceGroups` can be sensitive on the orderness, e.g. first() and last(). If the intention is to address (flat)MapGroups and cogroup specifically, addressing these APIs directly sounds to me as more straightforward way to go. I guess you'd want to disallow sort for streaming, but either 1) you can disallow it in logical planning phase or 2) we can document that sorting is applied per microbatch in streaming query. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
