karuppayya opened a new pull request #28804: URL: https://github.com/apache/spark/pull/28804
### What changes were proposed in this pull request? In case of HashAggregation, a partial aggregation(update) is done followed by final aggregation(merge) During partial aggregation we sort and spill to disk every-time fby, when the fast Map(when enabled) and UnsafeFixedWidthAggregationMap gets exhausted **When the cardinality of grouping column is close to the total number of records being processed, the sorting of data spilling to disk is not required, since it is kind of no-op and we can directly use in Final aggregation.** When the user is aware of nature of data, currently he has no control over disabling this sort, spill operation. This is similar to following issues in Hive: https://issues.apache.org/jira/browse/HIVE-223 https://issues.apache.org/jira/browse/HIVE-291 In this PR, the abilty to disable sort/spill during partial aggregation is added ### Why are the changes needed? This improvement can improve the performance of queries ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch was tested manually ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
