karuppayya opened a new pull request #28804:
URL: https://github.com/apache/spark/pull/28804


   ### What changes were proposed in this pull request?
   In case of HashAggregation, a partial aggregation(update) is done followed 
by final aggregation(merge) 
   
   During partial aggregation we sort and spill to disk every-time       fby, 
when the fast Map(when enabled) and  UnsafeFixedWidthAggregationMap gets 
exhausted
   
   **When the cardinality of grouping column is close to the total number of 
records being processed, the sorting of data spilling to disk is not required, 
since it is kind of no-op and we can directly use in Final aggregation.**
   
   When the user is aware of nature of data, currently he has no control over 
disabling this sort, spill operation.
   
   This is similar to following issues in Hive:
   https://issues.apache.org/jira/browse/HIVE-223
   https://issues.apache.org/jira/browse/HIVE-291
   
   In this PR, the abilty to disable sort/spill during partial aggregation is 
added
    
   ### Why are the changes needed?
   This improvement can improve the performance of queries
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   This patch was tested manually


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to