[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

GitBox Thu, 13 Aug 2020 05:54:03 -0700


maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-673460228



   btw, could this optimization be implemented on the adaptive execution 
framework (`AdaptiveSparkPlanExec`)? In the initial discussion 
(https://github.com/apache/spark/pull/28804/files#r447158417), it was pointed 
out that accurate statistics could not be collected. But, I think we might be 
able to collect the stats based on the framework. Yea, as we know, we need to 
look for a light-weight way to compute cardinality on shuffle output. If we 
find it, I think we can simply drop partial aggregate for high cardinality 
cases.
   
   Have you already considered this approach? What I'm worried about now is 
that the current implementation makes the code complicated and it is limited to 
hash aggregates w/codegen only.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Reply via email to