maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-673460228


   btw, could this optimization be implemented on the adaptive execution 
framework (`AdaptiveSparkPlanExec`)? In the initial discussion 
(https://github.com/apache/spark/pull/28804/files#r447158417), it was pointed 
out that accurate statistics could not be collected. But, I think we might be 
able to collect the stats based on the framework. Yea, as we know, we need to 
look for a light-weight way to compute cardinality on shuffle output. If we 
find it, I think we can simply drop partial aggregate for high cardinality 
cases.
   
   Have you already considered this approach? What I'm worried about now is 
that the current implementation makes the code complicated and it is limited to 
hash aggregates w/codegen only.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to