maropu commented on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-673460228
btw, could this optimization be implemented on the adaptive execution framework (`AdaptiveSparkPlanExec`)? In the initial discussion (https://github.com/apache/spark/pull/28804/files#r447158417), it was pointed out that accurate statistics could not be collected. But, I think we might be able to collect the stats based on the framework. Yea, as we know, we need to look for a light-weight way to compute cardinality on shuffle output. If we find it, I think we can simply drop partial aggregate for high cardinality cases. Have you already considered this approach? What I'm worried about now is that the current implementation makes the code complicated and it is limited to hash aggregates w/codegen only. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
