yjshen commented on issue #11680: URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2333198678
> Does anyone know what the rationale of having repartition + coalesce, what kind of query benefits from it The primary reason is scalability. Efficient aggregation requires multi-core CPUs to process data in parallel. To facilitate this and prevent contention from multiple threads altering a single hash table simultaneously (often managed with locks), a repartition phase is introduced. This repartition allows each thread to perform aggregation independently. Furthermore, pre-aggregation is employed to conduct preliminary calculations to streamline repartitioning, significantly reducing the volume of data that needs repartitioning(in cases either the cadinatlity is low or there are several hot keys). > The next thing is to find out when and how should I avoid EnforceDistribution and bypass SanityCheckPlan if needed. And ensure there is no other queries slow down. This when-and-how problem is difficult for query engines because it requires foreknowledge of the data characteristics on grouping keys. It's even harder for DataFusion since we have very limited table metadata that could help us with this decision. In an adaptive approach, the aggregate operator could internally start with one method and move to another without interacting with other components(the physical optimizer and other physical operators), making it more feasible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
