Rachelint commented on issue #11680: URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2362005473
> > > Does anyone know what the rationale of having repartition + coalesce, what kind of query benefits from it > > > > > > The primary reason is scalability. Efficient aggregation requires multi-core CPUs to process data in parallel. To facilitate this and prevent contention from multiple threads altering a single hash table simultaneously (often managed with locks), a repartition phase is introduced. > > This repartition allows each thread to perform aggregation independently. Furthermore, pre-aggregation is employed to conduct preliminary calculations to streamline repartitioning, significantly reducing the volume of data that needs repartitioning(in cases either the cadinatlity is low or there are several hot keys). > > > The next thing is to find out when and how should I avoid EnforceDistribution and bypass SanityCheckPlan if needed. And ensure there is no other queries slow down. > > > > > > This when-and-how problem is difficult for query engines because it requires foreknowledge of the data characteristics on grouping keys. It's even harder for DataFusion since we have very limited table metadata that could help us with this decision. > > In an adaptive approach, the aggregate operator could internally start with one method and move to another without interacting with other components(the physical optimizer and other physical operators), making it more feasible. > > Maybe reducing the cost of repartition is an alternative? I think the reason why `partial + final` lead to performance regression in some situation is possible that the cost of splitting batches is greater than the improvement of parallel merging? I am trying a poc about fusing the repartition and partial aggregation. It is exciting that this idea seems promising! I tried some related queries in my local, and it seems faster. - Main ``` Q16: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; Query 16 iteration 0 took 5537.5 ms and returned 10 rows Query 16 iteration 1 took 5308.1 ms and returned 10 rows Query 16 iteration 2 took 5499.6 ms and returned 10 rows Query 16 iteration 3 took 5561.5 ms and returned 10 rows Query 16 iteration 4 took 5380.3 ms and returned 10 rows Q17: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10; Query 17 iteration 0 took 4961.5 ms and returned 10 rows Query 17 iteration 1 took 4926.8 ms and returned 10 rows Query 17 iteration 2 took 5000.1 ms and returned 10 rows Query 17 iteration 3 took 4952.6 ms and returned 10 rows Query 17 iteration 4 took 4922.3 ms and returned 10 rows ``` - The poc #12526 ``` Q16: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; Query 16 iteration 0 took 5252.9 ms and returned 10 rows Query 16 iteration 1 took 5092.9 ms and returned 10 rows Query 16 iteration 2 took 5102.9 ms and returned 10 rows Query 16 iteration 3 took 5181.0 ms and returned 10 rows Query 16 iteration 4 took 5169.2 ms and returned 10 rows Q17: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10; Query 17 iteration 0 took 4660.8 ms and returned 10 rows Query 17 iteration 1 took 4619.3 ms and returned 10 rows Query 17 iteration 2 took 4588.3 ms and returned 10 rows Query 17 iteration 3 took 4717.4 ms and returned 10 rows Query 17 iteration 4 took 4633.0 ms and returned 10 rows ``` I will check it more carefully again, and if it actually works, I will submit a formal PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
