yjshen commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2333198678

   > Does anyone know what the rationale of having repartition + coalesce, what 
kind of query benefits from it
   
   The primary reason is scalability. Efficient aggregation requires multi-core 
CPUs to process data in parallel. To facilitate this and prevent contention 
from multiple threads altering a single hash table simultaneously (often 
managed with locks), a repartition phase is introduced. 
   
   This repartition allows each thread to perform aggregation independently. 
Furthermore, pre-aggregation is employed to conduct preliminary calculations to 
streamline repartitioning, significantly reducing the volume of data that needs 
repartitioning(in cases either the cadinatlity is low or there are several hot 
keys).
   
   > The next thing is to find out when and how should I avoid 
EnforceDistribution and bypass SanityCheckPlan if needed. And ensure there is 
no other queries slow down.
   
   This when-and-how problem is difficult for query engines because it requires 
foreknowledge of the data characteristics on grouping keys. It's even harder 
for DataFusion since we have very limited table metadata that could help us 
with this decision. 
   
   In an adaptive approach, the aggregate operator could internally start with 
one method and move to another without interacting with other components(the 
physical optimizer and other physical operators), making it more feasible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to