2010YOUY01 commented on PR #22518:
URL: https://github.com/apache/datafusion/pull/22518#issuecomment-4657264875

   Thank you for working on this optimization. I think this is a really cool 
idea!
   
   Let me first reiterate the motivation from the issue to make sure I 
understand it correctly:
   
   > If we use a fixed `skip_partial_aggregation_probe_ratio_threshold`, one 
problematic scenario is that this default value was originally tuned for simple 
group-by keys like integers. If the group keys are complex, such as long 
strings, this value should probably be lower to be optimal, so we should 
introduce an adaptive strategy.
   
   I agree this is a valid concern. However, I would prefer to use a simpler 
heuristic for this. The aggregation code is already very hard to maintain, and 
adding an `A/B Sampling` stage may make it even more complex. It is also hard 
to ensure that it is compatible with other execution paths, such as spilling.
   
   We may want to optimize the aggregate operator in other directions in the 
future for bigger gains, and I think it is important to keep the current code 
manageable.
   
   I think there are two possible directions:
   
   1. Figure out a way to fully decouple the `ABSampling` state from the main 
execution state machine. If we can do that, then keeping a more sophisticated 
adaptive algorithm would make sense to me.
   2. Otherwise, we could simplify the heuristic to roughly the same complexity 
as the existing approach. For example, if the default threshold is 0.8, then if 
the group key size is large, we use `DEFAULT_THRESHOLD / 4`. This approach must 
be worse than a cost model, but I think it is more important to keep the core 
execution code maintainable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to