Re: [PR] rfc: optional skipping partial aggregation [datafusion]

via GitHub Sat, 27 Jul 2024 04:00:12 -0700


alamb commented on PR #11627:
URL: https://github.com/apache/datafusion/pull/11627#issuecomment-2254113651


   > @alamb thank you for sharing benchmark results -- I'll check out if any of 
them benefited from this feature (I suppose it shouldn't be triggered in many 
of them) and will look for the possible reasons of q32 (and other queries) 
slowdown (actually this one -- q32, should benefit most, when producing state 
will be implemented for `avg`) + additionally check if generating state can be 
done faster via kernels instead of loops.
   
   Awesome -- I am planning to look into them as well
   
   > 
   > > That this approach may overfit the problem (aka that it isn't 
generalizeable outside the context of the benchmark runs)
   > 
   > Probably, but I supposed this idea to be opposite to overfitting, since it 
relies more on the input data, rather then fixed settings (I may be wrong here 
however).
   
   The more I think about it, the more I agree with you. While there are tuning 
knobs (e.g. the fraction of tuples aggregates) I do think they are general.
   
   
   > > That this approach might preclude making some larger changes (like 
simply turning off the intermediate generation)
   > > ...
   > > I wonder if you have thought about some way to disable aggregation 
entirely in the partial aggregation phase
   > 
   > Initially I've been considering to make partial aggregation just propagate 
input batches as is, and adding some internal flag into their schema metadata 
(pointing that final aggregation to use `update_batch` instead of 
`merge_batch`), but decided that it may be too "pipeline breaking" due to 
different batch schemas (as you've pointed out) -- I suppose it'll require an 
additional logic in `CoalesceBatchesExec` (it wont be able to concat batches 
with different schemas, coming from on/off partial aggregation partitions), it 
also could be a blocker for any (unplanned yet) optimizations of 
`RepartitionExec` (in case there will be some buffers with embedded batch 
concatenation), and it also may produce some burden for DF-based projects doing 
data shuffling across the nodes and having their own shuffle operators. Overall 
-- I've decided that current approach is more safe (at least at this moment), 
as it doesn't affect anything besides aggregation operator.
   
   
   I think this makes sense and I agree with your conclusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] rfc: optional skipping partial aggregation [datafusion]

Reply via email to