alamb commented on PR #11627: URL: https://github.com/apache/datafusion/pull/11627#issuecomment-2254113651
> @alamb thank you for sharing benchmark results -- I'll check out if any of them benefited from this feature (I suppose it shouldn't be triggered in many of them) and will look for the possible reasons of q32 (and other queries) slowdown (actually this one -- q32, should benefit most, when producing state will be implemented for `avg`) + additionally check if generating state can be done faster via kernels instead of loops. Awesome -- I am planning to look into them as well > > > That this approach may overfit the problem (aka that it isn't generalizeable outside the context of the benchmark runs) > > Probably, but I supposed this idea to be opposite to overfitting, since it relies more on the input data, rather then fixed settings (I may be wrong here however). The more I think about it, the more I agree with you. While there are tuning knobs (e.g. the fraction of tuples aggregates) I do think they are general. > > That this approach might preclude making some larger changes (like simply turning off the intermediate generation) > > ... > > I wonder if you have thought about some way to disable aggregation entirely in the partial aggregation phase > > Initially I've been considering to make partial aggregation just propagate input batches as is, and adding some internal flag into their schema metadata (pointing that final aggregation to use `update_batch` instead of `merge_batch`), but decided that it may be too "pipeline breaking" due to different batch schemas (as you've pointed out) -- I suppose it'll require an additional logic in `CoalesceBatchesExec` (it wont be able to concat batches with different schemas, coming from on/off partial aggregation partitions), it also could be a blocker for any (unplanned yet) optimizations of `RepartitionExec` (in case there will be some buffers with embedded batch concatenation), and it also may produce some burden for DF-based projects doing data shuffling across the nodes and having their own shuffle operators. Overall -- I've decided that current approach is more safe (at least at this moment), as it doesn't affect anything besides aggregation operator. I think this makes sense and I agree with your conclusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org