Rachelint commented on PR #11943: URL: https://github.com/apache/datafusion/pull/11943#issuecomment-2296550942
> Benchmark looks amazing 🚀 > > I have an idea to minimize this PR (and make progress easier): > #### Problem with existing code > > Now aggregation execution code (the entry point in `row_hash.rs/poll_next()` has many specialized execution paths, I think the basic one is `GroupedHash`, other execution paths include spilling/streaming aggregate/skip partial aggregation/with soft limit/... > > It's hard to understand each of them, if multiple specializations can be triggered in the same execution, it's even trickier (e.g. can spilling and skipping partial aggregation happen within a single execution?) I believe these details are currently under-documented. > #### This PR > > I skimmed through this PR, it appears not to implement compatibility with streaming group-by, but tries to be compatible with spilling/skipping partial aggregation (at least reused some execution path) > #### Thoughts > > I think making these specializations compatible requires significant design/doc/testing effort, which could be leave to another PR. > > Given the major improvement of blocked aggregation is, avoding extra copy when groups/accumulators get resized, it seems possible to demonstrate the performance numbers without supporting other advanced execution logic. > > I'm wondering is it possible not to support spilling (also skipping partial aggregation and maybe other optimizations) at first? So that this PR's execution path is not mixing with other specialized paths Yes, in fact some other problems not related to this pr exists in spilling path... I planned to just ignore them and only disable features > Benchmark looks amazing 🚀 > > I have an idea to minimize this PR (and make progress easier): > #### Problem with existing code > > Now aggregation execution code (the entry point in `row_hash.rs/poll_next()` has many specialized execution paths, I think the basic one is `GroupedHash`, other execution paths include spilling/streaming aggregate/skip partial aggregation/with soft limit/... > > It's hard to understand each of them, if multiple specializations can be triggered in the same execution, it's even trickier (e.g. can spilling and skipping partial aggregation happen within a single execution?) I believe these details are currently under-documented. > #### This PR > > I skimmed through this PR, it appears not to implement compatibility with streaming group-by, but tries to be compatible with spilling/skipping partial aggregation (at least reused some execution path) > #### Thoughts > > I think making these specializations compatible requires significant design/doc/testing effort, which could be leave to another PR. > > Given the major improvement of blocked aggregation is, avoding extra copy when groups/accumulators get resized, it seems possible to demonstrate the performance numbers without supporting other advanced execution logic. > > I'm wondering is it possible not to support spilling (also skipping partial aggregation and maybe other optimizations) at first? So that this PR's execution path is not mixing with other specialized paths Thanks! It is clever to just disable this blocked optimization when spilling is enabled. Yes, actually I found some other problems not introduced by this pr exists in spilling path... And the current compatible impl for spilling may have the bad performance (we need to copy all blocks into single to do it...). I planned to just leave them to another prs... I am cleaning up the unnecessary codes, and will switch it to ready soon. I am -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org