Re: [PR] Sketch for aggregation intermediate results blocked management [datafusion]

via GitHub Mon, 19 Aug 2024 06:14:05 -0700


Rachelint commented on PR #11943:
URL: https://github.com/apache/datafusion/pull/11943#issuecomment-2296550942


   > Benchmark looks amazing 🚀
   > 
   > I have an idea to minimize this PR (and make progress easier):
   > #### Problem with existing code
   > 
   > Now aggregation execution code (the entry point in 
`row_hash.rs/poll_next()` has many specialized execution paths, I think the 
basic one is `GroupedHash`, other execution paths include spilling/streaming 
aggregate/skip partial aggregation/with soft limit/...
   > 
   > It's hard to understand each of them, if multiple specializations can be 
triggered in the same execution, it's even trickier (e.g. can spilling and 
skipping partial aggregation happen within a single execution?) I believe these 
details are currently under-documented.
   > #### This PR
   > 
   > I skimmed through this PR, it appears not to implement compatibility with 
streaming group-by, but tries to be compatible with spilling/skipping partial 
aggregation (at least reused some execution path)
   > #### Thoughts
   > 
   > I think making these specializations compatible requires significant 
design/doc/testing effort, which could be leave to another PR.
   > 
   > Given the major improvement of blocked aggregation is, avoding extra copy 
when groups/accumulators get resized, it seems possible to demonstrate the 
performance numbers without supporting other advanced execution logic.
   > 
   > I'm wondering is it possible not to support spilling (also skipping 
partial aggregation and maybe other optimizations) at first? So that this PR's 
execution path is not mixing with other specialized paths
   
   Yes, in fact some other problems not related to this pr exists in spilling 
path... I planned to just ignore them and only disable features 
   
   > Benchmark looks amazing 🚀
   > 
   > I have an idea to minimize this PR (and make progress easier):
   > #### Problem with existing code
   > 
   > Now aggregation execution code (the entry point in 
`row_hash.rs/poll_next()` has many specialized execution paths, I think the 
basic one is `GroupedHash`, other execution paths include spilling/streaming 
aggregate/skip partial aggregation/with soft limit/...
   > 
   > It's hard to understand each of them, if multiple specializations can be 
triggered in the same execution, it's even trickier (e.g. can spilling and 
skipping partial aggregation happen within a single execution?) I believe these 
details are currently under-documented.
   > #### This PR
   > 
   > I skimmed through this PR, it appears not to implement compatibility with 
streaming group-by, but tries to be compatible with spilling/skipping partial 
aggregation (at least reused some execution path)
   > #### Thoughts
   > 
   > I think making these specializations compatible requires significant 
design/doc/testing effort, which could be leave to another PR.
   > 
   > Given the major improvement of blocked aggregation is, avoding extra copy 
when groups/accumulators get resized, it seems possible to demonstrate the 
performance numbers without supporting other advanced execution logic.
   > 
   > I'm wondering is it possible not to support spilling (also skipping 
partial aggregation and maybe other optimizations) at first? So that this PR's 
execution path is not mixing with other specialized paths
   
   Thanks! It is clever to just disable this blocked optimization when spilling 
is enabled.
   
   Yes, actually I found some other problems not introduced by this pr exists 
in spilling path... And the current compatible impl for spilling may have the 
bad performance (we need to copy all blocks into single to do it...).
   
   I planned to just leave them to another prs... I am cleaning up the 
unnecessary codes, and will switch it to ready soon.
   
   
   I am 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Sketch for aggregation intermediate results blocked management [datafusion]

Reply via email to