Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]

via GitHub Sat, 10 Aug 2024 14:22:59 -0700


Rachelint commented on PR #11758:
URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2282280596


   > > Thank you @alamb 🙏. Let me analyze it further 🤔
   > 
   > In order to actually generate the output in multiple batches and gain 
performance, I think we would need to change:
   > 
   >     1. The `GroupValues` storage (so that it never creates a large 
contiguous range)
   > 
   >     2. The `GroupsAccumulators` likewise to manage the internal state as 
multiple chunks and not as single chunks
   > 
   > 
   > This would likely require some sort of API change to the accumulators / etc
   > 
   > I wonder if we could find some way to do the implementation incrementally
   
   I agree, finally it should be a big change which switches the group values 
and related states mananged by block like duckdb , and I am working on this.
   
   But maybe just splitting the emit result still have benefits? Seems that it 
can avoid calling the `slice` function many times which really costs cpu, too?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]

Reply via email to