[GitHub] [arrow-datafusion] alamb commented on issue #2723: Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication)

GitBox Wed, 23 Nov 2022 05:49:11 -0800


alamb commented on issue #2723:
URL: 
https://github.com/apache/arrow-datafusion/issues/2723#issuecomment-1325093786


   I general, other than the fact that this proposal sounds like a lot of work, 
I think it sounds wonderful 🏆 
   
   I did have a question about the proposed trait implementation:
   
   ```rust
   trait Aggregator: MemoryConsumer {
       /// Update aggregator state for given groups.
       fn update_batch(&mut self, keys: Rows, batch: &RecordBatch)) -> 
Result<()>;
       ...
   }
   ```
   
   Is the idea that each aggregator would contain a HashMap or some other way 
to map keys --> intermediates (rather than the hash map being in the 
aggregator? This seems like it would result in a fair amount of duplication
   
   I would have expected something like the following (basically push the 
`take` into the aggregator)
   
   ```rust
   trait Aggregator: MemoryConsumer {
       /// Update aggregator state for given rows
       fn update_batch(&mut self, indices: Vec<usize>, batch: &RecordBatch)) -> 
Result<()>;
       ...
   }
   ```
   
   
   The big danger of the plan initially seems like  that it wouldn't be 
finished and then we are in an even worse state (3 implementations!) but I 
think your idea to incrementally rewrite V2 to V3 and then remove V1 sounds 
like a good mitigation strategy
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #2723: Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication)

Reply via email to