milenkovicm commented on code in PR #4202:
URL: https://github.com/apache/arrow-datafusion/pull/4202#discussion_r1023810458


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -270,10 +315,25 @@ fn group_aggregate_batch(
                 // Existing entry for this group value
                 Some((_hash, group_idx)) => {
                     let group_state = &mut group_states[*group_idx];
+
                     // 1.3
                     if group_state.indices.is_empty() {
                         groups_with_rows.push(*group_idx);
                     };
+
+                    // ensure we have enough indices allocated
+                    if group_state.indices.capacity() == 
group_state.indices.len() {
+                        // allocate more
+
+                        // growth factor: 2, but at least 2 elements
+                        let bump_elements = (group_state.indices.capacity() * 
2).max(2);

Review Comment:
   I tend to agree with with @alamb here, IMHO `group_aggregate_batch` is too 
busy at the moment, and some kind of separation of concerns would help. 
   
   What if `group_aggregate_batch` returns how much more memory it allocated, 
and  accounting is done after method call? This would help with encapsulation 
of aggregation algorithm and make it easier to swap. I'm aware that it might 
not produce 100% correct results but as we discussed in #3941 it is ok to have 
small discrepancy for short period of time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to