ahmed-mez commented on code in PR #18906:
URL: https://github.com/apache/datafusion/pull/18906#discussion_r2653016944
##########
datafusion/physical-plan/src/aggregates/group_values/row.rs:
##########
@@ -206,37 +233,52 @@ impl GroupValues for GroupValuesRows {
output
}
EmitTo::First(n) => {
- let groups_rows = group_values.iter().take(n);
- let output = self.row_converter.convert_rows(groups_rows)?;
- // Clear out first n group keys by copying them to a new Rows.
- // TODO file some ticket in arrow-rs to make this more
efficient?
- let mut new_group_values = self.row_converter.empty_rows(0, 0);
- for row in group_values.iter().skip(n) {
- new_group_values.push(row);
- }
- std::mem::swap(&mut new_group_values, &mut group_values);
-
- self.map.retain(|(_exists_hash, group_idx)| {
- // Decrement group index by n
- match group_idx.checked_sub(n) {
- // Group index was >= n, shift value down
- Some(sub) => {
- *group_idx = sub;
- true
- }
- // Group index was < n, so remove from table
- None => false,
+ if self.drain_mode {
Review Comment:
Hi @alamb
I took a stab at the `EmitTo::Next(n)` approach you suggested!
https://github.com/apache/datafusion/pull/19562 - early [benchmark
results](https://github.com/apache/datafusion/pull/19562/files#diff-71570c2f006317fc69e4be7742dcb9f33e94f05860aa0f0dd8620352bf638455R212)
are promising.
The change touches several files and critical functionalities. Before
polishing this further, I wanted to check:
1. Does this approach look on track with what you had in mind?
2. Would it make sense to create an epic to break this down into smaller
PRs? I'm thinking something like:
a. Add `EmitTo::Next` variant (keeping All temporarily as
`Next(usize::MAX)`)
b. Update `GroupValues` implementations
c. Update `GroupsAccumulator` implementations
d. Add `Draining` state and wire it up
e. Remove `EmitTo::All`, cleanup
Happy to iterate on the approach or adjust the breakdown based on your
feedback!
cc @gabotechs - I'm interested in your opinion as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]