[GitHub] [arrow-datafusion] kazuyukitanimura commented on a diff in pull request #7400: feat: Support spilling for hash aggregation

via GitHub Wed, 13 Sep 2023 15:16:34 -0700


kazuyukitanimura commented on code in PR #7400:
URL: https://github.com/apache/arrow-datafusion/pull/7400#discussion_r1325129442



##########
datafusion/physical-expr/src/aggregate/first_last.rs:
##########
@@ -165,6 +165,8 @@ struct FirstValueAccumulator {
     orderings: Vec<ScalarValue>,
     // Stores the applicable ordering requirement.
     ordering_req: LexOrdering,
+    // Whether merge_batch() is called before

Review Comment:
   Removing those fails `run_first_last_multi_partitions` with `Error: 
ArrowError(InvalidArgumentError("number of columns(5) must match number of 
fields(4) in schema"))` (at least in my local test). 
   
   The existing code is somewhat stateful, meaning calling `state()` keeps 
adding a new column every time to the schema. With this PR, the final 
aggregation calls `merge_batch()` then, if applicable, calls `state()` for 
spilling via `GroupedHashAggregateStream.emit()`. The existing code is assuming 
that `state()` is called only once for partial aggregation. Ideally, calling 
`state()` multiple times should be allowed without worrying about the internal 
state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] kazuyukitanimura commented on a diff in pull request #7400: feat: Support spilling for hash aggregation

Reply via email to