kazuyukitanimura commented on code in PR #7400:
URL: https://github.com/apache/arrow-datafusion/pull/7400#discussion_r1325129442
##########
datafusion/physical-expr/src/aggregate/first_last.rs:
##########
@@ -165,6 +165,8 @@ struct FirstValueAccumulator {
orderings: Vec<ScalarValue>,
// Stores the applicable ordering requirement.
ordering_req: LexOrdering,
+ // Whether merge_batch() is called before
Review Comment:
Removing those fails `run_first_last_multi_partitions` with `Error:
ArrowError(InvalidArgumentError("number of columns(5) must match number of
fields(4) in schema"))` (at least in my local test).
The existing code is somewhat stateful, meaning calling `state()` keeps
adding a new column every time to the schema. With this PR, the final
aggregation calls `merge_batch()` then, if applicable, calls `state()` for
spilling via `GroupedHashAggregateStream.emit()`. The existing code is assuming
that `state()` is called only once for partial aggregation. Ideally, calling
`state()` multiple times should be allowed without worrying about the internal
state.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]