alamb opened a new pull request, #4488: URL: https://github.com/apache/arrow-datafusion/pull/4488
# Which issue does this PR close? closes https://github.com/apache/arrow-datafusion/issues/3105 # Rationale for this change I think median as currently implemented is unusable except for everything but the simplest of queries (where there is no intermediate repartitioning) Median was the only aggregate that stored it state using `AggregateState::Array` while all other aggregages use `AggregateState::Scalar` It turns out that `AggregateState::Array` doesn't support partial aggregates where partial state has to be marshalled into an `Array` and combined. Thus, `median` likely never worked for any DataFusion plan that has more than one partition (where the data is repartitioned for parallel aggregation). # What changes are included in this PR? I tried a few ways to implement partial state for `AggregateState::Array` and they all got messy very quickly. Thus, I opted for what I think is the simpler, possibly less performant, approach of using Scalars rather than Arrays. Median was added in https://github.com/apache/arrow-datafusion/pull/3009 by @andygrove (and sadly I think I have have suggested the array based approach without fully understanding the implications) For any real world dataset where the performance different between approaches would be measurable, I believe the the current implementation is going to error anyways. # Are these changes tested? Yes # Are there any user-facing changes? Less error # Proposed follow on If this approach is accepted, I propose removing the AggregateState enum and update the Accumulator trait to simplify it as a follow on PR. cc @andygrove as he originally wrote this code in https://github.com/apache/arrow-datafusion/pull/3009 cc @tustvold as I think he is thinking about how to improve the grouping situation overall -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
