alamb opened a new pull request, #4488:
URL: https://github.com/apache/arrow-datafusion/pull/4488

   # Which issue does this PR close?
   
   closes https://github.com/apache/arrow-datafusion/issues/3105
   
   # Rationale for this change
   
   I think median as currently implemented is unusable except for everything 
but the simplest of queries (where there is no intermediate repartitioning)
   
   Median was the only aggregate that stored it state using 
`AggregateState::Array` while all other aggregages use `AggregateState::Scalar`
   
   It turns out that `AggregateState::Array` doesn't support partial aggregates 
where partial state has to be marshalled into an `Array` and combined. Thus, 
`median` likely never worked for any DataFusion plan that has more than one 
partition (where the data is repartitioned for parallel aggregation).
   
   
   # What changes are included in this PR?
   
   I tried a few ways to implement partial state for `AggregateState::Array`  
and they all got messy very quickly.
   
   Thus, I opted for what I think is the simpler, possibly less performant, 
approach of using Scalars rather than Arrays. 
   
   Median was added in https://github.com/apache/arrow-datafusion/pull/3009 by 
@andygrove (and sadly I think I have have suggested the array based approach 
without fully understanding the implications)
   
   For any real world dataset where the performance different between 
approaches would be measurable, I believe the the current implementation is 
going to error anyways. 
   
   # Are these changes tested?
   Yes
   
   # Are there any user-facing changes?
   Less error
   
   # Proposed follow on
   
   If this approach is accepted, I propose removing the AggregateState enum and 
update the Accumulator trait to simplify it as a follow on PR.
   
   cc @andygrove as he originally wrote this code in 
https://github.com/apache/arrow-datafusion/pull/3009 
   
   cc @tustvold as I think he is thinking about how to improve the grouping 
situation overall
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to