[jira] [Commented] (ARROW-9937) [Rust] [DataFusion] Average is not correct

Jorge (Jira) Tue, 08 Sep 2020 00:11:09 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191993#comment-17191993
 ]


Jorge commented on ARROW-9937:
------------------------------

[~andygrove], I remember that you wanted to touch this. If not, let me know and 
I take a shoot at it.

Looking at [Ballista's source code for 
this|https://github.com/ballista-compute/ballista/blob/main/rust/ballista/src/execution/operators/hash_aggregate.rs]
 , I think that we have the same issue there. :/


> [Rust] [DataFusion] Average is not correct
> ------------------------------------------
>
>                 Key: ARROW-9937
>                 URL: https://issues.apache.org/jira/browse/ARROW-9937
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Jorge
>            Priority: Major
>
> The current design of aggregates makes the calculation of the average 
> incorrect.
> It also makes it impossible to compute the [geometric 
> mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
> operations. 
> The central issue is that Accumulator returns a `ScalarValue` during partial 
> aggregations via {{get_value}}, but very often a `ScalarValue` is not 
> sufficient information to perform the full aggregation.
> A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
> distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
> calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
> reduces them using another average, i.e.
> {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
> which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
> I believe that our Accumulators need to pass more information from the 
> partial aggregations to the final aggregation.
> We could consider taking an API equivalent to 
> [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
> i.e. have an `update`, a `merge` and an `evaluate`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9937) [Rust] [DataFusion] Average is not correct

Reply via email to