[GitHub] [arrow] ritchie46 opened a new pull request #9416: ARROW-11496: [Rust] WIP aggregation on NaN values.

GitBox Thu, 04 Feb 2021 06:04:48 -0800


ritchie46 opened a new pull request #9416:
URL: https://github.com/apache/arrow/pull/9416



   @jorgecarleitao @Dandandan @nevi-me @alamb 
   
   Before I continue on this I want to start a discussion on what the behavior 
of aggregation should be. I started this after I got this issue: 
https://github.com/ritchie46/polars/issues/328. 
   
   It boils down to this:
   
   ```rust
           let a = Float64Array::from_iter_values(vec![1.0, f64::NAN]);
           dbg!(min(&a));
           dbg!(max(&a));
   ```
   ```
   [arrow/src/compute/kernels/aggregate.rs:905] min(&a) = Some(
       1.0,
   )
   [arrow/src/compute/kernels/aggregate.rs:906] max(&a) = Some(
       NaN,
   )
   
   ```
   
   
   I initially thought this was a bug, but then I see this behavior is asserted 
in the tests as being valid. 
   
   However this is different behavior than that of most numerical tools I know 
(e.g. numpy, tensorflow, torch, etc.). [The IEEE 
754](https://en.wikipedia.org/wiki/IEEE_754) standard states that _"Any 
comparison with a NaN is treated as unordered."_. 
   
   But if a max aggregation, differs from a min aggregation with regards to NaN 
this implies to me that we currently treat NaN as ordered and that NaN is 
larger than any number.
   
   IMO we should return NaN for both the max and the min kernel and may also 
add a variant that excludes the NaNs. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ritchie46 opened a new pull request #9416: ARROW-11496: [Rust] WIP aggregation on NaN values.

Reply via email to