[I] Inconsistent null handling in min/max [arrow-datafusion]

via GitHub Thu, 02 Nov 2023 08:15:42 -0700


westonpace opened a new issue, #8031:
URL: https://github.com/apache/arrow-datafusion/issues/8031


   ### Describe the bug
   
   The `datafusion_physical_expr::aggregate::min_max::MinAccumulator` and 
`MaxAccumulator` appear to use different approaches to calculating the minimum 
and maximum at different times and this can lead to inconsistent results.
   
   The min/max are extracted from batches using 
[`arrow_arith::aggregate::max`](https://arrow.apache.org/rust/arrow_arith/aggregate/fn.max.html)
 and 
[`arrow_arith::aggregate::min`](https://arrow.apache.org/rust/arrow_arith/aggregate/fn.min.html)
 which state:
   
   > For floating point arrays any NaN values are considered to be greater than 
any other non-null value
   
   Once these values are extracted from the batch they are then compared with 
the current `ScalarValue`.  I had assumed this was using the `PartialOrd` 
implementation of `ScalarValue` which uses total order (which would still be 
inconsistent).
   
   However, it appears that 
[`f32::min`](https://doc.rust-lang.org/std/primitive.f32.html#method.min) and 
[`f32::max`](https://doc.rust-lang.org/std/primitive.f32.html#method.max) are 
used instead.  These methods give us a delightful, third interpretation of 
floating point ordering which is:
   
   > If one of the arguments is NaN, then the other argument is returned.
   
   This means that `min` considers `NaN` to be greater and `max` considers 
`NaN` to be smaller :dizzy_face: 
   
   As a result, it seems the `MinAccumulator` is consistent (`NaN` is always 
greater) but not aligned with total ordering.  The `MaxAccumulator` is 
inconsistent and can give different results depending on how the data is 
batched up.
   
   
   ### To Reproduce
   
   ```
   use std::sync::Arc;
   
   use arrow::datatypes::DataType;
   use arrow_array::{ArrayRef, Float32Array};
   use datafusion::physical_plan::Accumulator;
   use datafusion_physical_expr::expressions::MaxAccumulator;
   
   pub fn main() {
       let pos_nan = f32::NAN;
       let zero = 0_f32;
   
       let vals: ArrayRef = Arc::new(Float32Array::from_iter_values(
           [zero, pos_nan].iter().copied(),
       ));
   
       // We accumulate the above two rows in two different ways.  First, we 
pass in both as a single batch
       // This will use `aggregate::max` which always puts `NaN` as the max
       let mut accumulator = 
MaxAccumulator::try_new(&DataType::Float32).unwrap();
       accumulator.update_batch(&[vals]).unwrap();
       dbg!(&accumulator.evaluate().unwrap()); // prints NaN
   
       // Next we pass the two values in two different batches.  This will use 
`f32.max` which never
       // puts `NaN` as the max
       let vals_a: ArrayRef = 
Arc::new(Float32Array::from_iter_values([pos_nan].iter().copied()));
       let vals_b: ArrayRef = 
Arc::new(Float32Array::from_iter_values([zero].iter().copied()));
   
       let mut accumulator = 
MaxAccumulator::try_new(&DataType::Float32).unwrap();
       accumulator.update_batch(&[vals_a]).unwrap();
       accumulator.update_batch(&[vals_b]).unwrap();
       dbg!(&accumulator.evaluate().unwrap()); // prints 0
   }
   ```
   
   ### Expected behavior
   
   My preference would be that everything be done in total order.  I am using 
`ScalarValue`'s `PartialOrd` internally and I would prefer that it be 
consistent with min & max.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Inconsistent null handling in min/max [arrow-datafusion]

Reply via email to