westonpace opened a new issue, #8031: URL: https://github.com/apache/arrow-datafusion/issues/8031
### Describe the bug The `datafusion_physical_expr::aggregate::min_max::MinAccumulator` and `MaxAccumulator` appear to use different approaches to calculating the minimum and maximum at different times and this can lead to inconsistent results. The min/max are extracted from batches using [`arrow_arith::aggregate::max`](https://arrow.apache.org/rust/arrow_arith/aggregate/fn.max.html) and [`arrow_arith::aggregate::min`](https://arrow.apache.org/rust/arrow_arith/aggregate/fn.min.html) which state: > For floating point arrays any NaN values are considered to be greater than any other non-null value Once these values are extracted from the batch they are then compared with the current `ScalarValue`. I had assumed this was using the `PartialOrd` implementation of `ScalarValue` which uses total order (which would still be inconsistent). However, it appears that [`f32::min`](https://doc.rust-lang.org/std/primitive.f32.html#method.min) and [`f32::max`](https://doc.rust-lang.org/std/primitive.f32.html#method.max) are used instead. These methods give us a delightful, third interpretation of floating point ordering which is: > If one of the arguments is NaN, then the other argument is returned. This means that `min` considers `NaN` to be greater and `max` considers `NaN` to be smaller :dizzy_face: As a result, it seems the `MinAccumulator` is consistent (`NaN` is always greater) but not aligned with total ordering. The `MaxAccumulator` is inconsistent and can give different results depending on how the data is batched up. ### To Reproduce ``` use std::sync::Arc; use arrow::datatypes::DataType; use arrow_array::{ArrayRef, Float32Array}; use datafusion::physical_plan::Accumulator; use datafusion_physical_expr::expressions::MaxAccumulator; pub fn main() { let pos_nan = f32::NAN; let zero = 0_f32; let vals: ArrayRef = Arc::new(Float32Array::from_iter_values( [zero, pos_nan].iter().copied(), )); // We accumulate the above two rows in two different ways. First, we pass in both as a single batch // This will use `aggregate::max` which always puts `NaN` as the max let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap(); accumulator.update_batch(&[vals]).unwrap(); dbg!(&accumulator.evaluate().unwrap()); // prints NaN // Next we pass the two values in two different batches. This will use `f32.max` which never // puts `NaN` as the max let vals_a: ArrayRef = Arc::new(Float32Array::from_iter_values([pos_nan].iter().copied())); let vals_b: ArrayRef = Arc::new(Float32Array::from_iter_values([zero].iter().copied())); let mut accumulator = MaxAccumulator::try_new(&DataType::Float32).unwrap(); accumulator.update_batch(&[vals_a]).unwrap(); accumulator.update_batch(&[vals_b]).unwrap(); dbg!(&accumulator.evaluate().unwrap()); // prints 0 } ``` ### Expected behavior My preference would be that everything be done in total order. I am using `ScalarValue`'s `PartialOrd` internally and I would prefer that it be consistent with min & max. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
