[I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]

via GitHub Tue, 22 Apr 2025 08:01:58 -0700


etseidl opened a new issue, #15812:
URL: https://github.com/apache/datafusion/issues/15812


   ### Describe the bug
   
   This was mentioned in 
https://github.com/apache/datafusion/issues/15742#issuecomment-2815595171 and 
discussed in detail in https://github.com/apache/parquet-format/pull/221, but 
datafusion is over-aggressive in pruning floating point columns. The issue 
appears with predicates of the form `x [gt|lt] literal`. Consider a column 
consisting of `[1.0, 0.0, -1.0, NaN, -2.0]`, the max will be 1 and the min -2. 
A query like `select * from ... where x > 2` will return no rows because no 
chunk exists where `max > 2`.
   
   ### To Reproduce
   
   ```sql
   > select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' 
where x > arrow_cast(2.0, 'Float16');
   +---+
   | x |
   +---+
   +---+
   0 row(s) fetched. 
   ```
   
   ### Expected behavior
   
   The above query should return a single row containing `NaN`.
   
   ### Additional context
   
   The Parquet community is considering changes to allow for `NaN` in 
statistics, with the currently favored approach being adding a new 
`ColumnOrder` to the specification. This will correct the issue above, but 
datafusion will need to check the `ColumnOrder` to know whether or not floating 
point statistics can be trusted.
   
   Also note that if/when https://github.com/apache/parquet-format/pull/221 is 
merged, other predicates such as `isnan(x)` might be candidates for pruning, 
but that is an optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]

Reply via email to