etseidl opened a new issue, #15812: URL: https://github.com/apache/datafusion/issues/15812
### Describe the bug This was mentioned in https://github.com/apache/datafusion/issues/15742#issuecomment-2815595171 and discussed in detail in https://github.com/apache/parquet-format/pull/221, but datafusion is over-aggressive in pruning floating point columns. The issue appears with predicates of the form `x [gt|lt] literal`. Consider a column consisting of `[1.0, 0.0, -1.0, NaN, -2.0]`, the max will be 1 and the min -2. A query like `select * from ... where x > 2` will return no rows because no chunk exists where `max > 2`. ### To Reproduce ```sql > select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16'); +---+ | x | +---+ +---+ 0 row(s) fetched. ``` ### Expected behavior The above query should return a single row containing `NaN`. ### Additional context The Parquet community is considering changes to allow for `NaN` in statistics, with the currently favored approach being adding a new `ColumnOrder` to the specification. This will correct the issue above, but datafusion will need to check the `ColumnOrder` to know whether or not floating point statistics can be trusted. Also note that if/when https://github.com/apache/parquet-format/pull/221 is merged, other predicates such as `isnan(x)` might be candidates for pruning, but that is an optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org