alamb opened a new issue, #9171: URL: https://github.com/apache/arrow-datafusion/issues/9171
### Is your feature request related to a problem or challenge? This is broken out from https://github.com/apache/arrow-datafusion/issues/7869 which is describing a slightly different problem [`PruningPredicate`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#) can't be told about columns that are known to contain only `NULL`. It can be told which columns have no nulls (via the [`PruningStatistics::null_counts()`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.null_counts)). Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as File 1: `col_a` File 2: `col_b`, `col_B` (`col_b` was added later) A predicate like `col_a != A AND col_b='bananas'` can not be `true` for File 1 (as `col_B` is logically `NULL` for all rows) This is subtly, but importantly different than the case when *nothing is known* about the column, which confusingly is encoded by returning NULL from [`PruningStatistics::min_values()`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.min_values) ### Describe the solution you'd like 1. Add a new method `PruningStatistics::row_counts()` to get the total row counts in each container. 2. Use the information from `PruningStatistics::row_counts()` and `PruningStatistics::null_counts()` to determine containers where columns are entirely NULL 4. Rewrite the predicate, replacing references to columns known to be `NULL` with a `NULL` literal and try to simplify the expressions (e.g. `a = 5` --> `NULL = 5` --> `NULL`) For the example in this ticket's description with predicate `col_a != A AND col_b='bananas'` where `col_b` is not known and the relevant container had `100` rows, 1. the relevant `PruningStatistics` would return `col_b: {null_count = 100, row_count = 100}` 2. `PruningPredicate::prune` would determine `col_b` was entirely null, and would rewrite the predicate to be `col_a != A AND NULL = 'bananas'`. 3. The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for `col_b` and thus could be proven to be not true. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
