alamb commented on issue #7869: URL: https://github.com/apache/arrow-datafusion/issues/7869#issuecomment-1872530149
Here is an idea on how to extend [PruningPredicate](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#) to handle this case ## Problem: `PruningPredicate` can't be told about columns that are known to contain only `NULL`. It can be told which columns have no nulls (via the [`PruningStatistics::null_counts()`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.null_counts)). I think we could teach `PruningPredicate` about all null colums like this: 1. Add a new method `PruningStatistics::row_counts()` to get the total row counts in each container. 2. Use the information from `PruningStatistics::row_counts()` and `PruningStatistics::null_counts()` to determine containers where columns are entirely NULL 4. Rewrite the predicate, replacing references to columns known to be `NULL` with a `NULL` literal and try to simplify the expressions (e.g. `a = 5` --> `NULL = 5` --> `NULL`) For the example in this ticket's description with predicate `col_a != A AND col_b='bananas'` where `col_b` is not known and the relevant container had `100` rows, 1. the relevant `PruningStatistics` would return `col_b: {null_count = 100, row_count = 100}` 2. `PruningPredicate::prune` would determine `col_b` was entirely null, and would rewrite the predicate to be `col_a != A AND NULL = 'bananas'`. 3. The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for `col_b` and thus could be proven to be not true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
