alamb commented on issue #7869:
URL: 
https://github.com/apache/arrow-datafusion/issues/7869#issuecomment-1872530149

   Here is an idea on how to extend  
[PruningPredicate](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#)
 to handle this case
   
   ## Problem: 
   `PruningPredicate` can't be told about columns that are known to contain 
only `NULL`. It can be told which columns have no nulls (via the 
[`PruningStatistics::null_counts()`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.null_counts)).
 
   
   I think we could teach `PruningPredicate` about all null colums like this:
   
   1. Add a new method `PruningStatistics::row_counts()` to get the total row 
counts in each container. 
   2. Use the information from `PruningStatistics::row_counts()` and 
`PruningStatistics::null_counts()` to determine containers where columns are 
entirely NULL
   4. Rewrite the predicate, replacing references to columns known to be `NULL` 
with a `NULL` literal  and try to simplify the expressions (e.g. `a = 5` --> 
`NULL = 5` --> `NULL`)
   
   
   For the example in this ticket's description with predicate `col_a != A AND 
col_b='bananas'` where `col_b` is not known and the relevant container had 
`100` rows, 
   1.  the relevant `PruningStatistics` would return `col_b: {null_count = 100, 
row_count = 100}`
   2. `PruningPredicate::prune` would determine `col_b` was entirely null, and 
would rewrite the predicate to be `col_a != A AND NULL = 'bananas'`.
   3. The pruning rewrite would happen again, and this time would not try to 
fetch min/max statistics for `col_b` and thus could be proven to be not true.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to