alamb opened a new issue, #9171:
URL: https://github.com/apache/arrow-datafusion/issues/9171

   ### Is your feature request related to a problem or challenge?
   
   This is broken out from 
https://github.com/apache/arrow-datafusion/issues/7869 which is describing a 
slightly different problem
   
   
[`PruningPredicate`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#)
  can't be told about columns that are known to contain only `NULL`. It can be 
told which columns have no nulls (via the 
[`PruningStatistics::null_counts()`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.null_counts)).
 
   
   Columns that contain only NULL occur in tables that have "schema evolution" 
-- for example if you have two files such as
   
   File 1: `col_a`
   File 2: `col_b`, `col_B` (`col_b` was added later)
   
   A predicate like `col_a != A AND col_b='bananas'` can not be `true` for File 
1 (as `col_B` is logically `NULL` for all rows)
   
   This is subtly, but importantly different than the case when *nothing is 
known* about the column, which confusingly is encoded by returning NULL from 
[`PruningStatistics::min_values()`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.min_values)
   
   
   ### Describe the solution you'd like
   
   
   
   1. Add a new method `PruningStatistics::row_counts()` to get the total row 
counts in each container. 
   2. Use the information from `PruningStatistics::row_counts()` and 
`PruningStatistics::null_counts()` to determine containers where columns are 
entirely NULL
   4. Rewrite the predicate, replacing references to columns known to be `NULL` 
with a `NULL` literal  and try to simplify the expressions (e.g. `a = 5` --> 
`NULL = 5` --> `NULL`)
   
   
   For the example in this ticket's description with predicate `col_a != A AND 
col_b='bananas'` where `col_b` is not known and the relevant container had 
`100` rows, 
   1.  the relevant `PruningStatistics` would return `col_b: {null_count = 100, 
row_count = 100}`
   2. `PruningPredicate::prune` would determine `col_b` was entirely null, and 
would rewrite the predicate to be `col_a != A AND NULL = 'bananas'`.
   3. The pruning rewrite would happen again, and this time would not try to 
fetch min/max statistics for `col_b` and thus could be proven to be not true.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to