[I] Don't error on unknown column when pruning if predicate can be proven false [arrow-datafusion]

via GitHub Thu, 19 Oct 2023 05:45:56 -0700


domodwyer opened a new issue, #7869:
URL: https://github.com/apache/arrow-datafusion/issues/7869


   ### Is your feature request related to a problem or challenge?
   
   At query time, our use case requires that we evaluate predicates against 
in-memory data that may have a schema that is a subset of the table schema. The 
predicate can reference columns that are not currently in memory or known at 
query time.
   
   For example, given the following in-memory data:
   
   | col_a | value |
   |--|--|
   | A | 42 |
   
   We may have to evaluate a predicate such as `col_a != A AND col_b=bananas`. 
Where `col_b` is not present in the in-memory schema / unknown at pruning time, 
but is a valid column for the table in the system as a whole.
   
   Because at query time we have a limited subset of the schema, the schema and 
statistics provided when constructing the `PruningPredicate` covers only 
`col_a, value`.
   
   However the `col_a != A` portion of the predicate can be proven FALSE 
irrespective of `col_b`. Unfortunately constructing the `PruningPredicate` 
eagerly validates the presence of statistics for all columns in the predicate, 
and errors stating that there are no fields named `col_b` before attempting to 
evaluate any portion of the predicate.
   
   ### Describe the solution you'd like
   
   Attempt to evaluate the predicate based on the available statistics, and 
return FALSE if possible. If the predicate cannot be proven FALSE, return a 
"missing column" error as it does today.
   
   For the example above, ideally pruning should return FALSE as it can be 
proven that `col_a != A` is FALSE even though `col_b` is unknown at pruning 
time.
   
   ### Describe alternatives you've considered
   
   Inserting NULL statistics into the pruning schema to satisfy the presence 
check - this works around the issue, but unfortunately requires extra 
processing to prevent the missing field error.
   
   ### Additional context
   
   This change in behaviour might need sticking behind a flag/option to opt 
into, rather than being the default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Don't error on unknown column when pruning if predicate can be proven false [arrow-datafusion]

Reply via email to