domodwyer opened a new issue, #7869: URL: https://github.com/apache/arrow-datafusion/issues/7869
### Is your feature request related to a problem or challenge? At query time, our use case requires that we evaluate predicates against in-memory data that may have a schema that is a subset of the table schema. The predicate can reference columns that are not currently in memory or known at query time. For example, given the following in-memory data: | col_a | value | |--|--| | A | 42 | We may have to evaluate a predicate such as `col_a != A AND col_b=bananas`. Where `col_b` is not present in the in-memory schema / unknown at pruning time, but is a valid column for the table in the system as a whole. Because at query time we have a limited subset of the schema, the schema and statistics provided when constructing the `PruningPredicate` covers only `col_a, value`. However the `col_a != A` portion of the predicate can be proven FALSE irrespective of `col_b`. Unfortunately constructing the `PruningPredicate` eagerly validates the presence of statistics for all columns in the predicate, and errors stating that there are no fields named `col_b` before attempting to evaluate any portion of the predicate. ### Describe the solution you'd like Attempt to evaluate the predicate based on the available statistics, and return FALSE if possible. If the predicate cannot be proven FALSE, return a "missing column" error as it does today. For the example above, ideally pruning should return FALSE as it can be proven that `col_a != A` is FALSE even though `col_b` is unknown at pruning time. ### Describe alternatives you've considered Inserting NULL statistics into the pruning schema to satisfy the presence check - this works around the issue, but unfortunately requires extra processing to prevent the missing field error. ### Additional context This change in behaviour might need sticking behind a flag/option to opt into, rather than being the default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
