westonpace commented on issue #36283:
URL: https://github.com/apache/arrow/issues/36283#issuecomment-1613227675
This is expected but could be improved. Parquet predicate pushdown works
like so:
* Extract a row group guarantee from parquet statistics (e.g. `30 < x < 70
&& 0 < y < 100`)
* Call `SimplifyWithGuarantee` on the filter, given the above guarantee
* For example, a filter `x == 100 && z < 20` would simplify to `false`.
The `SimplifyWithGuarantee` method does not understand `isin`. It could be
improved to do so if someone were interested. The place to make the change
would be here I think:
https://github.com/apache/arrow/blob/apache-arrow-12.0.1/cpp/src/arrow/compute/expression.cc#L1230
First we "extract known values" (places in the guarantee where we have
something like x == 7). This usually wouldn't apply because equality
guarantees come from partitioning and not from parquet statistics.
Second, we consider inequalities in the guarantee. This is the part that is
critical for parquet predicate pushdown. We then call Inequality::Simplify
which looks for places in the filter that are:
* calls to is_valid or is_null (these might be simplified by an inequality)
* comparisons (these might also be simplified by an inequality)
I think the point you are making is that `isin` is another function that may
be simplified by an inequality. If we know that x > 100 and the filter is
`isin(0, 7, 12)` then we can simplify this to `literal(false)`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]