wombatu-kun commented on code in PR #16621:
URL: https://github.com/apache/iceberg/pull/16621#discussion_r3402572190
##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFilters.java:
##########
@@ -29,19 +33,31 @@
import org.apache.iceberg.expressions.Expressions;
import org.apache.iceberg.expressions.Literal;
import org.apache.iceberg.expressions.UnboundPredicate;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.DecimalUtil;
+import org.apache.iceberg.util.UUIDUtil;
+import org.apache.parquet.column.ColumnDescriptor;
import org.apache.parquet.filter2.compat.FilterCompat;
import org.apache.parquet.filter2.predicate.FilterApi;
import org.apache.parquet.filter2.predicate.FilterPredicate;
import org.apache.parquet.filter2.predicate.Operators;
import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
class ParquetFilters {
private ParquetFilters() {}
- static FilterCompat.Filter convert(Schema schema, Expression expr, boolean
caseSensitive) {
+ static FilterCompat.Filter convert(
+ MessageType parquetSchema, Expression expr, boolean caseSensitive) {
+ Schema schema = ParquetSchemaUtil.convert(parquetSchema);
Review Comment:
`convert` visits the expression without first calling
`Expressions.rewriteNot(expr)`, which the sibling row-group filters (metrics,
dictionary, bloom) all do. That used to be harmless, but it no longer is:
`decimalPred` now returns `AlwaysTrue` to mean "I can't push this literal down,
so let every row through" - for example `decimal_col = 12.345` on a
`decimal(9,2)` column, or any literal that overflows the column precision.
The negated form then breaks. `NOT (decimal_col = 12.345)` is true for every
row, so it should match everything, but the visitor turns `not(AlwaysTrue)`
into `AlwaysFalse` ("match nothing"). `AlwaysFalse` is an unfinished
placeholder (the `// TODO: handle AlwaysFalse.INSTANCE` here): at the top level
or under AND it throws when the filter is applied, and under OR the negated
branch is silently dropped so the query returns fewer rows than it should. This
is the same failure #16035 set out to remove, resurfacing for negated
predicates - it stayed hidden before only because decimal predicates used to
throw outright.
Calling `expr = Expressions.rewriteNot(expr)` at the start of `convert`, as
the other filters do, fixes it: `not(eq)` becomes `notEq`, which `decimalPred`
correctly treats as "can't push down, return NOOP, keep all rows". A
negated-literal case next to `skipsDecimalLiteralWithIncompatibleScale` would
guard against regressions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]