cbb330 commented on code in PR #16692:
URL: https://github.com/apache/iceberg/pull/16692#discussion_r3369510790
##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFilters.java:
##########
@@ -51,6 +56,137 @@ static FilterCompat.Filter convert(Schema schema,
Expression expr, boolean caseS
}
}
+ /**
+ * Folds predicates on initial-default columns that are absent from a data
file against the column
+ * default, instead of letting them be applied to the (physically missing,
hence null) column.
+ *
+ * <p>A column added by schema evolution with an {@code initial-default} is
backfilled with the
+ * default at read time, but record-level filtering runs <em>before</em>
that injection. For a
+ * file written before the column existed the record filter would see the
column as null and drop
+ * every row — silently removing exactly the rows the default backfills
(including via the {@code
+ * IsNotNull} that engines infer for null-intolerant predicates). This
evaluates such predicates
+ * against the default value and folds them to {@code alwaysTrue}/{@code
alwaysFalse}, the same
+ * way partition predicates are folded out of the residual. Predicates on
columns the file
+ * actually contains are returned unchanged so that normal record, stats,
dictionary, and bloom
+ * filtering still applies (and still prunes those files on the column's
real values).
+ *
+ * @param expr a residual filter expression
+ * @param expectedSchema the table read schema, whose fields carry
initial-default values
+ * @param fileColumnIds the field ids physically present in the data file
being read
+ * @param caseSensitive whether column resolution is case sensitive
+ * @return the filter with absent initial-default columns folded to their
default value
+ */
+ static Expression replaceMissingColumnDefaults(
Review Comment:
Thanks @pvary for the nudge which moved this. The standalone helper here in
`ParquetFilters` is gone and the fix now lives in
`ParquetMetricsRowGroupFilter`, which is where engine reads actually drop the
rows.
The problem being solved is that a column missing from the file is
represented as all-null and skips the row group for col = <default> and the
inferred IsNotNull(col). So since that filter already gets the schema carrying
initialDefault, the solution can just evaluate the predicate against the
default instead of assuming null.
You explicitly mentioned `convert`, which builds the parquet-mr record
filter, but only the deprecated readSupport path reaches this. no engine read
goes through it. so it isn't where the drop happens. so happy to give it the
same treatment if you'd like that path covered too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]