Re: [PR] Parquet: Fix initial-default rows dropped when filtering on the defaulted column [iceberg]

via GitHub Sun, 07 Jun 2026 07:23:41 -0700


cbb330 commented on code in PR #16692:
URL: https://github.com/apache/iceberg/pull/16692#discussion_r3369510790



##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFilters.java:
##########
@@ -51,6 +56,137 @@ static FilterCompat.Filter convert(Schema schema, 
Expression expr, boolean caseS
     }
   }
 
+  /**
+   * Folds predicates on initial-default columns that are absent from a data 
file against the column
+   * default, instead of letting them be applied to the (physically missing, 
hence null) column.
+   *
+   * <p>A column added by schema evolution with an {@code initial-default} is 
backfilled with the
+   * default at read time, but record-level filtering runs <em>before</em> 
that injection. For a
+   * file written before the column existed the record filter would see the 
column as null and drop
+   * every row — silently removing exactly the rows the default backfills 
(including via the {@code
+   * IsNotNull} that engines infer for null-intolerant predicates). This 
evaluates such predicates
+   * against the default value and folds them to {@code alwaysTrue}/{@code 
alwaysFalse}, the same
+   * way partition predicates are folded out of the residual. Predicates on 
columns the file
+   * actually contains are returned unchanged so that normal record, stats, 
dictionary, and bloom
+   * filtering still applies (and still prunes those files on the column's 
real values).
+   *
+   * @param expr a residual filter expression
+   * @param expectedSchema the table read schema, whose fields carry 
initial-default values
+   * @param fileColumnIds the field ids physically present in the data file 
being read
+   * @param caseSensitive whether column resolution is case sensitive
+   * @return the filter with absent initial-default columns folded to their 
default value
+   */
+  static Expression replaceMissingColumnDefaults(

Review Comment:
   Thanks @pvary for the nudge which moved this. The standalone helper here in 
`ParquetFilters` is gone and the fix now lives in 
`ParquetMetricsRowGroupFilter`, which is where engine reads actually drop the 
rows.
   
   The problem being solved is that a column missing from the file is 
represented as all-null and skips the row group for col = <default> and the 
inferred IsNotNull(col). So since that filter already gets the schema carrying 
initialDefault, the solution can just evaluate the predicate against the 
default instead of assuming null.
   
   
   You explicitly mentioned `convert`, which builds the parquet-mr record 
filter, but only the deprecated readSupport path reaches this. no engine read 
goes through it. so it isn't where the drop happens. so happy to give it the 
same treatment if you'd like that path covered too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Parquet: Fix initial-default rows dropped when filtering on the defaulted column [iceberg]

Reply via email to