cbb330 opened a new pull request, #16692:
URL: https://github.com/apache/iceberg/pull/16692

   ## What
   
   A column added by schema evolution with an `initial-default` is backfilled 
with the default at read
   time, but the per-file row-group filters (`ReadConf`) evaluate predicates 
against the physically
   absent — hence all-null — column. For a file written before the column 
existed, a predicate
   referencing the column (including the `IsNotNull` engines infer for 
null-intolerant predicates)
   skipped the row group, **silently dropping exactly the rows the default 
backfills**. The full scan
   was correct, so the case was untested and unobserved.
   
   Fixes #16690.
   
   ## Repro (Spark 3.5, Parquet, format-version 3)
   
   ```sql
   CREATE TABLE t (id bigint, name string) USING iceberg TBLPROPERTIES 
('format-version'='3');
   INSERT INTO t VALUES (1, 'Alice');                          -- written 
before column c exists
   -- table.updateSchema().addColumn("c", StringType, lit("US")).commit();
   INSERT INTO t VALUES (2, 'Bob', 'US');                      -- c present
   ```
   
   | Query | Before | After |
   |---|---|---|
   | `SELECT id, c` | `(1,US),(2,US)` ✅ | `(1,US),(2,US)` |
   | `WHERE c = 'US'` | `(2)` ❌ | `(1),(2)` ✅ |
   | `WHERE upper(c) = 'US'` | `(2)` ❌ | `(1),(2)` ✅ |
   | `WHERE c IS NOT NULL` | `(2)` ❌ | `(1),(2)` ✅ |
   | `WHERE c = 'CA'` | `()` | `()` ✅ (absent file still excluded) |
   | `WHERE c IS NULL` | `()` | `()` ✅ |
   
   ## Fix
   
   `ParquetFilters.replaceMissingColumnDefaults` folds predicates on 
initial-default columns that are
   absent from a data file against the column default before the row-group 
filters run — the same way
   partition predicates are folded out of the residual. `c = 'US'` → 
`alwaysTrue`, `c = 'CA'` →
   `alwaysFalse`, `IsNotNull(c)` → `alwaysTrue`, `IsNull(c)` → `alwaysFalse`. 
Predicates on columns the
   file actually contains are returned unchanged, so those files are still 
pruned on the column's real
   values, and tables with no initial-default columns are a no-op.
   
   This is the read-path counterpart to how partition columns are already 
excluded from per-file
   filtering; it does not change the table spec.
   
   ## Testing
   
   - `TestReplaceMissingColumnDefaults` — unit tests for the fold 
(match/non-match/null/present-column/
     no-default/conjunction).
   - `TestDefaultValuesFilteredRead` (spark 3.5) — end-to-end: the filtered 
reads above now return the
     backfilled row; fails on `main` without the fix.
   - Full `:iceberg-parquet:test` passes.
   
   ## Format coverage
   
   Parquet is the only format that exhibits this silent drop, and this fixes 
both of its read paths
   (`ParquetReader` and `VectorizedParquetReader` both go through `ReadConf`):
   
   - **Parquet** — does row-group filtering (stats/dictionary/bloom) before the 
default is injected, so
     it dropped the backfilled rows. Fixed here.
   - **Avro** — applies no file-level filtering (`AvroFormatModel`'s `filter()` 
is a no-op); the engine
     applies the residual after the default is injected, so reads are already 
correct.
   - **ORC** — reading an absent initial-default column currently throws
     `UnsupportedOperationException` (`ORCSchemaUtil`), so there is no silent 
wrong result to fix here;
     ORC default-value reads are a separate, pre-existing gap.
   
   ## Possible follow-up
   
   A further enhancement could make the metrics evaluators default-aware to 
also *prune* absent-column
   files by the default (e.g. skip a file whose absent column is all-`'US'` for 
`c = 'CA'`); this PR
   keeps such files and reads them correctly, matching the behavior for files 
that contain the column
   but lack stats.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to