cbb330 opened a new pull request, #16692:
URL: https://github.com/apache/iceberg/pull/16692
## What
A column added by schema evolution with an `initial-default` is backfilled
with the default at read
time, but the per-file row-group filters (`ReadConf`) evaluate predicates
against the physically
absent — hence all-null — column. For a file written before the column
existed, a predicate
referencing the column (including the `IsNotNull` engines infer for
null-intolerant predicates)
skipped the row group, **silently dropping exactly the rows the default
backfills**. The full scan
was correct, so the case was untested and unobserved.
Fixes #16690.
## Repro (Spark 3.5, Parquet, format-version 3)
```sql
CREATE TABLE t (id bigint, name string) USING iceberg TBLPROPERTIES
('format-version'='3');
INSERT INTO t VALUES (1, 'Alice'); -- written
before column c exists
-- table.updateSchema().addColumn("c", StringType, lit("US")).commit();
INSERT INTO t VALUES (2, 'Bob', 'US'); -- c present
```
| Query | Before | After |
|---|---|---|
| `SELECT id, c` | `(1,US),(2,US)` ✅ | `(1,US),(2,US)` |
| `WHERE c = 'US'` | `(2)` ❌ | `(1),(2)` ✅ |
| `WHERE upper(c) = 'US'` | `(2)` ❌ | `(1),(2)` ✅ |
| `WHERE c IS NOT NULL` | `(2)` ❌ | `(1),(2)` ✅ |
| `WHERE c = 'CA'` | `()` | `()` ✅ (absent file still excluded) |
| `WHERE c IS NULL` | `()` | `()` ✅ |
## Fix
`ParquetFilters.replaceMissingColumnDefaults` folds predicates on
initial-default columns that are
absent from a data file against the column default before the row-group
filters run — the same way
partition predicates are folded out of the residual. `c = 'US'` →
`alwaysTrue`, `c = 'CA'` →
`alwaysFalse`, `IsNotNull(c)` → `alwaysTrue`, `IsNull(c)` → `alwaysFalse`.
Predicates on columns the
file actually contains are returned unchanged, so those files are still
pruned on the column's real
values, and tables with no initial-default columns are a no-op.
This is the read-path counterpart to how partition columns are already
excluded from per-file
filtering; it does not change the table spec.
## Testing
- `TestReplaceMissingColumnDefaults` — unit tests for the fold
(match/non-match/null/present-column/
no-default/conjunction).
- `TestDefaultValuesFilteredRead` (spark 3.5) — end-to-end: the filtered
reads above now return the
backfilled row; fails on `main` without the fix.
- Full `:iceberg-parquet:test` passes.
## Format coverage
Parquet is the only format that exhibits this silent drop, and this fixes
both of its read paths
(`ParquetReader` and `VectorizedParquetReader` both go through `ReadConf`):
- **Parquet** — does row-group filtering (stats/dictionary/bloom) before the
default is injected, so
it dropped the backfilled rows. Fixed here.
- **Avro** — applies no file-level filtering (`AvroFormatModel`'s `filter()`
is a no-op); the engine
applies the residual after the default is injected, so reads are already
correct.
- **ORC** — reading an absent initial-default column currently throws
`UnsupportedOperationException` (`ORCSchemaUtil`), so there is no silent
wrong result to fix here;
ORC default-value reads are a separate, pre-existing gap.
## Possible follow-up
A further enhancement could make the metrics evaluators default-aware to
also *prune* absent-column
files by the default (e.g. skip a file whose absent column is all-`'US'` for
`c = 'CA'`); this PR
keeps such files and reads them correctly, matching the behavior for files
that contain the column
but lack stats.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]