GayathriSrividya opened a new pull request, #3460:
URL: https://github.com/apache/iceberg-python/pull/3460
Closes #3148
## Root cause
`dynamic_partition_overwrite` builds its delete predicate using the
**current** partition spec only:
```python
delete_filter = self._build_partition_predicate(
partition_records=partitions_to_overwrite,
spec=self.table_metadata.spec(), # always current spec
schema=self.table_metadata.schema(),
)
```
After a partition spec evolution (e.g. adding a `region` field), data files
written under the **older** spec carry `NULL` for the new field — because
`region` simply wasn't part of the schema at write time.
The predicate produced for the new partition `{category=A, region=us}` is:
```
category = 'A' AND region = 'us'
```
The `_StrictMetricsEvaluator` correctly sees that spec-0 files have `region
= NULL` for every row, so `region = 'us'` can never be `ROWS_MUST_MATCH`. Those
files are silently kept, leaving stale data behind.
## Fix
Detect which fields in the current spec were absent from at least one
historical spec ("evolved" fields). For those fields, extend the per-field
clause to also accept `NULL`:
```
category = 'A' AND (region = 'us' OR region IS NULL)
```
This causes the metrics evaluator to flag pre-evolution files (all-null
`region`) as `ROWS_MUST_MATCH` and delete them, while correctly preserving
files in other non-overlapping partitions (e.g. `region = 'eu'`).
## Verification
Added two unit tests for `_build_partition_predicate` (with / without
`evolved_source_ids`) and manually confirmed the repro from the issue now
produces `[999]` instead of `[1, 2, 999]`.
```
make lint ✓
pytest tests/table/ ✓ (304 passed)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]