GayathriSrividya opened a new pull request, #3460:
URL: https://github.com/apache/iceberg-python/pull/3460

   Closes #3148
   
   ## Root cause
   
   `dynamic_partition_overwrite` builds its delete predicate using the 
**current** partition spec only:
   
   ```python
   delete_filter = self._build_partition_predicate(
       partition_records=partitions_to_overwrite,
       spec=self.table_metadata.spec(),   # always current spec
       schema=self.table_metadata.schema(),
   )
   ```
   
   After a partition spec evolution (e.g. adding a `region` field), data files 
written under the **older** spec carry `NULL` for the new field — because 
`region` simply wasn't part of the schema at write time.
   
   The predicate produced for the new partition `{category=A, region=us}` is:
   
   ```
   category = 'A' AND region = 'us'
   ```
   
   The `_StrictMetricsEvaluator` correctly sees that spec-0 files have `region 
= NULL` for every row, so `region = 'us'` can never be `ROWS_MUST_MATCH`. Those 
files are silently kept, leaving stale data behind.
   
   ## Fix
   
   Detect which fields in the current spec were absent from at least one 
historical spec ("evolved" fields). For those fields, extend the per-field 
clause to also accept `NULL`:
   
   ```
   category = 'A' AND (region = 'us' OR region IS NULL)
   ```
   
   This causes the metrics evaluator to flag pre-evolution files (all-null 
`region`) as `ROWS_MUST_MATCH` and delete them, while correctly preserving 
files in other non-overlapping partitions (e.g. `region = 'eu'`).
   
   ## Verification
   
   Added two unit tests for `_build_partition_predicate` (with / without 
`evolved_source_ids`) and manually confirmed the repro from the issue now 
produces `[999]` instead of `[1, 2, 999]`.
   
   ```
   make lint  ✓
   pytest tests/table/  ✓  (304 passed)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to