nazq opened a new pull request, #2566:
URL: https://github.com/apache/iceberg-rust/pull/2566

   ## Which issue does this PR close?
   
   - Closes #2565.
   
   ## What changes are included in this PR?
   
   `TableScanBuilder::build()` currently validates caller-supplied column names 
against `snapshot.schema(metadata)`, which returns the schema the snapshot was 
written under. For default scans (no explicit `snapshot_id`) on a table whose 
current schema differs from the snapshot's schema — i.e. any table that's been 
through an `UpdateSchemaAction` commit — this rejects columns that are valid 
against the post-evolution schema:
   
   ```
   DataInvalid => Column note not found in table. Schema: <pre-evolution schema>
   ```
   
   ### Fix
   
   
[`crates/iceberg/src/scan/mod.rs:221`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/scan/mod.rs#L221):
   
   ```diff
   - let schema = snapshot.schema(self.table.metadata())?;
   + let schema = if self.snapshot_id.is_some() {
   +     snapshot.schema(self.table.metadata())?
   + } else {
   +     self.table.metadata().current_schema().clone()
   + };
   ```
   
   - **Explicit `snapshot_id`** (time-travel): keep the snapshot-time 
vocabulary. A caller asking for "the table as it existed at snapshot N" should 
see schema N's columns.
   - **Default scan** (no `snapshot_id`): use the table's current schema. Field 
IDs are stable across schemas, so the downstream Parquet projection 
(`get_arrow_projection_mask_with_field_ids`, which reads `PARQUET:field_id` 
metadata embedded by the writer) still finds the right on-disk columns even 
when the file's column names differ from the current schema's column names.
   
   This matches PyIceberg's approach in 
`pyiceberg/io/pyarrow.py::_task_to_record_batches` — project by field ID, 
rename the arrow batch on the way out.
   
   The column-name validation loop (lines 224–237) and the subsequent 
`field_id_by_name` lookup (lines 256–261) share the same `schema` variable, so 
the fix is one assignment with a 12-line comment explaining the invariant.
   
   ### Why this wasn't caught earlier
   
   `UpdateSchemaAction` (#2120) shipped with metadata-only tests in 
`crates/catalog/loader/tests/schema_update_suite.rs` — none of them exercise 
`table.scan().select_columns(...)` after the schema commit. The pre-existing 
`crates/integration_tests/tests/read_evolved_schema.rs` only uses 
`table.scan().build()` with no `select_columns`, which bypasses the validation 
loop entirely. So the regression has been latent in the add/delete path since 
#2120 merged; it's just easier to trip with rename.
   
   ## Are these changes tested?
   
   Yes — three regression tests in `scan::tests`, all on the same minimal 
fixture (`make_schema_evolved_table`) with `current-schema-id=1` (`id, value`) 
and a sole snapshot at `schema-id=0` (`id, tmp`):
   
   | Test | What it covers |
   |---|---|
   | `test_default_scan_uses_current_schema_after_evolution` | `select(["id", 
"value"])` succeeds in the default scan — the post-evolution column name 
resolves |
   | `test_default_scan_rejects_old_name_after_rename` | `select(["id", 
"tmp"])` fails with `DataInvalid` — the pre-evolution name is no longer part of 
the public vocabulary in a default scan |
   | `test_snapshot_id_scan_uses_snapshot_schema` | 
`snapshot_id(1).select(["id", "tmp"])` succeeds (time-travel keeps the old 
vocabulary); `snapshot_id(1).select(["id", "value"])` fails |
   
   All 34 existing `scan::tests` still pass. Full iceberg lib suite: 1299/1299. 
Clippy + rustfmt clean.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to