[I] Table scan rejects current-schema column names after `UpdateSchemaAction` commit [iceberg-rust]

via GitHub Mon, 01 Jun 2026 17:19:54 -0700


nazq opened a new issue, #2565:
URL: https://github.com/apache/iceberg-rust/issues/2565


   # Table scan rejects current-schema column names after `UpdateSchemaAction` 
commit
   
   **Label:** `bug`
   
   ## Is your feature request related to a problem or challenge?
   
   A default `TableScanBuilder::build()` validates caller-supplied column names 
against the *snapshot's* schema, not the *table's current* schema. After an 
`UpdateSchemaAction` commit changes the current schema (rename / add / delete 
column), pre-existing snapshots still point at the pre-evolution `schema_id`, 
so the scan rejects names that are valid against the post-evolution schema.
   
   ### Reproducer
   
   Setup: any iceberg table with at least one snapshot. Apply a 
schema-evolution transaction (uses the action shipped in #2120 / 
`UpdateSchemaAction`):
   
   ```rust
   let tx = Transaction::new(&table);
   let action = tx.update_schema()
       .add_column(AddColumn::optional("note", 
Type::Primitive(PrimitiveType::String)));
   let tx = action.apply(tx)?;
   let table = tx.commit(&catalog).await?;
   ```
   
   The catalog now reports the post-evolution schema (verified via 
`catalog.load_table().metadata().current_schema()`). But a scan over the same 
`Table`:
   
   ```rust
   table.scan().select(["note"]).build()
   ```
   
   returns:
   
   ```
   DataInvalid => Column note not found in table. Schema: table {
     1: id: optional long
     2: name: optional string
     3: tmp: optional double
   }
   ```
   
   The schema dump is the **snapshot's** schema — the column added a moment ago 
is missing.
   
   ### Root cause
   
   
[`crates/iceberg/src/scan/mod.rs:221`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/scan/mod.rs#L221):
   
   ```rust
   let schema = snapshot.schema(self.table.metadata())?;
   ```
   
   `snapshot.schema(metadata)` resolves the snapshot's `schema_id` against 
`metadata.schemas` and returns *the schema the snapshot was written under*. For 
time-travel scans (`.snapshot_id(...)`) that's exactly right — the caller is 
asking for "the table as it existed at this snapshot." But for a default scan, 
the caller is asking for "the table as it is now," and the post-evolution 
columns are legitimately part of that vocabulary.
   
   The downstream Parquet projection 
(`crates/iceberg/src/arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids`)
 already maps field IDs to on-disk column names via `PARQUET:field_id` 
metadata, so resolving names against the current schema is safe end-to-end — 
field IDs are stable across schema versions, and the file's original column 
names live in the parquet metadata until the file is rewritten. PyIceberg's 
reader (`pyiceberg/io/pyarrow.py::_task_to_record_batches`) implements exactly 
this pattern: project by field ID, rename the arrow batch on the way out.
   
   ### Why this wasn't caught upstream
   
   `UpdateSchemaAction` (#2120) shipped with metadata-only tests in 
`crates/catalog/loader/tests/schema_update_suite.rs` — none of them call 
`table.scan().select_columns(...)` after the schema commit. The pre-existing 
`crates/integration_tests/tests/read_evolved_schema.rs` only uses 
`table.scan().build()` with no `select_columns`, which bypasses the column-name 
validation loop entirely (it falls through to `column_names.unwrap_or_else(|| 
schema.as_struct().fields()...)`).
   
   So a column-name lookup combined with a schema-evolved table is the gap. 
Both `add_column` and `delete_column` (already in `main`) trigger it; 
`rename_column` (#2563) trips it even more cleanly because the old name 
continues to exist on disk.
   
   ## Describe the solution you'd like
   
   Branch on whether the caller asked for a specific snapshot:
   
   ```rust
   let schema = if self.snapshot_id.is_some() {
       snapshot.schema(self.table.metadata())?
   } else {
       self.table.metadata().current_schema().clone()
   };
   ```
   
   - Explicit `snapshot_id` (time-travel): keep the snapshot-time vocabulary. A 
caller asking "what existed at snapshot N" should see schema N's columns.
   - Default scan (no `snapshot_id`): use the table's current schema. Field IDs 
are stable across schemas, so the downstream projection still finds the right 
on-disk columns.
   
   Both the column-name validation loop and the subsequent `field_id_by_name` 
lookup share the same `schema` variable, so the fix is one assignment.
   
   ## Willingness to contribute
   
   I can contribute this independently. I have a working branch with the fix + 
three regression tests (rename-then-read works, old-name-after-rename errors, 
time-travel still uses snapshot schema), all 1299 iceberg lib tests passing, 
clippy + rustfmt clean. PR ready to open once this issue is filed for reference.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Table scan rejects current-schema column names after `UpdateSchemaAction` commit [iceberg-rust]

Reply via email to