nazq opened a new issue, #2565:
URL: https://github.com/apache/iceberg-rust/issues/2565
# Table scan rejects current-schema column names after `UpdateSchemaAction`
commit
**Label:** `bug`
## Is your feature request related to a problem or challenge?
A default `TableScanBuilder::build()` validates caller-supplied column names
against the *snapshot's* schema, not the *table's current* schema. After an
`UpdateSchemaAction` commit changes the current schema (rename / add / delete
column), pre-existing snapshots still point at the pre-evolution `schema_id`,
so the scan rejects names that are valid against the post-evolution schema.
### Reproducer
Setup: any iceberg table with at least one snapshot. Apply a
schema-evolution transaction (uses the action shipped in #2120 /
`UpdateSchemaAction`):
```rust
let tx = Transaction::new(&table);
let action = tx.update_schema()
.add_column(AddColumn::optional("note",
Type::Primitive(PrimitiveType::String)));
let tx = action.apply(tx)?;
let table = tx.commit(&catalog).await?;
```
The catalog now reports the post-evolution schema (verified via
`catalog.load_table().metadata().current_schema()`). But a scan over the same
`Table`:
```rust
table.scan().select(["note"]).build()
```
returns:
```
DataInvalid => Column note not found in table. Schema: table {
1: id: optional long
2: name: optional string
3: tmp: optional double
}
```
The schema dump is the **snapshot's** schema — the column added a moment ago
is missing.
### Root cause
[`crates/iceberg/src/scan/mod.rs:221`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/scan/mod.rs#L221):
```rust
let schema = snapshot.schema(self.table.metadata())?;
```
`snapshot.schema(metadata)` resolves the snapshot's `schema_id` against
`metadata.schemas` and returns *the schema the snapshot was written under*. For
time-travel scans (`.snapshot_id(...)`) that's exactly right — the caller is
asking for "the table as it existed at this snapshot." But for a default scan,
the caller is asking for "the table as it is now," and the post-evolution
columns are legitimately part of that vocabulary.
The downstream Parquet projection
(`crates/iceberg/src/arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids`)
already maps field IDs to on-disk column names via `PARQUET:field_id`
metadata, so resolving names against the current schema is safe end-to-end —
field IDs are stable across schema versions, and the file's original column
names live in the parquet metadata until the file is rewritten. PyIceberg's
reader (`pyiceberg/io/pyarrow.py::_task_to_record_batches`) implements exactly
this pattern: project by field ID, rename the arrow batch on the way out.
### Why this wasn't caught upstream
`UpdateSchemaAction` (#2120) shipped with metadata-only tests in
`crates/catalog/loader/tests/schema_update_suite.rs` — none of them call
`table.scan().select_columns(...)` after the schema commit. The pre-existing
`crates/integration_tests/tests/read_evolved_schema.rs` only uses
`table.scan().build()` with no `select_columns`, which bypasses the column-name
validation loop entirely (it falls through to `column_names.unwrap_or_else(||
schema.as_struct().fields()...)`).
So a column-name lookup combined with a schema-evolved table is the gap.
Both `add_column` and `delete_column` (already in `main`) trigger it;
`rename_column` (#2563) trips it even more cleanly because the old name
continues to exist on disk.
## Describe the solution you'd like
Branch on whether the caller asked for a specific snapshot:
```rust
let schema = if self.snapshot_id.is_some() {
snapshot.schema(self.table.metadata())?
} else {
self.table.metadata().current_schema().clone()
};
```
- Explicit `snapshot_id` (time-travel): keep the snapshot-time vocabulary. A
caller asking "what existed at snapshot N" should see schema N's columns.
- Default scan (no `snapshot_id`): use the table's current schema. Field IDs
are stable across schemas, so the downstream projection still finds the right
on-disk columns.
Both the column-name validation loop and the subsequent `field_id_by_name`
lookup share the same `schema` variable, so the fix is one assignment.
## Willingness to contribute
I can contribute this independently. I have a working branch with the fix +
three regression tests (rename-then-read works, old-name-after-rename errors,
time-travel still uses snapshot schema), all 1299 iceberg lib tests passing,
clippy + rustfmt clean. PR ready to open once this issue is filed for reference.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]