adriangb opened a new pull request, #22496:
URL: https://github.com/apache/datafusion/pull/22496

   ## Which issue does this PR close?
   
   - No separate issue. Follows up on #22372 (panic fix in 
`TableSchema::with_table_partition_cols`) and the API discussion it spawned, 
and is informed by #22026 (which adds a third column group, virtual columns, to 
`TableSchema`).
   
   ## Rationale for this change
   
   `TableSchema` has one required input (the file schema) and a growing set of 
*optional* column groups: partition columns today, virtual columns in #22026. 
The current API expresses this awkwardly:
   
   - `new(file_schema, partition_cols)` privileges partition columns with a 
positional slot while virtual columns only get a builder method — an asymmetry 
that grows with every new column kind.
   - `TableSchema` eagerly recomputes and caches the concatenated table schema 
on *every* incremental setter call, so 
`from_file_schema(s).with_table_partition_cols(p)` rebuilds it twice (three 
times once virtual columns are added). This is exactly why `new()`'s docs told 
callers to avoid the builder-style chain.
   - The setter mutated an inner `Arc<Vec<FieldRef>>` in place, which is what 
caused the shared-`Arc` panic fixed in #22372.
   
   A dedicated builder addresses all three, and mirrors the existing 
`FileScanConfigBuilder` (the type that *owns* a `TableSchema`).
   
   ## What changes are included in this PR?
   
   - **`TableSchemaBuilder`**: `new(file_schema)` → 
`.with_table_partition_cols(impl Into<Fields>)` → `.build()`. The concatenated 
table schema is computed exactly **once**, in `build()`. The setter takes `impl 
Into<Fields>`, so an existing schema's `Fields` is accepted zero-copy.
   - **Partition columns are now stored as `arrow::datatypes::Fields`** (an 
immutable `Arc<[FieldRef]>`) instead of `Arc<Vec<FieldRef>>`: one fewer 
indirection, shareable zero-copy, and — being immutable — the shared-`Arc` 
mutation panic is structurally impossible.
   - **`TableSchema::table_partition_cols()` and the delegating 
`FileScanConfig::table_partition_cols()` now return `&Fields`.** `Fields` 
derefs to `&[FieldRef]`, so iteration/indexing/`len`/`is_empty` are unchanged; 
only the arrow `FileFormat` path needed `.to_vec()`.
   - **`TableSchema::with_table_partition_cols` is deprecated** in favor of the 
builder. It now **replaces** rather than appends. (Note: `main` currently 
*appends* here — the replace change in #22372 was not captured by that PR's 
squash merge — so this also restores the intended replace semantics.)
   - `new` / `from_file_schema` are kept as conveniences that route through the 
builder.
   - Documented in the 54.0.0 upgrade guide.
   
   This intentionally leaves virtual columns out; #22026 should extend the 
builder with `with_virtual_columns` once it lands.
   
   ## Are these changes tested?
   
   Yes. New unit tests cover building with partition columns, 
replace-on-repeat, zero-copy `Fields` input, and the deprecated setter's 
behavior; existing `TableSchema` / `FileScanConfig` tests and doctests pass. 
`cargo clippy --all-targets -- -D warnings` is clean across the 
datasource/proto/arrow/parquet/catalog-listing crates.
   
   ## Are there any user-facing changes?
   
   Yes — please apply the `api change` label:
   
   - `TableSchema::table_partition_cols()` / 
`FileScanConfig::table_partition_cols()` return `&Fields` instead of 
`&Vec<FieldRef>` (source-compatible for most uses via `Deref`).
   - `TableSchema::with_table_partition_cols` is deprecated (use the builder) 
and now replaces rather than appends.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to