JanKaul opened a new pull request, #22515:
URL: https://github.com/apache/datafusion/pull/22515
## Which issue does this PR close?
- Closes #.
## Rationale for this change
The upstream `parquet` crate exposes "virtual columns" — columns that aren't
physically stored in the file but are materialized by the reader (e.g. the
per-row `RowNumber` extension type). DataFusion's parquet datasource didn't
wire this through: a user who included a `RowNumber`-tagged field in their file
schema would get a "column not found in parquet schema" error or a misaligned
projection mask, because the opener fed the full Arrow schema to
`ArrowReaderOptions::with_schema` (which expects real fields only) and built
`ProjectionMask::roots` from indices that included virtual fields with no
parquet leaves.
## What changes are included in this PR?
- `datafusion/datasource-parquet/src/opener/mod.rs`
- Add `split_virtual_fields` helper that partitions a schema into
`(real-only schema, virtual field list)`.
- In `PreparedParquetOpen::load`, extract virtuals from
`logical_file_schema` and pass them to
`ArrowReaderOptions::with_virtual_columns`; store the list on
`MetadataLoadedParquetOpen` so later stages can re-supply the real-only schema.
- In `MetadataLoadedParquetOpen`, strip virtuals before each
`options.with_schema(...)` call via a `resupply_schema` closure (the underlying
API rejects virtuals).
- Drop a now-unneeded `#[cfg(feature = "parquet_encryption")] let mut
options = options;` shadow — the new unconditional `with_virtual_columns` write
site means `options` always has a potential mutation, so the cfg-gated shadow
is no longer required to silence `unused_mut`.
- `datafusion/datasource-parquet/src/row_filter.rs`
- In `build_projection_read_plan`, filter virtual roots out of the indices
passed to `ProjectionMask::roots` / `leaf_indices_for_roots` (virtuals have no
parquet column to mask).
- Add a comment on `build_parquet_read_plan` documenting why no symmetric
filter is needed: `leaf_indices_for_roots` silently drops indices with no
matching leaf, and the decoder appends virtuals to every batch regardless of
the mask, so `projected_schema` stays aligned.
## Are these changes tested?
Yes. Added two tests in `opener/mod.rs`:
- `test_virtual_row_number_column` — round-trips a `[Int32,
RowNumber(Int64)]` schema across two row groups and asserts `row_num == [0..6)`.
- `test_virtual_row_number_column_with_row_group_pruning` — same setup with
a predicate (`a >= 20`) that prunes the first row group and confirms the row
numbers come back as `[3, 4, 5]` (i.e. the virtual column reflects the file's
absolute row indices, not the post-pruning sequence).
## Are there any user-facing changes?
Yes — purely additive. A `Field` tagged with the `RowNumber` extension type
(or any other parquet virtual-column extension) in a file schema is now honored
by the parquet datasource. No existing API signatures change.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]