JanKaul opened a new pull request, #22515:
URL: https://github.com/apache/datafusion/pull/22515

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   The upstream `parquet` crate exposes "virtual columns" — columns that aren't 
physically stored in the file but are materialized by the reader (e.g. the 
per-row `RowNumber` extension type). DataFusion's parquet datasource didn't 
wire this through: a user who included a `RowNumber`-tagged field in their file 
schema would get a "column not found in parquet schema" error or a misaligned 
projection mask, because the opener fed the full Arrow schema to 
`ArrowReaderOptions::with_schema` (which expects real fields only) and built 
`ProjectionMask::roots` from indices that included virtual fields with no 
parquet leaves.
   
   ## What changes are included in this PR?
   
   - `datafusion/datasource-parquet/src/opener/mod.rs`
     - Add `split_virtual_fields` helper that partitions a schema into 
`(real-only schema, virtual field list)`.
     - In `PreparedParquetOpen::load`, extract virtuals from 
`logical_file_schema` and pass them to 
`ArrowReaderOptions::with_virtual_columns`; store the list on 
`MetadataLoadedParquetOpen` so later stages can re-supply the real-only schema.
     - In `MetadataLoadedParquetOpen`, strip virtuals before each 
`options.with_schema(...)` call via a `resupply_schema` closure (the underlying 
API rejects virtuals).
     - Drop a now-unneeded `#[cfg(feature = "parquet_encryption")] let mut 
options = options;` shadow — the new unconditional `with_virtual_columns` write 
site means `options` always has a potential mutation, so the cfg-gated shadow 
is no longer required to silence `unused_mut`.
   - `datafusion/datasource-parquet/src/row_filter.rs`
     - In `build_projection_read_plan`, filter virtual roots out of the indices 
passed to `ProjectionMask::roots` / `leaf_indices_for_roots` (virtuals have no 
parquet column to mask).
     - Add a comment on `build_parquet_read_plan` documenting why no symmetric 
filter is needed: `leaf_indices_for_roots` silently drops indices with no 
matching leaf, and the decoder appends virtuals to every batch regardless of 
the mask, so `projected_schema` stays aligned.
   
   ## Are these changes tested?
   
   Yes. Added two tests in `opener/mod.rs`:
   - `test_virtual_row_number_column` — round-trips a `[Int32, 
RowNumber(Int64)]` schema across two row groups and asserts `row_num == [0..6)`.
   - `test_virtual_row_number_column_with_row_group_pruning` — same setup with 
a predicate (`a >= 20`) that prunes the first row group and confirms the row 
numbers come back as `[3, 4, 5]` (i.e. the virtual column reflects the file's 
absolute row indices, not the post-pruning sequence).
   
   ## Are there any user-facing changes?
   
   Yes — purely additive. A `Field` tagged with the `RowNumber` extension type 
(or any other parquet virtual-column extension) in a file schema is now honored 
by the parquet datasource. No existing API signatures change.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to