schenksj opened a new issue, #22517:
URL: https://github.com/apache/datafusion/issues/22517

   ## Is your feature request related to a problem or challenge?
   
   `ParquetSource` / `ParquetOpener` (in `datafusion-datasource-parquet`) 
cannot emit the parquet reader's **row-number virtual column**, even though the 
underlying `parquet` crate (58.x) fully supports it:
   
   ```rust
   let row_number = Field::new("row_number", DataType::Int64, false)
       .with_extension_type(parquet::arrow::...::RowNumber);
   let builder = builder.with_virtual_columns(vec![row_number_field])?;
   ```
   
   The row-number virtual column gives each row its **true physical position 
within the file even under row-group / page / row-filter pruning**. This is 
exactly what engines need to reconstruct stable per-row identity while still 
benefiting from predicate pushdown.
   
   Concretely, this blocks **Delta Lake row tracking** (`_metadata.row_id` = 
`baseRowId + physical_row_index`) on top of DataFusion: to keep the synthesized 
`row_id`/`row_index` correct, an integrating engine must currently *disable* 
data-filter pushdown (so the reader returns every row in physical order and a 
running counter stays aligned). That defeats row-group skipping whenever 
`_metadata.row_id` is projected alongside a selective filter.
   
   There is no hook to inject this today:
   - `ParquetOpener` never calls `with_virtual_columns`, and its 
`expr_adapter_factory` field is `pub(crate)`, so the opener can't be 
reused/extended from outside the crate.
   - `ParquetSource` exposes no builder-customization hook.
   - The `ParquetFileReaderFactory` provides only the `AsyncFileReader`, not 
builder configuration.
   
   So the only workaround is to re-implement a custom `FileOpener` (duplicating 
projection / row-filter / pruning plumbing), which is what we're doing 
downstream in Apache DataFusion Comet (apache/datafusion-comet — Delta contrib).
   
   ## Describe the solution you'd like
   
   Expose virtual columns on `ParquetSource` / `ParquetOpener`, e.g.:
   
   ```rust
   let source = ParquetSource::new(schema)
       .with_virtual_columns(vec![row_number_field]); // RowNumber-extension 
field(s)
   ```
   
   …and have `ParquetOpener` forward them to 
`ParquetRecordBatchStreamBuilder::with_virtual_columns(...)` and include them 
in the projected output schema, so the rest of the existing 
pruning/row-filter/projection logic is reused unchanged.
   
   ## Describe alternatives you've considered
   
   - Re-implementing a custom `FileOpener` that builds the stream with 
`with_virtual_columns` (our current downstream approach — works, but duplicates 
a lot of well-tested opener logic and is a maintenance burden).
   - A reader-factory hook — insufficient, since virtual columns are configured 
on the stream *builder*, not the reader.
   
   ## Additional context
   
   Downstream consumer: Apache DataFusion Comet's native Delta Lake scan 
(apache/datafusion-comet#4366). We'd be happy to contribute a PR if the API 
shape above is agreeable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to