mbutrovich commented on PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4400142871

   Addressed @comphead's 
[review](https://github.com/apache/datafusion/pull/22026#issuecomment-4398834689):
   
   - **P1 + P2 + P3:** Introduced `VirtualColumnsState` (`Arc`-shared, holds 
validated fields, `null_replacements`, and the logical-with-virtual schema). 
Built once per scan partition in `ParquetSource::create_morselizer`; stored as 
`Option<Arc<VirtualColumnsState>>` on `ParquetMorselizer` and 
`PreparedParquetOpen`.
   - **Per-file cost** for virtual-column scans drops to one `Arc::clone`. The 
two remaining per-file `append_fields` calls (`physical_for_rewrite`, 
`stream_schema`) depend on per-file coercions/projection mask and can't be 
cached.
   - **P4 skipped:** adding `OnceLock<SchemaRef>` to every `TableSchema` to 
save a one-shot `Vec` iteration on a planning-time path is not a necessary 
compute-vs-memory trade.
   - 
**[opener.rs:547](https://github.com/apache/datafusion/pull/22026#discussion_r3202953589):**
 Call site moved into `create_morselizer` with an inline comment explaining why 
predicate validation gates on `pushdown_filters` (when pushdown is off, the 
predicate stays above the scan as a `FilterExec` and resolves virtual columns 
there).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to