Re: [I] Add benchmarks for queries against wide schema parquet files [datafusion]

via GitHub Mon, 08 Jun 2026 12:26:48 -0700


adriangb commented on issue #21968:
URL: https://github.com/apache/datafusion/issues/21968#issuecomment-4652647524


   ### Wide-schema perf — PR breakdown
   
   The investigation work is now split into focused PRs:
   
   **arrow-rs (primitives):**
   - apache/arrow-rs#9882 — O(1) `parquet_column`, 
`parquet_to_arrow_schema_and_field_levels` + 
`ArrowReaderMetadata::from_field_levels`, 
`StatisticsConverter::from_arrow_field`, 
`AsyncFileReader::get_arrow_reader_metadata`. Ready for review.
   
   **DataFusion:**
   - #22829 — the arrow-rs-independent wins: cache entry `memory_size` in 
`DefaultFilesMetadataCache`, and `apply_file_schema_type_coercions` 
early-return. Mergeable now.
   - #22830 (draft) — cache per-file `ArrowReaderMetadata` across opens. 
Depends on arrow-rs#9882; blocked on a build until those primitives ship in a 
release DataFusion bumps to (DataFusion `main` is on arrow `58.3.0`, the 
primitives are on arrow-rs `main`/`59.0.0`).
   
   The original `statistics_from_parquet_metadata` O(N²)→O(N) idea is **not** 
included — upstream #22462 already refactored that path, and combined with 
arrow-rs#9882's O(1) `parquet_column` it is already O(N).
   
   The earlier all-in-one draft #21987 is superseded by the split above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Add benchmarks for queries against wide schema parquet files [datafusion]

Reply via email to