adriangb opened a new pull request, #22830:
URL: https://github.com/apache/datafusion/pull/22830

   > **Draft — blocked, not yet buildable.** See "Status" below. Opened for
   > visibility of the design while the dependency lands.
   
   ## Which issue does this PR close?
   
   Part of the wide-schema parquet read performance work in #21968.
   
   - Part of #21968.
   
   ## Rationale for this change
   
   Building an `ArrowReaderMetadata` walks every leaf of the parquet schema
   to produce the arrow `Schema` + dremel field levels. For wide-schema
   files (hundreds/thousands of columns) this `O(N_columns)` walk runs on
   every file open, once per query, even though the result is identical
   across queries for the same file. It is one of the larger remaining
   per-file costs identified in #21968.
   
   ## What changes are included in this PR?
   
   Cache the built `ArrowReaderMetadata` on `CachedParquetMetaData` (which
   already lives in the file metadata cache):
   
   - `arrow_reader_metadata()` lazily builds the base metadata via
     `parquet_to_arrow_schema_and_field_levels` + `from_field_levels` and
     memoises it in a `OnceLock`; warm hits are a cheap `Arc`-bump clone.
   - `coerced_arrow_reader_metadata()` memoises a single post-coercion
     build keyed by the supplied schema's `Arc` identity.
   - `CachedParquetFileReader` overrides
     `AsyncFileReader::get_arrow_reader_metadata` to serve both from cache.
   
   ## Status / dependencies
   
   This depends on the arrow-rs primitives in **apache/arrow-rs#9882**
   (`from_field_levels`, `parquet_to_arrow_schema_and_field_levels`,
   `ArrowReaderOptions` accessors, 
`AsyncFileReader::get_arrow_reader_metadata`).
   
   It is **not buildable yet**: DataFusion `main` pins arrow `58.3.0` while
   those primitives are on arrow-rs `main` (`59.0.0`), so no Cargo `[patch]`
   can satisfy `^58.3.0`. CI will be red until the arrow-rs changes ship in
   a release DataFusion bumps to (then the `[patch]`/version wiring is added
   here). Kept as a draft until then.
   
   The arrow-rs-independent pieces of the #21968 investigation
   (`memory_size` caching, coercion early-return) are split out into a
   separately-mergeable PR: #22829.
   
   ## Are these changes tested?
   
   Not yet (blocked on building, see above). Tests will be added once the
   dependency is available.
   
   ## Are there any user-facing changes?
   
   No public API changes — `CachedParquetMetaData` gains internal caching
   fields and methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to