adriangb opened a new pull request, #22830:
URL: https://github.com/apache/datafusion/pull/22830
> **Draft — blocked, not yet buildable.** See "Status" below. Opened for
> visibility of the design while the dependency lands.
## Which issue does this PR close?
Part of the wide-schema parquet read performance work in #21968.
- Part of #21968.
## Rationale for this change
Building an `ArrowReaderMetadata` walks every leaf of the parquet schema
to produce the arrow `Schema` + dremel field levels. For wide-schema
files (hundreds/thousands of columns) this `O(N_columns)` walk runs on
every file open, once per query, even though the result is identical
across queries for the same file. It is one of the larger remaining
per-file costs identified in #21968.
## What changes are included in this PR?
Cache the built `ArrowReaderMetadata` on `CachedParquetMetaData` (which
already lives in the file metadata cache):
- `arrow_reader_metadata()` lazily builds the base metadata via
`parquet_to_arrow_schema_and_field_levels` + `from_field_levels` and
memoises it in a `OnceLock`; warm hits are a cheap `Arc`-bump clone.
- `coerced_arrow_reader_metadata()` memoises a single post-coercion
build keyed by the supplied schema's `Arc` identity.
- `CachedParquetFileReader` overrides
`AsyncFileReader::get_arrow_reader_metadata` to serve both from cache.
## Status / dependencies
This depends on the arrow-rs primitives in **apache/arrow-rs#9882**
(`from_field_levels`, `parquet_to_arrow_schema_and_field_levels`,
`ArrowReaderOptions` accessors,
`AsyncFileReader::get_arrow_reader_metadata`).
It is **not buildable yet**: DataFusion `main` pins arrow `58.3.0` while
those primitives are on arrow-rs `main` (`59.0.0`), so no Cargo `[patch]`
can satisfy `^58.3.0`. CI will be red until the arrow-rs changes ship in
a release DataFusion bumps to (then the `[patch]`/version wiring is added
here). Kept as a draft until then.
The arrow-rs-independent pieces of the #21968 investigation
(`memory_size` caching, coercion early-return) are split out into a
separately-mergeable PR: #22829.
## Are these changes tested?
Not yet (blocked on building, see above). Tests will be added once the
dependency is available.
## Are there any user-facing changes?
No public API changes — `CachedParquetMetaData` gains internal caching
fields and methods.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]