[PR] perf: cache full Parquet metadata (incl. page index) via DataFusion's CachedParquetFileReaderFactory [datafusion-comet]

via GitHub Mon, 22 Jun 2026 12:07:47 -0700


mbutrovich opened a new pull request, #4707:
URL: https://github.com/apache/datafusion-comet/pull/4707


   ## Which issue does this PR close?
   
   Part of #3978.
   
   ## Rationale for this change
   
   Comet's `native_datafusion` scan used a hand-rolled 
`CachingParquetReaderFactory` that cached only the Parquet footer. DataFusion's 
opener loads the page index in a separate step, so it re-fetched the page index 
from object storage on every open (the footer-only cache never held it). For 
non-selective `IS NOT NULL` predicates on non-null join keys (e.g. TPC-DS q88), 
the page index prunes nothing yet is re-read per split, adding gigabytes of 
wasted I/O at scale. The hand-rolled cache was also unbounded (no eviction).
   
   ## What changes are included in this PR?
   
   - Replace `CachingParquetReaderFactory` with DataFusion's 
`CachedParquetFileReaderFactory`, backed by the per-task `RuntimeEnv` 
file-metadata cache (bounded LRU, `metadata_cache_limit`). It loads the full 
metadata including the page index once per file, so the opener no longer 
re-fetches it.
   - Delete `parquet_read_cached_factory.rs`.
   
   Follow-up noted in a code TODO: metadata I/O is not reflected in 
`bytes_scanned` because `fetch_metadata` reads via `ObjectStore::get_ranges` 
directly. A byte-counting `ObjectStore` wrapper would surface it.
   
   ## How are these changes tested?
   
   New `caches_full_metadata_with_page_index` unit test in `parquet_exec.rs`: 
writes a Parquet file with a page index, runs a scan, and asserts the 
`RuntimeEnv` metadata cache holds metadata with the column and offset index 
loaded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: cache full Parquet metadata (incl. page index) via DataFusion's CachedParquetFileReaderFactory [datafusion-comet]

Reply via email to