mbutrovich opened a new pull request, #4707: URL: https://github.com/apache/datafusion-comet/pull/4707
## Which issue does this PR close? Part of #3978. ## Rationale for this change Comet's `native_datafusion` scan used a hand-rolled `CachingParquetReaderFactory` that cached only the Parquet footer. DataFusion's opener loads the page index in a separate step, so it re-fetched the page index from object storage on every open (the footer-only cache never held it). For non-selective `IS NOT NULL` predicates on non-null join keys (e.g. TPC-DS q88), the page index prunes nothing yet is re-read per split, adding gigabytes of wasted I/O at scale. The hand-rolled cache was also unbounded (no eviction). ## What changes are included in this PR? - Replace `CachingParquetReaderFactory` with DataFusion's `CachedParquetFileReaderFactory`, backed by the per-task `RuntimeEnv` file-metadata cache (bounded LRU, `metadata_cache_limit`). It loads the full metadata including the page index once per file, so the opener no longer re-fetches it. - Delete `parquet_read_cached_factory.rs`. Follow-up noted in a code TODO: metadata I/O is not reflected in `bytes_scanned` because `fetch_metadata` reads via `ObjectStore::get_ranges` directly. A byte-counting `ObjectStore` wrapper would surface it. ## How are these changes tested? New `caches_full_metadata_with_page_index` unit test in `parquet_exec.rs`: writes a Parquet file with a page index, runs a scan, and asserts the `RuntimeEnv` metadata cache holds metadata with the column and offset index loaded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
