[PR] perf: parallelize CPU-heavy parquet metadata parsing in `list_files_for_scan` [datafusion]

via GitHub Fri, 17 Apr 2026 03:13:07 -0700


Dandandan opened a new pull request, #21692:
URL: https://github.com/apache/datafusion/pull/21692


   ## Which issue does this PR close?
   
   - Part of #19971.
   
   ## Rationale for this change
   
   On cold runs, `list_files_for_scan` bottlenecks on a single thread — the 
heavy part is not file listing IO but CPU work inside parquet metadata 
decode/merge/statistics extraction. Profiling shows all of this collapsed onto 
one async task even though `list_files_for_scan` already drives per-file work 
with `.buffer_unordered(meta_fetch_concurrency)`. `buffer_unordered` polls 
futures concurrently on a single task, so CPU-bound futures serialize.
   
   ## What changes are included in this PR?
   
   Wrap the metadata fetch + statistics/ordering extraction in 
`ParquetFormat::infer_stats_and_ordering` with 
`SpawnedTask::spawn_blocking(move || handle.block_on(...))`, so each call runs 
on a separate worker thread. Combined with the existing 
`buffer_unordered(meta_fetch_concurrency)` in `list_files_for_scan`, we now get 
real parallelism across files.
   
   This follows the same pattern as #19969 (parallelizing `infer_schema`).
   
   Trait signatures are unchanged; only the parquet implementation is touched.
   
   ## Are these changes tested?
   
   Covered by existing parquet / catalog-listing tests (`cargo test -p 
datafusion-datasource-parquet`, `cargo test -p datafusion-catalog-listing`).
   
   ## Are there any user-facing changes?
   
   No API changes. Cold-start listing of parquet tables with many files should 
be noticeably faster on multi-core systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: parallelize CPU-heavy parquet metadata parsing in `list_files_for_scan` [datafusion]

Reply via email to