Dandandan opened a new pull request, #21692: URL: https://github.com/apache/datafusion/pull/21692
## Which issue does this PR close? - Part of #19971. ## Rationale for this change On cold runs, `list_files_for_scan` bottlenecks on a single thread — the heavy part is not file listing IO but CPU work inside parquet metadata decode/merge/statistics extraction. Profiling shows all of this collapsed onto one async task even though `list_files_for_scan` already drives per-file work with `.buffer_unordered(meta_fetch_concurrency)`. `buffer_unordered` polls futures concurrently on a single task, so CPU-bound futures serialize. ## What changes are included in this PR? Wrap the metadata fetch + statistics/ordering extraction in `ParquetFormat::infer_stats_and_ordering` with `SpawnedTask::spawn_blocking(move || handle.block_on(...))`, so each call runs on a separate worker thread. Combined with the existing `buffer_unordered(meta_fetch_concurrency)` in `list_files_for_scan`, we now get real parallelism across files. This follows the same pattern as #19969 (parallelizing `infer_schema`). Trait signatures are unchanged; only the parquet implementation is touched. ## Are these changes tested? Covered by existing parquet / catalog-listing tests (`cargo test -p datafusion-datasource-parquet`, `cargo test -p datafusion-catalog-listing`). ## Are there any user-facing changes? No API changes. Cold-start listing of parquet tables with many files should be noticeably faster on multi-core systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
