Tushar7012 commented on issue #19971: URL: https://github.com/apache/datafusion/issues/19971#issuecomment-3795665464
Hi @Dandandan I'd like to take on this issue! I've been actively contributing to DataFusion and am excited to work on this performance optimization that's part of the ClickBench EPIC (#18489). I've analyzed the current implementation and here's my understanding and proposed approach: **Current Bottleneck:** The `list_files_for_scan` method in `catalog-listing/src/table.rs` currently uses `try_join_all` on `pruned_partition_list` calls sequentially, and while there's `buffer_unordered` for statistics collection (controlled by `meta_fetch_concurrency`), the core file listing and partition evaluation can be parallelized further. **Proposed Solution:** 1. **Parallelize `pruned_partition_list` calls** - When multiple `table_paths` exist, process them concurrently using `FuturesUnordered` or similar 2. **Leverage existing `meta_fetch_concurrency` config** - Reuse this setting (default=32) to control parallelism, maintaining consistency with existing patterns 3. **Stream-based parallelism** - Use `flatten_unordered` more aggressively in the file discovery phase, not just for metadata fetching 4. **Consider partition filter evaluation** - The `filter_partitions` step in `helpers.rs` could benefit from parallel evaluation when dealing with many partitioned files **Testing Plan:** - Benchmark with ClickBench queries to measure cold-start improvement - Test with partitioned datasets (like TPC-H, TPC-DS) where `list_files_for_scan` is called frequently - Ensure no regression in single-threaded scenarios I'm familiar with the async patterns used in DataFusion and the `object_store` crate. Would be happy to discuss the approach before implementation if you'd prefer. Could you please assign this to me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
