Tushar7012 commented on issue #19971:
URL: https://github.com/apache/datafusion/issues/19971#issuecomment-3795665464

   Hi @Dandandan 
   
   I'd like to take on this issue! I've been actively contributing to 
DataFusion and am excited to work on this performance optimization that's part 
of the ClickBench EPIC (#18489).
   
   I've analyzed the current implementation and here's my understanding and 
proposed approach:
   
   **Current Bottleneck:**
   The `list_files_for_scan` method in `catalog-listing/src/table.rs` currently 
uses `try_join_all` on `pruned_partition_list` calls sequentially, and while 
there's `buffer_unordered` for statistics collection (controlled by 
`meta_fetch_concurrency`), the core file listing and partition evaluation can 
be parallelized further.
   
   **Proposed Solution:**
   1. **Parallelize `pruned_partition_list` calls** - When multiple 
`table_paths` exist, process them concurrently using `FuturesUnordered` or 
similar
   2. **Leverage existing `meta_fetch_concurrency` config** - Reuse this 
setting (default=32) to control parallelism, maintaining consistency with 
existing patterns
   3. **Stream-based parallelism** - Use `flatten_unordered` more aggressively 
in the file discovery phase, not just for metadata fetching
   4. **Consider partition filter evaluation** - The `filter_partitions` step 
in `helpers.rs` could benefit from parallel evaluation when dealing with many 
partitioned files
   
   **Testing Plan:**
   - Benchmark with ClickBench queries to measure cold-start improvement
   - Test with partitioned datasets (like TPC-H, TPC-DS) where 
`list_files_for_scan` is called frequently
   - Ensure no regression in single-threaded scenarios
   
   I'm familiar with the async patterns used in DataFusion and the 
`object_store` crate. Would be happy to discuss the approach before 
implementation if you'd prefer.
   
   Could you please assign this to me?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to