Re: [I] Parallelize `list_files_for_scan` [datafusion]

via GitHub Sun, 25 Jan 2026 01:05:13 -0800


Tushar7012 commented on issue #19971:
URL: https://github.com/apache/datafusion/issues/19971#issuecomment-3796243862


   Thanks for the assignment and the pointer to PR #19969! 
   
   I've reviewed your `infer_schema` parallelization approach and understand 
the pattern now - using tokio spawning directly within the function rather than 
a multi-layered solution.
   
   For `list_files_for_scan`, I'll follow the same approach:
   1. Spawn parallel tasks within the function to list files concurrently
   2. Use `JoinSet` or similar to collect results
   3. Keep the existing API surface unchanged
   
   I'll also note that benchmarks may not capture the improvement well since 
the gain is primarily on cold start / first query (before caching kicks in).
   
   I'll start working on a draft PR following this pattern. Let me know if 
there's anything specific I should be aware of for the `list_files_for_scan` 
case!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Parallelize `list_files_for_scan` [datafusion]

Reply via email to