Tushar7012 commented on PR #20023:
URL: https://github.com/apache/datafusion/pull/20023#issuecomment-3816000733
Hey @2010YOUY01 ,
Thank you for the feedback and for sharing the guide. I apologize if my
previous responses felt disconnected; I’ve spent the last few hours doing a
deep dive into the implementation and the performance trade-offs to ensure I
fully understand the impact.
I have updated the PR with a few key refinements:
Parallelized IO: Switched the listing logic to use tokio::task::JoinSet.
This allows us to process multiple table paths concurrently, which is critical
for large datasets distributed across many prefixes.
Performance Verification: I’ve added a
benchmark_parallel_listing
test directly in
table.rs
. On my local machine, I verified that for 10 paths with a 100ms simulated
network latency, the execution time dropped from 1000ms (sequential) to ~102ms
(parallel).
WASM Compatibility: I kept the try_join_all fallback specifically for WASM
targets since JoinSet isn't supported there, ensuring the build remains stable
across all platforms.
I’ve also cleaned up the imports and resolved the linting issues. I’m
genuinely interested in improving DataFusion's performance here and would
appreciate a fresh review of these technical changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]