Dandandan commented on PR #19761: URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3743876310
So in summary about >10% faster on average, but some slowdowns. * I am wondering what the speedup would be if we just load a single batch per partition instead of based on size - perhaps the speedup is similar with reduced overhead? * While I am excited about the speedup, I wonder if this is not mostly "hiding" some "issues" with our current join implementation: hashing and concatenation is currently single-threaded for CollectLeft, both could be parallelized. Also we could lower the overhead of the parallel hash join and reduce the threshold to make the total query run more in parallel and avoid time spent waiting on one not well parallelizable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
