gabotechs commented on PR #19761: URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3883127052
> In my opinion, it would be better to have a more focused solution for this problem (right deep join) This is not exactly the problem this PR tries to solve. Deep nested joins just aggravate it, but the problem is still there even in a single hash join. It boils down to how streams work in Rust. Unlike other languages, streams in Rust do nothing unless polled. This means that not polling the probe side early is a missed opportunity for making progress while the build side is being built. In real world scenario, where queries might not be CPU bounded, this gets worst. > would a better solution be to make the parallelism more explicit? The explicitness is attempted to be provided as a new node in the plan (BufferExec) that explicitly models and measures data buffering. > Some way to explicitly parallelize just the lookup table sides (left sides) This could show some speedups, although there are still opportunities to make progress on the probe side while the build sides are being built. Note that speeds would also be visible if there is a single hash join, as we can build both sides in parallel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
