Samyak2 commented on PR #19761: URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3861751421
I'm trying to understand the problem that this PR solves. Please correct me if I'm wrong. (Thanks to @faizkothari for explaining this!). When there's a right deep join - many lookup tables joining one probe table - we would ideally want to start and run the lookup tables in parallel. What happens currently is that the right side (probe side) of the join is not executed at all until the lookup table (left side) is completely executed and built. Even if there are more lookup tables to be built on the probe side, they are deferred until the first one completes. This is the cascading problem you mentioned: > This gets worst when multiple hash joins are chained together: they will get executed in cascade as if they were domino pieces, which has the benefit of leaving a small memory footprint, but underutilizes the resources of the machine for executing the query faster. The way this PR solves the problem is by unconditionally starting the probe side of the join, until there's enough data to fill the buffer. This also explains why there is a degradation in case of dynamic filters -- we start the probe side even before the dynamic filters are built. In my opinion, it would be better to have a more focused solution for this problem (right deep join). Instead of unconditionally starting the probe side, would a better solution be to make the parallelism more explicit? Some way to explicitly parallelize just the lookup table sides (left sides). This could potentially solve the slowdowns we see in certain queries. I haven't looked into the slowdowns yet, so I could be wrong here. What do you all think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
