Samyak2 commented on PR #19761:
URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3861751421

   I'm trying to understand the problem that this PR solves. Please correct me 
if I'm wrong. (Thanks to @faizkothari for explaining this!).
   
   When there's a right deep join - many lookup tables joining one probe table 
- we would ideally want to start and run the lookup tables in parallel. What 
happens currently is that the right side (probe side) of the join is not 
executed at all until the lookup table (left side) is completely executed and 
built. Even if there are more lookup tables to be built on the probe side, they 
are deferred until the first one completes. This is the cascading problem you 
mentioned:
   
   > This gets worst when multiple hash joins are chained together: they will 
get executed in cascade as if they were domino pieces, which has the benefit of 
leaving a small memory footprint, but underutilizes the resources of the 
machine for executing the query faster.
   
   The way this PR solves the problem is by unconditionally starting the probe 
side of the join, until there's enough data to fill the buffer. This also 
explains why there is a degradation in case of dynamic filters -- we start the 
probe side even before the dynamic filters are built.
   
   In my opinion, it would be better to have a more focused solution for this 
problem (right deep join). Instead of unconditionally starting the probe side, 
would a better solution be to make the parallelism more explicit? Some way to 
explicitly parallelize just the lookup table sides (left sides). This could 
potentially solve the slowdowns we see in certain queries. I haven't looked 
into the slowdowns yet, so I could be wrong here.
   
   What do you all think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to