szehon-ho opened a new pull request, #41398:
URL: https://github.com/apache/spark/pull/41398

   ### What changes were proposed in this pull request?
   Add support for shuffle-hash join for following scenarios:
   
   * left outer join with left-side build
   * right outer join with right-side build
   The algorithm is similar to SPARK-32399, which supports shuffle-hash join 
for full outer join.
   
   The same methods fullOuterJoinWithUniqueKey and 
fullOuterJoinWithNonUniqueKey are improved to support the new case. These 
methods are called after the HashedRelation is already constructed of the build 
side, and do these two iterations:
   
   1.  Iterate Stream side.
     a. If find match on build side, mark.
     b. If no match on build side, join with null build-side row and add to 
result
   2. Iterate build side.
     a. If find marked for match, add joined row to result
     b. If no match marked, join with null stream-side row
   
   The left outer join with left-side build, and right outer join with 
right-side build, need only a subset of these logics, namely replacing 1b above 
with a no-op.
   
   Codegen is left for a follow-up PR.
   
   ### Why are the changes needed?
   For joins of these types, shuffle-hash join can be more performant than 
sort-merge join, especially if the big table is large, as it skips an expensive 
sort of the big table.
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   Unit test in JoinSuite.scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to