szehon-ho opened a new pull request, #41398:
URL: https://github.com/apache/spark/pull/41398
### What changes were proposed in this pull request?
Add support for shuffle-hash join for following scenarios:
* left outer join with left-side build
* right outer join with right-side build
The algorithm is similar to SPARK-32399, which supports shuffle-hash join
for full outer join.
The same methods fullOuterJoinWithUniqueKey and
fullOuterJoinWithNonUniqueKey are improved to support the new case. These
methods are called after the HashedRelation is already constructed of the build
side, and do these two iterations:
1. Iterate Stream side.
a. If find match on build side, mark.
b. If no match on build side, join with null build-side row and add to
result
2. Iterate build side.
a. If find marked for match, add joined row to result
b. If no match marked, join with null stream-side row
The left outer join with left-side build, and right outer join with
right-side build, need only a subset of these logics, namely replacing 1b above
with a no-op.
Codegen is left for a follow-up PR.
### Why are the changes needed?
For joins of these types, shuffle-hash join can be more performant than
sort-merge join, especially if the big table is large, as it skips an expensive
sort of the big table.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test in JoinSuite.scala
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]