GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/3173
[SPARK-2213][SQL] Sort Merge Join
This PR adds MergeJoin operator to Spark SQL. The semantics of MergeJoin
operator is similar to Hive's Sort merge bucket join.
MergeJoin operator relies on SortBasedShuffle to create partitions that
sorted by the join key. In each partition, we merge the two child iterators.
The tricky part in merge step is handling duplicate join keys. To handle
duplicate keys, we use a buffer to store all matching elements in right
iterator for a certain join key. The buffer is used for generating join tuples
when the join key of the next left element is the same as the current join key.
MergeJoin reduces extra memory consumption, in the current implementation,
MergeJoin only needs memory that can hold elements of the key that has the most
duplicates in right iterator.
For query optimization, we may resolve to MergeJoin when both relations
are large and neither of the two can fit in memory. Currently, this heuristic
is not added to optimizer. I would appreciate if you can add comments on how to
resolve to MergeJoin in optimizer.
Currently, MergeJoin only supports inner join. However, it can be extended
to support outer join. Will handle outer join in separate PRs.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Ishiihara/spark SparkSQL-merge
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3173.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3173
----
commit 1c41f6f248f1145c7d730129795e50bdd8a53f2b
Author: Liquan Pei <[email protected]>
Date: 2014-10-28T23:47:35Z
initial commit
commit dc6a6840e2d2b1681e70a6a3eeb10d7a9e6437ce
Author: Liquan Pei <[email protected]>
Date: 2014-10-29T00:17:59Z
add MergeJoin.scala
commit f5ef4624aea5304ffdcc8daf5fbebc20943c3cf4
Author: Liquan Pei <[email protected]>
Date: 2014-11-09T04:05:56Z
Merge join working
commit b13cc4526f0098386b64bce50b2f983f95709f23
Author: Liquan Pei <[email protected]>
Date: 2014-11-09T04:09:11Z
Merge remote-tracking branch 'upstream/master' into SparkSQL-merge
Conflicts:
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
commit d6b6e7b8194682c713400823a4fd17e0419d89e4
Author: Liquan Pei <[email protected]>
Date: 2014-11-09T04:51:04Z
add inline comments for merge join
commit 837eb081e6382a23b4fd67a5265188aab1c7e305
Author: Liquan Pei <[email protected]>
Date: 2014-11-09T05:02:17Z
use merge join as inner join operator in JoinSuite
commit 5cb98c306f76183e4148d9b0a6b0a8ce4d58368e
Author: Liquan Pei <[email protected]>
Date: 2014-11-09T05:30:52Z
improve inline comments
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]