GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/734
[SQL] SPARK-1800 Add broadcast hash join operator
WIP: A few things remain, but looking for feedback on this approach.
- [ ] Figure out how to configure this. The immutability of SparkConf is
probably not great for things like query hints.
- [ ] Figure out how to test this.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark broadcastHashJoin
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/734.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #734
----
commit a8420ca0c4cbc5988607d0cd235ffeb2cb51d052
Author: Michael Armbrust <[email protected]>
Date: 2014-05-11T18:23:02Z
Copy records in executeCollect to avoid issues with mutable rows.
commit cf6b3818fbe7d1908bcbdc7f18c5773c01d05541
Author: Michael Armbrust <[email protected]>
Date: 2014-05-11T18:30:56Z
Split out generic logic for hash joins and create two concrete physical
operators: BroadcastHashJoin and ShuffledHashJoin.
commit 76ca4341036b95f71763f631049fdae033990ab5
Author: Michael Armbrust <[email protected]>
Date: 2014-05-11T18:31:20Z
A simple strategy that broadcasts tables only when they are found in a
configuration hint.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---