GitHub user koertkuipers opened a pull request:
https://github.com/apache/spark/pull/6883
SPARK-4644 blockjoin
Although the discussion (and design doc) under SPARK-4644 seem focussed on
other aspects of skew (OOM mostly) than this pullreq (which focusses on
avoiding a single reducer taking a long time), i decided to put this pullreq
under SPARK-4644 anyhow, to avoid the proliferation of JIRA tickets. If this is
not the right place let me know and i will move it.
Inspired by block join in scalding.
From scalding docs:
This is useful in cases where the data has extreme skew. A symptom of this
is that we may see a job stuck for a very long time on a small number of
reducers.
A block join is way to get around this: we add a random integer field and a
replica field to every tuple in the left and right pipes. We then join on the
original keys and on these new dummy fields. These dummy fields make it less
likely that the skewed keys will be hashed to the same reducer.
The final data size is right * rightReplication + left * leftReplication
but because of the fragmentation, we are guaranteed the same number of hits as
the original join.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tresata/spark feat-blockjoin
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6883.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6883
----
commit 77d8fee6ad7ba5f83eb0c82b7f1625e2206a5446
Author: Koert Kuipers <[email protected]>
Date: 2015-06-17T20:35:18Z
add blockJoin, blockLeftOuterJoin and blockRightOuterJoin to spark core
commit d1fd3e020812c72c44a6461d9c94065e2784cdbb
Author: Koert Kuipers <[email protected]>
Date: 2015-06-17T23:48:43Z
correct scaladocs for block join functions
commit 2114df748f62b53155d7db5524e163504cead228
Author: Koert Kuipers <[email protected]>
Date: 2015-06-18T03:36:21Z
add block joins to java api
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]