GitHub user koertkuipers opened a pull request:

    https://github.com/apache/spark/pull/6883

    SPARK-4644 blockjoin

    Although the discussion (and design doc) under SPARK-4644 seem focussed on 
other aspects of skew (OOM mostly) than this pullreq (which focusses on 
avoiding a single reducer taking a long time), i decided to put this pullreq 
under SPARK-4644 anyhow, to avoid the proliferation of JIRA tickets. If this is 
not the right place let me know and i will move it.
    
    Inspired by block join in scalding.
    From scalding docs:
    
    This is useful in cases where the data has extreme skew. A symptom of this 
is that we may see a job stuck for a very long time on a small number of 
reducers.
    A block join is way to get around this: we add a random integer field and a 
replica field to every tuple in the left and right pipes. We then join on the 
original keys and on these new dummy fields. These dummy fields make it less 
likely that the skewed keys will be hashed to the same reducer.
    
    The final data size is right * rightReplication + left * leftReplication 
but because of the fragmentation, we are guaranteed the same number of hits as 
the original join.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tresata/spark feat-blockjoin

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6883.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6883
    
----
commit 77d8fee6ad7ba5f83eb0c82b7f1625e2206a5446
Author: Koert Kuipers <[email protected]>
Date:   2015-06-17T20:35:18Z

    add blockJoin, blockLeftOuterJoin and blockRightOuterJoin to spark core

commit d1fd3e020812c72c44a6461d9c94065e2784cdbb
Author: Koert Kuipers <[email protected]>
Date:   2015-06-17T23:48:43Z

    correct scaladocs for block join functions

commit 2114df748f62b53155d7db5524e163504cead228
Author: Koert Kuipers <[email protected]>
Date:   2015-06-18T03:36:21Z

    add block joins to java api

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to