[GitHub] spark pull request: [HOTFIX] Some clean-up in shuffle code.

pwendell Mon, 30 Mar 2015 23:06:38 -0700

GitHub user pwendell opened a pull request:

    https://github.com/apache/spark/pull/5286


    [HOTFIX] Some clean-up in shuffle code.

    Before diving into review #4450 I did a look through the existing shuffle
    code to learn how it works. Unfortunately, there are some very 
    confusing things in this code. This patch makes a few small changes
    to simplify things. It is not easily to concisely describe the changes 
    because of how convoluted the issues were, but they are fairly small
    logically:
    
    1. There is a trait named `ShuffleBlockManager` that only deals with
       one logical function which is retrieving shuffle block data given shuffle
       block coordinates. This trait has two implementors 
FileShuffleBlockManager
       and IndexShuffleBlockManager. Confusingly the vast majority of those
       implementations have nothing to do with this particular functionality.
       So I've renamed the trait to ShuffleBlockResolver and documented it.
    2. The aforementioned trait had two almost identical methods, for no good
       reason. I removed one method (getBytes) and modified callers to use the
       other one. I think the behavior is preserved in all cases.
    3. The sort shuffle code uses an identifier "0" in the reduce slot of a
       BlockID as a placeholder. I made it into a constant since it needs to
       be consistent across multiple places.
    
    I think for (3) there is actually a better solution that would avoid the
    need to do this type of workaround/hack in the first place, but it's more
    complex so I'm punting it for now.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pwendell/spark cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5286
    
----
commit a4060797a83c591be288e8fca48affecb17412c6
Author: Patrick Wendell <patr...@databricks.com>
Date:   2015-03-31T01:12:43Z

    [HOTFIX] Some clean-up in shuffle code.
    
    Before diving into review #4450 I did a look through the existing shuffle
    code. Unfortunately, there are some very confusing things in this code.
    This patch makes a few small changes to simplify things. It is not easily
    to concisely describe the changes because of how convoluted the issues were:
    
    1. There was a trait named ShuffleBlockManager that only deals with
       one logical function which is retrieving shuffle block data given shuffle
       block coordinates. This trait has two implementors 
FileShuffleBlockManager
       and IndexShuffleBlockManager. Confusingly the vast majority of those
       implementations have nothing to do with this particular functionality.
       So I've renamed the trait to ShuffleBlockResolver and documented it.
    2. The aformentioned trait had two almost identical methods, for no good
       reason. I removed one method (getBytes) and modified callers to use the
       other one. I think the behavior is preserved in all cases.
    3. The sort shuffle code uses an identifier "0" in the reduce slot of a
       BlockID as a placeholder. I made it into a constant since it needs to
       be consistent across multiple places.
    
    I think for (3) there is actually a better solution that would avoid the
    need to do this type of workaround/hack in the first place, but it's more
    complex so I'm punting it for now.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [HOTFIX] Some clean-up in shuffle code.

Reply via email to