GitHub user pwendell opened a pull request:
https://github.com/apache/spark/pull/5286
[HOTFIX] Some clean-up in shuffle code.
Before diving into review #4450 I did a look through the existing shuffle
code to learn how it works. Unfortunately, there are some very
confusing things in this code. This patch makes a few small changes
to simplify things. It is not easily to concisely describe the changes
because of how convoluted the issues were, but they are fairly small
logically:
1. There is a trait named `ShuffleBlockManager` that only deals with
one logical function which is retrieving shuffle block data given shuffle
block coordinates. This trait has two implementors
FileShuffleBlockManager
and IndexShuffleBlockManager. Confusingly the vast majority of those
implementations have nothing to do with this particular functionality.
So I've renamed the trait to ShuffleBlockResolver and documented it.
2. The aforementioned trait had two almost identical methods, for no good
reason. I removed one method (getBytes) and modified callers to use the
other one. I think the behavior is preserved in all cases.
3. The sort shuffle code uses an identifier "0" in the reduce slot of a
BlockID as a placeholder. I made it into a constant since it needs to
be consistent across multiple places.
I think for (3) there is actually a better solution that would avoid the
need to do this type of workaround/hack in the first place, but it's more
complex so I'm punting it for now.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/pwendell/spark cleanup
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5286.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5286
----
commit a4060797a83c591be288e8fca48affecb17412c6
Author: Patrick Wendell <[email protected]>
Date: 2015-03-31T01:12:43Z
[HOTFIX] Some clean-up in shuffle code.
Before diving into review #4450 I did a look through the existing shuffle
code. Unfortunately, there are some very confusing things in this code.
This patch makes a few small changes to simplify things. It is not easily
to concisely describe the changes because of how convoluted the issues were:
1. There was a trait named ShuffleBlockManager that only deals with
one logical function which is retrieving shuffle block data given shuffle
block coordinates. This trait has two implementors
FileShuffleBlockManager
and IndexShuffleBlockManager. Confusingly the vast majority of those
implementations have nothing to do with this particular functionality.
So I've renamed the trait to ShuffleBlockResolver and documented it.
2. The aformentioned trait had two almost identical methods, for no good
reason. I removed one method (getBytes) and modified callers to use the
other one. I think the behavior is preserved in all cases.
3. The sort shuffle code uses an identifier "0" in the reduce slot of a
BlockID as a placeholder. I made it into a constant since it needs to
be consistent across multiple places.
I think for (3) there is actually a better solution that would avoid the
need to do this type of workaround/hack in the first place, but it's more
complex so I'm punting it for now.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]