GitHub user mengxr reopened a pull request:
https://github.com/apache/spark/pull/506
[SPARK-1485][MLLIB] Implement Butterfly AllReduce
The current implementations of machine learning algorithms rely on the
driver for some computation and data broadcasting. This may create a bottleneck
at the driver for both computation and communication, especially in multi-model
training. An efficient implementation of AllReduce can help free up the driver.
This PR contains a simple butterfly AllReduce implementation. Compared it with
reduce + broadcast (http) on a 16-node EC2 cluster (with slow connection), and
saw 2x speed-up on vectors of size 1k to 10m.
Possible improvements:
1. Each executor only needs one copy.
2. Better handling when the number of partitions is not a power of two?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mengxr/spark butterfly
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/506.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #506
----
commit 76f4bb7b29e0b605561dac048535380b645839fd
Author: Xiangrui Meng <[email protected]>
Date: 2014-04-18T04:43:32Z
init impl of allReduce
commit d14300540da65900a2a19f8edc5adc5ce9c3e72d
Author: Xiangrui Meng <[email protected]>
Date: 2014-04-22T21:11:25Z
move allReduce to mllib
commit 98c329d5ac12ec341a9cf6355cd67ea031189a24
Author: Xiangrui Meng <[email protected]>
Date: 2014-04-23T09:44:51Z
allow arbitrary number of partitions
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---