Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/6652#issuecomment-109439594
@ash211 Yeah, so there are two sets of experiments that I've done which
show benefits with this.
1. The first is in the [OSDI 2014
paper](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-venkataraman.pdf).
There are a bunch of things in that paper, but Figure 19 which compares
Baseline to KMN-M/K=1.0 is very similar to the current patch.
2. I've run a bunch of machine learning / linear algebra workloads which
use aggregation trees implemented as a series of shuffles. In these scenarios
you get ~50% reduction in data sent over the network for each stage of the
reduction. Here is an example of a time-breakdown diagram
[before](https://www.dropbox.com/s/vjkcu3bc9cscelr/120_waterfall.pdf?dl=0) and
[after](https://www.dropbox.com/s/820vlt4xuefvr7f/1016_waterfall.pdf?dl=0). In
the second case here we are able to completely avoid network transfer in the
second stage by aggregating across multiple partitions in the same machine.
3. @JoshRosen also included a couple of other papers which have done this
in [the
JIRA](https://issues.apache.org/jira/browse/SPARK-2774?focusedCommentId=14081300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14081300)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]