GitHub user weiwee opened a pull request:
https://github.com/apache/spark/pull/20821
[SPARK-23678][GraphX] a more efficient partition strategy
## What changes were proposed in this pull request?
add a new partition strategy with several advantage:
1. nicer bound on vertex replication, sqrt(2 * numParts), which is about
23% reducing compare with EdgePartition2D partition strategy, which has bound
2 * sqrt(numParts). This reduce the shuffle size in several operation such as
aggregateMessage and triplets.
2. colocate all edges between two vertices regardless of direction.
3. same work balance compared with EdgePartition2D
## How was this patch tested?
manual tests, see
[https://github.com/weiwee/edgePartitionTri/blob/master/EdgePartitionTriangle.ipynb](url)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/weiwee/spark edge-partition-triangle
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20821.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20821
----
commit 05df5c809f91c59e45bb22411a8a5828f3e30512
Author: wenbinwei <wenbinwei@...>
Date: 2018-03-14T07:28:18Z
add new partition strategy: EdgePartitionTriangle
commit 200b1716fe90604f8068ba5309c7673e5586b1cd
Author: wenbinwei <wenbinwei@...>
Date: 2018-03-14T07:30:48Z
add case clause EdgePartitionTriangle to method fromString
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]