GitHub user ankurdave opened a pull request:
https://github.com/apache/spark/pull/497
Unify GraphImpl RDDs + other graph load optimizations
This PR makes the following changes, primarily in
e4fbd329aef85fe2c38b0167255d2a712893d683:
1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs:
vertices, edges, routing table, and triplet view. This commit merges them down
to two: vertices (with routing table), and edges (with replicated vertices).
2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles
when building a graph: one to extract routing information from the edges and
move it to the vertices, and another to find nonexistent vertices referred to
by edges. With this commit, the latter is done as a side effect of the former.
3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side
effect of unifying the edges and the triplet view.
4. *Join elimination for mapTriplets.*
5. *Ship only the needed vertex attributes when upgrading the triplet
view.* If the triplet view already contains source attributes, and we now need
both attributes, only ship destination attributes rather than re-shipping both.
This is done in `ReplicatedVertexView#upgrade`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ankurdave/spark unify-rdds
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/497.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #497
----
commit d64e8d49ec43d2fd6bde519b8aeafb5cc4f1be61
Author: Ankur Dave <[email protected]>
Date: 2014-04-20T03:09:24Z
Log current Pregel iteration
commit 62c7b7851301890814d523513fc5b67a0eb781ab
Author: Ankur Dave <[email protected]>
Date: 2014-04-20T03:10:36Z
In Analytics, take PageRank numIter
commit d6d60e21bfc97fa39b25aa875c96ca9fc05a9973
Author: Ankur Dave <[email protected]>
Date: 2014-04-20T03:08:57Z
In GraphLoader, coalesce to minEdgePartitions
commit e4fbd329aef85fe2c38b0167255d2a712893d683
Author: Ankur Dave <[email protected]>
Date: 2014-04-13T02:18:37Z
Unify GraphImpl RDDs + other graph load optimizations
This commit makes the following changes:
1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs:
vertices, edges, routing table, and triplet view. This commit merges
them down to two: vertices (with routing table), and edges (with
replicated vertices).
2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles
when building a graph: one to extract routing information from the edges
and move it to the vertices, and another to find nonexistent vertices
referred to by edges. With this commit, the latter is done as a side
effect of the former.
3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side
effect of unifying the edges and the triplet view.
4. *Join elimination for mapTriplets.*
5. *Ship only the needed vertex attributes when upgrading the
triplet view.* If the triplet view already contains source attributes,
and we now need both attributes, only ship destination attributes rather
than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---