[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

ankurdave Tue, 22 Apr 2014 21:49:23 -0700

GitHub user ankurdave opened a pull request:

    https://github.com/apache/spark/pull/497


    Unify GraphImpl RDDs + other graph load optimizations

    This PR makes the following changes, primarily in 
e4fbd329aef85fe2c38b0167255d2a712893d683:
    
    1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: 
vertices, edges, routing table, and triplet view. This commit merges them down 
to two: vertices (with routing table), and edges (with replicated vertices).
    
    2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles 
when building a graph: one to extract routing information from the edges and 
move it to the vertices, and another to find nonexistent vertices referred to 
by edges. With this commit, the latter is done as a side effect of the former.
    
    3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side 
effect of unifying the edges and the triplet view.
    
    4. *Join elimination for mapTriplets.*
    
    5. *Ship only the needed vertex attributes when upgrading the triplet 
view.* If the triplet view already contains source attributes, and we now need 
both attributes, only ship destination attributes rather than re-shipping both. 
This is done in `ReplicatedVertexView#upgrade`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ankurdave/spark unify-rdds

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #497
    
----
commit d64e8d49ec43d2fd6bde519b8aeafb5cc4f1be61
Author: Ankur Dave <[email protected]>
Date:   2014-04-20T03:09:24Z

    Log current Pregel iteration

commit 62c7b7851301890814d523513fc5b67a0eb781ab
Author: Ankur Dave <[email protected]>
Date:   2014-04-20T03:10:36Z

    In Analytics, take PageRank numIter

commit d6d60e21bfc97fa39b25aa875c96ca9fc05a9973
Author: Ankur Dave <[email protected]>
Date:   2014-04-20T03:08:57Z

    In GraphLoader, coalesce to minEdgePartitions

commit e4fbd329aef85fe2c38b0167255d2a712893d683
Author: Ankur Dave <[email protected]>
Date:   2014-04-13T02:18:37Z

    Unify GraphImpl RDDs + other graph load optimizations
    
    This commit makes the following changes:
    
    1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs:
    vertices, edges, routing table, and triplet view. This commit merges
    them down to two: vertices (with routing table), and edges (with
    replicated vertices).
    
    2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles
    when building a graph: one to extract routing information from the edges
    and move it to the vertices, and another to find nonexistent vertices
    referred to by edges. With this commit, the latter is done as a side
    effect of the former.
    
    3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side
    effect of unifying the edges and the triplet view.
    
    4. *Join elimination for mapTriplets.*
    
    5. *Ship only the needed vertex attributes when upgrading the
    triplet view.* If the triplet view already contains source attributes,
    and we now need both attributes, only ship destination attributes rather
    than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

Reply via email to