[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...

ankurdave Tue, 04 Nov 2014 17:57:06 -0800

GitHub user ankurdave opened a pull request:

    https://github.com/apache/spark/pull/3100


    [SPARK-3936] Add aggregateMessages, which supersedes mapReduceTriplets

    aggregateMessages enables neighborhood computation similarly to 
mapReduceTriplets, but it introduces two API improvements:
    
    1. Messages are sent using an imperative interface based on EdgeContext 
rather than by returning an iterator of messages. This is more efficient, 
providing a 20.2% speedup on PageRank over apache/spark#3054 (uk-2007-05 graph, 
10 iterations, 16 r3.2xlarge machines, sped up from 403 s to 322 s).
    
    2. Rather than attempting bytecode inspection, the required triplet fields 
must be explicitly specified by the user by passing a TripletFields object. 
This fixes SPARK-3936.
    
    Subsumes apache/spark#2815.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ankurdave/spark aggregateMessages

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3100
    
----
commit 4a566dc86624ac3f6dfa747d344c86e4be44adc2
Author: Ankur Dave <[email protected]>
Date:   2014-08-14T02:33:47Z

    Optimizations for mapReduceTriplets and EdgePartition
    
    1. EdgePartition now stores local vertex ids instead of global ids. This
       avoids hash lookups when looking up vertex attributes and aggregating
       messages.
    
    2. Internal iterators in mapReduceTriplets are inlined into a while
       loop.

commit b567be2825ea22f2e61fbd9caa34940f5bc404df
Author: Ankur Dave <[email protected]>
Date:   2014-11-04T09:56:48Z

    iter.foreach -> while loop

commit c85076de62b4c3344c443d4e85fce8fc47274aac
Author: Ankur Dave <[email protected]>
Date:   2014-11-04T09:58:00Z

    Readability improvements

commit e0f8ecc7b678de2b011650ed96b974369730947e
Author: Ankur Dave <[email protected]>
Date:   2014-11-04T09:58:23Z

    Take activeSet in ExistingEdgePartitionBuilder
    
    Also rename VertexPreservingEdgePartitionBuilder to
    ExistingEdgePartitionBuilder to better reflect its usage.

commit 194a2df94768be9c08ed50654170bad937bd115a
Author: Ankur Dave <[email protected]>
Date:   2014-11-04T10:03:34Z

    Test triplet iterator in EdgePartition serialization test

commit 1e80aca308463b0ec7dbeee58c7d1935ebb59e77
Author: Ankur Dave <[email protected]>
Date:   2014-11-01T07:01:21Z

    Add aggregateMessages, which supersedes mapReduceTriplets
    
    aggregateMessages enables neighborhood computation similarly to
    mapReduceTriplets, but it introduces two API improvements:
    
    1. Messages are sent using an imperative interface based on EdgeContext
    rather than by returning an iterator of messages. This is more
    efficient, providing a 20.2% speedup on PageRank over
    apache/spark#3054 (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge
    machines, sped up from 403 s to 322 s).
    
    2. Rather than attempting bytecode inspection, the required triplet
    fields must be explicitly specified by the user by passing a
    TripletFields object. This fixes SPARK-3936.
    
    Subsumes apache/spark#2815.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...

Reply via email to