[ 
https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3427:
------------------------------
    Description: 
GraphX's current implementation of static (fixed iteration count) PageRank uses 
the Pregel API. This unnecessarily tracks active vertices, even though in 
static PageRank all vertices are always active. Active vertex tracking incurs 
the following costs:

1. A shuffle per iteration to ship the active sets to the edge partitions.
2. A hash table creation per iteration at each partition to index the active 
sets for lookup.
3. A hash lookup per edge to check whether the source vertex is active.

I reimplemented static PageRank using the lower-level GraphX API instead of the 
Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% 
speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank 
on a synthetic graph with 10M vertices and 1.27B edges.

  was:
GraphX's current implementation of static (fixed iteration count) PageRank is 
implemented using the Pregel API. This unnecessarily tracks active vertices, 
even though in static PageRank all vertices are always active. Active vertex 
tracking incurs the following costs:

1. A shuffle per iteration to ship the active sets to the edge partitions.
2. A hash table creation per iteration at each partition to index the active 
sets for lookup.
3. A hash lookup per edge to check whether the source vertex is active.

I reimplemented static PageRank using the GraphX API instead of the 
higher-level Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this 
provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 
iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.


> Avoid active vertex tracking in static PageRank
> -----------------------------------------------
>
>                 Key: SPARK-3427
>                 URL: https://issues.apache.org/jira/browse/SPARK-3427
>             Project: Spark
>          Issue Type: Improvement
>          Components: GraphX
>            Reporter: Ankur Dave
>            Assignee: Ankur Dave
>
> GraphX's current implementation of static (fixed iteration count) PageRank 
> uses the Pregel API. This unnecessarily tracks active vertices, even though 
> in static PageRank all vertices are always active. Active vertex tracking 
> incurs the following costs:
> 1. A shuffle per iteration to ship the active sets to the edge partitions.
> 2. A hash table creation per iteration at each partition to index the active 
> sets for lookup.
> 3. A hash lookup per edge to check whether the source vertex is active.
> I reimplemented static PageRank using the lower-level GraphX API instead of 
> the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided 
> a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of 
> PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to