[
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734
]
Larry Xiao edited comment on SPARK-3523 at 12/2/14 4:50 PM:
------------------------------------------------------------
Hi [~lianhuiwang], that's great!
Currently, to allow more complex partitioning, I changed the interface of
partitioner, because default partitioners are hash based, and didn't need much
information about graph.
An option is to partition the graph externally, as [~ankurd] suggested.
What do you think?
was (Author: larryxiao):
Hi [~lianhuiwang], that's great!
As you can see, to allow more complex partitioning, the interface need to be
change, because default partitioners are hash based.
Another option is to partition the graph externally, as [~ankurd] suggested.
What do you think?
> GraphX graph partitioning strategy
> ----------------------------------
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
> Issue Type: Improvement
> Components: GraphX
> Affects Versions: 1.0.2
> Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the
> workload on few high-degree vertex should be decomposed by all machines
> * For low-degree vertex, The computation on one vertex is quite small. Thus
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
> * a heuristic algorithm for bipartite
> h5. Result
> *
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
> * The left Y axis is replication factor, right axis is the balance
> (measured using CV, coefficient of variation) of either vertices or edges of
> all partitions. The balance of edges can infer computation balance, and the
> balance of vertices can infer communication balance.
> *
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
>
> * This is an example of a balanced partitioning achieving 20% saving on
> communication.
> *
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
> * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> *
> https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with
> PartitionStrategy.scala, so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed
> Graphs
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]