[
https://issues.apache.org/jira/browse/FLINK-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742216#comment-14742216
]
ASF GitHub Bot commented on FLINK-2661:
---------------------------------------
GitHub user andralungu opened a pull request:
https://github.com/apache/flink/pull/1124
[FLINK-2661] Added Node Splitting Methods
Social media graphs, citation networks or even protein networks have a
common property: their degree distribution follows a power-law curve. This
structure raises challenges to both vertex-centric and GSA/GAS models because
they uniformly process vertices, regardless of their degree distribution. This
leads to large execution time stalls: vertices wait for skewed nodes to finish
computation [synchronous].
This PR aims to diminish the impact of high-degree nodes by proposing four
main functions: `determinieSkewedVertices`, `treeDeAggregate` (splits a node
into subnodes, recursively, in levels), `propagateValuesToSplitVertices`
(useful when the algorithm performs more than one superstep), `treeAggregate`
(brings the graph back to its initial state).
These functions modify a graph at a high-level, making its degree
distribution more uniform. The method does not need any special partitioning or
runtime modification and (for skewed networks and computationally intensive
algorithms) can speed up the run time by a factor of two.
I added an example: NodeSplittingJaccardSimilarityMeasure, for which I
needed to split the overall sequence of operations to two functions to be able
to test the result. Calling the entire main method would have resulted in the
"Two few memory segments etc" exception - too many operations called within one
test, in other words.
For more info, please consult the additional entry in the documentation.
If we reach a common point and this PR gets merged, I will also follow
@fhueske 's suggestion from the mailing list - adding a Split version of each
of the library methods to allow users to verify whether their run time
improves.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andralungu/flink splitJaccardFlink
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1124.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1124
----
commit b02d0917edcd5f3c8846fe01044afd7444a58c08
Author: Andra Lungu <[email protected]>
Date: 2015-09-12T10:25:20Z
[FLINK-2661] Added Node Splitting Methods
[FLINK-2661] Minor modifications in the docs
[FLINK-2661] pdf to png
----
> Add a Node Splitting Technique to Overcome the Limitations of Skewed Graphs
> ---------------------------------------------------------------------------
>
> Key: FLINK-2661
> URL: https://issues.apache.org/jira/browse/FLINK-2661
> Project: Flink
> Issue Type: Task
> Components: Gelly
> Affects Versions: 0.10
> Reporter: Andra Lungu
> Assignee: Andra Lungu
>
> Skewed graphs raise unique challenges to computation models such as Gelly's
> vertex-centric or GSA iterations. This is mainly because of the fact that
> these approaches uniformly process vertices regardless of their degree
> distribution.
> In vertex-centric, for instance, a skewed node will take more time to process
> its neighbors compared to the other nodes in the graph. The first will act as
> a straggler causing the latter to remain idle until it finishes its
> computation.
> This issue can be mitigated by splitting a high-degree node into subnodes and
> evenly distributing the edges to the the resulted subvertices. The
> computation will then be performed on the split vertex.
> To this end, we should add a Splitting API on top of Gelly which can help:
> - determine skewed nodes
> - split them
> - merge them back at the end of the computation, given a user defined
> combiner.
> To illustrate the usage of these methods, we should add an example as well as
> a separate entry in the documentation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)