zhengruifeng created SPARK-13714: ------------------------------------ Summary: Another ConnectedComponents based on Max-Degree Propagation Key: SPARK-13714 URL: https://issues.apache.org/jira/browse/SPARK-13714 Project: Spark Issue Type: New Feature Components: GraphX Reporter: zhengruifeng Priority: Minor
Current ConnectedComponents algorithm was based on Min-VertexId Propagation, which is sensitive to the place of Min-VertexId. While this implementation is based on Max-Degree Propagation. First, the degree graph is computed. And in the pregel progress, the vertex with the max degree in a CC is the start point of propagation. This new method has advantages over the old one: 1, The convergence is only determined by the structs of CC, and is robust to the place of vertex with Min-ID. 2, For spherical CCs in which there may be a concept like 'center', it can accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), the old CC need 4 supersteps, while the new one only need 2 supersteps. 3, If we limit the number of iteration, the new method tend to generate more acceptable results. 4, The output for each CC is the vertex with max degree in it, which may be more meaningful. And because the vertex-ID is nominal in most cases, the vertex with min-ID in a CC is somewhat meanless. But there are still two disadvantages: 1,The message boy grows, from (VID) to (VID, Degree). that is (Long) -> (Long, Int) 2,For graph with simple CCs, it may be slower than old one. Because it need a extra degree computation. The api is the same like ConnectedComponents: val graph = ... val cc = graph.ConnectedComponentsWithDegree(100) or val cc = ConnectedComponentsWithDegree.run(graph, 100) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org