Repository: tinkerpop Updated Branches: refs/heads/centrality-recipes [created] e97255f20
Updated the closeness and betweeness centrality recipes. 1) added a disclaimer that both recipes should be used with care (only on small (sub)graphs) 2) optimized the execution plan 3) took multiple shortest paths between a pair of vertices into account 4) updated the sample graph to one that has multiple shortest paths between certain pairs of vertices Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/e97255f2 Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/e97255f2 Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/e97255f2 Branch: refs/heads/centrality-recipes Commit: e97255f205832d052ac747404d3a12d9e06b0fec Parents: 97cc07d Author: Daniel Kuppitz <daniel_kupp...@hotmail.com> Authored: Fri Jan 20 16:36:42 2017 +0100 Committer: Daniel Kuppitz <daniel_kupp...@hotmail.com> Committed: Fri Jan 20 16:36:42 2017 +0100 ---------------------------------------------------------------------- docs/src/recipes/centrality.asciidoc | 121 +++++++++++++++++------------- 1 file changed, 67 insertions(+), 54 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/e97255f2/docs/src/recipes/centrality.asciidoc ---------------------------------------------------------------------- diff --git a/docs/src/recipes/centrality.asciidoc b/docs/src/recipes/centrality.asciidoc index cbce418..f6fba30 100644 --- a/docs/src/recipes/centrality.asciidoc +++ b/docs/src/recipes/centrality.asciidoc @@ -67,43 +67,48 @@ image:betweeness-example.png[width=600] [gremlin-groovy ] ---- -g.addV('name','a').as('a'). - addV('name','b').as('b'). - addV('name','c').as('c'). - addV('name','d').as('d'). - addV('name','e').as('e'). +g.addV(id,'A').as('a'). + addV(id,'B').as('b'). + addV(id,'C').as('c'). + addV(id,'D').as('d'). + addV(id,'E').as('e'). + addV(id,'F').as('f'). addE('next').from('a').to('b'). addE('next').from('b').to('c'). - addE('next').from('c').to('d'). - addE('next').from('d').to('e').iterate() -g.withSack(0).V().store("x").repeat(both().simplePath()).emit().path(). <1> - group().by(project("a","b").by(limit(local, 1)). <2> - by(tail(local, 1))). - by(order().by(count(local))). <3> - select(values).as("shortestPaths"). <4> - select("x").unfold().as("v"). <5> - select("shortestPaths"). <6> - map(unfold().filter(unfold().where(eq("v"))).count()). <7> - sack(sum).sack().as("betweeness"). <8> - select("v","betweeness") + addE('next').from('b').to('d'). + addE('next').from('c').to('e'). + addE('next').from('d').to('e'). + addE('next').from('e').to('f').iterate() +g.V().as("v"). <1> + repeat(both().simplePath().as("v")).emit(). <2> + filter(project("x","y","z").by(select(first, "v")). <3> + by(select(last, "v")). + by(select(all, "v").count(local)).as("triple"). + coalesce(select("x","y").as("a"). <4> + select("triples").unfold().as("t"). + select("x","y").where(eq("a")). + select("t"), + store("triples")). <5> + select("z").as("length"). + select("triple").select("z").where(eq("length"))). <6> + select(all, "v").unfold(). <7> + groupCount().next() <8> ---- -<1> Defines a Gremlin link:http://tinkerpop.apache.org/docs/x.y.z/reference/#sack-step[sack] with a value of zero, -which represents the initial betweeness score for each vertex, and traverses on both incoming and outgoing edges -avoiding <<cycle-detection, cyclic paths>>. -<2> Group each path by the first and last vertex. -<3> Reduce the list of paths to the shortest path between the first and last vertex by ordering on their lengths. -<4> Recall that at this point, there is a `Map` keyed by first and last vertex and with a value of just the shortest -path. Extract the shortest path with `select(values)`, since that's the only portion required for the remainder of -the traversal. -<5> The "x" key contains the list of vertices stored from step 1 - unfold that list into "v" for later use. This step -will unwrap the vertex that is stored in the `Traverser` as -link:http://tinkerpop.apache.org/javadocs/x.y.z/full/org/apache/tinkerpop/gremlin/process/traversal/step/util/BulkSet.html[BulkSet] -so that it can be used directly in the `Traversal`. -<6> Iterate the set of shortest paths. At this point, it is worth noting that the traversal is iterating each vertex -in "v" and for each vertex in "v" it is iterating each `Path` in "shortestpaths". -<7> For each path, transform it to a count of the number of times that "v" from step 5 is encountered. -<8> Sum the counts for each vertex using `sack()`, normalize the value and label it as the "betweeness" to be the score. +<1> Starting from each vertex in the graph... +<2> ...traverse on both - incoming and outgoing - edges, avoiding <<cycle-detection, cyclic paths>>. +<3> Create a triple consisting of the first vertex, the last vertex and the length of the path between them. +<4> Determine whether a path between those two vertices was already found. +<5> If this is the first path between the two vertices, store the triple in an internal collection named "triples". +<6> Keep only those paths between a pair of vertices that have the same length as the first path that was found between them. +<7> Select all shortest paths and unfold them. +<8> Count the number of occurrences of each vertex, which is ultimately its betweeness score. + +WARNING: Since the betweeness centrality algorithm requires the shortest path between any pair of vertices in the graph, +its practical applications are very limited. It's recommended to use this algorithm only on small subgraphs (graphs like +the link:http://tinkerpop.apache.org/docs/current/reference/#grateful-dead[Grateful Dead graph] with only 808 vertices +and 8049 edges already require a massive amount of compute resources to determine the shortest paths between all vertex +pairs). [[closeness-centrality]] Closeness Centrality @@ -114,28 +119,36 @@ other reachable vertices in the graph. The following examples use the modern toy [gremlin-groovy,modern] ---- -g.withSack(1f).V().repeat(both().simplePath()).emit().path(). <1> - group().by(project("a","b").by(limit(local, 1)). <2> - by(tail(local, 1))). - by(order().by(count(local))). <3> - select(values).unfold(). <4> - project("v","length"). - by(limit(local, 1)). <5> - by(count(local).sack(div).sack()). <6> - group().by(select("v")).by(select("length").sum()) <7> +g = TinkerFactory.createModern().traversal() +g.withSack(1f).V().as("v"). <1> + repeat(both().simplePath().as("v")).emit(). <2> + filter(project("x","y","z").by(select(first, "v")). <3> + by(select(last, "v")). + by(select(all, "v").count(local)).as("triple"). + coalesce(select("x","y").as("a"). <4> + select("triples").unfold().as("t"). + select("x","y").where(eq("a")). + select("t"), + store("triples")). <5> + select("z").as("length"). + select("triple").select("z").where(eq("length"))). <6> + group().by(select(first, "v")). <7> + by(select(all, "v").count(local).sack(div).sack().sum()).next() ---- -<1> Defines a Gremlin link:http://tinkerpop.apache.org/docs/x.y.z/reference/#sack-step[sack] with a value of one, -and traverses on both incoming and outgoing edges avoiding <<cycle-detection, cyclic paths>>. -<2> Group each path by the first and last vertex. -<3> Reduce the list of paths to the shortest path between the first and last vertex by ordering on their lengths. -<4> Recall that at this point, there is a `Map` keyed by first and last vertex and with a value of just the shortest -path. Extract the shortest path with `select(values)`, since that's the only portion required for the remainder of -the traversal. -<5> The first `by()` modulator for `project()` extracts the first vertex in the path. -<6> The second `by()` modulator for `project()` extracts the path length and divides that distance by the value of -the `sack()` which was initialized to 1 at the start of the traversal. -<7> Group the resulting `Map` objects on "v" and sum their lengths to get the centrality score for each. +<1> Defines a Gremlin link:http://tinkerpop.apache.org/docs/x.y.z/reference/#sack-step[sack] with a value of one. +<2> Traverses on both - incoming and outgoing - edges, avoiding <<cycle-detection, cyclic paths>>. +<3> Create a triple consisting of the first vertex, the last vertex and the length of the path between them. +<4> Determine whether a path between those two vertices was already found. +<5> If this is the first path between the two vertices, store the triple in an internal collection named "triples". +<6> Keep only those paths between a pair of vertices that have the same length as the first path that was found between them. +<7> For each vertex divide 1 by the product of the lengths of all shortest paths that start with this particular vertex. + +WARNING: Since the closeness centrality algorithm requires the shortest path between any pair of vertices in the graph, +its practical applications are very limited. It's recommended to use this algorithm only on small subgraphs (graphs like +the link:http://tinkerpop.apache.org/docs/current/reference/#grateful-dead[Grateful Dead graph] with only 808 vertices +and 8049 edges already require a massive amount of compute resources to determine the shortest paths between all vertex +pairs). [[eigenvector-centrality]] Eigenvector Centrality @@ -165,4 +178,4 @@ link:http://tinkerpop.apache.org/docs/current/reference/#timelimit-step[time lim traversal from taking longer than one hundred milliseconds to execute (the previous example takes considerably longer than that). While the answer provided with the `timeLimit()` is not the absolute ranking, it does provide a relative ranking that closely matches the absolute one. The use of `timeLimit()` in certain algorithms (e.g. recommendations) -can shorten the time required to get a reasonable and usable result. \ No newline at end of file +can shorten the time required to get a reasonable and usable result.