Added duplicate edge detection recipe
Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/7783c838 Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/7783c838 Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/7783c838 Branch: refs/heads/master Commit: 7783c838e15cfdf3ee567790c657f34403766936 Parents: 043d1a3 Author: Stephen Mallette <sp...@genoprime.com> Authored: Tue Jan 10 08:49:02 2017 -0500 Committer: Stephen Mallette <sp...@genoprime.com> Committed: Tue Jan 10 08:49:02 2017 -0500 ---------------------------------------------------------------------- docs/src/recipes/duplicate-edge.asciidoc | 142 ++++++++++++++++++++++++++ docs/src/recipes/index.asciidoc | 18 ++++ 2 files changed, 160 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/7783c838/docs/src/recipes/duplicate-edge.asciidoc ---------------------------------------------------------------------- diff --git a/docs/src/recipes/duplicate-edge.asciidoc b/docs/src/recipes/duplicate-edge.asciidoc new file mode 100644 index 0000000..ed56106 --- /dev/null +++ b/docs/src/recipes/duplicate-edge.asciidoc @@ -0,0 +1,142 @@ +//// +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +//// +[[duplicate-edge]] +Duplicate Edge Detection +------------------------ + +Whether part of a graph maintenance process or for some other analysis need, it is sometimes necessary to detect +if there is more than one edge between two vertices. The following examples will assume that an edge with the same +label and direction will be considered "duplicate". + +The "modern" graph does not have any duplicate edges that fit that definition, so the following example adds one +that is duplicative of the "created" edge between vertex "1" and "3". + +[gremlin-groovy,modern] +---- +g.V(1).as("a").V(3).addE("created").from("a").iterate() +g.V(1).outE("created") +---- + +One way to find the duplicate edges would be to do something like this: + +[gremlin-groovy,existing] +---- +g.V().outE(). + project("a","b"). <1> + by().by(inV().path().by().by(label)). + group(). <2> + by(select("b")). + by(select("a").fold()). + unfold(). <3> + select(values). <4> + where(count(local).is(gt(1))) +---- + +<1> The "a" and "b" from the `project` contain the edge and the path respectively. The path consists of a the outgoing +vertex, an edge, and the incoming vertex. The use of `by().by(label))` converts the edge to its label (recall that `by` +are applied in round-robin fashion), so the path will look something like: `[v[1],created,v[3]]`. +<2> Group by the path from "b" and construct a list of edges from "a". Any value in this `Map` that has a list of edges +greater than one means that there is more than one edge for that edge label between those two vertices (i.e. the `Map` +key). +<3> Unroll the key-value pairs in the `Map` of paths-edges. +<4> Only the values from the `Map` are needed and as mentioned earlier, those lists with more than one edge would +containa duplicate. + +This method find the duplicates, but does require more memory than other approaches. A slightly more complex approach +that uses less memory might look like this: + +[gremlin-groovy,existing] +---- +g.V().as("ov"). + outE().as("e"). + inV().as("iv"). + inE(). <1> + where(neq("e")). <2> + where(eq("e")).by(label). + where(outV().as("ov")). + group(). + by(select("ov","e","iv").by().by(label)). <3> + unfold(). <4> + select(values). + where(count(local).is(gt(1))) +---- + +<1> To this point in the traversal, the outgoing edges of a vertex are being iterated with the current edge labelled +as "e". For "e", Gremlin traverses to the incoming vertex and back on in edges of that vertex. +<2> Those incoming edges are filtered with the following `where` steps. The first ensures that it does not traverse +back over "e" (i.e. the current edge). The second determines if the edge label is equivalent (i.e. the test for the +working definition of "duplicate"). The third determines if the outgoing vertex matches the one that started the +path labelled as "ov". +<3> This line is quite similar to the output achieved in the previous example at step 2. A `Map` is produced that uses +the outgoing vertex, the edge label, and the incoming vertex as the key, with the list of edges for that path as the +value. +<4> The rest of the traversal is the same as the previous one. + +A third way to approach this problem would be to force a link:https://en.wikipedia.org/wiki/Depth-first_search[depth-first search]. +The previous examples invoke traversal strategies that force a link:https://en.wikipedia.org/wiki/Breadth-first_search[breadth first search] +as a performance optimization. + +[gremlin-groovy,existing] +---- +g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov"). <1> + outE().as("e1"). + inV().as("iv"). + inE(). + where(neq("e1")). + where(outV().as("ov")).as("e2"). <2> + filter(select("e1","e2").by(label).where("e1", eq("e2"))) <3> +---- + +<1> Remove strategies that will optimize for breadth first searches and thus allow Gremlin to go depth first. +<2> To this point, the traversal is very much like the previous one. Review step 2 in the previous example to see the +parallels here. +<3> The final `filter` simply looks for edges that match on label, which would then meet the working definition of +"duplicate". + +The basic pattern at play here is to compare the path of the outgoing vertex, its outgoing edge label and the incoming +vertex. This model can obviously be contracted or expanded as needed to fit different definitions of "duplicate" For +example, if the definition of "duplicate" was just two vertices with more than one edge of any label between them, then +the approach might look something like this: + +[gremlin-groovy,existing] +---- +---- + +Alternatively, a "duplicate" definition could extended to the label and properties of the edge. For purposes of +demonstration, an additional edge is added to the "modern" graph: + +[gremlin-groovy] +---- +g = TinkerFactory.createModern().traversal() +g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate() +g.V(1).as("a").V(3).addE("created").property("weight",0.5d).from("a").iterate() +g.V(1).outE("created").valueMap(true) +---- + +To identify the duplicate with this revised definition, the previous traversal can be modified to: + +[gremlin-groovy,existing] +---- +g.withoutStrategies(LazyBarrierStrategy, PathRetractionStrategy).V().as("ov"). + outE().as("e1"). + inV().as("iv"). + inE(). + where(neq("e1")). + where(outV().as("ov")).as("e2"). + filter(select("e1","e2").by(label).where("e1", eq("e2"))). + filter(select("e1","e2").by("weight").where("e1", eq("e2"))).valueMap(true) +---- \ No newline at end of file http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/7783c838/docs/src/recipes/index.asciidoc ---------------------------------------------------------------------- diff --git a/docs/src/recipes/index.asciidoc b/docs/src/recipes/index.asciidoc index 2e88ac5..9c42f0f 100644 --- a/docs/src/recipes/index.asciidoc +++ b/docs/src/recipes/index.asciidoc @@ -44,6 +44,8 @@ include::connected-components.asciidoc[] include::cycle-detection.asciidoc[] +include::duplicate-edge.asciidoc[] + include::if-then-based-grouping.asciidoc[] include::pagination.asciidoc[] @@ -212,4 +214,20 @@ g.V().hasLabel("student").as("s"). by(select("s").by("name")). by(group().by(select("t").by(valueMap("fromTime","toTime"))). by(select("c").dedup().values("name").fold())).next() +---- + +[[appendix-d]] +_In the "modern" graph, with a duplicate edge added, find the vertex pairs that have more than one edge between them._ + +[gremlin-groovy,modern] +---- +g.V(1).as("a").V(3).addE("created").property("weight",0.4d).from("a").iterate() +g.V(1).outE("created") +g.V().as("a"). + out().as("b"). + groupCount(). + by(select("a","b")). + unfold(). + filter(select(values).is(gt(1))). + select(keys) ---- \ No newline at end of file