[ 
https://issues.apache.org/jira/browse/TINKERPOP-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133224#comment-15133224
 ] 

ASF GitHub Bot commented on TINKERPOP-962:
------------------------------------------

GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/210

    TINKERPOP-962: Provide "vertex query" selectivity when importing data in 
OLAP.

    TINKERPOP-962: Provide "vertex query" selectivity when importing data in 
OLAP.
    
    https://issues.apache.org/jira/browse/TINKERPOP-962
    
    (For TinkerPop 3.2.0 -- Breaking Change for GraphComputer Implementations)
    
    This feature enables us to push down a `GraphFilter` predicate to the 
underlying OLAP graph system. For instance, if `g.V().count()` is executed by 
`SparkGraphComputer`, then there is no reason to load all the edges, simply 
push down a `GraphFilter`-predicate that filters out edges. For graph database 
providers like Titan, they can simply only send up the subset of the graph that 
is required for the OLAP job instead of filtering on the OLAP cluster machines. 
In the future, we will provide `GraphFilterTraversalStrategy` which will 
analyze the traversal and automatically generate a `GraphFilter` so the user is 
blind to which subsets of the full graph are actually being accessed by the 
OLAP engine.
    
    This pull request yields a breaking change for graph system providers that 
have their own `GraphComputer` implementation. There are two new methods on 
`GraphComputer` and one new method on `GraphReader`.
    
    ```
    GraphComputer vertices(Traversal<Vertex,Vertex> vertexFilter)
    GraphComputer edges(Traversal<Vertex,Edge> edgeFilter)
    GraphReader.readVertex(InputStream inputStream, GraphFilter graphFilter)
    ```
    
    TinkerPop provides a `GraphFilter` object that does a lot of the heavy 
lifting so at minimum, the graph system provider simply needs to 
`GraphFilter.isLegal()` the vertices and edges it loads. Note that if the graph 
system provider relies on `GiraphGraphComputer` or `SparkGraphComputer`, then 
there is no change on their part unless they want to leverage the `GraphFilter` 
locally before sending their data to Giraph or Spark (an optimization that can 
be done at a later date without impacting users).
    
    There was a host of changes that took place for this feature to be created. 
When merged, the `CHANGELOG.txt` will have the following new items:
    
    ```
    * Added `GraphFilter` to support filtering out vertices and edges that 
won't be touched by an OLAP job.
    * Added `GraphComputer.vertices()` and `GraphComputer.edges()` for 
`GraphFilter` construction (*breaking*).
    * `SparkGraphComputer`, `GiraphGraphComputer`, and `TinkerGraphComputer` 
all support `GraphFilter`.
    * Added `GraphComputerTest.shouldSupportGraphFilter()` which verifies all 
filtered graphs have the same topology.
    * Added `GraphFilterAware` interface to `hadoop-gremlin/` which tells the 
OLAP engine that the `InputFormat` handles filtering.
    * `GryoInputFormat` and `ScriptInputFormat` all implement 
`GraphFilterAware`.
    * Fixed a bug in `TraversalUtil.isLocalStarGraph()` which allowed certain 
illegal traversals to pass.
    * Added `TraversalUtil.isLocalVertex()` to verify that the traversal does 
not touch incident edges.
    * `GraphReader` IO interface now has `Optional<Vertex> 
readGraph(InputStream, GraphFilter)`. Default `UnsupportOperationException`.
    * `GryoReader` does not materialize edges that will be filtered out and 
this greatly reduces GC and load times.
    * Created custom `Serializers` for `SparkGraphComputer` message-passing 
classes which reduce graph sizes significantly.
    ```
    
    Ran `mvn clean install` and integration tests. Passed.
    
    VOTE +1.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-962

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/210.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #210
    
----
commit 873174e8218aef31f2220928ab16463aeda650cd
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-01T16:29:14Z

    Started working on GraphComputer.vertices() and GraphComputer.edges(). Have 
it working (untested) for SparkGraphComputer. The same pattern will flow over 
to GiraphGraphComputer. There are some issues regarding semantics in 
TinkerGraphComputer. Will bring up with a [DISCUSS].

commit 3b3e008ce03d1f63610b92ff79886376d9dc55f7
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-01T19:38:42Z

    GraphComputerTest now verifies that graph filters work -- 
GraphComputer.vertices() and GraphComputer.edges(). SparkGraphComputer 
implements graph filters correctly. TinkerGraph and Giraph throw 
UnsupportOperationException at this point (i.e. TODO). Had to add remove() 
methods to many of the inner Iterator anonymous classes in IteratorUtils and 
MultiIterator. Basically, they just call remove() on the wrapped iterator. 
Thus, cleanly backwards compatible. Added GraphFilterAware interface will allow 
InputFormats to say whether or not they do vertex/edge-filtering on graph load. 
Nothing connected to that yet, but GryoInputFormat (and smart providers) will 
be able to leverage this interface. Still a work in progress....

commit 3485d8454855938fd7c0c24d5c3f9c3eb6ab308a
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-01T21:45:48Z

    Created a CommonFileInputFormat abstract class that both GryoInputFormat 
and ScriptInputFormat now extend. It handles all vertex/edge filter 
construction and has helper methods for filtering the StarVertex prior to being 
fully loaded by the InputFormat. This is really nice as we can now tweak vertex 
loading to a pretty intense degree especially with GryoInputFormat (e.g. once 
properties are loaded, check vertex filter and thus, don't even deserialize the 
edges). How it is right now, the full Vertex is materialized, then validated 
before the InputFormat will nextKeyValue().

commit 77732ddd5f60bbd65a445390e590da34bea1db2f
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-01T21:59:04Z

    tweaks to filtered boolean check.

commit 64c684065143b75697ccac755b9dfbf943c8c54c
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-01T23:46:07Z

    GiraphGraphComputer now has support for vertexFilters and edgeFilters. 
Consolidated a bunch of code to make it easy for future InputFormats to be 
GraphFilterAware. Will most likely make a filterMap so variables are bundled 
nicely.

commit bc417dbf01fee817aa325ee8e4b582fef8ab6788
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T00:44:19Z

    created a GraphFilter container object that makes storing and applying 
filters easy. Very clean model. GraphFilter will next contain stuff like 
inferences on the filters so easy push-down predicates are available to the 
graph system provider.

commit d0ac65277702b703c1ab2257adcbf67b0699b959
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T15:55:14Z

    GraphFilter is now a really cool class. It is part of gremlin-core/computer 
and provides access to GraphComputer vertices() and edges() load filters. It 
also provides direct support for filtering StarVertex vertices (as most OLAP 
systems will leverage StarVertex). Its StarVertex support is nice in that 
GraphFilter analyzes the edgesFilter and can do bulk dropEdges() to prune the 
StarVertex fast. Whatever it can't do in bulk, it then runs the edgeFilter over 
the remaining edges. GraphComputerTest.shouldSupportGraphFilter() ensures that 
the graph is properly pruned. I have some ideas about pushing GraphFilter down 
to the StarVertex deserializer, but will need @spmallette help on that. If we 
can do that, then we can get some BLAZING speeds for highly pruned OLAP 
operations.

commit eee16c9354602a49e7dfb7738f2ce4d9fe36152c
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T19:17:59Z

    TinkerGraph now supports GraphComputer GraphFilter. Sort of an elegant 
solution that makes use of tagging elements that are legal or not. As of right 
now, the full test suite passes (integration too). GraphFilter works -- this is 
going to be huge for speeding up OLAP times.

commit 7ad48f20586ec58b1fea7018fa8f37ec8c95c9b9
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T19:44:57Z

    added a MapReduce test. We now verify that GraphFilter works for both 
VertexProgram+MapReduce and MapReduce only. TinkerGraph and Spark integration 
tests pass.

commit 72e388c4a1eadb6654a422988857006ed27b6158
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T19:53:54Z

    added nice GraphFilter.legalVertex() and GraphFilter.legalEdges() methods 
so that the provider doesn't have to be smart about how to apply the underlying 
filter traversal.

commit e4cf925b496ee250f7dca48d094e8b93816ca075
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T20:24:45Z

    Added a state-based test case to GraphFilter. About to run this thing on 
the Blade cluster against Friendster to see how well we do now.

commit 7023987a0b5154646a4d77e0b2b3506e850ed3d2
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T20:35:27Z

    Forgot to add vertices() and edges() to the 
ComputerTraversalEngine.Builder. I can't wait for this model to go away in 
favor of a fluent TraversalSource.

commit 6cfb1f22f43fa82be10d04fc28e86e8f3db9d28e
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T21:24:38Z

    found a bug in TraversalUtil.isLocalStarGraph(). Added 
TraversalUtil.isLocalVertex() (for only checking properties -- no edge access). 
Added JavaDoc to new GraphComputer methods. Added verfication that the provided 
traversals don't leave their respective boundaries.

commit f7ad5c4f6a7b197cebb86fa22d4c263ce6b3365b
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-02T21:49:22Z

    Added standard GraphComputer.Exceptions for GraphFilter and verfiy 
Exceptions are thrown correctly in GraphComputerTest. Tweaks to JavaDoc.

commit b824d0c0994276e3714dc59341aa24526127eafe
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-03T18:03:23Z

    Created specialized serializers for common classes in Spark to avoid the 
overhead of JavaSerialization.

commit 4afe29a80fb15f965924297fafda942adeb36b06
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-03T18:09:46Z

    forgot a Serialization that popped up when taking things to the cluster.

commit 1c9a31c4c3d3f09c829d135363ad7ebff6590c8d
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-03T19:49:53Z

    Learned about ExternalizableSerializer which makes registration of Kryo 
serializers alot more simple. Ran this code on the cluster -- what took 25 
minutes now takes 6.8 minutes.

commit 097e09a39a151e6dbb8ebb268bc1792baac8765a
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-03T20:48:31Z

    minor nothings.

commit 001a13dec5d3bb7ffa269fa2e392947d5c600a5e
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-03T21:28:59Z

    Merge branch 'master' into TINKERPOP-962

commit 569496f671f4e532fc459cee54da3e6e62522ac1
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-04T14:36:31Z

    Merge branch 'master' into TINKERPOP-962

commit 07f7a8c614493de4bd13d2e75292609c5ee7183c
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-04T17:42:24Z

    Moved GraphFilterTest to gremlin-groovy/ so I can use reflection and not 
have to make internal variables protected for testing purposes. 
Optional<Vertex> GraphReader.readVertex(InputStream,GraphFilter) now exists at 
the interface level with an UnsupportedOperationException default. GryoReader 
can now read vertices from a GraphFilter-perspective and only materialize those 
vertices/edges that are legal. Should be fairly trivial to add to 
GraphSONReader.

commit b3d3116e5f287e61d44993af3e709c7d04bf77ac
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-04T20:00:00Z

    was using null to represent a filtered vertex. went with Optional 
throughout so the API is consistent.

commit a28b1fdc673bb6a11b741d306e3706efc4510592
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-04T20:23:12Z

    method rename. pointless twiddling.

commit 25e5b24049ef22d1bb64ae652d6ff5cba4786451
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-04T20:51:12Z

    ensure that the context is closed after the test suite has completed.

commit ed18cd9382ee1e2db7f4618a72e9d28ed6b2fb2a
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-02-04T22:51:08Z

    OMG, the most insane bug for the last two hours. Painfull......

----


> Provide "vertex query" selectivity when importing data in OLAP.
> ---------------------------------------------------------------
>
>                 Key: TINKERPOP-962
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-962
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: process
>    Affects Versions: 3.1.0-incubating
>            Reporter: Marko A. Rodriguez
>            Assignee: Marko A. Rodriguez
>              Labels: breaking
>             Fix For: 3.2.0-incubating
>
>
> Currently, when you do:
> {code}
> graph.compute().program(PageRankVertexProgram).submit()
> {code}
> We are pulling the entire {{graph}} into the OLAP engine. We should allow the 
> user to limit the amount of data pulled via "vertex query"-type filter. For 
> instance, we could support the following two new methods on {{GraphComputer}}.
> {code}
> graph.compute().program(PageRankVertexProgram).vertices(hasLabel('person')).edges(out,
>  hasLabel('knows','friend').has('weight',gt(0.8)).submit()
> {code}
> The two methods would be defined as:
> {code}
> public interface GraphComputer {
> ...
> GraphComputer vertices(final Traversal<Vertex,Vertex> vertexFilter)
> GraphComputer edges(final Direction direction, final Traversal<Edge,Edge> 
> edgeFilter)
> {code}
> If the user does NOT provide a {{vertices()}} (or {{edges()}}) call, then the 
> {{Traversal}} is assumed to be {{IdentityTraversal}}. Finally, in terms of 
> execution order, first {{vertices()}} is called and if "false" then don't 
> call edges. Else, call edges on all the respective incoming and outgoing 
> edges. Don't really like {{Direction}} there and perhaps its just:
> {code}
> GraphComputer edges(final Traversal<Vertex,Edge> edgeFilter)
> {code}
> And then all edges that pass through are added to OLAP vertex. You don't want 
> {{both}}? Then its {{outE('knows',friend').has('weight',gt(0.8))}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to