[Discussed] Integrating SPARQL-Gremlin 0.2 Plugin with the TinkerPop codebase

Harsh Thakkar Mon, 18 Dec 2017 14:01:59 -0800

Hello Stephen,

Alright, I will remove the bsbm file from the repository and I refer to it in 
the docs (with some examples) sharing a link to download from the website if 
that is acceptable. No worries. 
Should I also remove the northwind file?



Your expectations are reasonable, it was just that I wasn't very clear about 
what needs to be done. Now it is pretty much clear. It will take some time for 
me to wrap my head around the specifics of the tinkerpop codebase in order to 
satisfy the 3 requirements. I will need some help (quite possibly) with getting 
things right as far as the DSL pattern for the gremlin language variants is 
concerned. I am already reading the dev-docs on this, from here:
http://tinkerpop.apache.org/docs/current/reference/#dsl

Also, since you are very well versed in the test suite, I would also request 
some assistance for the same when we are there :) as it is our first time 
pushing a work to the production level. So bear with us :)

I agree with you on not having any API shifts, this does certainly not give a 
good impression, also its a lot of effort down the drain. Quality must be 
ensured. 

One question, though there is not a strict deadline, when is the 3.3.2 release 
planned?

Cheers,
Harsh


On 2017-12-18 20:48, Stephen Mallette <[email protected]> wrote: 
> A quick note about (4) - Having some sample data for user convenience is
> good. Files like that though should not be "resources", but should be added
> here:
> 
> https://github.com/harsh9t/tinkerpop/tree/master/data
> 
> Placing those files there will allow them to be included in the the .zip
> distribution files we produce for Gremlin Console and Gremlin Server. Now,
> that BSBM file is a bit much. It's 90M in size and 22M compressed to zip.
> Either way, that's going to push our already large zip distributions bigger
> than they should be. I don't think the value of this file is worth the
> that. We can definitely make it available as a separate download from the
> web site if everyone thinks it's that important and then provide links to
> it, but I don't think it should be in the source repository as it is now.
> 
> Aside from (4) I just wanted to make some general points about my
> expectations for a sparql-gremlin being part of a TinkerPop release branch.
> Apologies if this wasn't clear from when we started. I think we need to see
> sparql-gremlin as close to a final form as possible before we look to merge
> it. By "final" I mean:
> 
> 1. sparql-gremlin has a full test suite - that means good unit test
> coverage at a minimum and integration tests as necessary (and I sense they
> will be necessary). I agree with marko, that we also have to consider the
> testing pattern carefully, so that we set the stage properly for future
> languages.
> 2. sparql-gremlin has a clear and easy method of usage that is consistent
> with how TinkerPop works - this is crucial prior to merge because TinkerPop
> has high profile production usage. once merged sparql-gremlin will
> immediately be consumed by users and we will not want to shift that API
> once it is available. we will break the code of too many people if we do
> that. we need to strive to get this right from the start.
> 3. sparql-gremlin has a good body of user documentation.
> 
> I don't think any of this is insurmountable, but it does mean there is a
> fair bit of work to do and it won't happen overnight. We held
> gremlin-dotnet to the same rigorous level before merging and even
> gremlin-javascript all these months later is still not merged for basically
> the same reasons, so this is just the process that we tend to go through.
> If we follow what we did for the GLVs, we will likely follow this basic
> process:
> 
> 1. You get sparql-gremlin "pretty close" to final in your fork
> 2. Once we all agree that you are "pretty close", you offer the pull request
> 3. We merge it into a TinkerPop branch for further evaluation (this will be
> a development branch and not a release branch)
> 4. We work together to get the development branch "final"
> 5. We issue a pull request from that development branch
> 6. The pull request goes through the standard review/vote process and we
> merge to a release branch.
> 7. sparql-gremlin will likely be part of 3.3.2 release
> 
> I hope that make sense.
> 
> 
> On Mon, Dec 18, 2017 at 12:26 PM, Marko Rodriguez <[email protected]>
> wrote:
> 
> > Actually, my (3) is bad. Given that query() would always return a
> > Traversal<S, Map<String,E>>, it would be necessary to have that linearized
> > to Traversal<Vertex,Vertex> for the test suite to validate it. That would
> > mean making SPARQLTraversal support extended Traversal methods like
> > flatMap(), blah, blahâ¦ That seems excessive, though convenient.
> >
> > Hmâ¦â¦ Thoughts?,
> > Marko.
> >
> > http://markorodriguez.com
> >
> >
> >
> > > On Dec 18, 2017, at 9:21 AM, Marko Rodriguez <[email protected]>
> > wrote:
> > >
> > > Hello,
> > >
> > > A couple of items worth considering.
> > >
> > > Regarding (7), that should be done prior to master/ merge. It is
> > necessary to follow the patterns that are established in TinkerPop
> > regarding language interoperability. The DSL pattern developed for Gremlin
> > language variants seems to be the best pattern for distinct languages as
> > well. In essence, if your language is not a fluent language, and instead,
> > uses a String, then it should be wrapped as such in a fluent interface
> > using all the Strategy, Step, and Traversal methods that makes sense so it
> > works within the larger infrastructure of TinkerPop (e.g. testing! â see
> > below). What I proposed in my previous email seems the easiest and cleanest
> > way to do things.
> > >
> > > Regarding (3), testing is crucial. Given that this would be TinkerPopâs
> > first distinct language, we donât have a pattern set forth for testing.
> > However, this doesnât mean we canât improvise on our current model. Off 
> > the
> > top of my head, perhaps the best way would be to follow the
> > ProcessTestSuite and do the SPARQL variants of those. For instance:
> > >
> > >       https://github.com/apache/tinkerpop/blob/master/gremlin-
> > test/src/main/java/org/apache/tinkerpop/gremlin/process/
> > traversal/step/map/VertexTest.java#L62 <https://github.com/apache/
> > tinkerpop/blob/master/gremlin-test/src/main/java/org/apache/
> > tinkerpop/gremlin/process/traversal/step/map/VertexTest.java#L62>
> > >
> > > The SPARQL test version would be:
> > >
> > > @Override
> > > public Traversal<Vertex, Vertex> get_g_VX1X_out(final Object v1Id) {
> > >   return sparql.query(âSELECT ?x WHERE {â + toURI(v1Id) + â ?a ?x 
> > > }â);
> > > }
> > >
> > > In this way, sparql is your SPARQLTraversalSource for each test and
> > query() will return a Traversal typed according (query() will have to have
> > solid generic support). From there, you would implement each and every test
> > that is semantically possible with SPARQL (where SPARQ wonât be able to
> > semantically cover all Gremlin tests).
> > >
> > > Stephen has done a lot of recent work to generalize our test suite out
> > of Java so it is in a language agnostic form. I havenât been following 
> > that
> > work so Iâm not sure what Iâm am saying above is exactly as it should be
> > done, but it is a start.
> > >
> > > HTH,
> > > Marko.
> > >
> > > http://markorodriguez.com <http://markorodriguez.com/>
> > >
> > >
> > >
> > >> On Dec 18, 2017, at 7:43 AM, Harsh Thakkar <[email protected] <mailto:
> > [email protected]>> wrote:
> > >>
> > >> Hi Stephen and All,
> > >>
> > >> Thanks for going through the code. I address your questions below (in
> > the same order):
> > >>
> > >> 1. Yes, this file can be removed. It was just to test the traversal
> > method.
> > >>
> > >> 2. Yes, I have commented out the block of tests at this moment since we
> > do not need to run tests at mvn clean install time. However, I kept it (in
> > commented out form) if there arose a need in future for the same. It can
> > surely be removed if you think, it won't be necessary.
> > >>
> > >> 3. There were two testing units (we continued them from Daniel's
> > version), one to check whether the prefixes are being encoded correctly,
> > the second one is to test whether the generated traversal is correct (in
> > short the compiler is functioning as it should). Since, we extended
> > previous work supporting a variety of SPARQL operators, more test cases can
> > be added to validate that each of these is functioning as expected.
> > However, as I mentioned in point #2. we need not do it explicitly as we
> > (Dharmen and I) have already tested them on 3-4 different datasets and
> > query-sets. Now, since we did not know if that was going to be formally
> > required in the future or not, we left them as it is, just commented it out.
> > >>
> > >> 4. These resources are the graphml files that we wish to provide the
> > users, for (i) loading and querying famous datasets - the Berlin SPARQL
> > Benchmark (BSBM)  (famous in the Semantic Web-RDF community) so that they
> > do not have to look elsewhere for the same. (ii) Also, it provides a strong
> > use-case for demonstrating the applicability of sparql-gremlin (creates
> > trust in the SW community users) and (iii) to keep the plug-in pretty much
> > self-dependent.
> > >>
> > >> 5 & 6  YES, damn it. The IDE did this. I will revert these changes.
> > It's like when you are not looking, the IDE does things on it own :-/
> > apologies!
> > >>
> > >> 7. Regarding, Marko's thoughts -- Yes, I was waiting for you to reply
> > to the thread. I do have some thoughts on this. But first, I was wondering
> > if this (what Marko suggested) is supposed to be entirely implemented in
> > the current version of sparql-gremlin 0.2, i.e. including the
> > withStrategies() and withStrategies() and remote() features, or it is to be
> > supported eventually (after the sparql-gremlin 0.2.0) plugin is rolled out.
> > Also, I am not entirely sure I got what Marko was exactly suggesting. I
> > bring this to light in the in-line style reply to Marko's comment later
> > here.
> > >>
> > >> The current implementation is more of a typical compiler, the users,
> > however, can use it by specifying the query file and the dataset against
> > which it is to be executed via the command (once in the gremlin shell):
> > >>
> > >> gremlin> graph = TinkerGraph.open(..)
> > >> gremlin> SparqlToGremlinCompiler.convertToGremlinTraversal(graph,
> > "SELECT ?a WHERE {....} ")
> > >> ==>{?x:marko, ?y:29}
> > >> ==>{?x:josh, ?y:32}
> > >>
> > >>
> > >> i.e. load a graph using pre-defined tinkerpop methods ( graph.io <
> > http://graph.io/>(IoCore.gryo()).readGraph(graphName),
> > TinkerGraph.open(), etc ) , then execute the traversal as above with
> > arguments -- (graph, queryString), where queryString = "SPARQL query".
> > >>
> > >> Now Let me quote Marko's comment and reply in-line to bring more
> > clarity:
> > >>
> > >> 1. There should be a SPARQLTraversalSource which supports one spawn
> > method Ã¢â¬â query(String).
> > >>      This is already happening inside the code. Therefore, we do not
> > need to mention it explicitly. Please correct me if I got it wrong here.
> > >>
> > >> 2. SPARQLTraversal is spawned and it only supports only the Traversal
> > methods Ã¢â¬â next(), toList(), iterate(), etc.
> > >>      All traversal methods that are supported, available to a regular
> > gremlin traversal, can be used by the sparql-gremlin compiler generated
> > traversal as well.
> > >>
> > >> 3. query(String) adds a ConstantStep(String).
> > >>             This is happening internally (as shown in the example
> > above), we can also make explicit. i.e. let the user only provide the
> > queryString instead of the whole "SparqlToGremlinCompiler.
> > convertToGremlinTraversal(graph, "SELECT ?a WHERE {....} ")" command.
> > Does this make sense? or am I missing something here.
> > >>
> > >>
> > >> 4. SPARQLTraversalSource has a registered SPARQLStrategy.
> > >>      At this moment, we leave it to the default setting for this
> > strategy selection.
> > >>
> > >> 5. SPARQLTraversalSource should also support withStrategies(),
> > withoutStrategies(), withRemote(), etc.
> > >>      Once the traversal is generated, it can support all strategies
> > like any other gremlin traversal. Does this make sense to you?
> > >>
> > >> In a nutshell,
> > >> What is happening is that we are converting the SPARQL queryString into
> > a gremlin traversal and leave it upto the tinkerpop compiler to choose what
> > is best for it.
> > >> We only map a SPARQL query to its corresponding pattern matching
> > gremlin traversal (i.e. using with .match() clause). Since, the
> > expressibility of SPARQL is less than that of Gremlin (i.e. SPARQL 1.0
> > doesn't support/allow  performing looping and traversing operations), we
> > can only map what is in the scope of SPARQL language to Gremlin. Once the
> > traversal is generated, it is left to the tinkerpop compiler to select and
> > execute a wide range of strategies ( various levels of optimizations, et).
> > >>
> > >> NOTE - Also, Right now the sparql-gremlin compiler returns the
> > traversal (string) and not Bytecode. Returning the Bytecode is also
> > completely possible, if you want so. We can just perform
> > traversal.asAdmin().getBytecode() for this and it is done.
> > >>
> > >> Since, we extended Daniel's work, we have not changed the names of
> > classes, methods and variable which were used. This, however, can be
> > changed, if you suggest so.
> > >>
> > >> 8. Yes, working in the academia doesn't groom you much on the
> > importance of commenting in the code by default, or for that matter any
> > "good-practices". I will add appropriate comment block in each class for
> > the javadocs.
> > >>
> > >> I hope the above reply address your questions to quite some extent.
> > Most of the issues are already handled internally (as I stated above). We
> > can also leave some advanced features such as remote(), for the 0.2.1
> > release (though this is just an option) :D
> > >> Having said that, Of course, we are in no hurry for the pull request. I
> > also believe it makes complete sense to get things right.
> > >>
> > >> Cheers!
> > >>
> > >> On 2017-12-18 14:11, Stephen Mallette <[email protected] <mailto:
> > [email protected]>> wrote:
> > >>> Harsh, I looked at the code in a bit more detail than I have. Here's
> > some
> > >>> thoughts/questions I had as I was going through things:
> > >>>
> > >>> 1. Can this file be removed - it doesn't appear to have any usage that
> > I
> > >>> can see:
> > >>>
> > >>> https://github.com/harsh9t/tinkerpop/blob/master/sparql-
> > gremlin/src/main/java/org/apache/tinkerpop/gremlin/sparql/Runable.java <
> > https://github.com/harsh9t/tinkerpop/blob/master/sparql-
> > gremlin/src/main/java/org/apache/tinkerpop/gremlin/sparql/Runable.java>
> > >>>
> > >>> 2. I note that this entire block of tests is commented out - should
> > that be
> > >>> removed?:
> > >>>
> > >>> https://github.com/harsh9t/tinkerpop/blob/master/sparql-
> > gremlin/src/test/java/org/apache/tinkerpop/gremlin/sparql/
> > SparqlToGremlinCompilerTest.java
> > >>>
> > >>> 3. I could be wrong, but even if you didn't remove the tests above, it
> > >>> seems like unit testing is rather thin at this point. Am I missing
> > >>> something? Is there more work to do there?
> > >>>
> > >>> 4. I don't understand the nature of these resources:
> > >>>
> > >>> https://github.com/harsh9t/tinkerpop/tree/master/sparql-
> > gremlin/src/main/resources
> > >>>
> > >>> Is there any need to package those with the jar? Should those be "test"
> > >>> resources instead? Do we need the really large data/bsbm1m.graphml
> > file for
> > >>> any specific reason?
> > >>>
> > >>> 5. What are these changes to these poms?
> > >>>
> > >>> https://github.com/harsh9t/tinkerpop/commit/
> > cb3b6512ea3536f556108e5a257c4586aa4d157a
> > >>>
> > >>> I assume that your IDE did that accidentally and it was not intended.
> > >>> Please revert that change.
> > >>>
> > >>> 6. This looks odd too - gremlin-shaded repeated again and again and
> > again:
> > >>>
> > >>> https://github.com/harsh9t/tinkerpop/commit/
> > 143d16f20dcaa9c915b96cdd4adf7b1504db5d36#diff-
> > 9e90009f097eabeb25c28159571fc6a2R118
> > >>>
> > >>> 7. Did you have any thoughts in reference to Marko's earlier reply that
> > >>> described how sparql-gremlin should be used? Right now, it seems like
> > the
> > >>> code you have there is just the "engine" but lacks the piece that
> > connects
> > >>> it into the rest of the stack. From my perspective, I think we need to
> > be
> > >>> sure that users have an easy, clear and consistent way to use
> > >>> sparql-gremlin before we can merge this work. Obviously, having that
> > aspect
> > >>> of the code thought through will impact the documentation that you
> > write as
> > >>> well, so I think you need to go down this path a bit further before we
> > get
> > >>> to the pull request stage.
> > >>>
> > >>> 8. We aren't big javadoc sticklers here, but we try to at least get
> > class
> > >>> level javadoc in place for most classes. I don't see much javadoc or
> > >>> comments in the code right now. I think I'd like to see a modicum of
> > >>> javadoc/comments present as part of this work.
> > >>>
> > >>> So, that's my broad level feedback at this point. It seems as though
> > there
> > >>> are some reasonably large issues there to contend with before a pull
> > >>> request is worth issuing. That's not a problem, of course....we will
> > just
> > >>> keep iterating toward the goal. I'm not aware of anything that is
> > pushing
> > >>> us to rush to a pull request - I'm of the opinion that we can take the
> > time
> > >>> to get this right.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Stephen
> > >>>
> > >>>
> > >>> On Fri, Dec 15, 2017 at 1:46 PM, Joshua Shinavier <[email protected]>
> > wrote:
> > >>>
> > >>>> Hi Marko,
> > >>>>
> > >>>> I think we're more or less on the same page here; it's clear that TP3
> > has a
> > >>>> different API than TP2. If you look at the guts of TP3 GraphSail [1],
> > it
> > >>>> uses the modern APIs, and yet does adapt them to the Sail interface.
> > >>>>
> > >>>> Something like PropertyGraphSail (or an equivalent Jena thing) still
> > makes
> > >>>> sense in TP3, as well. One interesting detail here is that in TP3,
> > vertices
> > >>>> can have labels, which can be turned into rdf:type statements (that,
> > in
> > >>>> turn, can be used to enable subclass/superclass inheritance if the
> > graph is
> > >>>> combined with a RDF schema.
> > >>>>
> > >>>> A TP3 equivalent of SailGraph would indeed be quite different in
> > >>>> implementation -- strategies, not wrapper graph -- than what we had
> > for
> > >>>> Blueprints, and yet would serve the same purpose.
> > >>>>
> > >>>> Josh
> > >>>>
> > >>>>
> > >>>> [1]
> > >>>> https://github.com/joshsh/graphsail/tree/master/src/
> > >>>> main/java/net/fortytwo/tpop/sail
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Fri, Dec 15, 2017 at 10:22 AM, Marko Rodriguez <
> > [email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> The model proposed below is in-line with TinkerPop2Ã¢â¬â¢s way of
> > thinking.
> > >>>>> Unfortunately, TinkerPop3 and more so for TinkerPop4, the Graph
> > >>>> Ã¢â¬Åstructure"
> > >>>>> API will become deprecated. This means that the notion of
> > Ã¢â¬Åwrapping the
> > >>>>> Graph APIÃ¢â¬  has gone away for TP3 and will be completely gone in
> > TP4. In
> > >>>>> TP4, there will not even be a Graph API Ã¢â¬â no more Vertex, Edge,
> > Property,
> > >>>>> etc. Only the concept of a Graph with only methods like
> > >>>> Graph.traversal(),
> > >>>>> Graph.partitions(), etc.
> > >>>>>
> > >>>>> Why was this route taken? In TinkerPop3, there was a need to support
> > any
> > >>>>> language besides Java. This was why Gremlin bytecode and the concept
> > of
> > >>>> the
> > >>>>> Gremlin traversal machine was introduced. A provider simply gets
> > Gremlin
> > >>>>> bytecode and has to do something with it. For the Java-based Gremlin
> > >>>>> traversal machine, this is why providers implement their own
> > GraphStep,
> > >>>>> VertexStep, etc. For a Python-based Gremlin traversal machine,
> > likewiseÃ¢â¬Â¦
> > >>>>>
> > >>>>> This means that SailGraph, GraphSail, PropertyGraphSail as stated
> > below
> > >>>>> donÃ¢â¬â¢t make sense in the current and future architectures.
> > >>>>>
> > >>>>> The next question becomes, "well how would you turn an RDF store
> > into a
> > >>>>> PropertyGraph?Ã¢â¬  Easy Ã¢â¬â implement your own custom 
> > >>>>> GraphStep,
> > VertexStep,
> > >>>>> etc. and respective ProviderStrategies that will handle the bytecode
> > >>>>> compilation accordingly.
> > >>>>>
> > >>>>> The next question becomes, Ã¢â¬Åwell how would a PropertyGraph 
> > >>>>> support
> > >>>>> reasoning?Ã¢â¬  Easy Ã¢â¬â implement your own custom 
> > >>>>> DecorationStrategy
> > that will
> > >>>>> insert reasoning into the traversal giving the RDFS schema. For
> > instance:
> > >>>>>        g.V().out(Ã¢â¬ÅlikesÃ¢â¬ )
> > >>>>>                ==>
> > >>>>>        g.V().out(Ã¢â¬ÅknowsÃ¢â¬ ,Ã¢â¬ likesÃ¢â¬ )
> > >>>>>                iff Ã¢â¬ÅlikesÃ¢â¬  is a sub-property of 
> > >>>>> Ã¢â¬ÅknowsÃ¢â¬
> > >>>>>
> > >>>>> In essence, it is possible to do this integration of RDF and
> > TinkerPop,
> > >>>> it
> > >>>>> just needs to be done at the correct level of abstraction so that it
> > >>>> stays
> > >>>>> in line with how TinkerPop is evolving, not how it was back in 2012.
> > >>>>>
> > >>>>> Take care,
> > >>>>> Marko.
> > >>>>>
> > >>>>> http://markorodriguez <http://markorodriguez/>.com
> > >>>>>
> > >>>>>
> > >>>>> On 2017-12-13 07:46, Joshua Shinavier <[email protected]> wrote:
> > >>>>>> Hi Harsh,>
> > >>>>>>
> > >>>>>> Glad you are taking Daniel's work forward. In porting the code to
> > the>
> > >>>>>> TinkerPop code base, might I suggest we allow for not only
> > >>>>> SPARQL-Gremlin,>
> > >>>>>> but a whole suite of RDF tools as in TP2. Perhaps call the module>
> > >>>>>> rdf-gremlin. Then we could have all of:>
> > >>>>>>
> > >>>>>> * SPARQL-Gremlin: executes standard SPARQL queries over a Property
> > >>>> Graph>
> > >>>>>> database>
> > >>>>>> * GraphSail [1,2]: stores RDF quads in the database, explicitly,
> > and>
> > >>>>>> enables SPARQL and triple pattern queries over the quads>
> > >>>>>> * PropertyGraphSail [3]: exposes a Property Graph with of two
> > mappings
> > >>>>> to>
> > >>>>>> the RDF data model>
> > >>>>>> * SailGraph [4]: takes an RDF triple store (not natively supporting>
> > >>>>>> Gremlin) and enables Gremlin queries>
> > >>>>>> * others? I have often thought that a continuous SPARQL
> > implementation>
> > >>>>>> built on Gremlin would be powerful>
> > >>>>>>
> > >>>>>> The biggest mismatch between the TP2 suite and what might be built
> > for>
> > >>>>>> Apache TinkerPop is that the previous suite was implemented using
> > >>>>> (Eclipse)>
> > >>>>>> RDF4j, whereas things seem to be leaning towards (Apache) Jena now.>
> > >>>>>> However, the same principles could be applied.>
> > >>>>>>
> > >>>>>> Josh>
> > >>>>>>
> > >>>>>>
> > >>>>>> [1] https://github.com/tinkerpop/blueprints/wiki/Sail-
> > Ouplementation>
> > >>>>>> [2] https://github.com/joshsh/graphsail>
> > >>>>>> [3]>
> > >>>>>> https://github.com/tinkerpop/blueprints/wiki/PropertyGraphSail-
> > >>>>> Ouplementation>
> > >>>>>> [4] https://github.com/tinkerpop/blueprints/wiki/Sail-
> > Implementation
> > >>>>>
> > >>>>> http://markorodriguez.com
> > >
> >
> >
>

[Discussed] Integrating SPARQL-Gremlin 0.2 Plugin with the TinkerPop codebase

Reply via email to