[jira] [Created] (COMMONSRDF-5) Is a graph API the right thing to do?

Stian Soiland-Reyes (JIRA) Sun, 29 Mar 2015 06:49:30 -0700

Stian Soiland-Reyes created COMMONSRDF-5:
--------------------------------------------


             Summary: Is a graph API the right thing to do?
                 Key: COMMONSRDF-5
                 URL: https://issues.apache.org/jira/browse/COMMONSRDF-5
             Project: Apache Commons RDF
          Issue Type: Wish
            Reporter: Stian Soiland-Reyes
            Priority: Minor


>From https://github.com/commons-rdf/commons-rdf/issues/35

larsga:
{quote}


I have a need for an RDF library in several different projects, and really like 
the idea of a commons-rdf. However, I'm not sure the current proposal provides 
the functionality that is actually necessary.

How should Java code interact with RDF? The most common case will be via 
SPARQL. So a common SPARQL client library with a convenient API, support for 
persistent connections, SSL, basic auth, etc etc, would be a very valuable 
thing.

Another case will be to parse (or generate RDF). For this, a simple streaming 
interface would be perfect.

An API for accessing RDF as an object model I have to say I'm deeply skeptical 
of, for two reasons. The first reason is that it's very rarely a good idea. In 
the vast majority of cases, your data either is in a database or should be in 
database. SPARQL is the right answer in these cases.

The second reason is that I see many people adoping this API approach to RDF 
even when they obviously should not. The reason seems to be that developers 
want an API, and given an API that's what they choose. Even when, 
architecturally, this is crazy. As a point of comparison, it's very rare for 
people to interact with relation data via interfaces named Database, Relation, 
Row, etc. But for RDF this has somehow become the norm. Some of the triple 
store vendors (like Oracle) even boast of supporting the Jena APIs, even though 
one should under no circumstances use APIs of that kind to work with triples 
stored in Oracle.

So my fear is that an API like the one currently proposed will not only fail to 
provide the functionality that is most commonly needed, but also lead 
developers astray.

I guess this is probably not the most pleasant feedback to receive, but I felt 
it had to be said. Sorry about that.

{quote}

ansell:
{quote}


No need to apologise (@wikier and I asked you to expand on your Twitter 
comments!)

>From my perspective, I would love to port (and improve where necessary) 
>RDFHandler from Sesame to Commons RDF. However, we felt that it was not 
>applicable in the first version that we requested comments from people, based 
>on a very narrow scope of solely relying on the RDF-1.1 Abstract Model 
>terminology.

As you point out, the level of terminology used in the Abstract Model is too 
low for common application usage. @afs has pointed out difficulties with using 
Graph as the actual access layer for a database. In Sesame, the equivalent 
Graph interface is never used for access to permanent data stores, only for 
in-memory filtering and analysis between the database and users, which happens 
fairly often in my applications so I am glad that it exists.

A good fast portable SPARQL client library would still need an object model to 
represent the results in, to send them to a typesafe API. Before we do that we 
wanted to get the object model to a relatively mature stage.

>From this point on we have a few paths that we can follow to expand out to an 
>RDF Streaming API and a SPARQL client library, particularly as we have a focus 
>on Java-8 with Lambdas.

For example, we could have something like:

{code}
interface TupleHandler {
  void handleHeaders(List<String> headers);
  void handleTuple(List<RDFTerm> row);
}

interface RDFHandler {
  void handleRDF(Triple triple)
}

interface SPARQLClient {
  void tupleHandler(TupleHandler handler);
  void rdfHandler(RDFHandler handler);
  boolean ask(String query);
  void select(String query);
  void construct(String query);
  void describe(IRI iri);
  void describe(String query);
}
{code}

Usage may be:

{code}
client.ask(myQuery)
client.tupleHandler(handler).select(myQuery)
client.rdfHandler(handler).construct(myQuery)
client.rdfHandler(handler).describe(IRI)
client.rdfHandler(handler).describe(myQuery)
{code}

Could you suggest a few possible alternative models that would suit you and 
critique that model?
{quote}

afs:
{quote}


Commons-RDF allows an application to switch bewteen implmentations. A variation 
of @larsga's point is that SPARQL (languages and protocols) gives that 
separation already.

I would like to hear more as to what is special about a portable SPARQL Client 
because it seems to be a purely local choice for the application. You can issue 
SPARQL queries over JDBC (see jena-jdbc). People are already querying DBpedia 
etc from Jena or Sesame or javascript or python or ... . DBpedia does not 
impose client choice.

There are processing models that are not so SPARQL-amenable, such as graph some 
analytics (think map/reduce or RDD), where handling the data at the RDF 1.1 
data model is important and then the RDF graph does matter as a concept because 
the application wshed to walk the graph, following links.

What would make working with SPARQL easier, but does not need portablity, 
needed is mini-languages that make SPARQL easier to write in programs, maybe 
specialised to particular usage patterns. There is no need for mega-toolkits 
everywhere.

@larsga - what's in your ideal RDF library?

(To Oracle, and others, "Jena API", includes the SPARQL interface then how to 
deal with the results.)
{quote}

larsga:
{quote}
What's special about a common SPARQL client is that none seems to exist in Java 
at the moment. So if commons-rdf could provide one that would be great.

Getting results via JDBC may be preferable in some cases, but in general it's 
not ideal. How do you get the result as it really was in that case? With data 
type URIs and language tags? How do you get the parsed results of CONSTRUCT 
queries? In addition, the API is not very convenient.

jena-jdbc requires jena-jdbc-core, which in turn requires ARQ, which then 
requires ... That's a non-starter. If I simply want to send SPARQL queries over 
HTTP having to pull in the entire Jena stack is just not on.

> There are processing models that are not so SPARQL-amenable, such as graph 
> some analytics (think map/reduce or RDD), where handling the data at the RDF 
> 1.1 data model is important and then the RDF graph does matter as a concept 
> because the application wshed to walk the graph, following links.

Yes. This is a corner case, though, and it's very far from obvious that a 
full-blown object model for graph traversal is the best way to approach this. 
Or that it will even scale. But never mind that.

What's missing in the Java/RDF space is the main tools you really need to build 
an RDF application in Java: streaming API to parsers plus a SPARQL client. 
Something like this can be provided very easily in a very light-weight package, 
and would provide immense value.

An object model representing the RDF Data Model directly would, imho, do more 
harm than good, simply because it would mislead people into thinking that this 
is the right way to interact with RDF in general.
{quote}

afs:
{quote}


At the minimum, you don't need anything other than an HTTP client and retrieve 
JSON!

If you want to work in RDF concepts in your application, Jena provides 
streaming parsers plus a SPARQL client as does Sesame. Each provides exactly 
what you describe! Yes, a system that was minimal would be smaller but (1) is 
the size difference that critical (solution - strip down a toolkit); data is 
large, code is small and (2) in what way is it not yet-another-toolkit, and all 
that goes with that?

{quote}

wikier:
{quote}


OK, I think now I understand @larsga's point...

I do agree that SPARQL should be in theory such "common interface". But what 
happens right now it that each library serializes the results using their own 
terms. So one of the goals of commons-rdf would be to align the interfaces 
there too.

Of course you could always say you can be decoupled by parsing the results by 
yourself. But that has two problems: On the one hand, you are reimplementing 
code you do not need and probably making mistakes. On the other hand, that only 
works if your code is not going to be used by anyone else; as soon as it's 
going to be used, instead of solving a problem your are causing another one.

In case this helps for the discussion, we discussed the idea of commons-rdf 
because in two following weeks I had to deal with the same problem: I needed to 
provide a simple client library and I realized the decision I made in terms of 
which library I chose forced people to use that library.

Those two client libraries are the Redlink SDK and MICO. Both with different 
purposes and different targets, but in the end dealing with the same problem.

https://github.com/redlink-gmbh/redlink-java-sdk
http://www.mico-project.eu/commons-rdf/
{quote}

larsga:
{quote}


Yes, this is getting closer to what I meant. As you say, a SPARQL client 
library is fine for stuff like ASK, SELECT, INSERT and so on. The problem is 
CONSTRUCT, or if you want to parse a file. However, even in those cases I do 
not want an in-memory representation of the resulting RDF. I want it streamed, 
kind of like SAX for XML. Then, if I need an in-memory representation I will 
build one from the resulting stream.

Now if you argue that there will be people for whom an in-memory representation 
is the best choice I guess that's OK. But I think it's wrong to force people to 
go via such a representation. Ideally, I'd like to see:

*    a simple streaming interface,
*    a simple abstraction for parsers and writers,
*    a SPARQL client library that represents RDF as callback streams.

If there also has to be a full-blown API with Graph, Statement, and the like, 
so be it. But it would IMHO be best if that were layered on top of the rest as 
an option, so that if you wanted you could build such a model by streaming 
statements into it, but you wouldn't be forced to go via those interfaces if 
you didn't want to.

{quote}

ansell:
{quote}


I understand that Graph is an abstraction that many people do not need, 
particularly if they are streaming, but Statement seems to be a very useful 
abstraction in an object oriented language, and it should be very low cost to 
create, even if you are streaming.

As Andy says, both Sesame and Jena currently offer streaming parsers for both 
SPARQL Results formats and RDF formats, so your main argument right now seems 
to be possible in practice already. The choice is just not interchangable after 
you decide which library to use at this point, which is the reason that we 
stopped where we did so far as the current model is at least enough to get 
streaming parsers going.

All parts of the API are loosely linked at this point, with a clear theoretical 
model from RDF-1.1. Hence, you don't need to implement or use Graph if you just 
want a streaming API that accepts Statement or a combination of the available 
RDFTerm's.

{quote}

drlivingstone:
{quote}
 think Statement seems like something that would be essential / useful - it's 
the smallest "functional" piece of RDF. (A use case where you want to iterate 
over parts of a Graph response that are in units smaller than triples seems 
weird to me - why not use a Select query then?, but anyway.) Whether Graph gets 
its own Class/API, or whether Statement could be a (potentially implicit) quad 
instead is probably where the different underlying libraries will have 
differing goals.

Regarding the goals of the library to have common abstractions / vocabulary - I 
would bet most people using RDF are also using (at least some) SPARQL. You can 
build a generic interface for querying and streaming through results that 
covers both Jena and Sesame, I have done so in Clojure anyway, in my KR 
library. This requires more than just agreeing that results are in terms of the 
common RDFTerm class though as pointed out above, a common SPARQL API is needed 
to agree to how tuples or graphs etc. are returned/iterated over etc. But it 
wasn't that hard to do. Having the underling library maintainers do it for me 
(possibly more efficiently) would have certainly been better. This goes beyond 
the scope of just defining core RDF terms though.
{quote}

stain:
{quote}


I think the Graph concept is useful - not everyone is accessing pre-existing 
data on a pre-existing SPARQL server. For instance, a light-weight container 
for annotations might want to expose just a couple of Graph instances without 
exposing the underlying RDF framework. Someone who is generating RDF as a 
side-product can chuck their triples in a Graph and then pass it to arbitrary 
RDF framework for serialization or going to a LOD server.

I can see many libraries that would not use Graph, but could use the other 
RDFTerms.

This would be the case for OWLAPI for instance, which has Ontology as a core 
concept rather than a graph. Operations like Graph.add() don't make much sense 
in general there, as you have to serialize the ontology as RDF before you get a 
graph.

I don't think it should be a requirement for implementors to provide a Graph 
implementation - thus RDFTermFactory.createGraph() is optional.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (COMMONSRDF-5) Is a graph API the right thing to do?

Reply via email to