Re: Comparing Sesame and Clerezza RDF API (was: ApacheCon EU CFP is open)

Reto Bachmann-Gmür Tue, 07 Aug 2012 14:37:13 -0700

Hi Sebastian

On Aug 7, 2012 5:39 PM, "Sebastian Schaffert" <
[email protected]> wrote:
> Just a meta-comment: before, I was told that Clerezza is a kind of
wrapper that abstracts away from the concrete triple store implementations
(Jena and Sesame). What you explained to me now is a completely alternative
triple store implementation. Which may have justifications, but also
additional problems. Comments follow below…


The parts of clerezza I'm arguing in favour here is the Java API modelling
the RDF Data model. It is agnostic on the actual storage of the RDF Data.
The API shoul work well for the 56 triples of the configuration of your
android app as well as for huge graphs stored in a triple store.


> Another meta-comment: I see this as an intellectual challenge where we
both learn; it is not my intention to criticise Clerezza, I just take the
Sesame position in this discussion because I know it quite well and have
made very good experiences with it, so I need really convincing arguments
to switch to a different API. Maybe the discussion with me can also help
you convince others to use/develop for Clerezza.

I very much appreciate this discussion and the review you're giving. It has
been a couple of year since the decision has been taken to propose a new
rdf api rather than using one of the existing (jena, sesame, rdf2go) it's
good to review this discussion.

..

> Merely counting classes does not say much, especially since Sesame
provides much more functionality. Many of the Sesame classes I see are
actually from the abstract syntax tree of the SPARQL Query and Update
parsers (and as we all know there are many different ways to implement a
parser), from the HTTP Server functionality, and from the different
serializers and parsers. If you can live without these, you will easily
have a package as small as the Clerezza core.

I think it is a good thing to have a core artifact containing all the APIs,
a single jar and a self-contained javadoc that you can compile both
backends as well as clients against. This is more a design advantage rather
than an issue of removing dead-code.
>
> >
> > What I mean by separating utility classes is mainly the separate
resource
> > centric api provided by RDF utils (
> > http://incubator.apache.org/clerezza/mvn-site/utils/apidocs/index.html).
I
> > think this is mainly a difference to the Jena API were the resource
> > oriented and the triple oriented approach are provided by the same API.
>
> This is probably a philosophical issue: have all functionality on one
place vs. having things cleanly separated. Both have advantages and
disadvantages. Personally, I like the Sesame value factory because it is a
single place where to look for the suitable factory methods.

In the Jena API a resource retuned in triple-iterator is directly an object
similar to a GraphNode in Clerezza, something that is tied to the graph
('model' in Jena) and which can thus provide methods to list its properties
and their values. This is not the case for the Sesame API (at least not in
org.openrdf.model).

As for the value factory it is not clear to me from the javadoc if the
triples added to a graph must consist exclusively of values created by the
respective value-factory.


> >
> > - The zz api defines identity criterions for graphs. The sesame API
doesn't
> > define when equals should return true, the zz API defines clear rules
which
> > are distinct for mutable and for inmutable graphs. Similarly the
hashcode
> > method for graphs is defined. In Sesame it seems that an instance is
equals
> > only to itsself. This doesn't take into account what RDF semantics say
> > about graph identity
>
> This is honestly a functionality I have never needed in 10 years. I see
its use case in small in-memory graphs (like they are used in Stanbol), but
for a multi-billion triple graph this is an irrelevant functionality.

Even you multi-billion triples backed application will probably communicate
with others which will typically happen with small graphs. The fact that
graph-isomorphism may neither be needed nor wanted in most cases is not
argument not to clearly define in the API when a triple collection has to
be compared in which way for equality.


>
> BTW, if you want to implement graph equivalence, you have to implement
the bijection as specified in
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality.
This is a very expensive operation anyways, especially when taking
into
account blank nodes.

That's correct and consisten with the equals definition of the clerezza
graph api. It is implemented by the utility AbstractGraph class so that
implementors don't have to care about it.

> For example, the following two graphs could be considered equivalent
(because the blank nodes are existentially qualified):
>
> Graph 1:
> <http://www.example.com/r/1> <http://www.example.com/p/1> _:a
> _:a <http://www.example.com/p/2> "123"
> _:a <http://www.example.com/p/3> "456"
>
> Graph 2:
> <http://www.example.com/r/1> <http://www.example.com/p/1> _:b
> <http://www.example.com/r/1> <http://www.example.com/p/1> _:c
> _:b <http://www.example.com/p/2> "123"
> _:c <http://www.example.com/p/3> "456"

No, even if the two graphs would mutually entail each other or 'express the
same content' (to use the wording of rdf semantics) they are still distinct
graphs, see: http://www.w3.org/TR/rdf-mt/#graphdefs

In your example however the two graphs do not mutually entail each other,
only graph 1 entails graph 2, not the other way round. In every possible
world in which <r/1> has a <p/1> with a <p/2> of "123" and a <p/3> of "456"
Graph 1 and Graph 2 is true, however only Graph 2 is true in the possible
worlds in which <r/1> has a <p/1> with a <p/2> of "123" and in which <r/1>
has a distinct <p/1> with a <p/3> of "456". (To have things less abstract:
replace <r/1> with Alice <p/1> with hasChild, <p/3> with hasFirstName and
<p/3> with hasMiddleName, for the first graph to be true Alice needs to
have at least once child which has the two names, for the second graph to
be true it is enough if one of children has that firstName and one has that
middleName)


>
>
> >
> > - In Sesame Graphs triples are added with one or several context. Such a
> > context is not defined in RDF semantics or in the abstract Syntax. In
> > Sesame a Graph is a collection of Statements where a Statement is not
the
> > same as a Triple in RDF
>
> The currently official RDF specification dates back to 1998 with a minor
revision in 2004 [1]. The definition of named graphs is undergoing
specification in the course of the work on RDF 1.1 at the W3C until 2013
[2]. Whether it is technically represented as quadruples (Sesame) or as in
the proposal for the abstract RDF model (as in Clerezza) is merely an
implementation detail, and also still under discussion. The Sesame approach
implements essentially the named graph specification of SPARQL 1.1 [3] (the
only one that currently officially exists) and has the advantage of
offering more efficient implementations, and especially of being very
convenient to the user (e.g. give me triples matching a certain pattern and
occurring either in graph1 or in graph2).
>
> [1] http://www.w3.org/TR/REC-rdf-syntax/
> [2] http://www.w3.org/TR/rdf11-concepts/#section-dataset
> [3] http://www.w3.org/TR/sparql11-query/#rdfDataset

A dataset is indeed very similar to the Clerezza concept of a TcProvider.
An RDF (and Sparql) Dataset is a collection of graph so it seems consistent
to model it that way in Java rather than having a graph object which
corresponds to an RDF Dataset and a parameter named 'context' which
corresponds to what is a Graph in the RDF11 and sparql spec.


>
> >
> > - Value Factory: In Sesame a value-factory is tied to the Graph. In ZZ
> > triples can be added to any graph and need not be created via a method
> > specific to that graph (it is left to to the implementation to
> > transparently do the optimization for nodes that originate from its
backend)
>
> Sesame follows a classical factory pattern from object oriented design
here to allow the backend to choose which implementation it wants to return
to give the best results. Without a factory, you will always have
additional maintenance overhead for converting into the correct
implementation, e.g. when adding a triple to a graph backed by a database.
With a factory, the implementation will immediately give you a
database-backed version of the triple.

It is true that some optimization are possible if a graph accepts only
nodes that were created by its associated factory. In practice however with
the Clerezza approach all nodes the clients gets from accessing graph can
be backend optimized objects, a mapping to the native objects is only need
for nodes that the application got from a graph originating from another
source or created in the client itself (with a straight forward 'new'). As
this mapping is only needed as long the client has a reference to that
object the backend should only keep a weak reference to the node and can
forget the mapping as soon as it becomes eligible for garbage collection.
So we are talking about adding a few bytes for object that would be in
memory anyway.


>
> >
> > - Ids for BNode. In ZZ Bnodes are just what they are according to the
> > specs: Anonymous resources. They are not java serializable objects so a
> > client can only reference a BNode as long as the object is alive. This
> > allows implementation to remove obsolete triples/duplicate bnodes when
> > nobody holds a reference to that bnode. In Sesame BNodes have an ID and
can
> > be reconstructed with an ID. This means that an implementation doesn't
know
> > how long a bnode is referenced. When a duplicate is detected it should
> > internally keep all the aliases of the node as it doesn't know for sure
> > clients will not reference this bnode by a specific id it was once
exposed
> > with.
>
> The semantics of BNodes are an issue of open debate and even dispute
until today. In practice, it is often a disadvantage to not expose an ID,
and this is why both Sesame and Jena do it, and most serialization formats
also do it.
I can't think of a generic way to serialize graphs without having bnode
id's in the serialization syntax. Also for backends I think having node-ids
is typically a reasonable design choice. I only think that it's bad for an
RDF API to expose such an Id.

> Actually I had some troubles with Clerezza in Stanbol for exactly this
reason. The case that does not work easily here is incremental updates of
graphs between two systems involving blank nodes. In the specification,
this case is forbidden (blank nodes are always distinct). In practice, it
is very useful to still be able to do it.

You are misusing bnode-id as identifier which is exactly why an API should
not expose them. We have a well working solution for identifiably nodes:
named nodes. What should be used if you want to use bnodes is an algorithm
such as rdf-sync:
http://data.semanticweb.org/pdfs/iswc-aswc/2007/ISWC2007_RT_Tummarello(1).pdf

In which case you might find the ability compare small graphs for equality
quite useful. But again you can also name your nodes, that is what
(internationalized) universal resource descriptors are for.

> And actually, since this is also very common practice in logics
(so-called Skolemization) the RDF specification takes this into account and
explicitly acknowledge it [4]:
>
> "Blank node identifiers are local identifiers that are used in some
concrete RDF syntaxes or RDF store implementations. They are always locally
scoped to the file or RDF store, and are not persistent or portable
identifiers for blank nodes…."
>
> [4] http://www.w3.org/TR/rdf11-concepts/#dfn-blank-node

Exactly, but when a b-node identifier is exposed to the outside world the
implementation can no longer do skolemization, as people will still use the
old skolem as identifier,

>
> >
> > - Namespaces: what are they doing in the core of the Sesame API, there
is
> > no such thing in RDF.
>
> There is in RDF 1.1: http://www.w3.org/TR/rdf11-concepts/#vocabularies
>
> Again, this is a point where people after working with RDF discovered it
would be extremely useful to be able to use abbreviated ways of writing
URIs or IRIs, so they included it the software systems. The fact that both
Sesame and Jena do this is proof of it. And the fact that they take this up
in RDF 1.1 also.

I wouldn't say they pic it up in RDF 1.1: "The term “namespace” on its own
does not have a well-defined meaning in the context of RDF", people
serializing IRIs often want to abbreviate them, for that we have the CURIE
spec. While it's certainly good to have Java utilities that implement this
spec as well as application servers providing centralized namespace
management there is no reason to integrate such a support in an api that's
supposed to model RDF.
>
> > Also the Sesame URI class which (probably) represents
> > what the RDF spec describes as "Uri Reference" has methods to split it
into
> > namespace (not using the Namespace class here) and local name.
>
> RDF 1.1 no longer uses the term URI reference, it speaks of IRI. I do not
find the fact that Sesame uses "URI" as interface name very fortunate,
however, but mainly because it sometimes clashes with the existing Java URI
class. The namespace handling methods are merely convenience methods.

The old RDF spec anticipated the introduction of IRIs, the criticism is not
about the naming but about mixing the IRIs with the short form which is
supported by some serialization fromats (some don't do and other may offer
full support for CURIEs rather that just the namespacing as described by
the Sesame API)

>
> >
> > - Literals: The ZZ API differentiates between typed and plain literals.
The
> > Sesame API has one literal datatype with some utility methods to access
its
> > value for common XSD datatypes.
>
> If I look here I see many different literal datatypes. Sesame just hides
them using the factory pattern:
>
>
http://www.openrdf.org/doc/sesame2/api/org/openrdf/model/impl/package-summary.html
>
> > The ZZ approach of having a literal factory
> > to convert java types to literals is more flexible and can be extended
to
> > new types.
>
> Which is (strictly speaking) not really foreseen in the RDF
specification. But I agree that it can be convenient … ;-)

It is foreseen. The possibility of datatype URIs other than the built in
one and the ones for xsd is acknowledged in
http://www.w3.org/TR/rdf-concepts/#section-Literal-Value which says: "There
may be other, implementation dependent, mechanisms by which URIs refer to
datatypes."
>
...
> >
> > All in all I think the ZZ core is not only closer to the spec
>
> … depending on which version of the spec you are talking about - 1998 or
evolving 2013 spec?
>
> > it is also
> > easier to implement a Graph representation of data. The implementor of a
> > Graph need not to care about splitting URIs or providing utility method
to
> > get values from a literal or
> >
> Which are in most cases very straightforward to implement and in the few
cases where they are NOT easy, they are actually needed (e.g. for mapping
database types to Java types).

That should still be possible with the literal-factory, this should be
improved with CLEREZZA-423.

...
> The Sesame SAIL API adds negligible overhead, but considerable additional
functionality. It is a plugin mechanism that allows adding additional
layers inbetween where you want them (e.g. for implementing a native SPARQL
query, for adding a reasoner, etc.).

Separating SPI from API always implies an overhead. The zz approach is to
have a minimal API that's very close to the spec and provide utility
classes (like the mentioned AbstractGraph) an implementar can choose to use
for convenience or not to use.
>
> >
> > - a custom SPARQL implementation that allows a very efficient native
> >> evaluation; at the same time, Sesame provides me an abstract syntax
tree
> >> and completely frees me from all the nasty parsing stuff and
implements the
> >> latest SPARQL specification for queries, updates and federation
extensions
> >> without me having to actually care
> >>
> > I already mentioned that zz should improve here: Sparql fastlane to
allow
> > backed optimization. The abstract syntax tree you describe however is
> > implemented in zz as well, for the latest released sparql spec (i.e not
yet
> > for sparql 1.1)
>
> So browsers should also not support HTNL5? Last time I checked it was
also still a working draft (http://www.w3.org/TR/html5/) … ;-)

Why? did I voted against your sparql 1.1 patch for clerezza? ;)
>
...
> You would typically implement the abstract class RDFParserBase in Sesame,
leaving you only the two parse() methods and the getRDFFormat() to
override. But giving you the option to also implement optional features
like namespace support. Not so complex, I have written several Sesame
parsers already.

There is namespace support in the clerezza platform but this is not tied to
the rdf-api.
...

> > No, Literals, Triples, Resources and others are just interfaces as
well. A
> > BNode is indeed a class (just an empty subclass of Object) the same goes
> > for URIRefs. The reason for this is that we couldn't find a use case
were
> > providing different implementation would provide benefits while it would
> > provide great potential for misuse (e.g. I tell my custom object that
> > implements BNode or UriRef add it to a triple store and expect to get my
> > object back when querying).
>
> Noone who uses some sort of storage should ever expect to get exactly the
same object back.

We agree on that.

> The benefit of leaving BNode and UriRef an interface is to leave the
implementation open to others. Just because YOU did not find a use case
doesn't mean such a use case doesn't exist for others.

A framework and every API by nature constrains the application using and
implementing it. Being as generic and flexible as possible generally
doesn't lead to the best API

> For efficient storage and querying, as well as caching there is indeed a
definite benefit of being able to e.g. associate identifiers with BNodes.

Nothing is hindering you from returning instances of your own BNode
subclass with everything in it your backend needs, this is what non trivial
implementation do. You must however also supports b-nodes from other
sources, i.e. any instance of BNode. In Practice this needs some mapping
but only as long as the object are referenced by the client anyway (see
above about weak references).

UriRefs are indeed not designed for a backend to return objects only
referencing to the full IRI in the backend. This is a design choice were
simplicity was rated higher than this optimization potential, if somebody
wants to use data:uri for large data chunks rather than literals this could
indeed by limiting.

> You are thinking too much in-memory.

I'm thinking java API. A static view with classes close to the spec and a
dynamic view which is about objects in memory. The latter takes into
account that the data may be large and stored outside the modelled scope
and thus adds no overhead that grows with the size of stored data (but with
an overhead size linear to the amount of memory that would be used anyway).


> > Agreed for direct sparql. For transactions I don't know what you mean by
> > "already there", yes some triple stores support some sorts of
transactions,
> > requiring all backend to support this would be quite a strong
requirement
> > and probably not what users want in many cases, see
> >
http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201009.mbox/%[email protected]%3Efor
> > thought on this issue.
>
> Sesame has transactions and Jena has transactions, so other intelligent
people might have thought it is a good idea.

In sesame transactions behave differently depending on which backend is
used. I think associating a transaction to a version/patch would be a
consistent approach. Requiring this for all mutable graphs would be a too
strong and unnecessary requirements. The locking mechanism provided by
Clerezza is implemented by wrapping implementations of MGraph from
providers that do not already provide support for locking.

> I need them, because transactions are the natural level of granularity
for our versioning and for our incremental reasoning. And when you build a
web application on top of some sort of triple store where users interact
with data, this is in fact a major requirement, because you want your
system to be always in a consistent state even in case of multithreading,
you want to take into account users simply clicking the back button in the
middle of some workflow etc.

This seems to be based on some session based web-application model rather
than rest-full web-app.
>
> If you would say exactly this sentence in a database forum ("probably not
what users want in many cases") you would be completely disintegrated in a
flame war. ;-)

Well, glad we're not in an enterprisee db forum here.

>
> >
> >
> >> - support for named graphs / contexts
> >>
> >
> > Named graphs are supported, not sure why you should have contexts to the
> > individual triples
>
> Technical simplification and one of the suggested implementations for
named graphs (quadruples or quads).

it's not a good idea for an API to be designed with just an implementation
strategy in mind. And the main point is the naming contrasting the one in
the RDF specs.
...
>
>
> What I am missing is a convincing argument that shows me how I as
experienced RDF developer can benefit from Clerezza over e.g. Sesame, and I
am sure that many other RDF developers will have the same hesitation. What
you have argued is that Clerezza follows the 1998 RDF spec more closely,
but the fact is that Sesame has already gone beyond that and tries to
anticipate many features of the upcoming RDF 1.1 and SPARQL 1.1
specifications (which are not really secrets for many years now).
Furthermore, you have argued why Clerezza ALSO can do what Sesame (and Jena
for that matter) already does, but you did not show me what it can do MORE.

Mainly not MORE but BETTER. And this allows implementations to do more
(like removing duplicate bnodes using RDF or OWL entailment without
disturbing clients). I think the Clerezza core might even be too big and it
might be good to separate the api for the rdf 1.1 spec from the sparql
spec, so that a mobile client rendering foaf-profiles may just need the
core without the sparql stuff. Graph provider could choose to just provide
graphs that can be queried with a generic sparql engine or also provide a
fastlane (CLEREZZA-468) so that queries against graphs that are all
provided by this backend can be more efficiently processed
....
>
> As it happens, the initial founder of RDF2Go (Max Völkel) is one of my
friends. The main rationale behind the project was strictly to be able to
exchange the underlying triple store more easily, because you might want to
build applications first and think about the triple store later (and then
of course get the bigger infrastructure). It is only used in research
projects as far as I know, and not really under active development anymore.

Well this is close to the rational of the clerezza API: code against the
RDF specs rather than a particular triple store implementation.
>
> As to big infrastructure: the Sesame jar is 2.1 MB big, alltogether.
Smaller than Lucene or Xerces. So I would not consider it big
infrastructure. Comes in many Maven modules, though (which is IMO a good
thing).

The thing is that RDF is a small and nice model for many kind of data.
Whether I want to expose the data from a few sensors as RDF data, accessing
big triple repository, use RDFa enhanced webpages or compare small graphs
from different sources the same data model is used. So lets expose this
data model in the simplest possible way in Java, it's not about reducing
the size of the app but about simplicity. I think to add additional stuff
to this core one needs to argue why this cannot be implemented on top of it.

Cheers,

Reto

Re: Comparing Sesame and Clerezza RDF API (was: ApacheCon EU CFP is open)

Reply via email to