Comparing Sesame and Clerezza RDF API (was: ApacheCon EU CFP is open)

Reto Bachmann-Gmür Mon, 06 Aug 2012 11:36:13 -0700

Hi Sebastian,

On Fri, Aug 3, 2012 at 2:26 PM, Sebastian Schaffert <
[email protected]> wrote:


> Hi Reto,
>
> comments inline. ;-)
>
same here ;)

>
> Am 03.08.2012 um 12:54 schrieb Reto Bachmann-Gmür:
>
> > I agree that clerezza should finally have tha fastlane for sparql query,
> > the current approach makes only sense if a query is against graphs from
> > multiple backends. This is definitively a bottle neck now.
> >
> > What puzzles me is that you seem to think that the sesame api is cleaner
> > than the clerezza one. The clerezza api was introduced as non of the api
> > available would model the rdf abstract syntax without tying additional
> > concepts and utility classes into the core. If you find anything that's
> not
> > clean I'd like to address this.
>
> Essentially, Sesame already provides many of the things Clerezza promises,
> and it is well proven, established, and highly performant. So I don't
> completely understand the rationale of reinventing the wheel with Clerezza.
>
> I also don't understand your argument of tying additional concepts and
> utility classes into the core. Clerezza implements utility classes in the
> same way as Sesame, so where is the difference? Additionally, the Sesame
> utility classes simply extend the Java core functionality (e.g. with
> iterators that can throw checked exceptions).
>
Looking at http://www.openrdf.org/doc/sesame2/api/ I see 1188 classes many
of them seem to be related to implementation specific aspects and transport
the clerezza core api (
http://incubator.apache.org/clerezza/mvn-site/rdf.core/apidocs/index.html))
contains 125 classes. These are the classes in the jar an api client or
implementor depends on.

What I mean by separating utility classes is mainly the separate resource
centric api provided by RDF utils (
http://incubator.apache.org/clerezza/mvn-site/utils/apidocs/index.html). I
think this is mainly a difference to the Jena API were the resource
oriented and the triple oriented approach are provided by the same API.

The classes in org.openrdf.model  are indeed similar to the ones in
org.apache.clerezza.rdf.core, so let me argue why I think the zz variant
are better:

- The zz api defines identity criterions for graphs. The sesame API doesn't
define when equals should return true, the zz API defines clear rules which
are distinct for mutable and for inmutable graphs. Similarly the hashcode
method for graphs is defined. In Sesame it seems that an instance is equals
only to itsself. This doesn't take into account what RDF semantics say
about graph identity

- In Sesame Graphs triples are added with one or several context. Such a
context is not defined in RDF semantics or in the abstract Syntax. In
Sesame a Graph is a collection of Statements where a Statement is not the
same as a Triple in RDF

- Value Factory: In Sesame a value-factory is tied to the Graph. In ZZ
triples can be added to any graph and need not be created via a method
specific to that graph (it is left to to the implementation to
transparently do the optimization for nodes that originate from its backend)

- Ids for BNode. In ZZ Bnodes are just what they are according to the
specs: Anonymous resources. They are not java serializable objects so a
client can only reference a BNode as long as the object is alive. This
allows implementation to remove obsolete triples/duplicate bnodes when
nobody holds a reference to that bnode. In Sesame BNodes have an ID and can
be reconstructed with an ID. This means that an implementation doesn't know
how long a bnode is referenced. When a duplicate is detected it should
internally keep all the aliases of the node as it doesn't know for sure
clients will not reference this bnode by a specific id it was once exposed
with.

- Namespaces: what are they doing in the core of the Sesame API, there is
no such thing in RDF. Also the Sesame URI class which (probably) represents
what the RDF spec describes as "Uri Reference" has methods to split it into
namespace (not using the Namespace class here) and local name.

- Literals: The ZZ API differentiates between typed and plain literals. The
Sesame API has one literal datatype with some utility methods to access its
value for common XSD datatypes. The ZZ approach of having a literal factory
to convert java types to literals is more flexible and can be extended to
new types.

- Statement: A sesame statemnet has a Context but this context is
irrelevant for the identity criterion defined for equals and hashcode. This
doesn't make a clean impression to me, either this contexts are relevant
and then addin two statements with diffrent contexts to a Set should give a
Set of size two or they aren't in which case they should disappear from the
api.

All in all I think the ZZ core is not only closer to the spec it is also
easier to implement a Graph representation of data. The implementor of a
Graph need not to care about splitting URIs or providing utility method to
get values from a literal or



> One aspect I like about the Sesame API is its completely modular structure
> at several levels. This allows me to easily and cleanly add functionality
> as needed, e.g.:
> - a custom triple store like the one I described before; you can e.g.
> easily provide a Jena TDB backend for Sesame (see
> http://sjadapter.sourceforge.net/)
>
In Sesame A custom triple store or a gateway just provides graph
implementations. The sesame solution you're referring to implements a
separate SPI (sail) which adds a level of complexity and is also a bit less
performance as in ZZ there is (potentially) nothing between your
implementation and the client, you can use zz utilty classes libe
AbstractMGraph but you don't have to.

- a custom SPARQL implementation that allows a very efficient native
> evaluation; at the same time, Sesame provides me an abstract syntax tree
> and completely frees me from all the nasty parsing stuff and implements the
> latest SPARQL specification for queries, updates and federation extensions
> without me having to actually care
>
I already mentioned that zz should improve here: Sparql fastlane to allow
backed optimization. The abstract syntax tree you describe however is
implemented in zz as well, for the latest released sparql spec (i.e not yet
for sparql 1.1)


> - a very clean Java SPI based approach for registering RDF parsers and
> serializers that can operate natively on the respective triple store
> implementations
>

So comparing org.openrdf.rio with
org.apache.clerezza.rdf.core.serializedform or is there a sperate SPI
apckage? The Sesame RDFParser interface seems much more complex than zz's
ParsingProvider

Clerezza supports registering parsers and serializers for any media type
(which is identified just by its media type without introducing an
RDF-Format class) both using OSGi as well as the META-INF/services approach
for non-osgi environment. Parsers and serializer have to work with data
from any backend they can however be optimized for data from a particular
backend.


> - easily wrap filters around triple stores or iterators
> You can easily see the modularity by looking at the many Maven artifacts
> the project is composed off. Essentially, if I don't need a functionality I
> can simply leave it out, and if I need it, adding it is a no-brainer.
>

The minimum clerezza jar you need is 240K, this contatins all you need to
access and query graphs. It also conatins the infrastructure for
serializing and parsing but you have to add the jar for the formats you
need (just adding the jar to the classpath or loading the bundle when using
OSGi is enough).

>
> In addition, the Sesame data model is completely based on lightweight
> interfaces instead of wrapper objects. This makes it very easy to provide
> efficient implementations and is IMHO very clean. In contrast, Clerezza
> provides its complete own RDF model based on your own version of a triple,
> your own version of a node, your own version of a URI, …
>

No, Literals, Triples, Resources and others are just interfaces as well. A
BNode is indeed a class (just an empty subclass of Object) the same goes
for URIRefs. The reason for this is that we couldn't find a use case were
providing different implementation would provide benefits while it would
provide great potential for misuse (e.g. I tell my custom object that
implements BNode or UriRef add it to a triple store and expect to get my
object back when querying).

>
> What I am missing from Clerezza:
> - a lightweight data model that does not require additional instances in
> main memory
>
yes bnodes and urirefs require to be in memory as long as they are used by
a client. For BNode I think this brings a significant advantage, what I
described above about the backend knowing when redundancy can be removed
without risk

> - a good reuse of functionality that is already there, e.g. direct SPARQL,
> transactions, …
>
Agreed for direct sparql. For transactions I don't know what you mean by
"already there", yes some triple stores support some sorts of transactions,
requiring all backend to support this would be quite a strong requirement
and probably not what users want in many cases, see
http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201009.mbox/%[email protected]%3Efor
thought on this issue.


> - support for named graphs / contexts
>

Named graphs are supported, not sure why you should have contexts to the
individual triples and I missing a clear description of this in sesame api.
As naming graphs go beyond the core rdf specs the names come into play in
org.apache.clerezza.rdf.core.access.


> - support for SPARQL Update
>
+1


> - new TripleImpl(…) is a Java anti-pattern. If something is called …Impl
> it should not be instantiated with new.
>
Well this is just a utility, it was calleds TripleImpl rather that
SimpleTriple as other implementations are typically backend provided. The
API doesn't mandate using this, any implementation of Triple will do.


>
> I have been working with RDF APIs for about 10 years now in various
> programming languages (even in Prolog, Haskell and Python). And my
> conclusion at the moment is that Sesame by far offers the most convenient
> API for a developer. But I am of course open to switching in case I get
> convincing arguments. ;-)

not sure how convincing you found them, or if the missing sparql fastlane
is a blocker for you.

>
> > It seems that what you are using of sesame is mainly the spi/api and not
> > the actual triple store. This is definitively something clerezza should
> > have a good offer for.
>
> So convince me why it is BETTER than the established project ;-)
>

Well before clerezza there were othe attempts to have a thinner backend
agnostic layer like RDF2Go. This seems to confirm that other too think that
the APIs provided by Sesame or Jena aren't that thin RDF Api one can use
and implement without endorsing a big infrastructure or a large set of
concept one doesn't necessarily want to deal with.


>
> >
> > When I last looked at it sesame was not ready to be used in apache
> > projects, not sure if license issues are the cause of it not being
> > available in maven central.
>
>
> Sesame is under a BSD license, so it should be compatible with Apache
> projects:
>
> http://www.openrdf.org/download.jsp


Is this true for the dependency it requires as well?

>
> The main issue might be that it is not yet completely OSGi compatible (at
> least not on a per-component level).
>
The ZZ sesame backend isn't modularized but integrates sesame as one large
bundle, we had to create this bundle ourselves.


>
> If it is indeed not compatible with Apache projects, it should be easy to
> contact the developers and simply ask them whether this can be changed.
>

Got no reply here:
http://sourceforge.net/mailarchive/forum.php?thread_name=4B71ABD4.4080904%40apache.org&forum_name=sesame-general

Cheers,
Reto

Comparing Sesame and Clerezza RDF API (was: ApacheCon EU CFP is open)

Reply via email to