Hi Sebastian, On Fri, Aug 3, 2012 at 2:26 PM, Sebastian Schaffert < [email protected]> wrote:
> Hi Reto, > > comments inline. ;-) > same here ;) > > Am 03.08.2012 um 12:54 schrieb Reto Bachmann-Gmür: > > > I agree that clerezza should finally have tha fastlane for sparql query, > > the current approach makes only sense if a query is against graphs from > > multiple backends. This is definitively a bottle neck now. > > > > What puzzles me is that you seem to think that the sesame api is cleaner > > than the clerezza one. The clerezza api was introduced as non of the api > > available would model the rdf abstract syntax without tying additional > > concepts and utility classes into the core. If you find anything that's > not > > clean I'd like to address this. > > Essentially, Sesame already provides many of the things Clerezza promises, > and it is well proven, established, and highly performant. So I don't > completely understand the rationale of reinventing the wheel with Clerezza. > > I also don't understand your argument of tying additional concepts and > utility classes into the core. Clerezza implements utility classes in the > same way as Sesame, so where is the difference? Additionally, the Sesame > utility classes simply extend the Java core functionality (e.g. with > iterators that can throw checked exceptions). > Looking at http://www.openrdf.org/doc/sesame2/api/ I see 1188 classes many of them seem to be related to implementation specific aspects and transport the clerezza core api ( http://incubator.apache.org/clerezza/mvn-site/rdf.core/apidocs/index.html)) contains 125 classes. These are the classes in the jar an api client or implementor depends on. What I mean by separating utility classes is mainly the separate resource centric api provided by RDF utils ( http://incubator.apache.org/clerezza/mvn-site/utils/apidocs/index.html). I think this is mainly a difference to the Jena API were the resource oriented and the triple oriented approach are provided by the same API. The classes in org.openrdf.model are indeed similar to the ones in org.apache.clerezza.rdf.core, so let me argue why I think the zz variant are better: - The zz api defines identity criterions for graphs. The sesame API doesn't define when equals should return true, the zz API defines clear rules which are distinct for mutable and for inmutable graphs. Similarly the hashcode method for graphs is defined. In Sesame it seems that an instance is equals only to itsself. This doesn't take into account what RDF semantics say about graph identity - In Sesame Graphs triples are added with one or several context. Such a context is not defined in RDF semantics or in the abstract Syntax. In Sesame a Graph is a collection of Statements where a Statement is not the same as a Triple in RDF - Value Factory: In Sesame a value-factory is tied to the Graph. In ZZ triples can be added to any graph and need not be created via a method specific to that graph (it is left to to the implementation to transparently do the optimization for nodes that originate from its backend) - Ids for BNode. In ZZ Bnodes are just what they are according to the specs: Anonymous resources. They are not java serializable objects so a client can only reference a BNode as long as the object is alive. This allows implementation to remove obsolete triples/duplicate bnodes when nobody holds a reference to that bnode. In Sesame BNodes have an ID and can be reconstructed with an ID. This means that an implementation doesn't know how long a bnode is referenced. When a duplicate is detected it should internally keep all the aliases of the node as it doesn't know for sure clients will not reference this bnode by a specific id it was once exposed with. - Namespaces: what are they doing in the core of the Sesame API, there is no such thing in RDF. Also the Sesame URI class which (probably) represents what the RDF spec describes as "Uri Reference" has methods to split it into namespace (not using the Namespace class here) and local name. - Literals: The ZZ API differentiates between typed and plain literals. The Sesame API has one literal datatype with some utility methods to access its value for common XSD datatypes. The ZZ approach of having a literal factory to convert java types to literals is more flexible and can be extended to new types. - Statement: A sesame statemnet has a Context but this context is irrelevant for the identity criterion defined for equals and hashcode. This doesn't make a clean impression to me, either this contexts are relevant and then addin two statements with diffrent contexts to a Set should give a Set of size two or they aren't in which case they should disappear from the api. All in all I think the ZZ core is not only closer to the spec it is also easier to implement a Graph representation of data. The implementor of a Graph need not to care about splitting URIs or providing utility method to get values from a literal or > One aspect I like about the Sesame API is its completely modular structure > at several levels. This allows me to easily and cleanly add functionality > as needed, e.g.: > - a custom triple store like the one I described before; you can e.g. > easily provide a Jena TDB backend for Sesame (see > http://sjadapter.sourceforge.net/) > In Sesame A custom triple store or a gateway just provides graph implementations. The sesame solution you're referring to implements a separate SPI (sail) which adds a level of complexity and is also a bit less performance as in ZZ there is (potentially) nothing between your implementation and the client, you can use zz utilty classes libe AbstractMGraph but you don't have to. - a custom SPARQL implementation that allows a very efficient native > evaluation; at the same time, Sesame provides me an abstract syntax tree > and completely frees me from all the nasty parsing stuff and implements the > latest SPARQL specification for queries, updates and federation extensions > without me having to actually care > I already mentioned that zz should improve here: Sparql fastlane to allow backed optimization. The abstract syntax tree you describe however is implemented in zz as well, for the latest released sparql spec (i.e not yet for sparql 1.1) > - a very clean Java SPI based approach for registering RDF parsers and > serializers that can operate natively on the respective triple store > implementations > So comparing org.openrdf.rio with org.apache.clerezza.rdf.core.serializedform or is there a sperate SPI apckage? The Sesame RDFParser interface seems much more complex than zz's ParsingProvider Clerezza supports registering parsers and serializers for any media type (which is identified just by its media type without introducing an RDF-Format class) both using OSGi as well as the META-INF/services approach for non-osgi environment. Parsers and serializer have to work with data from any backend they can however be optimized for data from a particular backend. > - easily wrap filters around triple stores or iterators > You can easily see the modularity by looking at the many Maven artifacts > the project is composed off. Essentially, if I don't need a functionality I > can simply leave it out, and if I need it, adding it is a no-brainer. > The minimum clerezza jar you need is 240K, this contatins all you need to access and query graphs. It also conatins the infrastructure for serializing and parsing but you have to add the jar for the formats you need (just adding the jar to the classpath or loading the bundle when using OSGi is enough). > > In addition, the Sesame data model is completely based on lightweight > interfaces instead of wrapper objects. This makes it very easy to provide > efficient implementations and is IMHO very clean. In contrast, Clerezza > provides its complete own RDF model based on your own version of a triple, > your own version of a node, your own version of a URI, … > No, Literals, Triples, Resources and others are just interfaces as well. A BNode is indeed a class (just an empty subclass of Object) the same goes for URIRefs. The reason for this is that we couldn't find a use case were providing different implementation would provide benefits while it would provide great potential for misuse (e.g. I tell my custom object that implements BNode or UriRef add it to a triple store and expect to get my object back when querying). > > What I am missing from Clerezza: > - a lightweight data model that does not require additional instances in > main memory > yes bnodes and urirefs require to be in memory as long as they are used by a client. For BNode I think this brings a significant advantage, what I described above about the backend knowing when redundancy can be removed without risk > - a good reuse of functionality that is already there, e.g. direct SPARQL, > transactions, … > Agreed for direct sparql. For transactions I don't know what you mean by "already there", yes some triple stores support some sorts of transactions, requiring all backend to support this would be quite a strong requirement and probably not what users want in many cases, see http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201009.mbox/%[email protected]%3Efor thought on this issue. > - support for named graphs / contexts > Named graphs are supported, not sure why you should have contexts to the individual triples and I missing a clear description of this in sesame api. As naming graphs go beyond the core rdf specs the names come into play in org.apache.clerezza.rdf.core.access. > - support for SPARQL Update > +1 > - new TripleImpl(…) is a Java anti-pattern. If something is called …Impl > it should not be instantiated with new. > Well this is just a utility, it was calleds TripleImpl rather that SimpleTriple as other implementations are typically backend provided. The API doesn't mandate using this, any implementation of Triple will do. > > I have been working with RDF APIs for about 10 years now in various > programming languages (even in Prolog, Haskell and Python). And my > conclusion at the moment is that Sesame by far offers the most convenient > API for a developer. But I am of course open to switching in case I get > convincing arguments. ;-) not sure how convincing you found them, or if the missing sparql fastlane is a blocker for you. > > > It seems that what you are using of sesame is mainly the spi/api and not > > the actual triple store. This is definitively something clerezza should > > have a good offer for. > > So convince me why it is BETTER than the established project ;-) > Well before clerezza there were othe attempts to have a thinner backend agnostic layer like RDF2Go. This seems to confirm that other too think that the APIs provided by Sesame or Jena aren't that thin RDF Api one can use and implement without endorsing a big infrastructure or a large set of concept one doesn't necessarily want to deal with. > > > > > When I last looked at it sesame was not ready to be used in apache > > projects, not sure if license issues are the cause of it not being > > available in maven central. > > > Sesame is under a BSD license, so it should be compatible with Apache > projects: > > http://www.openrdf.org/download.jsp Is this true for the dependency it requires as well? > > The main issue might be that it is not yet completely OSGi compatible (at > least not on a per-component level). > The ZZ sesame backend isn't modularized but integrates sesame as one large bundle, we had to create this bundle ourselves. > > If it is indeed not compatible with Apache projects, it should be easy to > contact the developers and simply ask them whether this can be changed. > Got no reply here: http://sourceforge.net/mailarchive/forum.php?thread_name=4B71ABD4.4080904%40apache.org&forum_name=sesame-general Cheers, Reto
