Hi Sebastian On Aug 7, 2012 5:39 PM, "Sebastian Schaffert" < [email protected]> wrote: > Just a meta-comment: before, I was told that Clerezza is a kind of wrapper that abstracts away from the concrete triple store implementations (Jena and Sesame). What you explained to me now is a completely alternative triple store implementation. Which may have justifications, but also additional problems. Comments follow below…
The parts of clerezza I'm arguing in favour here is the Java API modelling the RDF Data model. It is agnostic on the actual storage of the RDF Data. The API shoul work well for the 56 triples of the configuration of your android app as well as for huge graphs stored in a triple store. > Another meta-comment: I see this as an intellectual challenge where we both learn; it is not my intention to criticise Clerezza, I just take the Sesame position in this discussion because I know it quite well and have made very good experiences with it, so I need really convincing arguments to switch to a different API. Maybe the discussion with me can also help you convince others to use/develop for Clerezza. I very much appreciate this discussion and the review you're giving. It has been a couple of year since the decision has been taken to propose a new rdf api rather than using one of the existing (jena, sesame, rdf2go) it's good to review this discussion. .. > Merely counting classes does not say much, especially since Sesame provides much more functionality. Many of the Sesame classes I see are actually from the abstract syntax tree of the SPARQL Query and Update parsers (and as we all know there are many different ways to implement a parser), from the HTTP Server functionality, and from the different serializers and parsers. If you can live without these, you will easily have a package as small as the Clerezza core. I think it is a good thing to have a core artifact containing all the APIs, a single jar and a self-contained javadoc that you can compile both backends as well as clients against. This is more a design advantage rather than an issue of removing dead-code. > > > > > What I mean by separating utility classes is mainly the separate resource > > centric api provided by RDF utils ( > > http://incubator.apache.org/clerezza/mvn-site/utils/apidocs/index.html). I > > think this is mainly a difference to the Jena API were the resource > > oriented and the triple oriented approach are provided by the same API. > > This is probably a philosophical issue: have all functionality on one place vs. having things cleanly separated. Both have advantages and disadvantages. Personally, I like the Sesame value factory because it is a single place where to look for the suitable factory methods. In the Jena API a resource retuned in triple-iterator is directly an object similar to a GraphNode in Clerezza, something that is tied to the graph ('model' in Jena) and which can thus provide methods to list its properties and their values. This is not the case for the Sesame API (at least not in org.openrdf.model). As for the value factory it is not clear to me from the javadoc if the triples added to a graph must consist exclusively of values created by the respective value-factory. > > > > - The zz api defines identity criterions for graphs. The sesame API doesn't > > define when equals should return true, the zz API defines clear rules which > > are distinct for mutable and for inmutable graphs. Similarly the hashcode > > method for graphs is defined. In Sesame it seems that an instance is equals > > only to itsself. This doesn't take into account what RDF semantics say > > about graph identity > > This is honestly a functionality I have never needed in 10 years. I see its use case in small in-memory graphs (like they are used in Stanbol), but for a multi-billion triple graph this is an irrelevant functionality. Even you multi-billion triples backed application will probably communicate with others which will typically happen with small graphs. The fact that graph-isomorphism may neither be needed nor wanted in most cases is not argument not to clearly define in the API when a triple collection has to be compared in which way for equality. > > BTW, if you want to implement graph equivalence, you have to implement the bijection as specified in http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality. This is a very expensive operation anyways, especially when taking into account blank nodes. That's correct and consisten with the equals definition of the clerezza graph api. It is implemented by the utility AbstractGraph class so that implementors don't have to care about it. > For example, the following two graphs could be considered equivalent (because the blank nodes are existentially qualified): > > Graph 1: > <http://www.example.com/r/1> <http://www.example.com/p/1> _:a > _:a <http://www.example.com/p/2> "123" > _:a <http://www.example.com/p/3> "456" > > Graph 2: > <http://www.example.com/r/1> <http://www.example.com/p/1> _:b > <http://www.example.com/r/1> <http://www.example.com/p/1> _:c > _:b <http://www.example.com/p/2> "123" > _:c <http://www.example.com/p/3> "456" No, even if the two graphs would mutually entail each other or 'express the same content' (to use the wording of rdf semantics) they are still distinct graphs, see: http://www.w3.org/TR/rdf-mt/#graphdefs In your example however the two graphs do not mutually entail each other, only graph 1 entails graph 2, not the other way round. In every possible world in which <r/1> has a <p/1> with a <p/2> of "123" and a <p/3> of "456" Graph 1 and Graph 2 is true, however only Graph 2 is true in the possible worlds in which <r/1> has a <p/1> with a <p/2> of "123" and in which <r/1> has a distinct <p/1> with a <p/3> of "456". (To have things less abstract: replace <r/1> with Alice <p/1> with hasChild, <p/3> with hasFirstName and <p/3> with hasMiddleName, for the first graph to be true Alice needs to have at least once child which has the two names, for the second graph to be true it is enough if one of children has that firstName and one has that middleName) > > > > > > - In Sesame Graphs triples are added with one or several context. Such a > > context is not defined in RDF semantics or in the abstract Syntax. In > > Sesame a Graph is a collection of Statements where a Statement is not the > > same as a Triple in RDF > > The currently official RDF specification dates back to 1998 with a minor revision in 2004 [1]. The definition of named graphs is undergoing specification in the course of the work on RDF 1.1 at the W3C until 2013 [2]. Whether it is technically represented as quadruples (Sesame) or as in the proposal for the abstract RDF model (as in Clerezza) is merely an implementation detail, and also still under discussion. The Sesame approach implements essentially the named graph specification of SPARQL 1.1 [3] (the only one that currently officially exists) and has the advantage of offering more efficient implementations, and especially of being very convenient to the user (e.g. give me triples matching a certain pattern and occurring either in graph1 or in graph2). > > [1] http://www.w3.org/TR/REC-rdf-syntax/ > [2] http://www.w3.org/TR/rdf11-concepts/#section-dataset > [3] http://www.w3.org/TR/sparql11-query/#rdfDataset A dataset is indeed very similar to the Clerezza concept of a TcProvider. An RDF (and Sparql) Dataset is a collection of graph so it seems consistent to model it that way in Java rather than having a graph object which corresponds to an RDF Dataset and a parameter named 'context' which corresponds to what is a Graph in the RDF11 and sparql spec. > > > > > - Value Factory: In Sesame a value-factory is tied to the Graph. In ZZ > > triples can be added to any graph and need not be created via a method > > specific to that graph (it is left to to the implementation to > > transparently do the optimization for nodes that originate from its backend) > > Sesame follows a classical factory pattern from object oriented design here to allow the backend to choose which implementation it wants to return to give the best results. Without a factory, you will always have additional maintenance overhead for converting into the correct implementation, e.g. when adding a triple to a graph backed by a database. With a factory, the implementation will immediately give you a database-backed version of the triple. It is true that some optimization are possible if a graph accepts only nodes that were created by its associated factory. In practice however with the Clerezza approach all nodes the clients gets from accessing graph can be backend optimized objects, a mapping to the native objects is only need for nodes that the application got from a graph originating from another source or created in the client itself (with a straight forward 'new'). As this mapping is only needed as long the client has a reference to that object the backend should only keep a weak reference to the node and can forget the mapping as soon as it becomes eligible for garbage collection. So we are talking about adding a few bytes for object that would be in memory anyway. > > > > > - Ids for BNode. In ZZ Bnodes are just what they are according to the > > specs: Anonymous resources. They are not java serializable objects so a > > client can only reference a BNode as long as the object is alive. This > > allows implementation to remove obsolete triples/duplicate bnodes when > > nobody holds a reference to that bnode. In Sesame BNodes have an ID and can > > be reconstructed with an ID. This means that an implementation doesn't know > > how long a bnode is referenced. When a duplicate is detected it should > > internally keep all the aliases of the node as it doesn't know for sure > > clients will not reference this bnode by a specific id it was once exposed > > with. > > The semantics of BNodes are an issue of open debate and even dispute until today. In practice, it is often a disadvantage to not expose an ID, and this is why both Sesame and Jena do it, and most serialization formats also do it. I can't think of a generic way to serialize graphs without having bnode id's in the serialization syntax. Also for backends I think having node-ids is typically a reasonable design choice. I only think that it's bad for an RDF API to expose such an Id. > Actually I had some troubles with Clerezza in Stanbol for exactly this reason. The case that does not work easily here is incremental updates of graphs between two systems involving blank nodes. In the specification, this case is forbidden (blank nodes are always distinct). In practice, it is very useful to still be able to do it. You are misusing bnode-id as identifier which is exactly why an API should not expose them. We have a well working solution for identifiably nodes: named nodes. What should be used if you want to use bnodes is an algorithm such as rdf-sync: http://data.semanticweb.org/pdfs/iswc-aswc/2007/ISWC2007_RT_Tummarello(1).pdf In which case you might find the ability compare small graphs for equality quite useful. But again you can also name your nodes, that is what (internationalized) universal resource descriptors are for. > And actually, since this is also very common practice in logics (so-called Skolemization) the RDF specification takes this into account and explicitly acknowledge it [4]: > > "Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes…." > > [4] http://www.w3.org/TR/rdf11-concepts/#dfn-blank-node Exactly, but when a b-node identifier is exposed to the outside world the implementation can no longer do skolemization, as people will still use the old skolem as identifier, > > > > > - Namespaces: what are they doing in the core of the Sesame API, there is > > no such thing in RDF. > > There is in RDF 1.1: http://www.w3.org/TR/rdf11-concepts/#vocabularies > > Again, this is a point where people after working with RDF discovered it would be extremely useful to be able to use abbreviated ways of writing URIs or IRIs, so they included it the software systems. The fact that both Sesame and Jena do this is proof of it. And the fact that they take this up in RDF 1.1 also. I wouldn't say they pic it up in RDF 1.1: "The term “namespace” on its own does not have a well-defined meaning in the context of RDF", people serializing IRIs often want to abbreviate them, for that we have the CURIE spec. While it's certainly good to have Java utilities that implement this spec as well as application servers providing centralized namespace management there is no reason to integrate such a support in an api that's supposed to model RDF. > > > Also the Sesame URI class which (probably) represents > > what the RDF spec describes as "Uri Reference" has methods to split it into > > namespace (not using the Namespace class here) and local name. > > RDF 1.1 no longer uses the term URI reference, it speaks of IRI. I do not find the fact that Sesame uses "URI" as interface name very fortunate, however, but mainly because it sometimes clashes with the existing Java URI class. The namespace handling methods are merely convenience methods. The old RDF spec anticipated the introduction of IRIs, the criticism is not about the naming but about mixing the IRIs with the short form which is supported by some serialization fromats (some don't do and other may offer full support for CURIEs rather that just the namespacing as described by the Sesame API) > > > > > - Literals: The ZZ API differentiates between typed and plain literals. The > > Sesame API has one literal datatype with some utility methods to access its > > value for common XSD datatypes. > > If I look here I see many different literal datatypes. Sesame just hides them using the factory pattern: > > http://www.openrdf.org/doc/sesame2/api/org/openrdf/model/impl/package-summary.html > > > The ZZ approach of having a literal factory > > to convert java types to literals is more flexible and can be extended to > > new types. > > Which is (strictly speaking) not really foreseen in the RDF specification. But I agree that it can be convenient … ;-) It is foreseen. The possibility of datatype URIs other than the built in one and the ones for xsd is acknowledged in http://www.w3.org/TR/rdf-concepts/#section-Literal-Value which says: "There may be other, implementation dependent, mechanisms by which URIs refer to datatypes." > ... > > > > All in all I think the ZZ core is not only closer to the spec > > … depending on which version of the spec you are talking about - 1998 or evolving 2013 spec? > > > it is also > > easier to implement a Graph representation of data. The implementor of a > > Graph need not to care about splitting URIs or providing utility method to > > get values from a literal or > > > Which are in most cases very straightforward to implement and in the few cases where they are NOT easy, they are actually needed (e.g. for mapping database types to Java types). That should still be possible with the literal-factory, this should be improved with CLEREZZA-423. ... > The Sesame SAIL API adds negligible overhead, but considerable additional functionality. It is a plugin mechanism that allows adding additional layers inbetween where you want them (e.g. for implementing a native SPARQL query, for adding a reasoner, etc.). Separating SPI from API always implies an overhead. The zz approach is to have a minimal API that's very close to the spec and provide utility classes (like the mentioned AbstractGraph) an implementar can choose to use for convenience or not to use. > > > > > - a custom SPARQL implementation that allows a very efficient native > >> evaluation; at the same time, Sesame provides me an abstract syntax tree > >> and completely frees me from all the nasty parsing stuff and implements the > >> latest SPARQL specification for queries, updates and federation extensions > >> without me having to actually care > >> > > I already mentioned that zz should improve here: Sparql fastlane to allow > > backed optimization. The abstract syntax tree you describe however is > > implemented in zz as well, for the latest released sparql spec (i.e not yet > > for sparql 1.1) > > So browsers should also not support HTNL5? Last time I checked it was also still a working draft (http://www.w3.org/TR/html5/) … ;-) Why? did I voted against your sparql 1.1 patch for clerezza? ;) > ... > You would typically implement the abstract class RDFParserBase in Sesame, leaving you only the two parse() methods and the getRDFFormat() to override. But giving you the option to also implement optional features like namespace support. Not so complex, I have written several Sesame parsers already. There is namespace support in the clerezza platform but this is not tied to the rdf-api. ... > > No, Literals, Triples, Resources and others are just interfaces as well. A > > BNode is indeed a class (just an empty subclass of Object) the same goes > > for URIRefs. The reason for this is that we couldn't find a use case were > > providing different implementation would provide benefits while it would > > provide great potential for misuse (e.g. I tell my custom object that > > implements BNode or UriRef add it to a triple store and expect to get my > > object back when querying). > > Noone who uses some sort of storage should ever expect to get exactly the same object back. We agree on that. > The benefit of leaving BNode and UriRef an interface is to leave the implementation open to others. Just because YOU did not find a use case doesn't mean such a use case doesn't exist for others. A framework and every API by nature constrains the application using and implementing it. Being as generic and flexible as possible generally doesn't lead to the best API > For efficient storage and querying, as well as caching there is indeed a definite benefit of being able to e.g. associate identifiers with BNodes. Nothing is hindering you from returning instances of your own BNode subclass with everything in it your backend needs, this is what non trivial implementation do. You must however also supports b-nodes from other sources, i.e. any instance of BNode. In Practice this needs some mapping but only as long as the object are referenced by the client anyway (see above about weak references). UriRefs are indeed not designed for a backend to return objects only referencing to the full IRI in the backend. This is a design choice were simplicity was rated higher than this optimization potential, if somebody wants to use data:uri for large data chunks rather than literals this could indeed by limiting. > You are thinking too much in-memory. I'm thinking java API. A static view with classes close to the spec and a dynamic view which is about objects in memory. The latter takes into account that the data may be large and stored outside the modelled scope and thus adds no overhead that grows with the size of stored data (but with an overhead size linear to the amount of memory that would be used anyway). > > Agreed for direct sparql. For transactions I don't know what you mean by > > "already there", yes some triple stores support some sorts of transactions, > > requiring all backend to support this would be quite a strong requirement > > and probably not what users want in many cases, see > > http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201009.mbox/%[email protected]%3Efor > > thought on this issue. > > Sesame has transactions and Jena has transactions, so other intelligent people might have thought it is a good idea. In sesame transactions behave differently depending on which backend is used. I think associating a transaction to a version/patch would be a consistent approach. Requiring this for all mutable graphs would be a too strong and unnecessary requirements. The locking mechanism provided by Clerezza is implemented by wrapping implementations of MGraph from providers that do not already provide support for locking. > I need them, because transactions are the natural level of granularity for our versioning and for our incremental reasoning. And when you build a web application on top of some sort of triple store where users interact with data, this is in fact a major requirement, because you want your system to be always in a consistent state even in case of multithreading, you want to take into account users simply clicking the back button in the middle of some workflow etc. This seems to be based on some session based web-application model rather than rest-full web-app. > > If you would say exactly this sentence in a database forum ("probably not what users want in many cases") you would be completely disintegrated in a flame war. ;-) Well, glad we're not in an enterprisee db forum here. > > > > > > >> - support for named graphs / contexts > >> > > > > Named graphs are supported, not sure why you should have contexts to the > > individual triples > > Technical simplification and one of the suggested implementations for named graphs (quadruples or quads). it's not a good idea for an API to be designed with just an implementation strategy in mind. And the main point is the naming contrasting the one in the RDF specs. ... > > > What I am missing is a convincing argument that shows me how I as experienced RDF developer can benefit from Clerezza over e.g. Sesame, and I am sure that many other RDF developers will have the same hesitation. What you have argued is that Clerezza follows the 1998 RDF spec more closely, but the fact is that Sesame has already gone beyond that and tries to anticipate many features of the upcoming RDF 1.1 and SPARQL 1.1 specifications (which are not really secrets for many years now). Furthermore, you have argued why Clerezza ALSO can do what Sesame (and Jena for that matter) already does, but you did not show me what it can do MORE. Mainly not MORE but BETTER. And this allows implementations to do more (like removing duplicate bnodes using RDF or OWL entailment without disturbing clients). I think the Clerezza core might even be too big and it might be good to separate the api for the rdf 1.1 spec from the sparql spec, so that a mobile client rendering foaf-profiles may just need the core without the sparql stuff. Graph provider could choose to just provide graphs that can be queried with a generic sparql engine or also provide a fastlane (CLEREZZA-468) so that queries against graphs that are all provided by this backend can be more efficiently processed .... > > As it happens, the initial founder of RDF2Go (Max Völkel) is one of my friends. The main rationale behind the project was strictly to be able to exchange the underlying triple store more easily, because you might want to build applications first and think about the triple store later (and then of course get the bigger infrastructure). It is only used in research projects as far as I know, and not really under active development anymore. Well this is close to the rational of the clerezza API: code against the RDF specs rather than a particular triple store implementation. > > As to big infrastructure: the Sesame jar is 2.1 MB big, alltogether. Smaller than Lucene or Xerces. So I would not consider it big infrastructure. Comes in many Maven modules, though (which is IMO a good thing). The thing is that RDF is a small and nice model for many kind of data. Whether I want to expose the data from a few sensors as RDF data, accessing big triple repository, use RDFa enhanced webpages or compare small graphs from different sources the same data model is used. So lets expose this data model in the simplest possible way in Java, it's not about reducing the size of the app but about simplicity. I think to add additional stuff to this core one needs to argue why this cannot be implemented on top of it. Cheers, Reto
