(NB multiple dev@ mailing lists)
On 13/11/12 12:13, Rupert Westenthaler wrote:
Hi all,
I would like to share some thoughts/comments and suggestions from my side:
Thanks - these are interesting to hear.
ResourceFactory: Clerezza is missing a Factory for RDF resources. I
would like to have such a Factory. The Factory should be obtainable
via the Graph - the Collection of Triples. IMO such a Factory is
required if all resource types (IRI, Bnode, Literal) are represented
by interfaces.
Yes - a factory is needed if interfaces.
Whether they should interfaces or whether fixed classes is an
interesting design point. I can see arguments both ways.
The argument for interfaces is presumably different implementations for
different storage layers (e.g. with hidden internal pointers related to
the storage). It is also a case of "it's the Java way".
But two RDFTerms (resources) are equal by value - if they have the same
IRI they are equal and equality is tied to putting in Java collections.
I think the consequence is that a specific subsystem can't assume that
RDF terms passed to it automatically must have come from that component.
Theer's not
And some RDF terms are
[[RDF Term is the term invented in SPARQL to cover
resources/bnodes/literals because there wasn't one in RDF : resource is
used for "web resource" so either the thing being described, not its
name, and/or as a general concept, not specific to RDF ]]
Interesting design point for literals is value vs lexical form/datatype.
It is the value that matters (OK - should matter), whether it's written
"+1"^^xsd:integer or "01"^^xsd:byte. Does any one have a use case
example where the derived datatype matters semantically?
BNodes: If Bnode is an interface than any implementation is free to
internally use a "bnode-id". One argument pro such ids (that was not
yet mentioned) is that such id's allow you to avoid in-memory mappings
for bnodes when wrapping an native implementation. In Clerezza you
currently need to have this Bidi maps.
Triple, Quads: While for some use cases the Triple-in-Graph based API
(Quad := Triple t =
TripleStore#getGraph(context).filter(subject,predicate,object)) is
sufficient this is no longer the case as soon as Applications want to
work with an Graph that contains Quads with several contexts. So I
would vote for having support for Quads.
That is what an RDF dataset is supposed to be, but it's not completely
transparent - just working with the default graph is very much like
working with one graph.
The full-blown quads-in-graph would be N3-style formulae, where a graph
nodes can be a graph. Also called "graph literals".
At this point, they are not going to happen for RDF but if building an
API or component, I would at least put the hooks in for it to prepare
for a possible future.
Dataset,Graph: Out of an User perspective Dataset (how the TripleStore
looks at the Triples) and Graph (how RDF looks at the Triples) are not
so different. Because of that I would like to have a single domain
object fitting for both. The API should focus on the Graph aspects (as
Clerezza does) while still allowing efficient implementations that do
not load all triples into memory (e.g. use closeable iterators)
Immutable Graphs: I had really problems to get this right and the
current Clerezza API does not help with that task (resulting in things
like read-only mutable graphs that are no Graphs as they only provide
a read-only view on a Graph that might still be changed by other
means). I think read-only Graphs (like
Collections.unmodifiableCollection(..)) should be sufficient. IMHO the
use case to protect a returned graph from modifications by the caller
of the method is much more prominent as truly immutable graphs.
SPARQL: I would not deal with parsing SPARQL queries but rather
forward them as is to the underlaying implementation. If doing so the
API would only need to border with result sets. This would also avoid
the need to deal with "Datasets". This is not arguing against a
fallback (e.g. the trick Clerezza does by using the Jena SPARQL
implementation) but in practice efficient SPARQL executions can only
happen natively within the TripleStore. Trying to do otherwise will
only trick users into use cases that will not scale.
Agreed - and memory is a precious resource at scale. It's usually
better to give it to the data storage to avoid I/O. Too much overhead
in higher level APIs keeping state competes with the I/O caching.
Andy
best
Rupert