Re: incubator-commonsrdf git commit: BlankNodeImpl salt using System.identityHashCode instead of .toString()

Peter Ansell Sat, 11 Apr 2015 05:23:45 -0700

On 11 April 2015 at 01:28, Andy Seaborne <[email protected]> wrote:
> On 10/04/15 13:21, Peter Ansell wrote:
>>
>> There is no guarantee that between JVM instances, in the simple
>> implementations, the internal blank node identifiers parsed from a
>> single document will be the same.
>
>
> In fact, they must be different if there is away to compare them. "simple"
> can't store things across JVM runs - it might be used in conjunction with a
> system that can though.


The discussion around the exact semantics of the test harness, which
is currently called "simple" and gives some impression that it should
be used as the basis for actual implementations, is reducing our focus
on the core API. I suggest that we rename the module to
commons-rdf-tester and therefore remove any concept that its
particular semantics would be used.

Then we can focus on removing those particular semantics from the core
API contracts and make them as general as possible, while leaving open
the option for implementations to map BlankNodes as necessary, as long
as they do it consistently.

> Making the internal identifier different every time is a way to be clear
> about that.
>
> If "simple" is limited to one JVM run, no external contact, then a weak salt
> is OK.  But "simple" must note this assumption/limitation.  For me, wanting
> something to support distributed use, these assumptions are a barrier.
>
> What I don't understand is why in simple it isn't done the simple way (:-) -
> uses globally seeded UUID.

I am not sure what you mean by globally seeded, and why it relates to
distributed use. The current system is to use a salt, which is
different to a seed, as a seed is used as the initial value for a
sequence while a salt is merged with the actual identifier to give it
context.

If there was a single global salt per JVM, then what would stop
different identifiers from RDF syntaxes being used to create the same
BlankNode internally, and hence overlap incorrectly.

On the other hand, if there was a seed, how would it be incremented
and mapped back to the particular identifier without using a Map to
store the mapping?

The current test harness does not require a Map, and hence its design
could be used for streaming parsers in constant memory.

> RDF says there is a set of bnodes; it's separate from IRIs and from
> literals.

It also assumes infinite memory for permanently storing the identity
of blank nodes to make sure they never  overlap between documents, so
it needs to be reduced to a finite memory concept to implement it. In
this case, relying on one of the UUID schemes seems to be the simplest
way to do that.

> So choose a notional one-to-map between UUIDs and bnodes.  The JVM provides
> a good seed; UUID.randomUUID.

I don't understand why UUID.randomUUID would remove the need for any
mappings, given that we do not want to require a physical Map object
to be stored in memory for parsers to operate at scale.

> We should encourage safe use across implementation and basing around that
> global correspondence reduces the number of concepts that need to be
> explained.
>
> Does that work for Sesame?

Not really. If you are promoting universally unique BlankNodes then
how do you map identifiers to BlankNode objects during the parse of a
document?

> Putting in the remapping in the test suite/reference is adding complexity. I
> think that a clear, clean implementation does not need to add in those
> concepts.

One of the requirements for me is the constant memory, streaming,
parser case across several documents containing the same physical
identifier, with each document being parsed as an internally
consistent set that does not have BlankNode objects that are equals to
each other across BlankNodes coming from different documents.

By going with a UUID.randomUUID call for each and every blank node
(not sure what you mean by seed still so that may not be what you
mean), there needs to be a Map stored in each parser to locate the
first BlankNode object created for each identifier coming out of a
physical document. That may be simple to understand, but requiring it
is beyond what Jena and Sesame require at their scales, and both of
them are reasonably constant memory and streaming currently so to
require a Map would be a backwards step.

Rather than providing a simple partial use case, I would think that
the test harness, which is still called "simple" at this point, should
exercise the boundaries of the full contract.

>> In addition once blank node
>> identifiers reach concrete syntaxes the identifier is opaque so when
>> it reaches another parser, even if it is parsed into the same JVM, it
>> will not be parsed in as an equivalent blank node.  Even if the same
>> RDFTermFactory instance is used, there is a one to one mapping for
>> the original document and a second separate one to one mapping for
>> the second document.
>
>
> Agreed - it's a feature of the RDF syntax.
>
>> Until Stian proposed opening up the simple implementation, there were
>> only two sources for the salt. I don't quite understand why it is
>> lossy and somehow inaccurate. Both of the salts are short and unique
>> and there may be a finite chance of overlap but they are both
>> generated using the same UUID scheme to minimise that small chance of
>> overlap.
>>
>> Using an external UUID just moves the issue, and removes the design
>> to fix the scheme and salt sources. I am not sure why it is
>> inaccurate. Maybe an example of how an hypothetical overlap may occur
>> would be useful to explain that comment.
>
>
> Firstly, if DEFAULT_SALT is a UUID.random, then internal identifiers will be
> different each run.  This is good - it stops hidden accidentally ordering
> creeping in.

I don't mind changing both the DEFAULT_SALT and the factory based salt
to UUID.random based values. There doesn't need to be any change to
the way the method works to support that, and it would benefit
distributed systems (while not having any effect on any cases where
the data was serialised out to a physical document/database before
being parsed back in).

> Secondly, it removes the need to explain the remapping and why adding a
> triple to a graph changes results in it not being in the graph.  A graph is
> a set of triples, all you can and have to rely on is the contract for
> triples.  add(t) then not contains(t) is quite strange.

RDF isn't simple, due solely to its use of BlankNodes. IRIs and
Literals are very easy to conceptualise and it is clear that their
nature does not change between Graphs. BlankNodes are hard to
interpret in general, other than to say that they are a set that isn't
one of the easy to conceptualise sets.

> In using "simple" as a light weight container, I'd like to be able to get
> out what I put in, whether they are sesame, jena or simple triples.

I disagree that all Triple and RDFTerm objects should be immutable. In
particular, if a Triple from Jena is added to a Sesame database, none
of the Jena annotations are important, or necessary to be kept at that
stage. In addition, some Sesame specific annotations may need to be
attached if the Triple is sent to a Sesame datastore/Graph, which
would require the object to change. To guarantee that future Triples
within a single JVM instance can be added consistently to the same
Graph, with the same BlankNode in the relevant position, is to either
preserve the identifier and rely on them being globally unique, which
is your preferred method, preserve the actual BlankNode object (not
practical for a streaming parser), or consistently map them to a new
BlankNode object with a new identifier, which is my preferrred method.

If they are globally unique and fixed, then the Sesame database would
need to preserve that particular value in a dataset somewhere to
locate it in future and give it back using the same identifier. In the
current method where .contains(t) may return false immediately after
.add(t), due to the mapping semantics, Sesame would not need to store
the original value, as long as it knew how to map the given value to
its internal set in the scope of the current query.

We are also running into many issues with the confusion about the
"simple" implementation being something that people would reuse in the
wild, hence, I would like it to be renamed to reduce its weight in
discussions to a proof of concept that does not see wide use as an
interchange or basis for any interoperability. In the long term we
can't make any assumptions about the origin of objects, except for the
very specific contract that we are working on for the interfaces in
the commons-rdf-api module. All of the other abstractions have to be
implementable based on that general contract.

>> I am not sure what you mean by bit slicing the constructor
>> parameters.
>
>
> It's a minor point but rather than relying on forming a string, and using
> the type 3/MD5 hash.  Although strictly there are some other requirements
> for type 3.

If you could expand on how UUID.randomUUID would be used to replace
the concatenation it would be very useful for clarifying this. If all
you are referring to is the limited address/hashCode space for the
existing method, then I am fine with replacing them with UUID and
concatenating, but I am not sure what you are considering replacing it
with in the lack of a code example.

Thanks,

Peter

Re: incubator-commonsrdf git commit: BlankNodeImpl salt using System.identityHashCode instead of .toString()

Reply via email to