On 11 April 2015 at 01:28, Andy Seaborne <[email protected]> wrote: > On 10/04/15 13:21, Peter Ansell wrote: >> >> There is no guarantee that between JVM instances, in the simple >> implementations, the internal blank node identifiers parsed from a >> single document will be the same. > > > In fact, they must be different if there is away to compare them. "simple" > can't store things across JVM runs - it might be used in conjunction with a > system that can though.
The discussion around the exact semantics of the test harness, which is currently called "simple" and gives some impression that it should be used as the basis for actual implementations, is reducing our focus on the core API. I suggest that we rename the module to commons-rdf-tester and therefore remove any concept that its particular semantics would be used. Then we can focus on removing those particular semantics from the core API contracts and make them as general as possible, while leaving open the option for implementations to map BlankNodes as necessary, as long as they do it consistently. > Making the internal identifier different every time is a way to be clear > about that. > > If "simple" is limited to one JVM run, no external contact, then a weak salt > is OK. But "simple" must note this assumption/limitation. For me, wanting > something to support distributed use, these assumptions are a barrier. > > What I don't understand is why in simple it isn't done the simple way (:-) - > uses globally seeded UUID. I am not sure what you mean by globally seeded, and why it relates to distributed use. The current system is to use a salt, which is different to a seed, as a seed is used as the initial value for a sequence while a salt is merged with the actual identifier to give it context. If there was a single global salt per JVM, then what would stop different identifiers from RDF syntaxes being used to create the same BlankNode internally, and hence overlap incorrectly. On the other hand, if there was a seed, how would it be incremented and mapped back to the particular identifier without using a Map to store the mapping? The current test harness does not require a Map, and hence its design could be used for streaming parsers in constant memory. > RDF says there is a set of bnodes; it's separate from IRIs and from > literals. It also assumes infinite memory for permanently storing the identity of blank nodes to make sure they never overlap between documents, so it needs to be reduced to a finite memory concept to implement it. In this case, relying on one of the UUID schemes seems to be the simplest way to do that. > So choose a notional one-to-map between UUIDs and bnodes. The JVM provides > a good seed; UUID.randomUUID. I don't understand why UUID.randomUUID would remove the need for any mappings, given that we do not want to require a physical Map object to be stored in memory for parsers to operate at scale. > We should encourage safe use across implementation and basing around that > global correspondence reduces the number of concepts that need to be > explained. > > Does that work for Sesame? Not really. If you are promoting universally unique BlankNodes then how do you map identifiers to BlankNode objects during the parse of a document? > Putting in the remapping in the test suite/reference is adding complexity. I > think that a clear, clean implementation does not need to add in those > concepts. One of the requirements for me is the constant memory, streaming, parser case across several documents containing the same physical identifier, with each document being parsed as an internally consistent set that does not have BlankNode objects that are equals to each other across BlankNodes coming from different documents. By going with a UUID.randomUUID call for each and every blank node (not sure what you mean by seed still so that may not be what you mean), there needs to be a Map stored in each parser to locate the first BlankNode object created for each identifier coming out of a physical document. That may be simple to understand, but requiring it is beyond what Jena and Sesame require at their scales, and both of them are reasonably constant memory and streaming currently so to require a Map would be a backwards step. Rather than providing a simple partial use case, I would think that the test harness, which is still called "simple" at this point, should exercise the boundaries of the full contract. >> In addition once blank node >> identifiers reach concrete syntaxes the identifier is opaque so when >> it reaches another parser, even if it is parsed into the same JVM, it >> will not be parsed in as an equivalent blank node. Even if the same >> RDFTermFactory instance is used, there is a one to one mapping for >> the original document and a second separate one to one mapping for >> the second document. > > > Agreed - it's a feature of the RDF syntax. > >> Until Stian proposed opening up the simple implementation, there were >> only two sources for the salt. I don't quite understand why it is >> lossy and somehow inaccurate. Both of the salts are short and unique >> and there may be a finite chance of overlap but they are both >> generated using the same UUID scheme to minimise that small chance of >> overlap. >> >> Using an external UUID just moves the issue, and removes the design >> to fix the scheme and salt sources. I am not sure why it is >> inaccurate. Maybe an example of how an hypothetical overlap may occur >> would be useful to explain that comment. > > > Firstly, if DEFAULT_SALT is a UUID.random, then internal identifiers will be > different each run. This is good - it stops hidden accidentally ordering > creeping in. I don't mind changing both the DEFAULT_SALT and the factory based salt to UUID.random based values. There doesn't need to be any change to the way the method works to support that, and it would benefit distributed systems (while not having any effect on any cases where the data was serialised out to a physical document/database before being parsed back in). > Secondly, it removes the need to explain the remapping and why adding a > triple to a graph changes results in it not being in the graph. A graph is > a set of triples, all you can and have to rely on is the contract for > triples. add(t) then not contains(t) is quite strange. RDF isn't simple, due solely to its use of BlankNodes. IRIs and Literals are very easy to conceptualise and it is clear that their nature does not change between Graphs. BlankNodes are hard to interpret in general, other than to say that they are a set that isn't one of the easy to conceptualise sets. > In using "simple" as a light weight container, I'd like to be able to get > out what I put in, whether they are sesame, jena or simple triples. I disagree that all Triple and RDFTerm objects should be immutable. In particular, if a Triple from Jena is added to a Sesame database, none of the Jena annotations are important, or necessary to be kept at that stage. In addition, some Sesame specific annotations may need to be attached if the Triple is sent to a Sesame datastore/Graph, which would require the object to change. To guarantee that future Triples within a single JVM instance can be added consistently to the same Graph, with the same BlankNode in the relevant position, is to either preserve the identifier and rely on them being globally unique, which is your preferred method, preserve the actual BlankNode object (not practical for a streaming parser), or consistently map them to a new BlankNode object with a new identifier, which is my preferrred method. If they are globally unique and fixed, then the Sesame database would need to preserve that particular value in a dataset somewhere to locate it in future and give it back using the same identifier. In the current method where .contains(t) may return false immediately after .add(t), due to the mapping semantics, Sesame would not need to store the original value, as long as it knew how to map the given value to its internal set in the scope of the current query. We are also running into many issues with the confusion about the "simple" implementation being something that people would reuse in the wild, hence, I would like it to be renamed to reduce its weight in discussions to a proof of concept that does not see wide use as an interchange or basis for any interoperability. In the long term we can't make any assumptions about the origin of objects, except for the very specific contract that we are working on for the interfaces in the commons-rdf-api module. All of the other abstractions have to be implementable based on that general contract. >> I am not sure what you mean by bit slicing the constructor >> parameters. > > > It's a minor point but rather than relying on forming a string, and using > the type 3/MD5 hash. Although strictly there are some other requirements > for type 3. If you could expand on how UUID.randomUUID would be used to replace the concatenation it would be very useful for clarifying this. If all you are referring to is the limited address/hashCode space for the existing method, then I am fine with replacing them with UUID and concatenating, but I am not sure what you are considering replacing it with in the lack of a code example. Thanks, Peter
