Heads-up for a change to blank nodes produced by RIOT when parsing (but not when using RDF/XML).

tl:dr

Parsing data with vast numbers of blank nodes in a single file now scales better.

Appearance of N-Triples and N-Quads output changes slightly.

Fully compatible with existing data.

https://issues.apache.org/jira/browse/JENA-352

Details:

The blank node allocator has to ensure that two uses of the same label always generate the same blank node even is uned in the first line of a file and the last line.

To do this, RIOT was keeping a map of label to allocated node. At scale, this fails as it uses memory (although you do need a lot of blank node labels for it to become serious).

A new policy is to use an with a seed value per parser run, which is combined with any string label to produce a globally unique id. There is also an LRU cache of 1000 slots to do map-like sharing and avoid excessive calls of to MD5 digest engine. Typically, a blank node label is used in a short section of the file much of the time (think blank nodes as subjects or blank nodes in structure values and lists).

The seed is a random UUID (122 bits of randomness). The label is combined with the seed by converting to UTF8 bytes and using MD5 to give a 128bit hashed value which is assumed to be globally unique. Using MD5 makes it fixed length which is a convenient. As we are not requiring an unattackable policy, MD5 is acceptable.

This change is observable - the format of blank nodes printed in N-Triples and N-Quads changes slightly. N-Triples and N-Quads print bNodes using the internal label (so work at arbitrary scale, and can even be used to restore blank nodes as described below).

The old allocator used a java.net.UID, which had : and - characters it in. These were encoded as Xhh for two hex digits (x3A and x2D).

The new format is slightly shorted, and does not have Xhh encoded characters

Old:
_:BX2D5bbaf4a1X3A13dbc7e7182X3AX2D7fff

New:
_:B70db88eb40afc13d2ab37d161e36392e

Printed labels start "B", a letter, to keep them compatible with pre RDF 1.1 parsers. Blank node labels can begin with a digit in RDF 1.1. _:1 is a legal bnode label in RDF 1.1.

This change does not invalidate any existing data (nothing should depend on the format of blank nodes, only uniqueness of ids). Specifically all existing persistently stored data still valid. It'll print old style.

Parsing speed should not affected.

Restoring blank nodes:

Blank nodes in NT and NQ dumps can be restored by rewriting the NT/NQ blank nodes _:Blabel as <_:label>, a pseudo URI scheme that tell RIOT (and in SPARQL) to use the given label. Use with care. Remember to remove the 'B'.

To restore old behaviour:

If you think anything odd has changed, you can check by restroing the old behaviour. In class 'SyntaxLabels' replace, in the second static function:

LabelToNode.createScopeByDocumentHash()

with

LabelToNode.createScopeByDocument()

        Andy

Reply via email to