RIOT, blank nodes and JENA-352

Andy Seaborne Sat, 30 Mar 2013 11:52:30 -0700

Heads-up for a change to blank nodes produced by RIOT when parsing (butnot when using RDF/XML).


tl:dr

Parsing data with vast numbers of blank nodes in a single file nowscales better.


Appearance of N-Triples and N-Quads output changes slightly.

Fully compatible with existing data.

https://issues.apache.org/jira/browse/JENA-352

Details:

The blank node allocator has to ensure that two uses of the same labelalways generate the same blank node even is uned in the first line of afile and the last line.

To do this, RIOT was keeping a map of label to allocated node. Atscale, this fails as it uses memory (although you do need a lot of blanknode labels for it to become serious).

A new policy is to use an with a seed value per parser run, which iscombined with any string label to produce a globally unique id. Thereis also an LRU cache of 1000 slots to do map-like sharing and avoidexcessive calls of to MD5 digest engine. Typically, a blank node labelis used in a short section of the file much of the time (think blanknodes as subjects or blank nodes in structure values and lists).

The seed is a random UUID (122 bits of randomness). The label iscombined with the seed by converting to UTF8 bytes and using MD5 to givea 128bit hashed value which is assumed to be globally unique. Using MD5makes it fixed length which is a convenient. As we are not requiring anunattackable policy, MD5 is acceptable.

This change is observable - the format of blank nodes printed inN-Triples and N-Quads changes slightly. N-Triples and N-Quads printbNodes using the internal label (so work at arbitrary scale, and caneven be used to restore blank nodes as described below).

The old allocator used a java.net.UID, which had : and - characters itin. These were encoded as Xhh for two hex digits (x3A and x2D).

The new format is slightly shorted, and does not have Xhh encodedcharacters


Old:
_:BX2D5bbaf4a1X3A13dbc7e7182X3AX2D7fff

New:
_:B70db88eb40afc13d2ab37d161e36392e

Printed labels start "B", a letter, to keep them compatible with pre RDF1.1 parsers. Blank node labels can begin with a digit in RDF 1.1. _:1is a legal bnode label in RDF 1.1.

This change does not invalidate any existing data (nothing should dependon the format of blank nodes, only uniqueness of ids). Specifically allexisting persistently stored data still valid. It'll print old style.


Parsing speed should not affected.

Restoring blank nodes:

Blank nodes in NT and NQ dumps can be restored by rewriting the NT/NQblank nodes _:Blabel as <_:label>, a pseudo URI scheme that tell RIOT(and in SPARQL) to use the given label. Use with care. Remember toremove the 'B'.


To restore old behaviour:

If you think anything odd has changed, you can check by restroing theold behaviour. In class 'SyntaxLabels' replace, in the second staticfunction:


LabelToNode.createScopeByDocumentHash()

with

LabelToNode.createScopeByDocument()

        Andy

RIOT, blank nodes and JENA-352

Reply via email to