On 16/05/14 18:41, Marco Neumann wrote:
sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?
At the moment ...
The hashing is done for the mapping of Node -> NodeId;
Triples (quads) are (NodeId, NodeId, NodeId) and NodeIds are not hashes.
NodeIds are 8 bytes: 1 byte for the kind of id and 7 bytes value.
For integers, decimals, dates, datetimes, booleans, the system attempts
to encode them directly inline into the NodeId (so its 56 bits of signed
integer, for example; timezones to the nearest 15 mins)
This is why the datatypes get lost : xsd:byte comes out as xsd:integer.
For all other nodes, they are stored in the node file and the 7 bytes is
the address into that file.
To get from Node to NodeId: happens rarely for query and a lot for
loading. You only look up constants in the query.
1/ If the Node can be inlined, calculate it.
2/ If it can't, hash (type marker+lexicalform+datatype+lang or similar
for the other nodes), and look in the node hase to NodeId table ('node2id')
I have been considering using hashes for the NodeId itself in a related
but different system. About 96 bits looks to be safe, or shorter with
some clash management. Then loading should be faster (and
parallelisable). You need NodeId->real Node for getting query results.
Andy
On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne <[email protected]> wrote:
On 14/05/14 19:10, Yi Liao wrote:
Hi Andy,
Thanks for your answer.
In you answer "Node ->(by calculation) 128 bit value", did you use
MD5 hash for this step? Is MD5 collision resistant? I guess it is
highly unlikely for two nodes to be hashed into the same value, so we
might just take the risk?
Thanks, Yi Liao
Hi there,
Yes - it's MD5.
128 bits is a simply huge number.
http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5
2^64 different RDF terms. That's RDF terms (Nodes) - not triples. Triples
aren't id'ed by hash.
If you are worried by using hashing and assuming uniqueness in 128 bits,
then there are a few others things to worry about: RAM corruption, even
in error-detecting RAM; ethernet, and indeed any data storage or transfer.
And coding bugs :-)
Andy
-----Original Message----- From: Andy Seaborne
[mailto:[email protected]] Sent: Tuesday, November 26, 2013 3:39 PM To:
[email protected] Subject: Re: how does the node to nodeId mapping
work?
On 26/11/13 19:47, Yi Liao wrote:
Hi,
Hi there,
Can anybody explain to me how does Jena map node to nodeId? The
following is stated in
http://jena.apache.org/documentation/tdb/architecture.html
"The Node to NodeId mapping is based on hash of the Node (a 128
bit MD5 hash - the length was found not to major performance
factor).
The default storage of the node table is a sequential access file
for the NodeId to Node mapping and a B+Tree for the Node to NodeId
mapping."
My understanding is that Jena hashes the node into a long integer,
Node ->(by calculation) 128 bit value ->(by index) file offset
and somehow converts the hashed value into an address offset to
the node table, and the node information is stored at the address
offset in the node table.
There is a hash to offset index.
The NodeTable itself is heavily cached.
Is my understanding correct?
Yes!
How does Jena converts the hashed value into an address offset? How
is B+ tree used in this process?
TDB uses a B+tree for the hash to address offset. While it only
needed to be a pure key->value mapping, the B+Tree code is used as
it's heavily tested.
There is in the codebase an external hash table which is pure
key->value. Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the
reallocate burstiness of the external hash table.
Thanks! Yi Liao
Andy