On 14/05/14 19:10, Yi Liao wrote:
Hi Andy,

Thanks for your answer.

In you answer "Node ->(by calculation) 128 bit value", did you use
MD5 hash for this step? Is MD5 collision resistant? I guess it is
highly unlikely for two nodes to be hashed into the same value, so we
might just take the risk?

Thanks, Yi Liao

Hi there,

Yes - it's MD5.

128 bits is a simply huge number.

http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5

2^64 different RDF terms. That's RDF terms (Nodes) - not triples. Triples aren't id'ed by hash.

If you are worried by using hashing and assuming uniqueness in 128 bits,
then there are a few others things to worry about: RAM corruption, even
in error-detecting RAM; ethernet, and indeed any data storage or transfer.

And coding bugs :-)

        Andy




-----Original Message----- From: Andy Seaborne
[mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
work?

On 26/11/13 19:47, Yi Liao wrote:
Hi,

Hi there,


Can anybody explain to me how does Jena map node to nodeId? The
following is stated in
http://jena.apache.org/documentation/tdb/architecture.html


"The Node to NodeId mapping is based on hash of the Node (a 128
bit MD5 hash - the length was found not to major performance
factor).

The default storage of the node table is a sequential access file
for the NodeId to Node mapping and a B+Tree for the Node to NodeId
mapping."

My understanding is that Jena hashes the node into a long integer,

Node ->(by calculation) 128 bit value ->(by index) file offset

and somehow converts the hashed value into an address offset to
the node table, and the node information is stored at the address
offset in the node table.

There is a hash to offset index.

The NodeTable itself is heavily cached.


Is my understanding correct?

Yes!

How does Jena converts the hashed value into an address offset? How
is B+ tree used in this process?

TDB uses a B+tree for the hash to address offset.  While it only
needed to be a pure key->value mapping, the B+Tree code is used as
it's heavily tested.

There is in the codebase an external hash table which is pure
key->value.  Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the
reallocate burstiness of the external hash table.


Thanks! Yi Liao

 Andy





Reply via email to