On 26/11/13 19:47, Yi Liao wrote:
Hi,
Hi there,
Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html "The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping." My understanding is that Jena hashes the node into a long integer,
Node ->(by calculation) 128 bit value ->(by index) file offset
and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table.
There is a hash to offset index. The NodeTable itself is heavily cached.
Is my understanding correct?
Yes!
How does Jena converts the hashed value into an address offset? How is B+ tree used in this process?
TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key->value mapping, the B+Tree code is used as it's heavily tested.
There is in the codebase an external hash table which is pure key->value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table.
Thanks! Yi Liao
Andy