Hi Andy, Thanks for your answer.
In you answer "Node ->(by calculation) 128 bit value", did you use MD5 hash for this step? Is MD5 collision resistant? I guess it is highly unlikely for two nodes to be hashed into the same value, so we might just take the risk? Thanks, Yi Liao -----Original Message----- From: Andy Seaborne [mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To: dev@jena.apache.org Subject: Re: how does the node to nodeId mapping work? On 26/11/13 19:47, Yi Liao wrote: > Hi, Hi there, > > Can anybody explain to me how does Jena map node to nodeId? The > following is stated in > http://jena.apache.org/documentation/tdb/architecture.html > > > "The Node to NodeId mapping is based on hash of the Node (a 128 bit > MD5 hash - the length was found not to major performance factor). > > The default storage of the node table is a sequential access file for > the NodeId to Node mapping and a B+Tree for the Node to NodeId > mapping." > > My understanding is that Jena hashes the node into a long integer, Node ->(by calculation) 128 bit value ->(by index) file offset > and somehow converts the hashed value into an address offset to the > node table, and the node information is stored at the address offset > in the node table. There is a hash to offset index. The NodeTable itself is heavily cached. > > Is my understanding correct? Yes! > How does Jena converts the hashed value into an address offset? How is > B+ tree used in this process? TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key->value mapping, the B+Tree code is used as it's heavily tested. There is in the codebase an external hash table which is pure key->value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table. > > Thanks! Yi Liao > Andy