Hi Andy,

Thanks for your answer. 

In you answer "Node ->(by calculation) 128 bit value", did you use MD5 hash for 
this step? Is MD5 collision resistant? I guess it is highly unlikely for two 
nodes to be hashed into the same value, so we might just take the risk?

Thanks,
Yi Liao


-----Original Message-----
From: Andy Seaborne [mailto:a...@apache.org] 
Sent: Tuesday, November 26, 2013 3:39 PM
To: dev@jena.apache.org
Subject: Re: how does the node to nodeId mapping work?

On 26/11/13 19:47, Yi Liao wrote:
> Hi,

Hi there,

>
> Can anybody explain to me how does Jena map node to nodeId? The 
> following is stated in 
> http://jena.apache.org/documentation/tdb/architecture.html
>
>
> "The Node to NodeId mapping is based on hash of the Node (a 128 bit
> MD5 hash - the length was found not to major performance factor).
>
> The default storage of the node table is a sequential access file for 
> the NodeId to Node mapping and a B+Tree for the Node to NodeId 
> mapping."
>
> My understanding is that Jena hashes the node into a long integer,

Node ->(by calculation) 128 bit value ->(by index) file offset

> and somehow converts the hashed value into an address offset to the 
> node table, and the node information is stored at the address offset 
> in the node table.

There is a hash to offset index.

The NodeTable itself is heavily cached.

>
> Is my understanding correct?

Yes!

> How does Jena converts the hashed value into an address offset? How is 
> B+ tree used in this process?

TDB uses a B+tree for the hash to address offset.  While it only needed to be a 
pure key->value mapping, the B+Tree code is used as it's heavily tested.

There is in the codebase an external hash table which is pure 
key->value.  Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the reallocate 
burstiness of the external hash table.

>
> Thanks! Yi Liao
>
        
        Andy




Reply via email to