sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?

On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne <[email protected]> wrote:
> On 14/05/14 19:10, Yi Liao wrote:
>>
>> Hi Andy,
>>
>> Thanks for your answer.
>>
>> In you answer "Node ->(by calculation) 128 bit value", did you use
>> MD5 hash for this step? Is MD5 collision resistant? I guess it is
>> highly unlikely for two nodes to be hashed into the same value, so we
>> might just take the risk?
>>
>> Thanks, Yi Liao
>
>
> Hi there,
>
> Yes - it's MD5.
>
> 128 bits is a simply huge number.
>
> http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5
>
> 2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. Triples
> aren't id'ed by hash.
>
> If you are worried by using hashing and assuming uniqueness in 128 bits,
> then there are a few others things to worry about: RAM corruption, even
> in error-detecting RAM; ethernet, and indeed any data storage or transfer.
>
> And coding bugs :-)
>
>         Andy
>
>
>
>>
>>
>> -----Original Message----- From: Andy Seaborne
>> [mailto:[email protected]] Sent: Tuesday, November 26, 2013 3:39 PM To:
>> [email protected] Subject: Re: how does the node to nodeId mapping
>> work?
>>
>> On 26/11/13 19:47, Yi Liao wrote:
>>>
>>> Hi,
>>
>>
>> Hi there,
>>
>>>
>>> Can anybody explain to me how does Jena map node to nodeId? The
>>> following is stated in
>>> http://jena.apache.org/documentation/tdb/architecture.html
>>>
>>>
>>> "The Node to NodeId mapping is based on hash of the Node (a 128
>>> bit MD5 hash - the length was found not to major performance
>>> factor).
>>>
>>> The default storage of the node table is a sequential access file
>>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId
>>> mapping."
>>>
>>> My understanding is that Jena hashes the node into a long integer,
>>
>>
>> Node ->(by calculation) 128 bit value ->(by index) file offset
>>
>>> and somehow converts the hashed value into an address offset to
>>> the node table, and the node information is stored at the address
>>> offset in the node table.
>>
>>
>> There is a hash to offset index.
>>
>> The NodeTable itself is heavily cached.
>>
>>>
>>> Is my understanding correct?
>>
>>
>> Yes!
>>
>>> How does Jena converts the hashed value into an address offset? How
>>> is B+ tree used in this process?
>>
>>
>> TDB uses a B+tree for the hash to address offset.  While it only
>> needed to be a pure key->value mapping, the B+Tree code is used as
>> it's heavily tested.
>>
>> There is in the codebase an external hash table which is pure
>> key->value.  Using it did not make an observable difference (see teh
>> cache) so using the B+Tree code was easy and it doesn't have the
>> reallocate burstiness of the external hash table.
>>
>>>
>>> Thanks! Yi Liao
>>>
>>  Andy
>>
>>
>>
>>
>



-- 


---
Marco Neumann
KONA

Reply via email to