Re: how does the node to nodeId mapping work?

2014-05-17 Thread Andy Seaborne

On 16/05/14 18:41, Marco Neumann wrote:

sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?


At the moment ...

The hashing is done for the mapping of Node - NodeId;

Triples (quads) are (NodeId, NodeId, NodeId) and NodeIds are not hashes.

NodeIds are 8 bytes: 1 byte for the kind of id and 7 bytes value.

For integers, decimals, dates, datetimes, booleans, the system attempts 
to encode them directly inline into the NodeId (so its 56 bits of signed 
integer, for example; timezones to the nearest 15 mins)


This is why the datatypes get lost : xsd:byte comes out as xsd:integer.

For all other nodes, they are stored in the node file and the 7 bytes is 
the address into that file.


To get from Node to NodeId: happens rarely for query and a lot for 
loading.  You only look up constants in the query.


1/ If the Node can be inlined, calculate it.

2/ If it can't, hash (type marker+lexicalform+datatype+lang or similar 
for the other nodes), and look in the node hase to NodeId table ('node2id')




I have been considering using hashes for the NodeId itself in a related 
but different system.  About 96 bits looks to be safe, or shorter with 
some clash management.  Then loading should be faster (and 
parallelisable).  You need NodeId-real Node for getting query results.


Andy



On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne a...@apache.org wrote:

On 14/05/14 19:10, Yi Liao wrote:


Hi Andy,

Thanks for your answer.

In you answer Node -(by calculation) 128 bit value, did you use
MD5 hash for this step? Is MD5 collision resistant? I guess it is
highly unlikely for two nodes to be hashed into the same value, so we
might just take the risk?

Thanks, Yi Liao



Hi there,

Yes - it's MD5.

128 bits is a simply huge number.

http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5

2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. Triples
aren't id'ed by hash.

If you are worried by using hashing and assuming uniqueness in 128 bits,
then there are a few others things to worry about: RAM corruption, even
in error-detecting RAM; ethernet, and indeed any data storage or transfer.

And coding bugs :-)

 Andy






-Original Message- From: Andy Seaborne
[mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
work?

On 26/11/13 19:47, Yi Liao wrote:


Hi,



Hi there,



Can anybody explain to me how does Jena map node to nodeId? The
following is stated in
http://jena.apache.org/documentation/tdb/architecture.html


The Node to NodeId mapping is based on hash of the Node (a 128
bit MD5 hash - the length was found not to major performance
factor).

The default storage of the node table is a sequential access file
for the NodeId to Node mapping and a B+Tree for the Node to NodeId
mapping.

My understanding is that Jena hashes the node into a long integer,



Node -(by calculation) 128 bit value -(by index) file offset


and somehow converts the hashed value into an address offset to
the node table, and the node information is stored at the address
offset in the node table.



There is a hash to offset index.

The NodeTable itself is heavily cached.



Is my understanding correct?



Yes!


How does Jena converts the hashed value into an address offset? How
is B+ tree used in this process?



TDB uses a B+tree for the hash to address offset.  While it only
needed to be a pure key-value mapping, the B+Tree code is used as
it's heavily tested.

There is in the codebase an external hash table which is pure
key-value.  Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the
reallocate burstiness of the external hash table.



Thanks! Yi Liao


  Andy














Re: how does the node to nodeId mapping work?

2014-05-16 Thread Andy Seaborne

On 14/05/14 19:10, Yi Liao wrote:

Hi Andy,

Thanks for your answer.

In you answer Node -(by calculation) 128 bit value, did you use
MD5 hash for this step? Is MD5 collision resistant? I guess it is
highly unlikely for two nodes to be hashed into the same value, so we
might just take the risk?

Thanks, Yi Liao


Hi there,

Yes - it's MD5.

128 bits is a simply huge number.

http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5

2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. 
Triples aren't id'ed by hash.


If you are worried by using hashing and assuming uniqueness in 128 bits,
then there are a few others things to worry about: RAM corruption, even
in error-detecting RAM; ethernet, and indeed any data storage or transfer.

And coding bugs :-)

Andy





-Original Message- From: Andy Seaborne
[mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
work?

On 26/11/13 19:47, Yi Liao wrote:

Hi,


Hi there,



Can anybody explain to me how does Jena map node to nodeId? The
following is stated in
http://jena.apache.org/documentation/tdb/architecture.html


The Node to NodeId mapping is based on hash of the Node (a 128
bit MD5 hash - the length was found not to major performance
factor).

The default storage of the node table is a sequential access file
for the NodeId to Node mapping and a B+Tree for the Node to NodeId
mapping.

My understanding is that Jena hashes the node into a long integer,


Node -(by calculation) 128 bit value -(by index) file offset


and somehow converts the hashed value into an address offset to
the node table, and the node information is stored at the address
offset in the node table.


There is a hash to offset index.

The NodeTable itself is heavily cached.



Is my understanding correct?


Yes!


How does Jena converts the hashed value into an address offset? How
is B+ tree used in this process?


TDB uses a B+tree for the hash to address offset.  While it only
needed to be a pure key-value mapping, the B+Tree code is used as
it's heavily tested.

There is in the codebase an external hash table which is pure
key-value.  Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the
reallocate burstiness of the external hash table.



Thanks! Yi Liao


 Andy








Re: how does the node to nodeId mapping work?

2014-05-16 Thread Marco Neumann
sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?

On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne a...@apache.org wrote:
 On 14/05/14 19:10, Yi Liao wrote:

 Hi Andy,

 Thanks for your answer.

 In you answer Node -(by calculation) 128 bit value, did you use
 MD5 hash for this step? Is MD5 collision resistant? I guess it is
 highly unlikely for two nodes to be hashed into the same value, so we
 might just take the risk?

 Thanks, Yi Liao


 Hi there,

 Yes - it's MD5.

 128 bits is a simply huge number.

 http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5

 2^64 different RDF terms.  That's RDF terms (Nodes) - not triples. Triples
 aren't id'ed by hash.

 If you are worried by using hashing and assuming uniqueness in 128 bits,
 then there are a few others things to worry about: RAM corruption, even
 in error-detecting RAM; ethernet, and indeed any data storage or transfer.

 And coding bugs :-)

 Andy





 -Original Message- From: Andy Seaborne
 [mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To:
 dev@jena.apache.org Subject: Re: how does the node to nodeId mapping
 work?

 On 26/11/13 19:47, Yi Liao wrote:

 Hi,


 Hi there,


 Can anybody explain to me how does Jena map node to nodeId? The
 following is stated in
 http://jena.apache.org/documentation/tdb/architecture.html


 The Node to NodeId mapping is based on hash of the Node (a 128
 bit MD5 hash - the length was found not to major performance
 factor).

 The default storage of the node table is a sequential access file
 for the NodeId to Node mapping and a B+Tree for the Node to NodeId
 mapping.

 My understanding is that Jena hashes the node into a long integer,


 Node -(by calculation) 128 bit value -(by index) file offset

 and somehow converts the hashed value into an address offset to
 the node table, and the node information is stored at the address
 offset in the node table.


 There is a hash to offset index.

 The NodeTable itself is heavily cached.


 Is my understanding correct?


 Yes!

 How does Jena converts the hashed value into an address offset? How
 is B+ tree used in this process?


 TDB uses a B+tree for the hash to address offset.  While it only
 needed to be a pure key-value mapping, the B+Tree code is used as
 it's heavily tested.

 There is in the codebase an external hash table which is pure
 key-value.  Using it did not make an observable difference (see teh
 cache) so using the B+Tree code was easy and it doesn't have the
 reallocate burstiness of the external hash table.


 Thanks! Yi Liao

  Andy








-- 


---
Marco Neumann
KONA


Re: how does the node to nodeId mapping work?

2014-05-16 Thread Marco Neumann
Andy, are Literals in Statements also hashed with the same MD5 in TDB?

On Tue, Nov 26, 2013 at 3:38 PM, Andy Seaborne a...@apache.org wrote:
 On 26/11/13 19:47, Yi Liao wrote:

 Hi,


 Hi there,



 Can anybody explain to me how does Jena map node to nodeId? The
 following is stated in
 http://jena.apache.org/documentation/tdb/architecture.html


 The Node to NodeId mapping is based on hash of the Node (a 128 bit
 MD5 hash - the length was found not to major performance factor).

 The default storage of the node table is a sequential access file for
 the NodeId to Node mapping and a B+Tree for the Node to NodeId
 mapping.

 My understanding is that Jena hashes the node into a long integer,


 Node -(by calculation) 128 bit value -(by index) file offset


 and somehow converts the hashed value into an address offset to the
 node table, and the node information is stored at the address offset
 in the node table.


 There is a hash to offset index.

 The NodeTable itself is heavily cached.


 Is my understanding correct?


 Yes!


 How does Jena converts the hashed value
 into an address offset? How is B+ tree used in this process?


 TDB uses a B+tree for the hash to address offset.  While it only needed to
 be a pure key-value mapping, the B+Tree code is used as it's heavily
 tested.

 There is in the codebase an external hash table which is pure key-value.
 Using it did not make an observable difference (see teh cache) so using the
 B+Tree code was easy and it doesn't have the reallocate burstiness of the
 external hash table.


 Thanks! Yi Liao


 Andy





-- 


---
Marco Neumann
KONA


RE: how does the node to nodeId mapping work?

2014-05-15 Thread Yi Liao
Hi Andy,

Thanks for your answer. 

In you answer Node -(by calculation) 128 bit value, did you use MD5 hash for 
this step? Is MD5 collision resistant? I guess it is highly unlikely for two 
nodes to be hashed into the same value, so we might just take the risk?

Thanks,
Yi Liao


-Original Message-
From: Andy Seaborne [mailto:a...@apache.org] 
Sent: Tuesday, November 26, 2013 3:39 PM
To: dev@jena.apache.org
Subject: Re: how does the node to nodeId mapping work?

On 26/11/13 19:47, Yi Liao wrote:
 Hi,

Hi there,


 Can anybody explain to me how does Jena map node to nodeId? The 
 following is stated in 
 http://jena.apache.org/documentation/tdb/architecture.html


 The Node to NodeId mapping is based on hash of the Node (a 128 bit
 MD5 hash - the length was found not to major performance factor).

 The default storage of the node table is a sequential access file for 
 the NodeId to Node mapping and a B+Tree for the Node to NodeId 
 mapping.

 My understanding is that Jena hashes the node into a long integer,

Node -(by calculation) 128 bit value -(by index) file offset

 and somehow converts the hashed value into an address offset to the 
 node table, and the node information is stored at the address offset 
 in the node table.

There is a hash to offset index.

The NodeTable itself is heavily cached.


 Is my understanding correct?

Yes!

 How does Jena converts the hashed value into an address offset? How is 
 B+ tree used in this process?

TDB uses a B+tree for the hash to address offset.  While it only needed to be a 
pure key-value mapping, the B+Tree code is used as it's heavily tested.

There is in the codebase an external hash table which is pure 
key-value.  Using it did not make an observable difference (see teh
cache) so using the B+Tree code was easy and it doesn't have the reallocate 
burstiness of the external hash table.


 Thanks! Yi Liao


Andy






how does the node to nodeId mapping work?

2013-11-26 Thread Yi Liao
Hi,

Can anybody explain to me how does Jena map node to nodeId? The following is 
stated in http://jena.apache.org/documentation/tdb/architecture.html


The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - 
the length was found not to major performance factor).

The default storage of the node table is a sequential access file for the 
NodeId to Node mapping and a B+Tree for the Node to NodeId mapping.

My understanding is that Jena hashes the node into a long integer, and somehow 
converts the hashed value into an address offset to the node table, and the 
node information is stored at the address offset in the node table.

Is my understanding correct? How does Jena converts the hashed value into an 
address offset? How is B+ tree used in this process?

Thanks!
Yi Liao


Re: how does the node to nodeId mapping work?

2013-11-26 Thread Andy Seaborne

On 26/11/13 19:47, Yi Liao wrote:

Hi,


Hi there,



Can anybody explain to me how does Jena map node to nodeId? The
following is stated in
http://jena.apache.org/documentation/tdb/architecture.html


The Node to NodeId mapping is based on hash of the Node (a 128 bit
MD5 hash - the length was found not to major performance factor).

The default storage of the node table is a sequential access file for
the NodeId to Node mapping and a B+Tree for the Node to NodeId
mapping.

My understanding is that Jena hashes the node into a long integer,


Node -(by calculation) 128 bit value -(by index) file offset


and somehow converts the hashed value into an address offset to the
node table, and the node information is stored at the address offset
in the node table.


There is a hash to offset index.

The NodeTable itself is heavily cached.



Is my understanding correct?


Yes!


How does Jena converts the hashed value
into an address offset? How is B+ tree used in this process?


TDB uses a B+tree for the hash to address offset.  While it only needed 
to be a pure key-value mapping, the B+Tree code is used as it's heavily 
tested.


There is in the codebase an external hash table which is pure 
key-value.  Using it did not make an observable difference (see teh 
cache) so using the B+Tree code was easy and it doesn't have the 
reallocate burstiness of the external hash table.




Thanks! Yi Liao



Andy