Re: how does the node to nodeId mapping work?
On 16/05/14 18:41, Marco Neumann wrote: sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...? At the moment ... The hashing is done for the mapping of Node - NodeId; Triples (quads) are (NodeId, NodeId, NodeId) and NodeIds are not hashes. NodeIds are 8 bytes: 1 byte for the kind of id and 7 bytes value. For integers, decimals, dates, datetimes, booleans, the system attempts to encode them directly inline into the NodeId (so its 56 bits of signed integer, for example; timezones to the nearest 15 mins) This is why the datatypes get lost : xsd:byte comes out as xsd:integer. For all other nodes, they are stored in the node file and the 7 bytes is the address into that file. To get from Node to NodeId: happens rarely for query and a lot for loading. You only look up constants in the query. 1/ If the Node can be inlined, calculate it. 2/ If it can't, hash (type marker+lexicalform+datatype+lang or similar for the other nodes), and look in the node hase to NodeId table ('node2id') I have been considering using hashes for the NodeId itself in a related but different system. About 96 bits looks to be safe, or shorter with some clash management. Then loading should be faster (and parallelisable). You need NodeId-real Node for getting query results. Andy On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne a...@apache.org wrote: On 14/05/14 19:10, Yi Liao wrote: Hi Andy, Thanks for your answer. In you answer Node -(by calculation) 128 bit value, did you use MD5 hash for this step? Is MD5 collision resistant? I guess it is highly unlikely for two nodes to be hashed into the same value, so we might just take the risk? Thanks, Yi Liao Hi there, Yes - it's MD5. 128 bits is a simply huge number. http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5 2^64 different RDF terms. That's RDF terms (Nodes) - not triples. Triples aren't id'ed by hash. If you are worried by using hashing and assuming uniqueness in 128 bits, then there are a few others things to worry about: RAM corruption, even in error-detecting RAM; ethernet, and indeed any data storage or transfer. And coding bugs :-) Andy -Original Message- From: Andy Seaborne [mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To: dev@jena.apache.org Subject: Re: how does the node to nodeId mapping work? On 26/11/13 19:47, Yi Liao wrote: Hi, Hi there, Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping. My understanding is that Jena hashes the node into a long integer, Node -(by calculation) 128 bit value -(by index) file offset and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table. There is a hash to offset index. The NodeTable itself is heavily cached. Is my understanding correct? Yes! How does Jena converts the hashed value into an address offset? How is B+ tree used in this process? TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key-value mapping, the B+Tree code is used as it's heavily tested. There is in the codebase an external hash table which is pure key-value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table. Thanks! Yi Liao Andy
Re: how does the node to nodeId mapping work?
On 14/05/14 19:10, Yi Liao wrote: Hi Andy, Thanks for your answer. In you answer Node -(by calculation) 128 bit value, did you use MD5 hash for this step? Is MD5 collision resistant? I guess it is highly unlikely for two nodes to be hashed into the same value, so we might just take the risk? Thanks, Yi Liao Hi there, Yes - it's MD5. 128 bits is a simply huge number. http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5 2^64 different RDF terms. That's RDF terms (Nodes) - not triples. Triples aren't id'ed by hash. If you are worried by using hashing and assuming uniqueness in 128 bits, then there are a few others things to worry about: RAM corruption, even in error-detecting RAM; ethernet, and indeed any data storage or transfer. And coding bugs :-) Andy -Original Message- From: Andy Seaborne [mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To: dev@jena.apache.org Subject: Re: how does the node to nodeId mapping work? On 26/11/13 19:47, Yi Liao wrote: Hi, Hi there, Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping. My understanding is that Jena hashes the node into a long integer, Node -(by calculation) 128 bit value -(by index) file offset and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table. There is a hash to offset index. The NodeTable itself is heavily cached. Is my understanding correct? Yes! How does Jena converts the hashed value into an address offset? How is B+ tree used in this process? TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key-value mapping, the B+Tree code is used as it's heavily tested. There is in the codebase an external hash table which is pure key-value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table. Thanks! Yi Liao Andy
Re: how does the node to nodeId mapping work?
sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...? On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne a...@apache.org wrote: On 14/05/14 19:10, Yi Liao wrote: Hi Andy, Thanks for your answer. In you answer Node -(by calculation) 128 bit value, did you use MD5 hash for this step? Is MD5 collision resistant? I guess it is highly unlikely for two nodes to be hashed into the same value, so we might just take the risk? Thanks, Yi Liao Hi there, Yes - it's MD5. 128 bits is a simply huge number. http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5 2^64 different RDF terms. That's RDF terms (Nodes) - not triples. Triples aren't id'ed by hash. If you are worried by using hashing and assuming uniqueness in 128 bits, then there are a few others things to worry about: RAM corruption, even in error-detecting RAM; ethernet, and indeed any data storage or transfer. And coding bugs :-) Andy -Original Message- From: Andy Seaborne [mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To: dev@jena.apache.org Subject: Re: how does the node to nodeId mapping work? On 26/11/13 19:47, Yi Liao wrote: Hi, Hi there, Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping. My understanding is that Jena hashes the node into a long integer, Node -(by calculation) 128 bit value -(by index) file offset and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table. There is a hash to offset index. The NodeTable itself is heavily cached. Is my understanding correct? Yes! How does Jena converts the hashed value into an address offset? How is B+ tree used in this process? TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key-value mapping, the B+Tree code is used as it's heavily tested. There is in the codebase an external hash table which is pure key-value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table. Thanks! Yi Liao Andy -- --- Marco Neumann KONA
Re: how does the node to nodeId mapping work?
Andy, are Literals in Statements also hashed with the same MD5 in TDB? On Tue, Nov 26, 2013 at 3:38 PM, Andy Seaborne a...@apache.org wrote: On 26/11/13 19:47, Yi Liao wrote: Hi, Hi there, Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping. My understanding is that Jena hashes the node into a long integer, Node -(by calculation) 128 bit value -(by index) file offset and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table. There is a hash to offset index. The NodeTable itself is heavily cached. Is my understanding correct? Yes! How does Jena converts the hashed value into an address offset? How is B+ tree used in this process? TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key-value mapping, the B+Tree code is used as it's heavily tested. There is in the codebase an external hash table which is pure key-value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table. Thanks! Yi Liao Andy -- --- Marco Neumann KONA
RE: how does the node to nodeId mapping work?
Hi Andy, Thanks for your answer. In you answer Node -(by calculation) 128 bit value, did you use MD5 hash for this step? Is MD5 collision resistant? I guess it is highly unlikely for two nodes to be hashed into the same value, so we might just take the risk? Thanks, Yi Liao -Original Message- From: Andy Seaborne [mailto:a...@apache.org] Sent: Tuesday, November 26, 2013 3:39 PM To: dev@jena.apache.org Subject: Re: how does the node to nodeId mapping work? On 26/11/13 19:47, Yi Liao wrote: Hi, Hi there, Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping. My understanding is that Jena hashes the node into a long integer, Node -(by calculation) 128 bit value -(by index) file offset and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table. There is a hash to offset index. The NodeTable itself is heavily cached. Is my understanding correct? Yes! How does Jena converts the hashed value into an address offset? How is B+ tree used in this process? TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key-value mapping, the B+Tree code is used as it's heavily tested. There is in the codebase an external hash table which is pure key-value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table. Thanks! Yi Liao Andy
how does the node to nodeId mapping work?
Hi, Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping. My understanding is that Jena hashes the node into a long integer, and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table. Is my understanding correct? How does Jena converts the hashed value into an address offset? How is B+ tree used in this process? Thanks! Yi Liao
Re: how does the node to nodeId mapping work?
On 26/11/13 19:47, Yi Liao wrote: Hi, Hi there, Can anybody explain to me how does Jena map node to nodeId? The following is stated in http://jena.apache.org/documentation/tdb/architecture.html The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping. My understanding is that Jena hashes the node into a long integer, Node -(by calculation) 128 bit value -(by index) file offset and somehow converts the hashed value into an address offset to the node table, and the node information is stored at the address offset in the node table. There is a hash to offset index. The NodeTable itself is heavily cached. Is my understanding correct? Yes! How does Jena converts the hashed value into an address offset? How is B+ tree used in this process? TDB uses a B+tree for the hash to address offset. While it only needed to be a pure key-value mapping, the B+Tree code is used as it's heavily tested. There is in the codebase an external hash table which is pure key-value. Using it did not make an observable difference (see teh cache) so using the B+Tree code was easy and it doesn't have the reallocate burstiness of the external hash table. Thanks! Yi Liao Andy