[
https://issues.apache.org/jira/browse/MARMOTTA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Schaffert updated MARMOTTA-469:
-----------------------------------------
Description:
The KiWi triple store currently generates unique IDs for nodes and triples
using a kind of sequence generator. Snowflake is generally very fast, but to
ensure that the same object always gets the same ID a lot of synchronization is
necessary (immediate commit for nodes, triple registry for triples), which has
a considerable performance impact, particularly in clustered environments.
A much faster approach would be to compute the ID from the objects themselves,
e.g. using an efficient and good hashing function. With a 64bit hash, the
probability for conflicts starts getting serious at around 2 billion objects
(probability 10%), so it might make sense switching to 128bit keys as well.
A good overview over clash probabilities is given in:
http://preshing.com/20110504/hash-collision-probabilities/
Changes would affect the API for ID generation (IDGenerator) as well as the
value factory. In addition, we would need to ignore duplicate IDs for database
inserts, e.g. using triggers or merge. Finally, we need to rethink the
behaviour of deleted/non-deleted triples.
was:
The KiWi triple store currently generates unique IDs for nodes and triples
using a kind of sequence generator. Snowflake is generally very fast, but to
ensure that the same object always gets the same ID a lot of synchronization is
necessary (immediate commit for nodes, triple registry for triples), which has
a considerable performance impact, particularly in clustered environments.
A much faster approach would be to compute the ID from the objects themselves,
e.g. using an efficient and good hashing function. With a 64bit hash, the
probability for conflicts starts getting serious at around 2 billion objects
(probability 10%), so it might make sense switching to 128bit keys as well.
A good overview over clash probabilities is given in:
> KiWi: Hashing-Based ID Generation
> ---------------------------------
>
> Key: MARMOTTA-469
> URL: https://issues.apache.org/jira/browse/MARMOTTA-469
> Project: Marmotta
> Issue Type: Improvement
> Components: KiWi Triple Store
> Reporter: Sebastian Schaffert
> Assignee: Sebastian Schaffert
> Fix For: 3.3
>
>
> The KiWi triple store currently generates unique IDs for nodes and triples
> using a kind of sequence generator. Snowflake is generally very fast, but to
> ensure that the same object always gets the same ID a lot of synchronization
> is necessary (immediate commit for nodes, triple registry for triples), which
> has a considerable performance impact, particularly in clustered environments.
> A much faster approach would be to compute the ID from the objects
> themselves, e.g. using an efficient and good hashing function. With a 64bit
> hash, the probability for conflicts starts getting serious at around 2
> billion objects (probability 10%), so it might make sense switching to 128bit
> keys as well.
> A good overview over clash probabilities is given in:
> http://preshing.com/20110504/hash-collision-probabilities/
> Changes would affect the API for ID generation (IDGenerator) as well as the
> value factory. In addition, we would need to ignore duplicate IDs for
> database inserts, e.g. using triggers or merge. Finally, we need to rethink
> the behaviour of deleted/non-deleted triples.
--
This message was sent by Atlassian JIRA
(v6.2#6252)