Re: [jr3] Index on randomly distributed data

Thomas Mueller Tue, 06 Mar 2012 10:19:40 -0800

Hi,

As for clustering, there are multiple solutions that don't require
*randomly distributed* node ids:


- if node ids are only used internally within a cluster node, then having
conflicting node ids isn't a problem.

- the path (compressed or uncompressed) could be used as the node id

You could still use UUIDs, but sequential UUIDs, as even those are
guaranteed to be globally unique - see also

http://www.sqlmag.com/article/quering/using-newsequentialid-instead-of-newi
d-

"In the past, many people used NewID() to assign primary keys to columns,
and these columns were unique across servers. When SQL Server 2000 first
came out, I too used NewID() to create primary keys. Alas, I soon
determined (as did many others) that using the NewID() function this way
can cause a variety of performance problems and that NewID() doesn't
provide a unique value that increases in order, as the IDENTITY-based key
does.You get some important performance benefits when a key value is
always higher than the last assigned value; generally these benefits are
related to how data is stored on the pages." (NewID() is the randomly
generated one we currently use in Jackrabbit 2).


Regards,
Thomas


On 3/6/12 6:00 PM, "Felix Meschberger" <[email protected]> wrote:

>Hi,
>
>I see and understand your points. But then, when it comes to clustering,
>generating a strictly monotone sequential ID without collisions without
>tampering performance is probably a hard problem to solve, right ?
>
>I wonder, who defines the node IDs ? Is it the layer above of below the
>Mikrokernel API ?
>
>If it would be the Mikrokernel impl. itself (thus below the API), why
>would it matter to the upper layers ?
>
>Regards
>Felix
>
>Am 06.03.2012 um 01:34 schrieb Thomas Mueller:
>
>> Hi,
>> 
>> In Jackrabbit 2, we currently use a randomly generated UUID as the node
>> id. For Jackrabbit 3 this is an option question. I was looking for ways
>>to
>> index randomly distributed data, but so far didn't find a solution. A
>> Google query for "uuid primary key performance" gave me:
>> 
>> http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
>> my job, we use UUID as PKs. What I can tell you from experience is DO
>>NOT
>> USE THEM as PKs ... It's one of those things that when you have less
>>than
>> 1000 records it;s ok, but when you have millions, it's the worst thing
>>you
>> can do. Why? Because UUID are not sequential..."
>> 
>> http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
>> hours to insert 15 million records into an empty UUID table"
>> 
>> http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
>> "For auto_increment key load process took 1 hour 50 minutes ... For UUID
>> process took over 12 hours and is still going...  So in this little case
>> we have about 200 times performance difference"
>> 
>> I believe if we rely on an index on randomly distributed data,
>>performance
>> will degrade (factor 10 or more, depending on the repository size, the
>> memory, and potentially on the number of changes). For Jackrabbit 2, to
>> solve this performance problem, we can actually switch to sequential
>>node
>> ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
>> node id, then it wouldn't be possible to switch (it is not possible to
>> generate sequential content hashes). With content hashes, one option is
>>to
>> make sure the index is always in memory. However, I believe we should
>>not
>> build a system that has such constraints, unless the alternative
>> (sequential node ids) has problems we can not solve otherwise.
>> 
>> Regards,
>> Thomas
>> 
>

Re: [jr3] Index on randomly distributed data

Reply via email to