Re: [jr3] Index on randomly distributed data

Felix Meschberger Tue, 06 Mar 2012 09:01:12 -0800

Hi,

I see and understand your points. But then, when it comes to clustering, 
generating a strictly monotone sequential ID without collisions without 
tampering performance is probably a hard problem to solve, right ?


I wonder, who defines the node IDs ? Is it the layer above of below the 
Mikrokernel API ?

If it would be the Mikrokernel impl. itself (thus below the API), why would it 
matter to the upper layers ?

Regards
Felix

Am 06.03.2012 um 01:34 schrieb Thomas Mueller:

> Hi,
> 
> In Jackrabbit 2, we currently use a randomly generated UUID as the node
> id. For Jackrabbit 3 this is an option question. I was looking for ways to
> index randomly distributed data, but so far didn't find a solution. A
> Google query for "uuid primary key performance" gave me:
> 
> http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
> my job, we use UUID as PKs. What I can tell you from experience is DO NOT
> USE THEM as PKs ... It's one of those things that when you have less than
> 1000 records it;s ok, but when you have millions, it's the worst thing you
> can do. Why? Because UUID are not sequential..."
> 
> http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
> hours to insert 15 million records into an empty UUID table"
> 
> http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
> "For auto_increment key load process took 1 hour 50 minutes ... For UUID
> process took over 12 hours and is still going...  So in this little case
> we have about 200 times performance difference"
> 
> I believe if we rely on an index on randomly distributed data, performance
> will degrade (factor 10 or more, depending on the repository size, the
> memory, and potentially on the number of changes). For Jackrabbit 2, to
> solve this performance problem, we can actually switch to sequential node
> ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
> node id, then it wouldn't be possible to switch (it is not possible to
> generate sequential content hashes). With content hashes, one option is to
> make sure the index is always in memory. However, I believe we should not
> build a system that has such constraints, unless the alternative
> (sequential node ids) has problems we can not solve otherwise.
> 
> Regards,
> Thomas
>

Re: [jr3] Index on randomly distributed data

Reply via email to