Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Ludovic Dubost Mon, 09 Jan 2012 03:20:45 -0800

Hi,

I have one small concern which leads to a big concern, and I was wondering
about something.


1/ Small concern: what did we do to verify the potential level of
collisions and if there is a chance they happen in our case

I see we want to truncate the MD5 hash to 64 bits. I was wondering if there
is a not a risk of having more collisions.
My question here is what did we do to verify the level of collisions on
real data.

We could provide some XWiki SAS client DBs, including our Intranet which is
quite big for testing if there was a testing program.

2/ Bigger concern: wouldn't it be better to have a way to
activate/deactivate the new feature. This would allow to still upgrade and
make tests on real life data without risking being in a corner

3/ Wondering: wouldn't it be better to use the real reference as the ID and
move to strings for it

Give that in an XWiki database, this part is really small (compared to
attachments and the data itself), are there really any reasons to use IDs
for this reference. Wouldn't the use of a String be better in the end ? We
already use this for the join between xwikidoc and xwikiobjects and haven't
seen any big problem with that did we ?

If we used that method wouldn't it means ZERO collision ?


4/ Small additional stuff

There is also the migration of Object IDs right ? The object IDs use the
same system and also have a risk of collision (which would lead to property
data being shared with completely unrelevant documents)

Ludovic


2012/1/7 Denis Gervalle <[email protected]>

> Now that the database migration mechanism has been improved, I would like
> to go ahead with my patch to improve document ids.
>
> Currently, ids are simple string hashcode of a locally serialized document
> reference, including the language for translated documents. The likelihood
> of having duplicates with the string hashing algorithm of java is really
> high.
>
> What I propose is:
>
>  1) use an MD5 hashing which is particularly good at distributing.
>  2) truncate the hash to the first 64bits, since the XWD_ID column is a
> 64bit long.
>  3) use a better string representation as the source of hashing
>
> Based on previous discussion, point 1) and 2) has already been agreed, and
> this vote is in particular about the string used for 3).
> I propose it in 2 steps:
>
>  1) before locale are fully supported in document reference, use this
> format:
>
>
>  
> <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
>    where language would be an empty string for the default document, so it
> would look like 7:mySpace5:myDoc0: and its french translation could be
> 7:mySpace5:myDoc2:fr
>  2) when locale are included in reference, we will replace the
> implementation by a reference serializer that would produce the same kind
> of representation, but that will include all spaces (not only the last
> one), to be prepared for the future.
>
> While doing so, I also propose to fix the cache key issue by using the same
> reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous
> examples will have the following key in the document cache:
> 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
>
> Using such a key (compared to the usual serialization) has the following
> advantages:
>  - ensure uniqueness of the reference without requiring a complex escaping
> algorithm, which is unneeded here.
>  - potentially reversible
>  - faster than the usual serialization
>  - support language
>  - independent of the current serialization that may evolved independently,
> so it will be stable over time which is really important when it is used as
> a base for the hashing algorithm used for document ids stored in the
> database.
>
> I would like to introduce this as early as possible, which means has soon
> has we are confident with the migration mechanism recently introduced.
> Since the migration of ids will convert 32bits hashes into 64bits ones, the
> risk of collision is really low, and to be careful, I have written a
> migration algorithm that would support such collision (unless it cause a
> circular reference collision, but this is really unexpected). However,
> changing ids again later, if we change our mind, will be really more risky
> and the migration difficult to implements, so it is really important that
> we agree on the way we compute these ids, once for all.
>
> Here is my +1,
>
> --
> Denis Gervalle
> SOFTEC sa - CEO
> eGuilde sarl - CTO
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>



-- 
Ludovic Dubost
Founder and CEO
Blog: http://blog.ludovic.org/
XWiki: http://www.xwiki.com
Skype: ldubost GTalk: ldubost
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Reply via email to