Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Thomas Mortagne Sat, 07 Jan 2012 23:29:42 -0800

+1

On Sun, Jan 8, 2012 at 2:44 AM, Caleb James DeLisle
<[email protected]> wrote:
> +1
>
> Caleb
>
> On 01/07/2012 04:39 PM, Denis Gervalle wrote:
>> Now that the database migration mechanism has been improved, I would like
>> to go ahead with my patch to improve document ids.
>>
>> Currently, ids are simple string hashcode of a locally serialized document
>> reference, including the language for translated documents. The likelihood
>> of having duplicates with the string hashing algorithm of java is really
>> high.
>>
>> What I propose is:
>>
>>  1) use an MD5 hashing which is particularly good at distributing.
>>  2) truncate the hash to the first 64bits, since the XWD_ID column is a
>> 64bit long.
>>  3) use a better string representation as the source of hashing
>>
>> Based on previous discussion, point 1) and 2) has already been agreed, and
>> this vote is in particular about the string used for 3).
>> I propose it in 2 steps:
>>
>>  1) before locale are fully supported in document reference, use this
>> format:
>>
>>  <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
>>     where language would be an empty string for the default document, so it
>> would look like 7:mySpace5:myDoc0: and its french translation could be
>> 7:mySpace5:myDoc2:fr
>>  2) when locale are included in reference, we will replace the
>> implementation by a reference serializer that would produce the same kind
>> of representation, but that will include all spaces (not only the last
>> one), to be prepared for the future.
>>
>> While doing so, I also propose to fix the cache key issue by using the same
>> reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous
>> examples will have the following key in the document cache:
>> 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
>>
>> Using such a key (compared to the usual serialization) has the following
>> advantages:
>>  - ensure uniqueness of the reference without requiring a complex escaping
>> algorithm, which is unneeded here.
>>  - potentially reversible
>>  - faster than the usual serialization
>>  - support language
>>  - independent of the current serialization that may evolved independently,
>> so it will be stable over time which is really important when it is used as
>> a base for the hashing algorithm used for document ids stored in the
>> database.
>>
>> I would like to introduce this as early as possible, which means has soon
>> has we are confident with the migration mechanism recently introduced.
>> Since the migration of ids will convert 32bits hashes into 64bits ones, the
>> risk of collision is really low, and to be careful, I have written a
>> migration algorithm that would support such collision (unless it cause a
>> circular reference collision, but this is really unexpected). However,
>> changing ids again later, if we change our mind, will be really more risky
>> and the migration difficult to implements, so it is really important that
>> we agree on the way we compute these ids, once for all.
>>
>> Here is my +1,
>>
>
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs




-- 
Thomas Mortagne
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Reply via email to