+1

Thanks,
Marius

On Sat, Jan 7, 2012 at 11:39 PM, Denis Gervalle <[email protected]> wrote:
> Now that the database migration mechanism has been improved, I would like
> to go ahead with my patch to improve document ids.
>
> Currently, ids are simple string hashcode of a locally serialized document
> reference, including the language for translated documents. The likelihood
> of having duplicates with the string hashing algorithm of java is really
> high.
>
> What I propose is:
>
>  1) use an MD5 hashing which is particularly good at distributing.
>  2) truncate the hash to the first 64bits, since the XWD_ID column is a
> 64bit long.
>  3) use a better string representation as the source of hashing
>
> Based on previous discussion, point 1) and 2) has already been agreed, and
> this vote is in particular about the string used for 3).
> I propose it in 2 steps:
>
>  1) before locale are fully supported in document reference, use this
> format:
>
>  <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
>    where language would be an empty string for the default document, so it
> would look like 7:mySpace5:myDoc0: and its french translation could be
> 7:mySpace5:myDoc2:fr
>  2) when locale are included in reference, we will replace the
> implementation by a reference serializer that would produce the same kind
> of representation, but that will include all spaces (not only the last
> one), to be prepared for the future.
>
> While doing so, I also propose to fix the cache key issue by using the same
> reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous
> examples will have the following key in the document cache:
> 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
>
> Using such a key (compared to the usual serialization) has the following
> advantages:
>  - ensure uniqueness of the reference without requiring a complex escaping
> algorithm, which is unneeded here.
>  - potentially reversible
>  - faster than the usual serialization
>  - support language
>  - independent of the current serialization that may evolved independently,
> so it will be stable over time which is really important when it is used as
> a base for the hashing algorithm used for document ids stored in the
> database.
>
> I would like to introduce this as early as possible, which means has soon
> has we are confident with the migration mechanism recently introduced.
> Since the migration of ids will convert 32bits hashes into 64bits ones, the
> risk of collision is really low, and to be careful, I have written a
> migration algorithm that would support such collision (unless it cause a
> circular reference collision, but this is really unexpected). However,
> changing ids again later, if we change our mind, will be really more risky
> and the migration difficult to implements, so it is really important that
> we agree on the way we compute these ids, once for all.
>
> Here is my +1,
>
> --
> Denis Gervalle
> SOFTEC sa - CEO
> eGuilde sarl - CTO
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to