+1 Thanks, Marius
On Sat, Jan 7, 2012 at 11:39 PM, Denis Gervalle <[email protected]> wrote: > Now that the database migration mechanism has been improved, I would like > to go ahead with my patch to improve document ids. > > Currently, ids are simple string hashcode of a locally serialized document > reference, including the language for translated documents. The likelihood > of having duplicates with the string hashing algorithm of java is really > high. > > What I propose is: > > 1) use an MD5 hashing which is particularly good at distributing. > 2) truncate the hash to the first 64bits, since the XWD_ID column is a > 64bit long. > 3) use a better string representation as the source of hashing > > Based on previous discussion, point 1) and 2) has already been agreed, and > this vote is in particular about the string used for 3). > I propose it in 2 steps: > > 1) before locale are fully supported in document reference, use this > format: > > <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language> > where language would be an empty string for the default document, so it > would look like 7:mySpace5:myDoc0: and its french translation could be > 7:mySpace5:myDoc2:fr > 2) when locale are included in reference, we will replace the > implementation by a reference serializer that would produce the same kind > of representation, but that will include all spaces (not only the last > one), to be prepared for the future. > > While doing so, I also propose to fix the cache key issue by using the same > reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous > examples will have the following key in the document cache: > 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr > > Using such a key (compared to the usual serialization) has the following > advantages: > - ensure uniqueness of the reference without requiring a complex escaping > algorithm, which is unneeded here. > - potentially reversible > - faster than the usual serialization > - support language > - independent of the current serialization that may evolved independently, > so it will be stable over time which is really important when it is used as > a base for the hashing algorithm used for document ids stored in the > database. > > I would like to introduce this as early as possible, which means has soon > has we are confident with the migration mechanism recently introduced. > Since the migration of ids will convert 32bits hashes into 64bits ones, the > risk of collision is really low, and to be careful, I have written a > migration algorithm that would support such collision (unless it cause a > circular reference collision, but this is really unexpected). However, > changing ids again later, if we change our mind, will be really more risky > and the migration difficult to implements, so it is really important that > we agree on the way we compute these ids, once for all. > > Here is my +1, > > -- > Denis Gervalle > SOFTEC sa - CEO > eGuilde sarl - CTO > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

