+1 On Sun, Jan 8, 2012 at 2:44 AM, Caleb James DeLisle <[email protected]> wrote: > +1 > > Caleb > > On 01/07/2012 04:39 PM, Denis Gervalle wrote: >> Now that the database migration mechanism has been improved, I would like >> to go ahead with my patch to improve document ids. >> >> Currently, ids are simple string hashcode of a locally serialized document >> reference, including the language for translated documents. The likelihood >> of having duplicates with the string hashing algorithm of java is really >> high. >> >> What I propose is: >> >> 1) use an MD5 hashing which is particularly good at distributing. >> 2) truncate the hash to the first 64bits, since the XWD_ID column is a >> 64bit long. >> 3) use a better string representation as the source of hashing >> >> Based on previous discussion, point 1) and 2) has already been agreed, and >> this vote is in particular about the string used for 3). >> I propose it in 2 steps: >> >> 1) before locale are fully supported in document reference, use this >> format: >> >> <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language> >> where language would be an empty string for the default document, so it >> would look like 7:mySpace5:myDoc0: and its french translation could be >> 7:mySpace5:myDoc2:fr >> 2) when locale are included in reference, we will replace the >> implementation by a reference serializer that would produce the same kind >> of representation, but that will include all spaces (not only the last >> one), to be prepared for the future. >> >> While doing so, I also propose to fix the cache key issue by using the same >> reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous >> examples will have the following key in the document cache: >> 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr >> >> Using such a key (compared to the usual serialization) has the following >> advantages: >> - ensure uniqueness of the reference without requiring a complex escaping >> algorithm, which is unneeded here. >> - potentially reversible >> - faster than the usual serialization >> - support language >> - independent of the current serialization that may evolved independently, >> so it will be stable over time which is really important when it is used as >> a base for the hashing algorithm used for document ids stored in the >> database. >> >> I would like to introduce this as early as possible, which means has soon >> has we are confident with the migration mechanism recently introduced. >> Since the migration of ids will convert 32bits hashes into 64bits ones, the >> risk of collision is really low, and to be careful, I have written a >> migration algorithm that would support such collision (unless it cause a >> circular reference collision, but this is really unexpected). However, >> changing ids again later, if we change our mind, will be really more risky >> and the migration difficult to implements, so it is really important that >> we agree on the way we compute these ids, once for all. >> >> Here is my +1, >> > > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs
-- Thomas Mortagne _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

