+1 with the following caveats: * We need to guarantee that a migration cannot corrupt the DB. For example imagine that we change a document id but this id is also used in some other tables and the migration stops before it's changed in the other tables. The change needs to be done in transactions for each doc being changed across all tables. Said differently the migrator should be allowed to be ctrl-c-ed at any time and you safely restart xwiki and the migrator will just carry on from where it was. * OR we need to have a configuration parameter for deciding to run this migration or not so that users run it only when they decide thus ensuring that they've done the proper backups and saving of DBs.
I prefer the first option but we need to guarantee it. Thanks -Vincent On Jan 7, 2012, at 10:39 PM, Denis Gervalle wrote: > Now that the database migration mechanism has been improved, I would like > to go ahead with my patch to improve document ids. > > Currently, ids are simple string hashcode of a locally serialized document > reference, including the language for translated documents. The likelihood > of having duplicates with the string hashing algorithm of java is really > high. > > What I propose is: > > 1) use an MD5 hashing which is particularly good at distributing. > 2) truncate the hash to the first 64bits, since the XWD_ID column is a > 64bit long. > 3) use a better string representation as the source of hashing > > Based on previous discussion, point 1) and 2) has already been agreed, and > this vote is in particular about the string used for 3). > I propose it in 2 steps: > > 1) before locale are fully supported in document reference, use this > format: > > <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language> > where language would be an empty string for the default document, so it > would look like 7:mySpace5:myDoc0: and its french translation could be > 7:mySpace5:myDoc2:fr > 2) when locale are included in reference, we will replace the > implementation by a reference serializer that would produce the same kind > of representation, but that will include all spaces (not only the last > one), to be prepared for the future. > > While doing so, I also propose to fix the cache key issue by using the same > reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous > examples will have the following key in the document cache: > 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr > > Using such a key (compared to the usual serialization) has the following > advantages: > - ensure uniqueness of the reference without requiring a complex escaping > algorithm, which is unneeded here. > - potentially reversible > - faster than the usual serialization > - support language > - independent of the current serialization that may evolved independently, > so it will be stable over time which is really important when it is used as > a base for the hashing algorithm used for document ids stored in the > database. > > I would like to introduce this as early as possible, which means has soon > has we are confident with the migration mechanism recently introduced. > Since the migration of ids will convert 32bits hashes into 64bits ones, the > risk of collision is really low, and to be careful, I have written a > migration algorithm that would support such collision (unless it cause a > circular reference collision, but this is really unexpected). However, > changing ids again later, if we change our mind, will be really more risky > and the migration difficult to implements, so it is really important that > we agree on the way we compute these ids, once for all. > > Here is my +1, > > -- > Denis Gervalle _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

