Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Vincent Massol Mon, 09 Jan 2012 02:44:58 -0800

On Jan 9, 2012, at 11:36 AM, Denis Gervalle wrote:

> On Mon, Jan 9, 2012 at 11:23, Vincent Massol <[email protected]> wrote:
> 
>> 
>> On Jan 9, 2012, at 11:09 AM, Denis Gervalle wrote:
>> 
>>> On Mon, Jan 9, 2012 at 10:07, Vincent Massol <[email protected]> wrote:
>>> 
>>>> +1 with the following caveats:
>>>> 
>>>> * We need to guarantee that a migration cannot corrupt the DB.
>>> 
>>> 
>>> The evolution of the migration was the first steps in that procedure,
>> since
>>> accessing a DB with an inappropriate XWiki core could have corrupt it.
>>> 
>>> 
>>>> For example imagine that we change a document id but this id is also
>> used
>>>> in some other tables and the migration stops before it's changed in the
>>>> other tables. The change needs to be done in transactions for each doc
>>>> being changed across all tables.
>>> 
>>> 
>>> That would be nice, but MySQL does not support transaction on ISAM table.
>>> I use a single transaction for the whole migration process,
>> 
>> I think we should have one transaction per document update instead. We've
>> had this problem in the past when upgrading very large systems. The
>> migration was never going through in one go for some reason which I have
>> forgotten so we had needed to use several tx so that the migrations could
>> be restarted when it failed and so that it could complete.
>> 
> 
> This could be done easily if you want it so. Just note that all other
> migration are single transaction based AFAICS.


I'm pretty sure this isn't the case.
See R4359XWIKI1459DataMigration and R6079XWIKI1878DataMigration for example.

Thanks
-Vincent

>>> so on systems
>>> that support it (Oracle ?), there will be migration or not. But I could
>> not
>>> secure MySQL better that it is possible to.
>> 
>> It should work fine on MySQL with InnoDB which recommend (see
>> http://platform.xwiki.org/xwiki/bin/view/AdminGuide/InstallationMySQL).
>> 
> 
> I am myself on MyISAM since long, since there is other drawback using
> InnoDB.
> I do not experience much issue with corruption up to now. So you could
> expect other to have similar setup.
> 
> 
>> 
>> Thanks
>> -Vincent
>> 
>>>> Said differently the migrator should be allowed to be ctrl-c-ed at any
>>>> time and you safely restart xwiki and the migrator will just carry on
>> from
>>>> where it was.
>>>> 
>>> 
>>> The migrator will restart were it left-off, but the granularity is the
>>> document. I proceed the updates by documents, updating all tables for
>> each
>>> one. If there is some issue during the migration let say on MySQL, and it
>>> is restarted, it will start again skipping documents that have been
>>> converted previously. So the corruption could be limited to a single
>>> document.
>>> 
>>> 
>>>> * OR we need to have a configuration parameter for deciding to run this
>>>> migration or not so that users run it only when they decide thus
>> ensuring
>>>> that they've done the proper backups and saving of DBs.
>>>> 
>>> 
>>> This is true using the new migration procedure, but not as flexible as
>> you
>>> seems to expect. Supporting two hashing algorithm is not a feature, but
>> an
>>> augmented risk of causing corruption for me.
>>> Now, if you use a recent core, that use new id, and on the other side,
>> you
>>> have not activated migrations and access an old db, you will simply be
>>> unable to access the database. You will receive a "db require migration"
>>> exception.
>>> 
>>> Anyway, migration are disable by default, and should be enabled by an
>>> administrator in xwiki.cfg. The release notes will mention the needs to
>>> proceed to it, and of course, to make a backup before. And you are always
>>> supposed to have backup when you upgrade, or you are not a system admin
>> ;)
>>> 
>>> 
>>>> I prefer the first option but we need to guarantee it.
>>>> 
>>> 
>>> We will never be able to guarantee it, but I have done my best to have it
>>> the most secure.
>>> 
>>> 
>>>> 
>>>> Thanks
>>>> -Vincent
>>>> 
>>>> On Jan 7, 2012, at 10:39 PM, Denis Gervalle wrote:
>>>> 
>>>>> Now that the database migration mechanism has been improved, I would
>> like
>>>>> to go ahead with my patch to improve document ids.
>>>>> 
>>>>> Currently, ids are simple string hashcode of a locally serialized
>>>> document
>>>>> reference, including the language for translated documents. The
>>>> likelihood
>>>>> of having duplicates with the string hashing algorithm of java is
>> really
>>>>> high.
>>>>> 
>>>>> What I propose is:
>>>>> 
>>>>> 1) use an MD5 hashing which is particularly good at distributing.
>>>>> 2) truncate the hash to the first 64bits, since the XWD_ID column is a
>>>>> 64bit long.
>>>>> 3) use a better string representation as the source of hashing
>>>>> 
>>>>> Based on previous discussion, point 1) and 2) has already been agreed,
>>>> and
>>>>> this vote is in particular about the string used for 3).
>>>>> I propose it in 2 steps:
>>>>> 
>>>>> 1) before locale are fully supported in document reference, use this
>>>>> format:
>>>>> 
>>>>> 
>>>> 
>> <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
>>>>>  where language would be an empty string for the default document, so
>>>> it
>>>>> would look like 7:mySpace5:myDoc0: and its french translation could be
>>>>> 7:mySpace5:myDoc2:fr
>>>>> 2) when locale are included in reference, we will replace the
>>>>> implementation by a reference serializer that would produce the same
>> kind
>>>>> of representation, but that will include all spaces (not only the last
>>>>> one), to be prepared for the future.
>>>>> 
>>>>> While doing so, I also propose to fix the cache key issue by using the
>>>> same
>>>>> reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the
>> previous
>>>>> examples will have the following key in the document cache:
>>>>> 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
>>>>> 
>>>>> Using such a key (compared to the usual serialization) has the
>> following
>>>>> advantages:
>>>>> - ensure uniqueness of the reference without requiring a complex
>> escaping
>>>>> algorithm, which is unneeded here.
>>>>> - potentially reversible
>>>>> - faster than the usual serialization
>>>>> - support language
>>>>> - independent of the current serialization that may evolved
>>>> independently,
>>>>> so it will be stable over time which is really important when it is
>> used
>>>> as
>>>>> a base for the hashing algorithm used for document ids stored in the
>>>>> database.
>>>>> 
>>>>> I would like to introduce this as early as possible, which means has
>> soon
>>>>> has we are confident with the migration mechanism recently introduced.
>>>>> Since the migration of ids will convert 32bits hashes into 64bits ones,
>>>> the
>>>>> risk of collision is really low, and to be careful, I have written a
>>>>> migration algorithm that would support such collision (unless it cause
>> a
>>>>> circular reference collision, but this is really unexpected). However,
>>>>> changing ids again later, if we change our mind, will be really more
>>>> risky
>>>>> and the migration difficult to implements, so it is really important
>> that
>>>>> we agree on the way we compute these ids, once for all.
>>>>> 
>>>>> Here is my +1,
>>>>> 
>>>>> --
>>>>> Denis Gervalle
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Reply via email to