Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Denis Gervalle Mon, 09 Jan 2012 08:02:02 -0800

On Mon, Jan 9, 2012 at 12:20, Ludovic Dubost <[email protected]> wrote:


> Hi,
>
> I have one small concern which leads to a big concern, and I was wondering
> about something.
>
> 1/ Small concern: what did we do to verify the potential level of
> collisions and if there is a chance they happen in our case
>

In theory, there is a risk, the same than the one you have with current
ids, but since the risk is reduced, it would be really bad luck to felt on
it. Note that we move from 32bits badly formed hash, to 64bits well suited
one. So the risk is not zero, but really unexpected since we double the
hash space and use a better repartition algorithm.


> I see we want to truncate the MD5 hash to 64 bits. I was wondering if there
> is a not a risk of having more collisions.
>

Sure, using the lower 64bits is not as good as using the full 128bits.
Using more bits require a change in the mapping and schema (see later).


> My question here is what did we do to verify the level of collisions on
> real data.
>
> We could provide some XWiki SAS client DBs, including our Intranet which is
> quite big for testing if there was a testing program.
>

That will be the purpose of the non final release, have some of you check
that their large DB works well with it. In particular, I would appreciate
some tests in non MySQL environment, especially Oracle...


> 2/ Bigger concern: wouldn't it be better to have a way to
> activate/deactivate the new feature. This would allow to still upgrade and
> make tests on real life data without risking being in a corner
>

We are already in a corner since we have fallen on id collision, so I only
try to put that corner further until we can fully change this to a fully
unique id. I know this is not easy, but we cannot stay stucked.

Providing both would require a two way migrator, and this would also
introduce more risk of mistakes that would cause database corruptions. I
have built a solid migrator that ensure you will migrate properly and
voluntarily before using the new core.
Note that if you plug an old core on the new DB, this one will corrupt it
somewhat, by not seing any documents, and recreating some default initial
documents using the old ids. This is more concerning since event if
document will not mixup, there object will partly, which will cause really
annoying issue. But what can we do, we cannot change old core retroactively.
So even if we provide rock solid solution, an old core will still corrupt
data. We cannot prevent all administrator mistakes.

Also providing both would means that we do not trust our 64bits hash to be
better than the 32bits one. As I said, there is no zero risk, but the
probability is really near zero.

3/ Wondering: wouldn't it be better to use the real reference as the ID and
> move to strings for it
>


> Give that in an XWiki database, this part is really small (compared to
> attachments and the data itself), are there really any reasons to use IDs
> for this reference. Wouldn't the use of a String be better in the end ? We
> already use this for the join between xwikidoc and xwikiobjects and haven't
> seen any big problem with that did we ?
>

But objects still use an id (for hibernate and) to link properties to
objects as you mention later. Maybe you want to means that we use a mix of
objects  for properties.


> If we used that method wouldn't it means ZERO collision ?
>

Sure, it would have been from the beginning, but it was not, and this is
now really difficult to change. The best would have been not to use
significant ID. Changing to string now would require an external migration
process, that use the old mapping to create a new id, than another process
that use the new mapping and remove the old ids. This is really another
job, that would be best done when we fully review the model.


> 4/ Small additional stuff
>
> There is also the migration of Object IDs right ? The object IDs use the
> same system and also have a risk of collision (which would lead to property
> data being shared with completely unrelevant documents)
>

Good catch ! I have not seen it, since the object does not directly depends
on document ids, but indirectly depends on similarly calculated ones. I
need to look at these as well since this could mean two object of a
previously colliding documents will collide.
Thanks,


>
> Ludovic
>
>
> 2012/1/7 Denis Gervalle <[email protected]>
>
> > Now that the database migration mechanism has been improved, I would like
> > to go ahead with my patch to improve document ids.
> >
> > Currently, ids are simple string hashcode of a locally serialized
> document
> > reference, including the language for translated documents. The
> likelihood
> > of having duplicates with the string hashing algorithm of java is really
> > high.
> >
> > What I propose is:
> >
> >  1) use an MD5 hashing which is particularly good at distributing.
> >  2) truncate the hash to the first 64bits, since the XWD_ID column is a
> > 64bit long.
> >  3) use a better string representation as the source of hashing
> >
> > Based on previous discussion, point 1) and 2) has already been agreed,
> and
> > this vote is in particular about the string used for 3).
> > I propose it in 2 steps:
> >
> >  1) before locale are fully supported in document reference, use this
> > format:
> >
> >
> >
>  
> <lengthOfLastSpaceName>:<lastSpaceName><lengthOfDocumentName>:<documentName><lengthOfLanguage>:<language>
> >    where language would be an empty string for the default document, so
> it
> > would look like 7:mySpace5:myDoc0: and its french translation could be
> > 7:mySpace5:myDoc2:fr
> >  2) when locale are included in reference, we will replace the
> > implementation by a reference serializer that would produce the same kind
> > of representation, but that will include all spaces (not only the last
> > one), to be prepared for the future.
> >
> > While doing so, I also propose to fix the cache key issue by using the
> same
> > reference, but prefixed by <lengthOfWikiName>:<wikiName>, so the previous
> > examples will have the following key in the document cache:
> > 5:xwiki7:mySpace5:myDoc0: and 5:xwiki7:mySpace5:myDoc2:fr
> >
> > Using such a key (compared to the usual serialization) has the following
> > advantages:
> >  - ensure uniqueness of the reference without requiring a complex
> escaping
> > algorithm, which is unneeded here.
> >  - potentially reversible
> >  - faster than the usual serialization
> >  - support language
> >  - independent of the current serialization that may evolved
> independently,
> > so it will be stable over time which is really important when it is used
> as
> > a base for the hashing algorithm used for document ids stored in the
> > database.
> >
> > I would like to introduce this as early as possible, which means has soon
> > has we are confident with the migration mechanism recently introduced.
> > Since the migration of ids will convert 32bits hashes into 64bits ones,
> the
> > risk of collision is really low, and to be careful, I have written a
> > migration algorithm that would support such collision (unless it cause a
> > circular reference collision, but this is really unexpected). However,
> > changing ids again later, if we change our mind, will be really more
> risky
> > and the migration difficult to implements, so it is really important that
> > we agree on the way we compute these ids, once for all.
> >
> > Here is my +1,
> >
> > --
> > Denis Gervalle
> > SOFTEC sa - CEO
> > eGuilde sarl - CTO
> > _______________________________________________
> > devs mailing list
> > [email protected]
> > http://lists.xwiki.org/mailman/listinfo/devs
> >
>
>
>
> --
> Ludovic Dubost
> Founder and CEO
> Blog: http://blog.ludovic.org/
> XWiki: http://www.xwiki.com
> Skype: ldubost GTalk: ldubost
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>



-- 
Denis Gervalle
SOFTEC sa - CEO
eGuilde sarl - CTO
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [VOTE] Change document id stored in the database to reduce the likelihood of duplicate id

Reply via email to