For document what could help a lot already without any performance penalty is to compare the loaded document reference and the passed one in XWikiHibernateStore#loadXWikiDoc. That's because most of the code in XWiki apply the following logic: getDocument(), modify it, saveDocument().
On Fri, Feb 2, 2018 at 12:19 PM, Marc Sladek <marc.sla...@synventis.com> wrote: > Hi Denis, > > Thanks a lot for your answer. I know it's been a while, but I'd still like > to follow up on it since it's quite the fundamental issue. > >> Therefore, by improving the hash algorithm, the size of the ids, and the > quality of the hashed key, we have considered ourselves to be saved enough > for a normal usage. > > Still, with enough bad luck, documents and objects may be overwritten > without a trace. This is not a stable implementation. And even worse, if on > any XWiki installation hash collisions will happen in the future (or have > already happened since 4.x), they probably won't be easily associated with > this issue because it's nearly impossible to debug. > > While I do now understand the motivation to stick with hashes, I'm still > not sure why a collision detection would be difficult to introduce and why > it's even "impossible for some API". Let me briefly outline an idea: > > In XWikiHibernateStore#saveXWikiDoc on L615 > <https://github.com/xwiki/xwiki-platform/blob/stable-9.11.x/xwiki-platform-core/xwiki-platform-oldcore/src/main/java/com/xpn/xwiki/store/XWikiHibernateStore.java#L615> > an exists check on the doc id is already performed. If now > xwikidoc.fullName is also selected in the HQL, a comparison to > doc.getDocumentReference() can expose an imminent collision before data is > overwritten. At least an XWikiException should be thrown in this case. A > similar thing could be done before saving BaseObjects on L1203 > <https://github.com/xwiki/xwiki-platform/blob/stable-9.11.x/xwiki-platform-core/xwiki-platform-oldcore/src/main/java/com/xpn/xwiki/store/XWikiHibernateStore.java#L1203> > to avoid collisions on Object IDs. > > I don't think a change like this would be difficult to implement, I could > provide a PR of that sort. The performance penalty has to be tested for > your systems though, since the full name isn't indexed afaik. > > Regards > > Marc Sladek > synventis gmbh > > On 30 November 2017 at 15:21, Denis Gervalle <d...@softec.lu> wrote: > >> Hi Marc, >> >> Here are some answers: >> >> 1) MD5 was already a dependency of our oldcore and using SHA1 would have >> added a dependency without bringing much benefit. Since we only used 64 >> bits of the MD5 anyway, I doubt using SHA1 would have provided a better >> distribution. >> >> 2) Such a collision detection is difficult to be introduced in the >> existing code base, for some API it is even impossible. What you experience >> with the 32-bit ids had been my motivation to the changes in 4.x and I >> could say, based on my long XWiki experience, that even with the poor >> 32 bit ids, very few users had been affected. Therefore, by improving the >> hash algorithm, the size of the ids, and the quality of the hashed key, we >> have considered ourselves to be saved enough for a normal usage. >> >> 3) That’s the worst point. I cannot answer about the first decision, I >> wasn’t yet involve, but regarding the changes introduced in 4.0, a change >> had been considered. The ids are only there to satisfy Hibernate and its >> loading mechanism. If we had used a counter, we had to manage a conversion >> table between ids and entity references with all the additional complexity >> (consistency issues, caching, ...). This is so because we use entity >> reference to point directly to document (or even objects) everywhere in >> XWiki. This would have been a huge work to introduce that behaviour and at >> the same time keeping all the existing API unchanged. It would probably >> have introduced a performance penalty as well. This is why we resigned and >> go for an improved hash solution. IMO, if we had to make such a change, we >> are even better rewriting the storage service completely, and even stop >> using Hibernate, which, to be honest, does not bring much benefit to >> XWiki with its ORM aspects. >> >> But if you really want the complete answers, you can look at those threads: >> http://xwiki.markmail.org/thread/fuprtrnupz2uy37f >> http://xwiki.markmail.org/thread/fsd25bvft74xwgcx >> >> Regards, >> >> -- >> Denis Gervalle >> SOFTEC sa - CEO >> >> On 30 Nov 2017, 14:14 +0100, Marc Sladek <marc.sla...@synventis.com>, >> wrote: >> > Dear XWiki devs >> > >> > We are using the XWiki platform for our applications but sadly are still >> > stuck with 2.7.2. Lately we ran into issues on a large database and >> noticed >> > "disappearing" BaseObjects. We were able to link it to XWIKI-6990 >> > <http://jira.xwiki.org/browse/XWIKI-6990>, where hibernate IDs collided >> > (hash collisions) and overwrote other objects without any trace - neither >> > visible in the history nor in a log file. >> > >> > We analysed your implemented solution from 4.0+ in XWikiDocument >> > <https://github.com/xwiki/xwiki-platform/blob/stable-8.4.x/x >> wiki-platform-core/xwiki-platform-oldcore/src/main/java/com/ >> xpn/xwiki/doc/XWikiDocument.java#L841 >> > and BaseElement >> > <https://github.com/xwiki/xwiki-platform/blob/stable-8.4.x/x >> wiki-platform-core/xwiki-platform-oldcore/src/main/java/com/ >> xpn/xwiki/objects/BaseElement.java#L237 >> > and >> > noticed that you changed the 32bit String#hashCode to 64bit MD5, which >> > makes a collision less likely. I have a few questions regarding your >> > solution: >> > >> > 1) Is there any specific reason why you have chosen MD5 over SHA-1 or 2? >> > >> > 2) Collisions are still possible and would be extremely hard to notice >> > since they are completely silent. Have you considered to implement a >> > collision detection to at least log occurring collisions - or even better >> > reserve 1-2bits of the 64bit to be used as collision counter in the case >> of >> > it happening? >> > >> > 3) To question the concept of generating a hash for an ID in general: >> > Wouldn't a database defined "auto increment" be a much more robust >> solution >> > for the hibernate IDs? A collision would be impossible and >> > clustering/scalability is still possible with e.g. the InnoDB >> “interleaved” >> > autoincrement lock mode. Why have you chosen a hash based solution in the >> > first place? >> > >> > I'm sorry if these questions were already answered in the dev mailing >> list >> > or on issues, please link me to them since I couldn't find any concrete >> > answers. >> > >> > Thanks for your time and regards >> > >> > Marc Sladek >> > synventis gmbh >> -- Thomas Mortagne