Thanks Mike. I Googled and found a claim that with MD5 one would need about 2^64 instances before the chance of collision rose to 50% (Birthday problem). If that's right then if everyone on Earth performed the operation 100 times/sec it would take around about the age of the universe for the chance of a collision to be 50%.
Best, Tim Finney On Wed, 2012-08-29 at 12:44 -0700, Michael Blakeley wrote: > If md5 were a perfect hash, you would have its entire 128-bit space. But it > is not: see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for some > background. Flaws in the design make collisions more likely than they ought > to be. > > Even if md5 were perfect, I would still recommend implementing a > hash-collision strategy of some kind. For example, you might treat the > md5-based URI as the directory prefix, and put multiple documents underneath > it based on an additional naming scheme. Most directories would contain only > one document, but the occasional hash-collision would be taken care of > automatically. Sometimes folks will suggest concatenating multiple hashes... > I have always been suspicious of that approach, but I don't have any firm > arguments against it. > > For most purposes I think random numbers are more straightforward. If you > want to be able to track down documents that hash to a particular md5, it may > be just as effective to create an md5 element or add a property. > > -- Mike > > On 29 Aug 2012, at 12:32 , Tim Finney wrote: > > > Hi All, > > > > On a slightly related topic, what's the chance of a collision if one > > uses something like this to generate a (hopefully) unique URI? > > > > fn:concat("/foo/", xdmp:md5(xdmp:get-original-url()), ".bar") > > > > Best, > > > > Tim > > > > On Wed, 2012-08-29 at 12:00 -0700, Danny Sinang wrote: > >> Hi, > >> > >> ML support suggested we do this to generate a unique ID for our > >> documents : > >> > >> declare function choose-uri() as xs:string > >> { > >> let $uri := fn:concat("/document-", xdmp:random(), ".xml") > >> return if (fn:exists(fn:doc($uri))) then choose-uri() else $uri > >> }; > >> > >> > >> My question is, will the call to fn:exists(fn:doc($uri)) be fast, > >> considering that we now have 8 million documents ? > >> > >> The fn:exists(fn:doc($uri)) call is needed to obtain a read lock, > >> which > >> will be upgraded to a write lock when xdmp:document-insert is called. > >> > >> Regards, > >> Danny > >> > >> > > > > _______________________________________________ > > General mailing list > > [email protected] > > http://developer.marklogic.com/mailman/listinfo/general > > > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
