Thanks Mike. I Googled and found a claim that with MD5 one would need
about 2^64 instances before the chance of collision rose to 50%
(Birthday problem). If that's right then if everyone on Earth performed
the operation 100 times/sec it would take around about the age of the
universe for the chance of a collision to be 50%.

Best,

Tim Finney

On Wed, 2012-08-29 at 12:44 -0700, Michael Blakeley wrote:
> If md5 were a perfect hash, you would have its entire 128-bit space. But it 
> is not: see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for some 
> background. Flaws in the design make collisions more likely than they ought 
> to be.
> 
> Even if md5 were perfect, I would still recommend implementing a 
> hash-collision strategy of some kind. For example, you might treat the 
> md5-based URI as the directory prefix, and put multiple documents underneath 
> it based on an additional naming scheme. Most directories would contain only 
> one document, but the occasional hash-collision would be taken care of 
> automatically. Sometimes folks will suggest concatenating multiple hashes... 
> I have always been suspicious of that approach, but I don't have any firm 
> arguments against it.
> 
> For most purposes I think random numbers are more straightforward. If you 
> want to be able to track down documents that hash to a particular md5, it may 
> be just as effective to create an md5 element or add a property.
> 
> -- Mike
> 
> On 29 Aug 2012, at 12:32 , Tim Finney wrote:
> 
> > Hi All,
> > 
> > On a slightly related topic, what's the chance of a collision if one
> > uses something like this to generate a (hopefully) unique URI?
> > 
> > fn:concat("/foo/", xdmp:md5(xdmp:get-original-url()), ".bar")
> > 
> > Best,
> > 
> > Tim
> > 
> > On Wed, 2012-08-29 at 12:00 -0700, Danny Sinang wrote:
> >> Hi,
> >> 
> >> ML support suggested we do this to generate a unique ID for our
> >> documents :
> >> 
> >> declare function choose-uri() as xs:string
> >>    {
> >>       let $uri := fn:concat("/document-", xdmp:random(), ".xml")
> >>       return if (fn:exists(fn:doc($uri))) then choose-uri() else $uri
> >>    };
> >> 
> >> 
> >> My question is, will the call to fn:exists(fn:doc($uri)) be fast,
> >> considering that we now have 8 million documents ?
> >> 
> >> The fn:exists(fn:doc($uri)) call is needed to obtain a read lock,
> >> which
> >> will be upgraded to a write lock when xdmp:document-insert is called.
> >> 
> >> Regards,
> >> Danny
> >> 
> >> 
> > 
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
> > 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to