If md5 were a perfect hash, you would have its entire 128-bit space. But it is 
not: see http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf for some background. 
Flaws in the design make collisions more likely than they ought to be.

Even if md5 were perfect, I would still recommend implementing a hash-collision 
strategy of some kind. For example, you might treat the md5-based URI as the 
directory prefix, and put multiple documents underneath it based on an 
additional naming scheme. Most directories would contain only one document, but 
the occasional hash-collision would be taken care of automatically. Sometimes 
folks will suggest concatenating multiple hashes... I have always been 
suspicious of that approach, but I don't have any firm arguments against it.

For most purposes I think random numbers are more straightforward. If you want 
to be able to track down documents that hash to a particular md5, it may be 
just as effective to create an md5 element or add a property.

-- Mike

On 29 Aug 2012, at 12:32 , Tim Finney wrote:

> Hi All,
> 
> On a slightly related topic, what's the chance of a collision if one
> uses something like this to generate a (hopefully) unique URI?
> 
> fn:concat("/foo/", xdmp:md5(xdmp:get-original-url()), ".bar")
> 
> Best,
> 
> Tim
> 
> On Wed, 2012-08-29 at 12:00 -0700, Danny Sinang wrote:
>> Hi,
>> 
>> ML support suggested we do this to generate a unique ID for our
>> documents :
>> 
>> declare function choose-uri() as xs:string
>>    {
>>       let $uri := fn:concat("/document-", xdmp:random(), ".xml")
>>       return if (fn:exists(fn:doc($uri))) then choose-uri() else $uri
>>    };
>> 
>> 
>> My question is, will the call to fn:exists(fn:doc($uri)) be fast,
>> considering that we now have 8 million documents ?
>> 
>> The fn:exists(fn:doc($uri)) call is needed to obtain a read lock,
>> which
>> will be upgraded to a write lock when xdmp:document-insert is called.
>> 
>> Regards,
>> Danny
>> 
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to