I read that the original MD5 algorithm has known issues with collisions.
Do I need to test for collisions myself...or is it pretty well tested (or proved?) to stand up to an intensive application?
A bit of googling turned up this:
http://www.cs.bham.ac.uk/~mdr/teaching/modules04/security/lectures/hash.html
SHA-1 is a direct competitor with MD5, since both of them were developed at the same time in an effort to strengthen MD4. SHA-1 produces 160 bits; thus its "collision difficulty" is much higher, 2^80 = 10^24 compared to 2^64 = 10^19 for MD5. So it is 10000 times as secure against brute force attacks. ... SHA-1 is slower, because it has to do more rounds and process more data. I guess it's between 2 and 10 times slower.
And: http://www.freedom-to-tinker.com/archives/000661.html
A fundamental property of CH functions is the probability of a hash collision given two different inputs. This is determined by considering the Birthday Paradox, and approaches 2^(n/2), where n = bits in hash, as the hash function gets better. For SHA-1, which is 160 bits, this yields a best-case collision probability of 1 in 2^80. You would *expect* to find a collision, given enough pairs of inputs, and so would quite normally chose a hash length based on the expected population of objects you want to digest in any domain of potential collision.
Posted by: John Wainwright at August 17, 2004 11:50 AM
This is part of a thread discussing a rumored break of the SHA-1 algorithm. Apparently the discussion got muddled, and the author of this comment was clarifying that the fact that collisions exist isn't significant, and in fact is an expected and known property of the algorithm.
What new, as John Saylor pointed out in his post, is that people are figuring out ways to generate data that when hashed, will match an existing digest.
This weakens the algorithm from a cryptography perspective, but doesn't have a bearing on your application.
So it looks like MD5 gives you a 1 in 100000000000000000000 chance of collision. Given you said you'd have < 100,000 keys, this should be adequate. If you don't think so, try SHA-1, as John Saylor also pointed out in his post.
Any experiences with how well Digest::MD5 does when used with many
millions of keys?
I believe Digest::MD5 is a direct implementation of RSA's publicly released code, so it'll behave the same as most other MD5 implementation out there. There's a pure Perl implementation also available, but I'd expect it too to perform identically, just slower.
-Tom _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

