[trimmed cc list, nobody wants to read this noise] On Sat, Apr 16, 2005 at 11:35:39PM +0200, Brian O'Mahoney wrote: > >> (1) I _have_ seen real-life collisions with MD5, in the context of > >> Document management systems containing ~10^6 ms-WORD documents. > > > > Dude! You could have been *famous*! Why the > > aitch-ee-double-hockey-sticks didn't you publish this when you found it? > > Seriously, man. > > The MD5 has was fine, or at least the code (a) produced the correct > results on the published test cases, and, (b) was properly applied to > all bytes of the file(s). I was surprised when it happened, which is why > I bothered to post to this list at this time, so I make two more points
OK, I guess it's time for some remedial math. There are 2^128 = 340282366920938463463374607431768211456 different MD5 hashes. You are suggesting that you found a collision using ~1e6 = ~1,000,000 plaintexts. Let's suppose there were actually 100,000,000 = 1e8 plaintexts, just in case you underestimated the number. Applying the birthday paradox, we have a 50% probability that you'd find one collision if there were ~7,213,475,309,916,173 possible hash values. If you extend the birthday argument ("what is the probability of a collision if you take N samples from a set of size M?") you get the following results, with N = 1e8: 50% (1 in 2) probability of collision in 7,213,475,309,916,173. 1% (1 in 100) probability of collision in 497,496,027,172,833,194. .05% (1 in 1845) probability of collision in 9,223,372,036,854,775,806. That's where my quick-and-dirty solver craps out, but we're still a really long ways from 340,282,366,920,938,463,463,374,607,431,768,211,456. A simple linear extrapolation (certainly wrong, but not by more than a few dozen orders of magnitude) says that the odds would be 1 in 68,056,473,384,187,692,692 for the full MD5 hash (I'm not even going to dignify that with a percentage). I'm not going to do the sums, but I would hazard a guess that it's more likely your PC suffered a cosmic-ray-induced memory fault - EACH OF THE FOUR TIMES YOU TESTED IT - causing it to report the same MD5, than that you actually discovered a collision with a measly million (or even hundred million) plaintexts. (Of course, I don't know how many tests of the hash you actually did. But the point stands.) Hell, if you're *that* lucky, what are you doing in IT? You could be making a killing at the roulette table. Or, even more likely, there was some other factor in the system (most likely that it was only using a few bits, probably 32, of the hash when looking for collisions) that resulted in a false alarm. If you had actual evidence of a collision, I'd love to see it - even if it's just the equivalent of % md5 foo d3b07384d113edec49eaa6238ad5ff00 foo % md5 bar d3b07384d113edec49eaa6238ad5ff00 bar % cmp foo bar foo bar differ: byte 25, line 1 % But in the absence of actual evidence, we have to assume (just based on the probabilities) that there was some error in your testing. -andy - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html