Adam Olsen wrote:
On Apr 16, 4:27 pm, "Rhodri James" <rho...@wildebst.demon.co.uk>
wrote:
On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <rha...@gmail.com> wrote:
On Apr 16, 3:16 am, Nigel Rantor <wig...@wiggly.org> wrote:
Okay, before I tell you about the empirical, real-world evidence I have
could you please accept that hashes collide and that no matter how many
samples you use the probability of finding two files that do collide is
small but not zero.
I'm afraid you will need to back up your claims with real files.
So that would be a "no" then.  If the implementation of dicts in Python,
say, were to assert as you are that the hashes aren't going to collide,
then I'd have to walk away from it.  There's no point in using something
that guarantees a non-zero chance of corrupting your data.

Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
65 thousand) will give you a decent chance of a collision.  In
contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
2**256 (SHA-512).  The two are at totally different extremes.

I'm just going to go ahead and take the above as an admission by you that the chance of collision is non-zero, and that if we accept that fact you cannot rely on a hash function to tell you if two files are identical.

Thanks.

There is *always* a non-zero chance of corruption, due to software
bugs, hardware defects, or even operator error.  It is only in that
broader context that you can realize just how minuscule the risk is.

Please explain why you're dragging the notion of corruption into this when it seems to be beside the point?

Can you explain to me why you justify great lengths of paranoia, when
the risk is so much lower?

Because in the real world, where I work, in practical, real, factual terms I have seen it happen. Not once. Not twice. But many, many times.

Why are you advocating a solution to the OP's problem that is more
computationally expensive than a simple byte-by-byte comparison and
doesn't guarantee to give the correct answer?

For single, one-off comparison I have no problem with a byte-by-byte
comparison.  There's a decent chance the files won't be in the OS's
cache anyway, so disk IO will be your bottleneck.
>
Only if you're doing multiple comparisons is a hash database
justified.  Even then, if you expect matching files to be fairly rare
I won't lose any sleep if you're paranoid and do a byte-by-byte
comparison anyway.  New vulnerabilities are found, and if you don't
update promptly there is a small (but significant) chance of a
malicious file leading to collision.

If I have a number of files then I would certainly use a hash as a quick test, but if two files hash to the same value I still have to go compare them. Hashing in this case saves time doing a comparison for each file but it doesn't obviate the need to do a byte-by-byte comparison to see if the files that hash to the same value are actually the same.

That's not my concern though.  What I'm responding to is Nigel
Rantor's grossly incorrect statements about probability.  The chance
of collision, in our life time, is *insignificant*.

Please tell me which statements? That the chance of two files hashing to the same value is non-zero? You admit as much above.

Also, please remember I gave you a way of verifying what I said, go crawl the web for images, pages, whatever, build a hash DB and tell me how long it takes you to find a collision using MD5 (which is the hash I was talking about when I told you I had real-world, experience to back up the theoretical claim that collisions occur.

Regards,

  Nigel

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to