Re: binary file compare...

Nigel Rantor Fri, 17 Apr 2009 03:50:15 -0700

Adam Olsen wrote:

On Apr 16, 4:27 pm, "Rhodri James" <[email protected]>
wrote:

On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <[email protected]> wrote:

On Apr 16, 3:16 am, Nigel Rantor <[email protected]> wrote:

Okay, before I tell you about the empirical, real-world evidence I have
could you please accept that hashes collide and that no matter how many
samples you use the probability of finding two files that do collide is
small but not zero.

I'm afraid you will need to back up your claims with real files.

So that would be a "no" then.  If the implementation of dicts in Python,
say, were to assert as you are that the hashes aren't going to collide,
then I'd have to walk away from it.  There's no point in using something
that guarantees a non-zero chance of corrupting your data.


Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
65 thousand) will give you a decent chance of a collision.  In
contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
2**256 (SHA-512).  The two are at totally different extremes.

I'm just going to go ahead and take the above as an admission by youthat the chance of collision is non-zero, and that if we accept thatfact you cannot rely on a hash function to tell you if two files areidentical.


Thanks.

There is *always* a non-zero chance of corruption, due to software
bugs, hardware defects, or even operator error.  It is only in that
broader context that you can realize just how minuscule the risk is.

Please explain why you're dragging the notion of corruption into thiswhen it seems to be beside the point?

Can you explain to me why you justify great lengths of paranoia, when
the risk is so much lower?

Because in the real world, where I work, in practical, real, factualterms I have seen it happen. Not once. Not twice. But many, many times.

Why are you advocating a solution to the OP's problem that is more
computationally expensive than a simple byte-by-byte comparison and
doesn't guarantee to give the correct answer?


For single, one-off comparison I have no problem with a byte-by-byte
comparison.  There's a decent chance the files won't be in the OS's
cache anyway, so disk IO will be your bottleneck.

Only if you're doing multiple comparisons is a hash database
justified.  Even then, if you expect matching files to be fairly rare
I won't lose any sleep if you're paranoid and do a byte-by-byte
comparison anyway.  New vulnerabilities are found, and if you don't
update promptly there is a small (but significant) chance of a
malicious file leading to collision.

If I have a number of files then I would certainly use a hash as a quicktest, but if two files hash to the same value I still have to go comparethem. Hashing in this case saves time doing a comparison for each filebut it doesn't obviate the need to do a byte-by-byte comparison to seeif the files that hash to the same value are actually the same.

That's not my concern though.  What I'm responding to is Nigel
Rantor's grossly incorrect statements about probability.  The chance
of collision, in our life time, is *insignificant*.

Please tell me which statements? That the chance of two files hashing tothe same value is non-zero? You admit as much above.

Also, please remember I gave you a way of verifying what I said, gocrawl the web for images, pages, whatever, build a hash DB and tell mehow long it takes you to find a collision using MD5 (which is the hash Iwas talking about when I told you I had real-world, experience to backup the theoretical claim that collisions occur.


Regards,

  Nigel

--
http://mail.python.org/mailman/listinfo/python-list

Re: binary file compare...

Reply via email to