On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein <[EMAIL PROTECTED]> might have written:
>More seriously, the best I can think of that doesn't use a strong slow >hash would be to group files by (file size, cheap hash) then compare >each file in a group with a representative of each distinct file found >among earlier files in the same group -- that leads to an average of >about three reads per duplicated file copy: one to hash it, and two for >the comparison between it and its representative (almost all of the >comparisons will turn out equal but you still need to check unless you >use a strong hash). The code I posted in another thread (and provided a link in this one) does exactly that (a quick hash of the first few K before calculating the whole file's md5 sum). However, Patrick's code is faster, reading only what's necessary (he does what I intended to do, but I was too lazy-- I actually rewrote from scratch one of the first programs I wrote in Python, which obviously was too amateurish code for me to publish :) It seems your objections are related to Xah Lee's specifications; I have no objections to your objections (-:) other than that we are just trying to produce something of practical value out of an otherwise doomed thread... -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list