In article <[EMAIL PROTECTED]>, "John Machin" <[EMAIL PROTECTED]> wrote:
> Just look at the efficiency of processing N files of the same size S, > where they differ after d bytes: [If they don't differ, d = S] I think this misses the point. It's easy to find the files that are different. Just a file size comparison will get most of them, and most of the rest can be detected in the first few kbytes. So I think it's safe to assume both of these filtering mechanisms would be incorporated into a good duplicate detection code, and that any remaining tests that would be performed are between files that are very likely to be the same as each other. The hard part is verifying that the files that look like duplicates really are duplicates. To do so, for a group of m files that appear to be the same, requires 2(m-1) reads through the whole files if you use a comparison based method, or m reads if you use a strong hashing method. You can't hope to cut the reads off early when using comparisons, because the files won't be different. The question to me is: when you have a group of m>2 likely-to-be-equal files, is 2(m-1) reads and very little computation better than m reads and a strong hash computation? I haven't yet seen an answer to this, but it's a question for experimentation rather than theory. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list