Patrick Useldinger <[EMAIL PROTECTED]> writes: > David Eppstein wrote: > > > The hard part is verifying that the files that look like duplicates > > really are duplicates. To do so, for a group of m files that appear > > to be the same, requires 2(m-1) reads through the whole files if you > > use a comparison based method, or m reads if you use a strong > > hashing method. You can't hope to cut the reads off early when > > using comparisons, because the files won't be different. > > If you read them in parallel, it's _at most_ m (m is the worst case > here), not 2(m-1). In my tests, it has always significantly less than > m.
Hmm, Patrick's right, David, isn't he? Except when m gets really BIG (say, M), in which case I suppose m is no longer the worst case number of whole-file reads. And I'm not sure what the trade off between disk seeks and disk reads does to the problem, in practice (with caching and realistic memory constraints). John -- http://mail.python.org/mailman/listinfo/python-list