In article <[EMAIL PROTECTED]>, [EMAIL PROTECTED] (John J. Lee) wrote:
> > If you read them in parallel, it's _at most_ m (m is the worst case > > here), not 2(m-1). In my tests, it has always significantly less than > > m. > > Hmm, Patrick's right, David, isn't he? Yes, I was only considering pairwise comparisons. As he says, simultaneously comparing all files in a group would avoid repeated reads without the CPU overhead of a strong hash. Assuming you use a system that allows you to have enough files open at once... > And I'm not sure what the trade off between disk seeks and disk reads > does to the problem, in practice (with caching and realistic memory > constraints). Another interesting point. -- David Eppstein Computer Science Dept., Univ. of California, Irvine http://www.ics.uci.edu/~eppstein/ -- http://mail.python.org/mailman/listinfo/python-list