Falcolas wrote: > On Feb 27, 10:07 am, Steve Holden <st...@holdenweb.com> wrote: >> Assuming no duplicates, how does this help? You still have to verify >> collisions. > > Absolutely. But a decent hashing function (particularly since you know > quite a bit about the data beforehand) will have very few collisions > (theoretically no collisions, again since we know the data > beforehand). > >> The problem is you can't guarantee any hash value will be unique, so >> ultimately you have to check whether list entries hashing to the same >> value are or are not equal. Which would mean either having the whole >> list in memory or having access to it via some other random access method. > > Again, absolutely. But as Tim pointed out, with a decent hash the > number of superfluous checks should be minimal, so your checks will > mostly occur on valid duplicate values. > > It's worth noting that I picked this algorithm with the assumption > that any storage to physical memory (such as required by using a set) > would be out of the question. > I take your point about the relatively rare collisions. Devising a zero-collision algorithm is far from trivial, since at the minimum it requires storing all hashes in a set to verify each new hash doesn't collide with previous ones.
But unless you can *guarantee* zero collisions you do still need a mapping from hash values to the original data, since this is the only way to determine whether collisions represent a duplicate value or not. That's not a trivial detail for fifteen million values, though it depends some on the format of the OP's "records". I suppose a list of file positions stored with each hash might be usable assuming either fixed-length or properly delimited records. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list