Falcolas wrote: > On Feb 27, 8:33 am, Tim Rowe <digi...@gmail.com> wrote: >> 2009/2/27 odeits <ode...@gmail.com>: >> >>> How big of a list are we talking about? If the list is so big that the >>> entire list cannot fit in memory at the same time this approach wont >>> work e.g. removing duplicate lines from a very large file. >> We were told in the original question: more than 15 million records, >> and it won't all fit into memory. So your observation is pertinent. >> >> -- >> Tim Rowe > > If order did matter, and the list itself couldn't be stored in memory, > I would personally do some sort of hash of each item (or something as > simple as first 5 bytes, last 5 bytes and length), keeping a reference > to which item the hash belongs, sort and identify duplicates in the > hash, and using the reference check to see if the actual items in > question match as well. > Assuming no duplicates, how does this help? You still have to verify collisions.
> Pretty brutish and slow, but it's the first algorithm which comes to > mind. Of course, I'm assuming that the list items are long enough to > warrant using a hash and not the values themselves. > The problem is you can't guarantee any hash value will be unique, so ultimately you have to check whether list entries hashing to the same value are or are not equal. Which would mean either having the whole list in memory or having access to it via some other random access method. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list