Re: removing duplication from a huge list.

Tim Chase Fri, 27 Feb 2009 09:31:33 -0800

How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file.


We were told in the original question: more than 15 million records,
and it won't all fit into memory. So your observation is pertinent.

Assuming the working set of unique items will still fit withinmemory, it can be done with the following regardless of theinput-file's size:


  def deduplicator(iterable):
    seen = set()
    for item in iterable:
      if item not in seen:
        seen.add(item)
        yield item

  s = [7,6,5,4,3,6,9,5,4,3,2,5,4,3,2,1]
  print list(deduplicator(s))

  for line in deduplicator(file('huge_test.txt')):
    print line


It maintains order, emitting only new items as they're encountered.

-tkc



--
http://mail.python.org/mailman/listinfo/python-list

Re: removing duplication from a huge list.

Reply via email to