Fisherking wrote: > I hope you can help me with this optimizing problem! > I have a large list of dictionaries (about 250 000 of them). Two or > more of these dictionaries contains the same data (not more than five > or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d': > 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...]. > (the dictionaries are really containg phone numbers, duration time and > call time.)
I'd assume the dictionaries to look like {"phone": "0123...", "duration": 5.67, "calltime": "10:42"} What do the different keys ("a", "b", "c") versus ("d", "e", "f") mean? > Which are the best way of searching through the list and extract the > items that are the same. In the first run I want to find every item > that are the same as {'a':1,'b':'2','c':3}, the second {'d': > 4,'e':'5','f':6} etc. The list are not read-only and I can pop them > out of the list when I have found them! items = [dict(a=1, b=2), dict(a=1, b=2), dict(a=3, c=2), dict(a=3, c=2.2), dict(a=3, c=2.6)] # 2.5 seen = set() for row in items: rowkey = frozenset(row.iteritems()) if rowkey in seen: print "dupe", row else: seen.add(rowkey) # 2.3 (and older); not tested seen = {} for row in items: rowkey = row.items() rowkey.sort() rowkey = tuple(rowkey) if rowkey in seen: print "dupe", row else: seen[rowkey] = 1 > (How can I found items that are almost the same? e.g. {'a': > 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} ) If you are content with detecting similarities on a single "axis": def window(items): items = iter(items) prev = items.next() for cur in items: yield cur, prev prev = cur def todict(k, v, rest): d = dict(rest) d[k] = v return d axis = "c" d = {} eps = 0.5 for row in items: row = dict(row) try: value = row.pop(axis) except KeyError: pass else: key = frozenset(row.iteritems()) d.setdefault(key, []).append(value) for key, values in d.iteritems(): if len(values) > 1: for cur, prev in window(sorted(values)): if abs(cur-prev) < eps: print "similar rows", todict(axis, cur, key), todict(axis, prev, key) Otherwise you have to define a distance function and use a clustering algorithm. Segaran's "Programming Collective Intelligence" has some examples in Python, though the code can hardly be called idiomatic. Peter -- http://mail.python.org/mailman/listinfo/python-list