On May 15, 3:06 am, Peter Otten <[EMAIL PROTECTED]> wrote: > George Sakkis wrote: > > I spent several hours debugging some bogus data results that turned > > out to be caused by the fact that heapq.nlargest doesn't respect rich > > comparisons: > > > import heapq > > import random > > > class X(object): > > def __init__(self, x): self.x=x > > def __repr__(self): return 'X(%s)' % self.x > > if True: > > # this succeeds > > def __cmp__(self, other): return cmp(self.x , other.x) > > else: > > # this fails > > def __lt__(self, other): return self.x < other.x > > > s = [X(i) for i in range(10)] > > random.shuffle(s) > > > s1 = heapq.nlargest(5, s) > > s2 = sorted(s, reverse=True)[:5] > > assert s1 == s2, (s,s1,s2) > > > s1 = heapq.nsmallest(5, s) > > s2 = sorted(s)[:5] > > assert s1 == s2, (s,s1,s2) > > > According to the docs, nlargest is equivalent to: "sorted(iterable, > > key=key, reverse=True)[:n]" and similarly for nsmallest. So that must > > be at least a documentation bug, if not an implementation one. > > Implementing a subset of the rich comparisons is always dangerous. According > to my ad hoc test you need <, <=, and == for nlargest()/nsmallest() to > work: > > import heapq > import random > > used_rich = set() > > class X(object): > def __init__(self, x): self.x=x > def __repr__(self): return 'X(%s)' % self.x > def __lt__(self, other): > used_rich.add("lt") > return self.x < other.x > def __eq__(self, other): > used_rich.add("eq") > return self.x == other.x > def __gt__(self, other): > used_rich.add("gt") > return self.x > other.x > def __ge__(self, other): > used_rich.add("ge") > return self.x >= other.x > def __ne__(self, other): > used_rich.add("ne") > return self.x != other.x > def __le__(self, other): > used_rich.add("le") > return self.x <= other.x > > s = [X(8), X(0), X(3), X(4), X(5), X(2), X(1), X(6), X(7), X(9)] > > smallest = sorted(s)[:5] > largest = sorted(s, reverse=True)[:5] > > print "used by sorted:", used_rich > used_rich = set() > > for i in range(10000): > > s1 = heapq.nlargest(5, s) > assert s1 == largest, (s, s1, largest) > > s1 = heapq.nsmallest(5, s) > assert s1 == smallest, (s, s1, smallest) > > random.shuffle(s) > > print "used by nlargest/nsmallest:", used_rich > > Output: > used by sorted: set(['lt']) > used by nlargest/nsmallest: set(['lt', 'le', 'eq'])
Interesting, I don't get 'eq' on one box ("2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)]") but I get it on another ("2.5.1 (r251:54863, May 9 2007, 15:27:54) [GCC 3.3.5 (Debian 1:3.3.5-13)]"). A more interesting question is how many (and which) of the rich comparisons can you omit while maintaining correctness (see script below). The results on my v2.5 on Windows are: 1. Use ('lt', 'le') if both present 2. If only 'lt' is missing use ('le', 'gt') 3. If only 'le' is missing use ('lt', 'ge') 4. If both ('lt', 'le') are missing use ('gt', 'ge') 5. If 3 or more of ('lt', 'le', 'gt', 'ge') are missing use the remaining one (if any) plus 'cmp' 6. If (5) holds and there is no 'cmp', nlargest/nsmaller are WRONG The results on the v2.5.1 debian box are the same, with the addition of 'eq' in all steps; if 'eq' is not present, 'cmp' is called, and if neither 'cmp' is present, the results are still correct (provided any of the cases 1-4 above hold of course). IOW, it does some extra equality checks if it can, but these are optional since it can be correct without them. For sorted(), the results are identical on both boxes: - Use only 'lt' if present else - Use only 'gt' if present else - Use only 'cmp' if present else - WRONG RESULTS IOW, no attempt to use 'le','ge','eq','ne' is made. I wonder how stable is this behavior across different versions and implementations (Jython, IronPython, etc). George ==== testing script ================================== import heapq from random import shuffle used = set(); add = used.add class X(object): def __init__(self, x): self.x=x def __repr__(self): return 'X(%s)' % self.x # try to remove one or more of these included in a used set # and see what happens def __gt__(self, other): add('gt'); return self.x > other.x def __ge__(self, other): add('ge'); return self.x >= other.x def __lt__(self, other): add('lt'); return self.x < other.x def __le__(self, other): add('le'); return self.x <= other.x def __eq__(self, other): add('eq'); return self.x == other.x def __ne__(self, other): add('ne'); return self.x != other.x def __cmp__(self, other): add('cmp'); return cmp(self.x, other.x) if __name__ == '__main__': check_heapq = True n = 1000 s = [X(8), X(0), X(3), X(4), X(5), X(2), X(1), X(6), X(7), X(9)] used.clear() if check_heapq: for i in xrange(n): s2 = heapq.nlargest(5, s) assert [a.x for a in s2] == range(9,4,-1), s2 shuffle(s) else: for i in xrange(n): s2 = sorted(s, reverse=True)[:5] assert [a.x for a in s2] == range(9,4,-1), s2 shuffle(s) used_largest = used.copy() print "largest:", used_largest used.clear() if check_heapq: for i in xrange(n): s2 = heapq.nsmallest(5, s) assert [a.x for a in s2] == range(5), s2 shuffle(s) else: for i in xrange(n): s2 = sorted(s)[:5] assert [a.x for a in s2] == range(5), s2 shuffle(s) used_smallest = used.copy() print "smallest:", used_smallest assert used_largest == used_smallest -- http://mail.python.org/mailman/listinfo/python-list