Bugs item #1460340, was opened at 2006-03-28 19:05 Message generated for change (Comment added) made by rhettinger You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1460340&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.3 Status: Open Resolution: None Priority: 5 Submitted By: paul rubin (phr) Assigned to: Tim Peters (tim_one) Summary: random.sample can raise KeyError Initial Comment: I have only tested this in 2.3 and the relevant code in random.py has changed in the current svn branch, but from inspection it looks to me like the bug may still be there. If you initialize a dictionary as follows: a={}.fromkeys(range(10)+range(10,100,2)+range(100,110)) then random.sample(a,3) raises KeyError most times that you call it. ---------------------------------------------------------------------- >Comment By: Raymond Hettinger (rhettinger) Date: 2006-03-31 04:54 Message: Logged In: YES user_id=80475 It probably should not support dicts at all and should make some effort to detect them and raise a TypeError. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2006-03-30 22:55 Message: Logged In: YES user_id=31435 Assigned to me. The current situation is unacceptable, because internal code comments and the test suite were left implying that random.sample() "should work" for dicts -- but it doesn't, and the failure modes are both subtle and silent. Note that my first example was utterly vanilla, using a small dict with a contiguous range of integer keys. That's not asking sample() to crack a safe, it's asking it to borrow candy from a dead baby ;-) I don't care a lot about the second example, but it would would also work right if dicts were forced into sample()'s first internal algorithm (and potential optimization be damned in the case of a dict). ---------------------------------------------------------------------- Comment By: Raymond Hettinger (rhettinger) Date: 2006-03-30 05:18 Message: Logged In: YES user_id=80475 Bah. I'm not overly concerned. It is a Python fact of life that objects defining __getitem__ cannot aways be clearly categorized as being either a sequence or a mapping but not both. You can add some additional checks like checking for a keys() method, but there is a limit to what you can do against these safe-cracking style efforts to trick the routine. I hope this quest for theoretical perfection doesn't lead to throwing the baby out with the bathwater. It would be ashamed to lose the automated choice of the best performing algorithm. If that happens, somebody's real-world use cases are certain to suffer. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2006-03-29 08:02 Message: Logged In: YES user_id=31435 Ya, this is flaky for dicts; re-opening. For example, >>> d = dict((i, complex(i, i)) for i in xrange(30)) >>> random.sample(d, 5) # ask for 5 and it samples values [(25+25j), (4+4j), (16+16j), (13+13j), (17+17j)] >>> random.sample(d, 6) # ask for 6 and it samples keys [20, 11, 9, 4, 14, 1] That one doesn't have to do with internal exceptions, it has only to do with which of sample()'s internal algorithms gets used. Paul's point about potential bias is also a worry. Here's a full example: """ from random import sample d = dict.fromkeys(range(24)) d['x'] = 'y' count = hits = 0 while 1: count += 1 hits += sample(d, 1) == ['x'] if count % 10000 == 0: print count, "%.2f%%" % (100.0 * hits / count) """ Since len(d) == 25, I'd hope to see 'x' selected 1/25 = 4% of the time. Instead it gets selected 0.16% of the time (1/25**2 -- and Paul's analysis of why is on target). ---------------------------------------------------------------------- Comment By: paul rubin (phr) Date: 2006-03-29 05:07 Message: Logged In: YES user_id=72053 Actually the previous comment is wrong too; 99% of the time, sample(a,1) will return None since that's the value connected to every key in the dictionary, i.e. it's population[j] for every j. The other 1% of the time, the dict gets converted to a list, and the sample returns a key from the dict rather than a value, which is certainly wrong. And you can see how the probabilities are still messed up if the values in the dict are distinct. I think it's ok to give up on dicts, but some warning should about it be added to the manual unless dict-like things somehow get detected in the code. It would be best to test for the object really being a sequence, but I don't know if such a test exists. Maybe one should be added. I'll leave it to you guys to reopen this bug if appropriate. ---------------------------------------------------------------------- Comment By: paul rubin (phr) Date: 2006-03-29 04:46 Message: Logged In: YES user_id=72053 I don't think the fix at 43421 is quite right, but I can't easily test it in my current setup. Suppose a = dict.fromkeys(range(99) + ['x']) b = random.sample(a,1) 99% of the time, there's no KeyError and b gets set to [j] where j is some random integer. 1% of the time, there's a KeyError, random.sample is called recursively, and the recursive call returns [some integer j] 99% of the time, and returns ['x'] 1% of the time. So in total, ['x'] gets returned .01% of the time instead of 1% of the time. I think it's better to not set result[i]=population[j] inside the loop. Instead, just build up the selected set until it has enough indices; then try to make a result list using those indices, and if there's a KeyError, convert the population to a list and use the same selection set to make the results. gbrandl also correctly points out that a dict is not a sequence type, so maybe it's ok to just punt on dicts. But it's obvious from the code comments that somebody once wanted dicts to work, and it's reasonable for people to want sets to work. ---------------------------------------------------------------------- Comment By: Raymond Hettinger (rhettinger) Date: 2006-03-29 04:17 Message: Logged In: YES user_id=80475 Checked-in a fix. See revision 43420 and 43421. ---------------------------------------------------------------------- Comment By: Georg Brandl (gbrandl) Date: 2006-03-29 04:11 Message: Logged In: YES user_id=849994 random.sample is documented to take a sequence as its first argument. A dict is not a sequence type. I think you want random.sample(a.keys(), 3) or, for large dicts, random.sample(a.iterkeys(), 3) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1460340&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com