On Wed, Nov 30, 2016 at 11:57:46AM -0800, Chris Kaynor wrote: > Consider that this code will not produce the "correct" results (for a > reasonable definition of correct): > > a = (i for i in range(100)) # Pretend this does something more > interesting, and isn't a trivial generator - maybe a file object > reading by line. > randomEntries = [random.choice(a) for i in range(10)] # May not be > quite as obvious, such as the choices could be chosen in a more > complex loop. > > randomEntries may not contain 10 items (or the generation may error > due to having insufficent items, depending on implementation),
Indeed. Given a iterator with 100 items, there's a 9% chance that the first choice will be in the last nine items, which implies that failures here will be common. > it will > not contain duplicates, and will be sorted. Right. In other words, its a highly biased, non-random "random sample". It may be worth adding a "reservoir sample" function, but I think it is a mistake to try modifying the existing functions to deal with iterators. Its too hard to get the same characteristics when sampling from a sequence and an iterator. Better to keep them as separate functions so that the caller knows what they're getting. Here's my first attempt at implementing Algorithm R from Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R from random import randrange, shuffle import itertools def reservoir_sample(iterable, count=1): """Return a list of count items sampled without replacement from iterable, using reservoir sampling "Algorithm R". This will exhaust the iterable, which must be finite. """ it = iter(iterable) reservoir = list(itertools.islice(it, count)) if len(reservoir) < count: raise ValueError('iterator is too short to sample %d items' % count) shuffle(reservoir) for i, value in enumerate(it, count+1): j = randrange(0, i) if j < count: reservoir[j] = value return reservoir -- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/