[Abe Dillon] > Randomly sampling from some population is often done because the entire > > population is impractically large which is also a motivation for using > > iterators, so it seems natural that one would be able to sample from an > > iterator. A naive implementation could use a heap queue: > > > > import heapq > > import random > > > > def stream(): > > while True: yield random.random() > > > > def sample(population, size): > > q = [tuple()]*size > > for el in zip(stream(), population): > > if el > q[0]: heapq.heapreplace(q, el) > > return [el[1] for el in q if el] >
[Steven D'Aprano] > Is that an improvement over: sample(list(itertools.slice(population, size))) > and if so, please explain. > > Different things entirely. Your spelling is missing sample's required second argument, and the difference should be clear if it's supplied: sample(list(itertools.slice(population, size)). size) That is, it merely returns some permutation of the _initial_ `size` items in the iterable. The rest of the population is ignored. In Python today, the easiest way to spell Abe's intent is, e.g., >>> from heapq import nlargest # or nsmallest - doesn't matter >>> from random import random >>> nlargest(4, (i for i in range(100000)), key=lambda x: random()) [75260, 45880, 99486, 13478] >>> nlargest(4, (i for i in range(100000)), key=lambda x: random()) [31732, 72288, 26584, 72672] >>> nlargest(4, (i for i in range(100000)), key=lambda x: random()) [14180, 86084, 22639, 2004] That also arranges to preserve `sample()'s promise that all sub-slices of the result are valid random samples too (because `nlargest` sorts by the randomly generated keys before returning the list). However, it does _not_ preserve - and nothing can preserve for arbitrary iterables - `sample()`'s promise to "[leave] the original population unchanged". We can't examine an arbitrary iterable's population at all without exhausting the iterable, and that can be destructive. So while this can indeed be useful, it would require changing `sample()` to break that promise in some cases. BTW, using a heap for this is uncommon. Search on "reservoir sampling" for more-common ways Most common is probably Vitter's "Algorithm R", which runs in O(len(iterable)) time (no additional log factor for a heap - it doesn't use a heap). I'd prefer to leave `sample()` alone, and introduce some spelling of `possibly_destructive_sample()` for arbitrary iterables - if that's wanted enough for someone to do the work ;-)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/