> On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benja...@gmail.com> > wrote: >> >> I posted this in the thread about indexing dict items but it seems to >> have been buried there so I'm starting a new thread. >> >> Maybe there could be a function in the random module for making a >> random choice from an arbitrary (finite) iterable. This implementation >> can choose a random element from an iterable by fully iterating over >> it so is O(N) in terms of CPU operations but O(1) for memory usage: >> On Mon, 13 Jul 2020 at 13:37, David Mertz <me...@gnosis.cx> wrote: > > This is an inefficient reservoir sampling. The optimized version does not > need to call a random inclusion switch on every element, but can skip a > geometrically ordered collection of (random) skip lengths to avoid most > random inclusion decisions. > > Obviously, all items must be iterated over no matter what, but if randrange() > is significant compared to the cost of next(), the skip-then-decide version > is much preferred, especially as size grows.
Yes, I was deliberately keeping the implementation simple because I find the naive version understandable just by looking at it. I knew there would be faster version but thanks for the pointer which quickly lead me here: https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm In Python that's: from math import exp, log, floor from random import random, randrange from itertools import islice def sample_iter(iterable, k=1): """Select k items uniformly from iterable. Returns the whole population if there are k or fewer items """ iterator = iter(iterable) values = list(islice(iterator, k)) W = exp(log(random())/k) while True: # skip is geometrically distributed skip = floor( log(random())/log(1-W) ) selection = list(islice(iterator, skip, skip+1)) if selection: values[randrange(k)] = selection[0] W *= exp(log(random())/k) else: return values I haven't quite convinced myself that using floating point like this for W is okay but this seems to work (maybe it would fail for really large k or something...). If there was a proper implementation of a geometric variate in the random module then that could be used instead. By my timings this can choose from a large dict in less time than a conversion to list although the time taken depends on k: In [2]: n = 6 In [3]: d = dict(zip(range(10**n), range(10**n))) In [4]: %timeit sample_iter(d) 15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each In [5]: %timeit list(d) 26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [6]: %timeit sample_iter(d, 2) 15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit sample_iter(d, 20) 17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [8]: %timeit sample_iter(d, 100) 19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) -- Oscar _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CQKW76VN65GVSVCQ3UX33YXNS5YG3SC3/ Code of Conduct: http://python.org/psf/codeofconduct/