New submission from Oscar Benjamin <oscar.j.benja...@gmail.com>: The random.choice/random.sample functions will only accept a sequence to select from. Can there be a function in the random module for selecting from an arbitrary iterable?
It is possible to make an efficient function that can make random selections from an arbitrary iterable e.g.: from math import exp, log, floor from random import random, randrange from itertools import islice def sample_iter(iterable, k=1): """Select k items uniformly from iterable. Returns the whole population if there are k or fewer items """ iterator = iter(iterable) values = list(islice(iterator, k)) W = exp(log(random())/k) while True: # skip is geometrically distributed skip = floor( log(random())/log(1-W) ) selection = list(islice(iterator, skip, skip+1)) if selection: values[randrange(k)] = selection[0] W *= exp(log(random())/k) else: return values https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm This could be used for random sampling from sets/dicts or also to choose something like a random line from a text file. The algorithm needs to fully consume the iterable but does so efficiently using islice. In the case of a dict this is faster than converting to a list and using random.choice: In [2]: n = 6 In [3]: d = dict(zip(range(10**n), range(10**n))) In [4]: %timeit sample_iter(d) 15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each In [5]: %timeit list(d) 26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [6]: %timeit sample_iter(d, 2) 15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit sample_iter(d, 20) 17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [8]: %timeit sample_iter(d, 100) 19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) This was already discussed on python-ideas: https://mail.python.org/archives/list/python-id...@python.org/thread/4OZTRD7FLXXZ6R6RU4BME6DYR3AXHOBD/ ---------- components: Library (Lib) messages: 373733 nosy: oscarbenjamin priority: normal severity: normal status: open title: Add a function to get a random sample from an iterable (reservoir sampling) _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue41311> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com