> On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benja...@gmail.com> 
> wrote:
>>
>> I posted this in the thread about indexing dict items but it seems to
>> have been buried there so I'm starting a new thread.
>>
>> Maybe there could be a function in the random module for making a
>> random choice from an arbitrary (finite) iterable. This implementation
>> can choose a random element from an iterable by fully iterating over
>> it so is O(N) in terms of CPU operations but O(1) for memory usage:
>>
On Mon, 13 Jul 2020 at 13:37, David Mertz <me...@gnosis.cx> wrote:
>
> This is an inefficient reservoir sampling. The optimized version does not 
> need to call a random inclusion switch on every element, but can skip a 
> geometrically ordered collection of (random) skip lengths to avoid most 
> random inclusion decisions.
>
> Obviously, all items must be iterated over no matter what, but if randrange() 
> is significant compared to the cost of next(), the skip-then-decide version 
> is much preferred, especially as size grows.

Yes, I was deliberately keeping the implementation simple because I
find the naive version understandable just by looking at it. I knew
there would be faster version but thanks for the pointer which quickly
lead me here:
https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm

In Python that's:

from math import exp, log, floor
from random import random, randrange
from itertools import islice

def sample_iter(iterable, k=1):
    """Select k items uniformly from iterable.

    Returns the whole population if there are k or fewer items
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values

I haven't quite convinced myself that using floating point like this
for W is okay but this seems to work (maybe it would fail for really
large k or something...). If there was a proper implementation of a
geometric variate in the random module then that could be used
instead.

By my timings this can choose from a large dict in less time than a
conversion to list although the time taken depends on k:

In [2]: n = 6

In [3]: d = dict(zip(range(10**n), range(10**n)))

In [4]: %timeit sample_iter(d)
15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each

In [5]: %timeit list(d)
26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit sample_iter(d, 2)
15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit sample_iter(d, 20)
17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit sample_iter(d, 100)
19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

--
Oscar
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CQKW76VN65GVSVCQ3UX33YXNS5YG3SC3/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to