Raymond Hettinger <[email protected]> added the comment:
Thanks for the suggestion. I'll give it some thought over the next few days.
Here are a few initial thoughts:
* The use of islice() really helps going through a small population quickly.
* The current sample() uses _randbelow() instead of random() to guarantee
equidistribution and to eliminate the small biases that come from using
random(). So this is a step backwards.
* The current sample() guarantees that slices of the sample are all in
randomized order. The proposed algorithm doesn't:
# Fully shuffled
>>> sample(range(20), k=19)
>>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1]
# Not random at all
>>> sample_iter(range(20), k=19)
>>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18]
* The current sample runs in speed proportional to *k* rather than *n*. This
means that the proposed algorithm slows down considerably for large populations:
In [12]: %timeit sample(range(100_000_000), k=20)
15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit sample_iter(range(100_000_000), k=20)
1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* For an apples-to-apples comparison with choices(), use the version in Python
3.9 which was updated to use floor() which is several times faster than int().
* Instead of listing the islice object, consider using next() instead. That
would likely be faster:
def sample_iter(iterable, k=1):
iterator = iter(iterable)
values = list(islice(iterator, k))
W = exp(log(random())/k)
try:
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
values[randrange(k)] = next(islice(iterator, skip, skip+1))
W *= exp(log(random())/k)
except StopIteration:
return values
* Using mock's call-count feature shows that the proposed algorithm uses more
entropy that the current sample() code. It seems that random() and
_randbelow() are both being called for every selection. I haven't yet
investigated to what causes this, but it is unfavorable especially for slow
generators like SystemRandom().
----------
assignee: -> rhettinger
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue41311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com