Raymond Hettinger <raymond.hettin...@gmail.com> added the comment:

ISTM that if a generator produces so much data that it is infeasible to fit in 
memory, then it will also take a long time to loop over it and generate a 
random value for each entry.  

FWIW, every time we've looked at reservoir sampling it has been less performant 
than what we have now.  The calls to randbelow() are the slowest part, so doing 
more calls makes the overall performance worse.  Also, doing more calls eats 
more entropy.  

In general, it is okay for functions to accept only sequences if they exploit 
indexing in some way.  For example, the current approach works great with 
sample(range(100_000_000_000), k=50).  We really don't have to make everything 
accept all iterators.  Besides, it is trivially easy to call list() if needed.

Overall, I'm -1 on redesigning the sampling algorithm to accommodate 
non-sequence iterators.  AFAICT, it isn't important at all and as Serhiy 
pointed out, writing your own reservoir sampling is easy do.  Lastly, the 
standard library doesn't try to be all things to all people, it is okay to 
leave many things for external packages -- we mostly provide a baseline of 
tools that cover common use cases and defer the rest to the Python ecosystem.

----------
assignee:  -> rhettinger
versions:  -Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37682>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to