[issue18844] allow weights in random.choice

Raymond Hettinger Tue, 06 Sep 2016 01:04:42 -0700

Raymond Hettinger added the comment:

Latest draft patch attached (w/o tests or docs).
Incorporates consultation from Alan Downey and Jake Vanderplas.


* Population and weights are separate arguments (like numpy.random.choice() and 
sample() in R).  Matches the way data would arrive in Pandas.  Easily extracted 
from a Counter or dict using keys() and values().  Suitable for applications 
that sample the population multiple times but using different weights.  See 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html and 
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html

* Excludes a replacement=False option. That use case necessarily has integer 
weights and may be better suited to the existing random.sample() rather than 
trying to recompute a CDF on every iteration as we would have to in this 
function.

* Allows cumulative_weights to be submitted instead of individual weights.  
This supports uses cases where the CDF already exists (as in the ThinkBayes 
examples) and where we want to periodically reuse the same CDF for repeated 
samples of the same population -- this occurs in resampling applications, Gibbs 
sampling, and Monte Carlo Markov Chain applications.  Per Jake, "MCMC/Gibbs 
Sampling approaches generally boil down to a simple weighted coin toss at each 
step" and "It's definitely common to do aggregation of multiple samples, e.g. 
to compute sample statistics"

* The API allows the weights to be integers, fractions, decimals, or floats.  
Likewise, the population and weights can be any Sequence.  Population elements 
need not be hashable.

* Returning a list means that the we don't have to save state in mid-stream 
(that is why we can't use a generator).  A list feeds nicely into Counters, 
mean, median, stdev, etc for summary statistics.  Returning a list parallels 
what random.sample() does, keeping the module internally consistent.

* Default uniform weighting falls back to random.choice() which would be more 
efficient than bisecting.

* Bisecting tends to beat other approaches in the general case.  See 
http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python

* Incorporates error checks for len(population)==len(cum_weights) and for 
conflicting specification of both weights and cumulative weights.

There API is not perfect and there are some aspects that give me heartburn.  1) 
Not saving the computed CDF is waste and forces the user to pre-build the CDF 
if they want to save it for later use (the API could return both the selections 
and the CDF but that would be awkward and atypical).  2) For the common case of 
having small integer weights on a small population, the bisecting approach is 
slower than using random.choice on a population expanded to include the 
selections multiple times in proportion to their weights (that said, short of 
passing in a flag, there is no cheap easy way for this function to detect that 
case and give it a fast path).  3) Outputting a list is inefficient if all 
you're doing with result is summarizing it with a Counter, histogram tool, 
mean, median, or stdev.  4)  There is no cheap way to check to see if the user 
supplied cum_weights is sorted or if the weights contain negative values.

----------
Added file: http://bugs.python.org/file44394/weighted_choice.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18844>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18844] allow weights in random.choice

Reply via email to