since this looks a lot like recreating a distribution from a frequency measurement, one of the other tools that comes to my mind (courtesy of an old question I asked of julia-users in the development of the hist method in base), is to compute the (continuous) Kernel Density Estimator first, and sample from that instead ( http://en.wikipedia.org/wiki/Kernel_density_estimation, https://github.com/JuliaStats/KernelDensity.jl)
On Wed, May 27, 2015 at 12:36 AM David P. Sanders <dpsand...@gmail.com> wrote: > > > El miércoles, 27 de mayo de 2015, 0:27:41 (UTC+2), Sebastian Nowozin > escribió: >> >> >> Hi Pontus, >> >> I have not seen this approach before, but it is certainly interesting and >> potentially useful. >> > > I hadn't seen it either, and it is nice, yes. > > As far as I am aware, the "standard" method for simulating from a given > discrete distribution is to > construct the *cumulative* distribution, which would be > > cumulative=[0.1, 0.3, 0.8, 0.9, 1.0] > > in the case you gave. > > You can now generate a standard uniform random variate r in [0,1] with > r=rand() > and check which element of cumulative r is less than, and greater than the > preceeding one. > > This is inefficient, O(N), for large vectors, since on average you have to > cover, say, half the array. > Alternatively you can use a bisection method, which reduces it to O(log N). > > Something like this is already implemented in Distributions.jl: > > http://distributionsjl.readthedocs.org/en/latest/univariate.html#categorical-distribution > > At a glance, the implementation of the table for the cumulative > distribution seems to be in StatsBase.jl > and I couldn't work out exactly what it is doing. > > David. > > > >> Without giving an answer to your questions, some things to consider: >> >> 1. Small probabilities will be the hardest to represent. >> >> 2. Maybe there is an exact method possible by storing in each bin not >> just the value but also a real value, then using a rejection-sampling >> step. By adjusting the real value you can make the sampler exact, leaving >> the number of total bins used as a free choice (determining efficiency), as >> long as the number of bins is at least as large as the number of outcomes. >> >> 3. The key reference on random sampling is Marsaglia's book on the topic, >> which may be worth checking out, >> https://books.google.co.uk/books?id=KJzlBwAAQBAJ >> >> 4. If your goal is to generate multiple random draws (instead of just >> one), a very interesting idea for efficiency comes from the literature on >> resampling methods in particle filtering. There the so called systematic >> resampling method (e.g. Section 2.4 in >> http://arxiv.org/pdf/cs/0507025.pdf) manages to draw N "samples" in O(N) >> time and O(1) space from a distribution over N outcomes with arbitrary >> probability distribution over these N elements. It is probably the most >> commonly used resampling method in particle filtering. However, these >> samples are not iid, but much like quasi-Monte Carlo samples guarantee >> unbiased expectations, see the above paper for some further pointers. >> >> In any case, an interesting idea (trading runtime for space), let us know >> if you find literature on this topic. >> >> Best, >> Sebastian >> >> On Tuesday, 26 May 2015 16:42:20 UTC+1, Pontus Stenetorp wrote: >>> >>> Everyone, >>> >>> tl;dr: When trading memory for computation time by sampling from a >>> discrete probability distribution by sampling uniformly from a vector, >>> is there a good way to decide on the size of the vector? Code >>> attached to the e-mail. >>> >>> First of all, my apologies if this has been answered previously. I >>> seem to remember it being at least mentioned, but I can not seem to >>> find anything on it. >>> >>> Given that we have a discrete distribution defined by a vector of >>> probabilities, say: >>> >>> probs = [0.1, 0.2, 0.5, 0.1, 0.1] >>> >>> A fairly common and straightforward way to sample from this >>> distribution, while trading memory for computation time, is to >>> allocate a vector from which can then sample uniformly using >>> `rand(1:end)`. For the above example: >>> >>> tbl = [1, 2, 2, 3, 3, 3, 3, 3, 4, 5] >>> >>> Which we can generate as follows: >>> >>> i = 1 >>> for j in 1:length(tbl) >>> tbl[j] = i >>> if 1/length(tbl)*j >= sum(probs[1:i]) >>> i += 1 >>> end >>> end >>> >>> What however struck me was that deciding on the size of `tbl` can be a >>> lot trickier for real examples. If the table is too small, we will >>> end up with too coarse granularity and will be unable to represent all >>> the entries in `probs`. If the table is too large, we will be wasting >>> memory. >>> >>> It seems like there should be some sort of sweet spot and what I >>> usually do at that time is reach for a copy of "Numerical Recipes", >>> however, either I skimmed it very poorly or this approach was not >>> covered. Even worse, I have been unable to search for what the >>> approach above is even called. Does anyone know what this approach >>> would be referred to? Or, even better, how would one best decide on >>> the size of `tbl`? I am currently using `max(1/(minimum(probs)), >>> 2_000)`, which guarantees a granularity of at least 0.05% and that the >>> smallest probability in `probs` will be covered. This works fine for >>> my application, but the perfectionist in me squirms whenever I think >>> that I can not come up with something more elegant. >>> >>> Thank you in advance for any and all responses, code attached to the >>> e-mail. >>> >>> Pontus >>> >>