since this looks a lot like recreating a distribution from a frequency
measurement, one of the other tools that comes to my mind (courtesy of an
old question I asked of julia-users in the development of the hist method
in base), is to compute the (continuous) Kernel Density Estimator first,
and sample from that instead (
http://en.wikipedia.org/wiki/Kernel_density_estimation,
https://github.com/JuliaStats/KernelDensity.jl)


On Wed, May 27, 2015 at 12:36 AM David P. Sanders <dpsand...@gmail.com>
wrote:

>
>
> El miércoles, 27 de mayo de 2015, 0:27:41 (UTC+2), Sebastian Nowozin
> escribió:
>>
>>
>> Hi Pontus,
>>
>> I have not seen this approach before, but it is certainly interesting and
>> potentially useful.
>>
>
> I hadn't seen it either, and it is nice, yes.
>
> As far as I am aware, the "standard" method for simulating from a given
> discrete distribution is to
> construct the *cumulative* distribution, which would be
>
> cumulative=[0.1, 0.3, 0.8, 0.9, 1.0]
>
> in the case you gave.
>
> You can now generate a standard uniform random variate r in [0,1] with
> r=rand()
> and check which element of cumulative r is less than, and greater than the
> preceeding one.
>
> This is inefficient, O(N), for large vectors, since on average you have to
> cover, say, half the array.
> Alternatively you can use a bisection method, which reduces it to O(log N).
>
> Something like this is already implemented in Distributions.jl:
>
> http://distributionsjl.readthedocs.org/en/latest/univariate.html#categorical-distribution
>
> At a glance, the implementation of the table for the cumulative
> distribution seems to be in StatsBase.jl
> and I couldn't work out exactly what it is doing.
>
> David.
>
>
>
>> Without giving an answer to your questions, some things to consider:
>>
>> 1. Small probabilities will be the hardest to represent.
>>
>> 2. Maybe there is an exact method possible by storing in each bin not
>> just the value but also a real value, then using a rejection-sampling
>> step.  By adjusting the real value you can make the sampler exact, leaving
>> the number of total bins used as a free choice (determining efficiency), as
>> long as the number of bins is at least as large as the number of outcomes.
>>
>> 3. The key reference on random sampling is Marsaglia's book on the topic,
>> which may be worth checking out,
>> https://books.google.co.uk/books?id=KJzlBwAAQBAJ
>>
>> 4. If your goal is to generate multiple random draws (instead of just
>> one), a very interesting idea for efficiency comes from the literature on
>> resampling methods in particle filtering.  There the so called systematic
>> resampling method (e.g. Section 2.4 in
>> http://arxiv.org/pdf/cs/0507025.pdf) manages to draw N "samples" in O(N)
>> time and O(1) space from a distribution over N outcomes with arbitrary
>> probability distribution over these N elements.  It is probably the most
>> commonly used resampling method in particle filtering.  However, these
>> samples are not iid, but much like quasi-Monte Carlo samples guarantee
>> unbiased expectations, see the above paper for some further pointers.
>>
>> In any case, an interesting idea (trading runtime for space), let us know
>> if you find literature on this topic.
>>
>> Best,
>> Sebastian
>>
>> On Tuesday, 26 May 2015 16:42:20 UTC+1, Pontus Stenetorp wrote:
>>>
>>> Everyone,
>>>
>>> tl;dr: When trading memory for computation time by sampling from a
>>> discrete probability distribution by sampling uniformly from a vector,
>>> is there a good way to decide on the size of the vector?  Code
>>> attached to the e-mail.
>>>
>>> First of all, my apologies if this has been answered previously.  I
>>> seem to remember it being at least mentioned, but I can not seem to
>>> find anything on it.
>>>
>>> Given that we have a discrete distribution defined by a vector of
>>> probabilities, say:
>>>
>>>     probs = [0.1, 0.2, 0.5, 0.1, 0.1]
>>>
>>> A fairly common and straightforward way to sample from this
>>> distribution, while trading memory for computation time, is to
>>> allocate a vector from which can then sample uniformly using
>>> `rand(1:end)`.  For the above example:
>>>
>>>     tbl = [1, 2, 2, 3, 3, 3, 3, 3, 4, 5]
>>>
>>> Which we can generate as follows:
>>>
>>>     i = 1
>>>     for j in 1:length(tbl)
>>>         tbl[j] = i
>>>         if 1/length(tbl)*j >= sum(probs[1:i])
>>>             i += 1
>>>         end
>>>     end
>>>
>>> What however struck me was that deciding on the size of `tbl` can be a
>>> lot trickier for real examples.  If the table is too small, we will
>>> end up with too coarse granularity and will be unable to represent all
>>> the entries in `probs`.  If the table is too large, we will be wasting
>>> memory.
>>>
>>> It seems like there should be some sort of sweet spot and what I
>>> usually do at that time is reach for a copy of "Numerical Recipes",
>>> however, either I skimmed it very poorly or this approach was not
>>> covered.  Even worse, I have been unable to search for what the
>>> approach above is even called.  Does anyone know what this approach
>>> would be referred to?  Or, even better, how would one best decide on
>>> the size of `tbl`?  I am currently using `max(1/(minimum(probs)),
>>> 2_000)`, which guarantees a granularity of at least 0.05% and that the
>>> smallest probability in `probs` will be covered.  This works fine for
>>> my application, but the perfectionist in me squirms whenever I think
>>> that I can not come up with something more elegant.
>>>
>>> Thank you in advance for any and all responses, code attached to the
>>> e-mail.
>>>
>>>     Pontus
>>>
>>

Reply via email to