Everyone,

tl;dr: When trading memory for computation time by sampling from a
discrete probability distribution by sampling uniformly from a vector,
is there a good way to decide on the size of the vector?  Code
attached to the e-mail.

First of all, my apologies if this has been answered previously.  I
seem to remember it being at least mentioned, but I can not seem to
find anything on it.

Given that we have a discrete distribution defined by a vector of
probabilities, say:

    probs = [0.1, 0.2, 0.5, 0.1, 0.1]

A fairly common and straightforward way to sample from this
distribution, while trading memory for computation time, is to
allocate a vector from which can then sample uniformly using
`rand(1:end)`.  For the above example:

    tbl = [1, 2, 2, 3, 3, 3, 3, 3, 4, 5]

Which we can generate as follows:

    i = 1
    for j in 1:length(tbl)
        tbl[j] = i
        if 1/length(tbl)*j >= sum(probs[1:i])
            i += 1
        end
    end

What however struck me was that deciding on the size of `tbl` can be a
lot trickier for real examples.  If the table is too small, we will
end up with too coarse granularity and will be unable to represent all
the entries in `probs`.  If the table is too large, we will be wasting
memory.

It seems like there should be some sort of sweet spot and what I
usually do at that time is reach for a copy of "Numerical Recipes",
however, either I skimmed it very poorly or this approach was not
covered.  Even worse, I have been unable to search for what the
approach above is even called.  Does anyone know what this approach
would be referred to?  Or, even better, how would one best decide on
the size of `tbl`?  I am currently using `max(1/(minimum(probs)),
2_000)`, which guarantees a granularity of at least 0.05% and that the
smallest probability in `probs` will be covered.  This works fine for
my application, but the perfectionist in me squirms whenever I think
that I can not come up with something more elegant.

Thank you in advance for any and all responses, code attached to the e-mail.

    Pontus

Attachment: probtbl.jl
Description: Binary data

Reply via email to