Everyone, tl;dr: When trading memory for computation time by sampling from a discrete probability distribution by sampling uniformly from a vector, is there a good way to decide on the size of the vector? Code attached to the e-mail.
First of all, my apologies if this has been answered previously. I seem to remember it being at least mentioned, but I can not seem to find anything on it. Given that we have a discrete distribution defined by a vector of probabilities, say: probs = [0.1, 0.2, 0.5, 0.1, 0.1] A fairly common and straightforward way to sample from this distribution, while trading memory for computation time, is to allocate a vector from which can then sample uniformly using `rand(1:end)`. For the above example: tbl = [1, 2, 2, 3, 3, 3, 3, 3, 4, 5] Which we can generate as follows: i = 1 for j in 1:length(tbl) tbl[j] = i if 1/length(tbl)*j >= sum(probs[1:i]) i += 1 end end What however struck me was that deciding on the size of `tbl` can be a lot trickier for real examples. If the table is too small, we will end up with too coarse granularity and will be unable to represent all the entries in `probs`. If the table is too large, we will be wasting memory. It seems like there should be some sort of sweet spot and what I usually do at that time is reach for a copy of "Numerical Recipes", however, either I skimmed it very poorly or this approach was not covered. Even worse, I have been unable to search for what the approach above is even called. Does anyone know what this approach would be referred to? Or, even better, how would one best decide on the size of `tbl`? I am currently using `max(1/(minimum(probs)), 2_000)`, which guarantees a granularity of at least 0.05% and that the smallest probability in `probs` will be covered. This works fine for my application, but the perfectionist in me squirms whenever I think that I can not come up with something more elegant. Thank you in advance for any and all responses, code attached to the e-mail. Pontus
probtbl.jl
Description: Binary data