[jira] [Commented] (RNG-91) Kemp small mean poisson sampler

Alex D Herbert (JIRA) Wed, 01 May 2019 05:13:11 -0700


    [ 
https://issues.apache.org/jira/browse/RNG-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830967#comment-16830967
 ]


Alex D Herbert commented on RNG-91:
-----------------------------------

Previous results show the Kemp method cannot compete with the current SmallMean 
method when the mean is above 8 and the generator is fast. Thus a new version 
was introduced that computes the sample value at the cumulative probability of 
50%. This performs as well as the SmallMean method at all means <40.

This analysis was done for repeat sampling. However it would not fair well for 
single samples as it requires pre-computation. This indicates there are use 
cases not covered by previous testing. I think we have the following 
combinations:
||RNG||Mean||Number of samples||
|Fast|<1|1|
|Fast|<1|repeat|
|Fast|1 - 40|1|
|Fast|1 - 40|repeat|
|Slow|<1|1|
|Slow|<1|repeat|
|Slow|1 - 40|1|
|Slow|1 - 40|repeat|

The mean <1 category is because the LargeMeanPoissonSampler is optimised for 
integer means. It creates a sample by adding a sample from a large integer mean 
to a sample from a mean in the range <1.

The mean 1 - 40 category is because the library switches to the LargeMean 
sampler when the mean is above 40. This was previously shown to be a reasonable 
limit in terms of performance.

The number of samples is a split between creating a single sample or lots. This 
effectively measures the start-up cost of creating a generator. The 
PoissonSamplerCache was created to exploit the re-use of previous computations 
for a LargeMean sampler. This hits the use case of a mean <1 and a single 
sample.

The use case of single samples for means in the range 1 - 40 is for 
completeness (and because I have a use case for it). 

I noted in the PR 91 comments that a test should be done on a sampler that 
fully computes the cumulative probability table. I have done this using various 
methods. Note that since the Poisson distribution is unbounded the cumulative 
probability table is computed up to approximately 99.999%. The remaining part 
of the distribution can be computed using the Kemp method which uses recursion 
to compute the next probability value. All that is required is the sampler 
store the probability and sample value of the next value after the tabulated 
probabilities. Here are the methods I have tried for searching the probability 
table:
 * [Alias Method|https://en.wikipedia.org/wiki/Alias_method]
 * Guide tables
 * Binary search

The Alias method takes the input probabilities and shares them out into a set 
of tables that can be searched in linear time. It is a very nice algorithm and 
well explained by Keith Schartz in [Sampling from a Discrete 
Distribution|http://www.keithschwarz.com/darts-dice-coins/].

The reference for a guide table is from:
{noformat}
Devroye, Luc (1986). Non-Uniform Random Variate Generation.
New York: Springer-Verlag. Chapter 3.2.4 "The method of guide tables" p. 
96.{noformat}
Here is [chapter 3|http://luc.devroye.org/chapter_three.pdf] as a PDF from the 
author's website.

The idea basically stores a table that maps a uniform deviate (i.e. a 
cumulative probability) on to an initial start point for the sample. The 
cumulative probability table is then searched linearly. If the cumulative 
probability distribution is well spaced then the search will take a few steps 
and performance is fast.

*Samplers*
|BoundedKempSmallMeanPoissonSampler|The Kemp method bounded to 1000 * mean as 
per the SmallMean sampler|
|KempSmallMeanPoissonSamplerAliasMethod|Kemp sampler using the Alias method to 
search the tabulated probabilities|
|KempSmallMeanPoissonSamplerBinarySearch|Kemp sampler using a binary search of 
the tabulated probabilities|
|KempSmallMeanPoissonSamplerGuideTable|Kemp sampler using the a guide table to 
search the tabulated probabilities|
|KempSmallMeanPoissonSamplerP50|Kemp sampler with the sample value stored at 
cumulative probability 50%|
|OriginalKempSmallMeanPoissonSampler|This is the original Kemp method. Its 
upper limit is unbounded until the current sample probability p(x) is zero.|
|SmallMeanPoissonSampler|The current library implementation|
|LargeMeanPoissonSampler|The current library implementation|
|TinyMeanPoissonSampler|An implementation of the algorithm from the SmallMean 
sampler using optimised 32-bit arithmetic. Can only handle means up to 22.|

*Single Samples*

For single samples ideally a method will not have to compute a lot of 
information. All methods have to compute {{Math.exp(-mean)}}, so here is what 
that takes:
|mean|Score|Error|Median|Error/Score|
|0.25|54.74|17.19|52.71|0.31|
|0.50|52.71|0.02|52.71|0.00|
|1.00|54.74|17.19|52.72|0.31|
|2.00|50.90|0.02|50.90|0.00|
|4.00|50.90|0.02|50.90|0.00|
|8.00|54.60|16.25|52.72|0.30|
|16.00|52.52|13.90|50.91|0.26|
|32.00|52.54|13.73|50.90|0.26|

*Generators*
 Here are the computation times for a double from the generators I tested 
(chosen to have a variety of speeds):
|randomSourceName|Method|Score|Error|Median|Error/Score|
|XO_RO_SHI_RO_128_PLUS|baselineNextDouble|5.18|1.60|5.00|0.31|
|WELL_44497_B|baselineNextDouble|24.58|8.96|22.92|0.36|
|ISAAC|baselineNextDouble|13.36|4.73|12.48|0.35|

The XO_RO_SHI_RO_128_PLUS is the fastest generator for double values in the 
library.

*Results*

I have a lot of data so I'll try to confine the results so that I do not 
include results that show the same thing.

*Mean <1*

The simple methods are best:
||randomSourceName||samplerType||mean||Method||Score||Error||Median||Error/Score||
|XO_RO_SHI_RO_128_PLUS|BoundedKempSmallMeanPoissonSampler|0.25|sample|8.39|0.09|8.38|0.01|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerAliasMethod|0.25|sample|15.07|0.05|15.07|0.00|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerBinarySearch|0.25|sample|10.55|0.01|10.55|0.00|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerGuideTable|0.25|sample|12.99|0.01|12.99|0.00|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerP50|0.25|sample|8.91|2.72|8.43|0.31|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|0.25|sample|9.30|3.12|9.87|0.34|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|0.25|sample|9.87|3.10|9.77|0.31|
|XO_RO_SHI_RO_128_PLUS|TinyMeanPoissonSampler|0.25|sample|7.12|0.02|7.12|0.00|
|XO_RO_SHI_RO_128_PLUS|BoundedKempSmallMeanPoissonSampler|0.25|singleSample|62.12|19.95|59.56|0.32|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerAliasMethod|0.25|singleSample|456.29|198.98|424.38|0.44|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerBinarySearch|0.25|singleSample|95.36|24.08|95.64|0.25|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerGuideTable|0.25|singleSample|126.96|93.26|141.90|0.73|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerP50|0.25|singleSample|80.20|14.03|80.32|0.17|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|0.25|singleSample|58.78|0.42|58.74|0.01|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|0.25|singleSample|63.76|20.81|59.87|0.33|
|XO_RO_SHI_RO_128_PLUS|TinyMeanPoissonSampler|0.25|singleSample|61.90|17.50|59.88|0.28|

Here the Kemp sampler, either bounded to 1000 * mean or the original is the 
best algorithm. The time is within 10% of Math.exp(-mean) + nextDouble() and 
includes the set-up cost of the generator that validates the input mean.

Note: It is not quite as fast as the TinyMean sampler. The TinyMean sampler is 
very fast but suffers with a slower generator (data not shown) as it requires 
multiple random number samples. So this is not a robust algorithm for any 
generator. It is also not numerically as accurate since it uses truncation to 
compute the products of fractions with unsigned 32-bit arithmetic. 
{noformat}
 Compute the fraction:
  r       [0,2^32)      2^32
 ----  *  --------   /  ----
 2^32       2^32        2^32

with r in [0,2^32)
{noformat}
Note in the above table that the computation cost for the tabulated methods is 
high and in the order:
{noformat}
BinarySearch < Guide table < Alias method
{noformat}
There is no advantage to be gained from tabulation when the mean is low. Note 
that for a Poisson distribution of mean 1 the cumulative probability F(X<=1) is 
0.7358, F(X<=6) is 0.9999. So the table is very small and does not require an 
efficient search method.

The remaining use cases are:
||RNG||Mean||Number of samples||
|Fast|1 - 40|1|
|Fast|1 - 40|repeat|
|Slow|1 - 40|1|
|Slow|1 - 40|repeat|

It is to be expected that: repeat sampling benefits from tabulating the 
cumulative probability; and single sampling should not precompute anything but 
may depends on the speed of the generator.

*Repeat Sampling, mean 1 - 40*

I've tested means from 2 - 64 in powers of 2. Here are the extremes in the 
appropriate range 1 - 40:
||randomSourceName||samplerType||mean||Method||Score||Error||Median||Error/Score||
|WELL_44497_B|BoundedKempSmallMeanPoissonSampler|2|sample|39.87|2.84|39.74|0.07|
|WELL_44497_B|KempSmallMeanPoissonSamplerAliasMethod|2|sample|33.87|1.78|33.67|0.05|
|WELL_44497_B|KempSmallMeanPoissonSamplerBinarySearch|2|sample|37.53|0.22|37.50|0.01|
|WELL_44497_B|KempSmallMeanPoissonSamplerGuideTable|2|sample|31.19|0.11|31.18|0.00|
|WELL_44497_B|KempSmallMeanPoissonSamplerP50|2|sample|33.91|0.17|33.88|0.01|
|WELL_44497_B|LargeMeanPoissonSampler|2|sample|216.71|68.81|208.71|0.32|
|WELL_44497_B|OriginalKempSmallMeanPoissonSampler|2|sample|40.67|12.78|39.20|0.31|
|WELL_44497_B|SmallMeanPoissonSampler|2|sample|75.00|15.24|73.31|0.20|
|WELL_44497_B|TinyMeanPoissonSampler|2|sample|41.79|0.23|41.80|0.01|
|XO_RO_SHI_RO_128_PLUS|BoundedKempSmallMeanPoissonSampler|2|sample|19.64|0.03|19.64|0.00|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerAliasMethod|2|sample|15.45|0.10|15.44|0.01|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerBinarySearch|2|sample|19.83|0.55|19.89|0.03|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerGuideTable|2|sample|16.91|5.25|16.30|0.31|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerP50|2|sample|19.22|5.36|19.59|0.28|
|XO_RO_SHI_RO_128_PLUS|LargeMeanPoissonSampler|2|sample|99.14|0.23|99.14|0.00|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|2|sample|20.53|0.11|20.52|0.01|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|2|sample|22.64|5.55|21.99|0.24|
|XO_RO_SHI_RO_128_PLUS|TinyMeanPoissonSampler|2|sample|17.81|5.30|17.20|0.30|
|WELL_44497_B|BoundedKempSmallMeanPoissonSampler|32|sample|222.68|0.09|222.68|0.00|
|WELL_44497_B|KempSmallMeanPoissonSamplerAliasMethod|32|sample|32.49|0.16|32.48|0.00|
|WELL_44497_B|KempSmallMeanPoissonSamplerBinarySearch|32|sample|51.95|14.52|50.70|0.28|
|WELL_44497_B|KempSmallMeanPoissonSamplerGuideTable|32|sample|27.39|0.11|27.37|0.00|
|WELL_44497_B|KempSmallMeanPoissonSamplerP50|32|sample|115.04|0.41|115.00|0.00|
|WELL_44497_B|LargeMeanPoissonSampler|32|sample|142.60|45.26|137.37|0.32|
|WELL_44497_B|OriginalKempSmallMeanPoissonSampler|32|sample|187.18|63.47|175.31|0.34|
|WELL_44497_B|SmallMeanPoissonSampler|32|sample|628.07|177.89|607.05|0.28|
|XO_RO_SHI_RO_128_PLUS|BoundedKempSmallMeanPoissonSampler|32|sample|215.19|62.99|207.90|0.29|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerAliasMethod|32|sample|16.22|0.23|16.20|0.01|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerBinarySearch|32|sample|33.17|10.01|32.14|0.30|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerGuideTable|32|sample|12.34|0.05|12.35|0.00|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerP50|32|sample|97.97|3.53|97.58|0.04|
|XO_RO_SHI_RO_128_PLUS|LargeMeanPoissonSampler|32|sample|64.46|0.22|64.47|0.00|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|32|sample|173.57|56.88|175.52|0.33|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|32|sample|94.20|0.19|94.18|0.00|

At small mean of 2 the advanced methods are fast for repeat sampling. The guide 
table and Alias methods are best and outperform the method of choice for mean 
<1 (Kemp sampler). They are about 25% faster than the current library method of 
the SmallMean sampler.

At large mean of 32 the advanced methods are *very* fast for repeat sampling. 
The guide table method is best and outperforms the method of choice for mean 
>40 (LargeMean sampler) by about 5-fold, and the SmallMean sampler by about 
9-fold with a fast generator.

Note:

At a higher mean the BoundedKemp sampler is slower than the original Kemp 
method. The difference is that one will terminate the loop when at 1000 * mean 
(using a stored variable to hold the limit) and the original when the 
probability is 0, using the variable within the method body. So this indicates 
the original Kemp method to be more robust to different sized mean.

*Single Sample, mean 1 - 40*
|WELL_44497_B|BoundedKempSmallMeanPoissonSampler|2|singleSample|108.29|69.04|113.05|0.64|
|WELL_44497_B|KempSmallMeanPoissonSamplerAliasMethod|2|singleSample|525.66|376.40|515.74|0.72|
|WELL_44497_B|KempSmallMeanPoissonSamplerBinarySearch|2|singleSample|142.65|67.88|145.47|0.48|
|WELL_44497_B|KempSmallMeanPoissonSamplerGuideTable|2|singleSample|172.43|44.55|169.59|0.26|
|WELL_44497_B|KempSmallMeanPoissonSamplerP50|2|singleSample|119.02|51.82|122.14|0.44|
|WELL_44497_B|LargeMeanPoissonSampler|2|singleSample|522.55|203.89|499.21|0.39|
|WELL_44497_B|OriginalKempSmallMeanPoissonSampler|2|singleSample|98.28|31.79|103.76|0.32|
|WELL_44497_B|SmallMeanPoissonSampler|2|singleSample|134.25|42.85|126.99|0.32|
|WELL_44497_B|TinyMeanPoissonSampler|2|singleSample|92.69|1.60|92.77|0.02|
|XO_RO_SHI_RO_128_PLUS|BoundedKempSmallMeanPoissonSampler|2|singleSample|65.80|0.12|65.82|0.00|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerAliasMethod|2|singleSample|509.69|160.74|518.71|0.32|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerBinarySearch|2|singleSample|115.96|39.04|117.25|0.34|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerGuideTable|2|singleSample|126.69|73.05|130.58|0.58|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerP50|2|singleSample|88.62|46.91|84.66|0.53|
|XO_RO_SHI_RO_128_PLUS|LargeMeanPoissonSampler|2|singleSample|340.79|70.63|346.05|0.21|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|2|singleSample|72.77|22.76|70.13|0.31|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|2|singleSample|67.58|22.31|64.99|0.33|
|XO_RO_SHI_RO_128_PLUS|TinyMeanPoissonSampler|2|singleSample|66.17|0.17|66.18|0.00|
|WELL_44497_B|BoundedKempSmallMeanPoissonSampler|32|singleSample|277.38|6.57|277.50|0.02|
|WELL_44497_B|KempSmallMeanPoissonSamplerAliasMethod|32|singleSample|1,948.94|404.13|1,975.78|0.21|
|WELL_44497_B|KempSmallMeanPoissonSamplerBinarySearch|32|singleSample|544.13|325.30|515.00|0.60|
|WELL_44497_B|KempSmallMeanPoissonSamplerGuideTable|32|singleSample|623.56|374.97|566.25|0.60|
|WELL_44497_B|KempSmallMeanPoissonSamplerP50|32|singleSample|384.99|102.22|388.65|0.27|
|WELL_44497_B|LargeMeanPoissonSampler|32|singleSample|520.72|111.48|510.25|0.21|
|WELL_44497_B|OriginalKempSmallMeanPoissonSampler|32|singleSample|241.60|57.03|239.76|0.24|
|WELL_44497_B|SmallMeanPoissonSampler|32|singleSample|683.65|214.16|660.73|0.31|
|XO_RO_SHI_RO_128_PLUS|BoundedKempSmallMeanPoissonSampler|32|singleSample|246.12|78.70|236.85|0.32|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerAliasMethod|32|singleSample|1,707.91|1,459.88|1,676.27|0.85|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerBinarySearch|32|singleSample|514.01|295.10|552.04|0.57|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerGuideTable|32|singleSample|574.12|293.58|571.64|0.51|
|XO_RO_SHI_RO_128_PLUS|KempSmallMeanPoissonSamplerP50|32|singleSample|355.89|79.52|366.59|0.22|
|XO_RO_SHI_RO_128_PLUS|LargeMeanPoissonSampler|32|singleSample|434.10|113.69|427.25|0.26|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|32|singleSample|209.38|0.07|209.39|0.00|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|32|singleSample|146.22|36.94|148.36|0.25|

When the mean is small the alias method set-up cost is too high for it to be 
applied. The set-up cost of the binary search and guide table methods is lower. 
They can compete within 2-fold of the best method. This is either the Kemp 
method (with a slow RNG) or the SmallMean method with a fast RNG. 

When the mean is high the tabulated methods are relatively more expensive. They 
can compete within 4-fold of the best method. This is either the Kemp method 
(with a slow RNG) or the SmallMean method with a fast RNG.

Using the time for taking a single sample and then the time for repeat samples 
we can find the number of samples {{x}} where the tabulated method 
[equals|https://en.wikipedia.org/wiki/Line%E2%80%93line_intersection#Given_the_equations_of_the_lines]
 the simple method:
{noformat}
ax + c = bx + d
x = (d - c) / (a - b){noformat}
|randomSourceName|Sampler|Mean|c|a|Sampler|d|b|x|
|WELL_44497_B|OriginalKempSmallMeanPoissonSampler|2|98.28|40.67|KempSmallMeanPoissonSamplerGuideTable|172.43|31.19|7.8217299578|
|WELL_44497_B|SmallMeanPoissonSampler|2|134.25|75|KempSmallMeanPoissonSamplerGuideTable|172.43|31.19|0.8714905273|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|2|72.77|20.53|KempSmallMeanPoissonSamplerGuideTable|126.69|16.91|14.8950276243|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|2|67.58|22.64|KempSmallMeanPoissonSamplerGuideTable|126.69|16.91|10.3158813264|
|WELL_44497_B|OriginalKempSmallMeanPoissonSampler|32|241.6|187.18|KempSmallMeanPoissonSamplerGuideTable|623.56|27.39|2.3903873834|
|WELL_44497_B|SmallMeanPoissonSampler|32|683.65|628.07|KempSmallMeanPoissonSamplerGuideTable|623.56|27.39|-0.1000366252|
|XO_RO_SHI_RO_128_PLUS|OriginalKempSmallMeanPoissonSampler|32|209.38|173.57|KempSmallMeanPoissonSamplerGuideTable|574.12|12.34|2.2622340755|
|XO_RO_SHI_RO_128_PLUS|SmallMeanPoissonSampler|32|146.22|94.2|KempSmallMeanPoissonSamplerGuideTable|574.12|12.34|5.2272172001|

Note: x is the number of *additional* samples after construction before a 
performance gain.

So in the worst case of mean 2 and a fast generator the guide table would have 
to be used to generate 16 samples before performance gain.

When the mean is high the cost of tabulation is paid back after 3 samples.

*Conclusions*

For means <1 the Kemp method is the best. It does not suffer performance issues 
with a slow RNG. This should be used to replace the SmallMean sampler to 
compute the fractional mean sample in the LargeMean sampler.

For means >1 repeat sampling benefits from tabulating the cumulative 
probability. The examples here show that the guide table approach performs as 
well as the Alias approach but has lower set-up cost.

For a single sample it is best to use either a SmallMean sampler for a fast RNG 
or the Kemp sampler for a slow RNG. When the number of samples is higher then 
tabulation is optimal after 16 to 3 samples for a mean in the range 1 - 40.

It should be noted that the Guide table requires approximately 12 bytes of 
storage per entry in the cumulative probability table. This is capped at length 
2 * mean. So this sampler will require approximately 24 * mean bytes of 
storage. This may be undesirable as a default. So the guide table sampler could 
be added to the library as an option but it should not be used within the 
{{PoissonSampler}} or {{PoissonSamplerCache}} by default. These should stick to 
the current functionality to return a {{SmallMeanPoissonSampler}} when mean <40 
(and users should stick to a fast RNG).

 

*Recommendations*
 * Add the Kemp sampler to the library with notes detailing the relative 
performance verses the Small mean sampler for fast / slow RNGs.
 * Update the Large mean sampler to use the Kemp sampler
 * Add the guide table Poisson sampler to the library with notes indicating 
performance benefits for repeat sampling at the cost of storage space.

 

*Note*

The Alias method and Guide table approach are generic for any discrete 
probability distribution defined by a set of probabilities. Samplers for these 
methods can be added to the library. This will be similar to the current 
{{DiscreteProbabilityCollectionSampler}} that will sample from a collection. 
However these new samplers will generate an index directly corresponding to an 
index in the input probability array of {{double[]}}.

 

> Kemp small mean poisson sampler
> -------------------------------
>
>                 Key: RNG-91
>                 URL: https://issues.apache.org/jira/browse/RNG-91
>             Project: Commons RNG
>          Issue Type: New Feature
>          Components: sampling
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: kemp.jpg, poisson-samplers.jpg
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current algorithm for the {{SmallMeanPoissonSampler}} is used to generate 
> Poisson samples for any mean up to 40. The sampler requires approximately n 
> samples from a RNG to generate 1 Poisson deviate for a mean of n.
> The [Kemp (1981)|https://www.jstor.org/stable/2346348] algorithm requires 1 
> sample from the RNG and then accrues a cumulative probability using a 
> recurrence relation to compute each successive Poisson probability:
> {noformat}
> p(n+1) = p(n) * mean / (n+1)
> {noformat}
> The full algorithm is here:
> {code:java}
>     mean = ...;
>     final double p0 = Math.exp(-mean);
>     @Override
>     public int sample() {
>         double u = rng.nextDouble();
>         int x = 0;
>         double p = p0;
>         // The algorithm listed in Kemp (1981) does not check that the 
> rolling probability
>         // is positive. This check is added to ensure no errors when the 
> limit of the summation
>         // 1 - sum(p(x)) is above 0 due to cumulative error in floating point 
> arithmetic.
>         while (u > p && p != 0) {
>             u -= p;
>             x++;
>             p = p * mean / x;
>         }
>         return x;
>     }
> {code}
> The limit for the sampler is set by the ability to compute p0. This is 
> approximately 744.440 when Math.exp(-mean) returns 0.
> A conservative limit of 700 sets an initial probability p0 of 
> 9.85967654375977E-305. When run through the summation series for the limit (u 
> initialised to 1) the result when the summation ends (p is zero) leaves u = 
> 3.335439283623915E-15. This is close to the expected tolerance for floating 
> point error (Note: 1 - Math.nextDown(1) = 1.1102230246251565E-16).
> Using a mean of 10 leaves u = 4.988586742717954E-17. So smaller means have a 
> similar error. The error in the cumulative sum is expected to result in 
> truncation of the long tail of the Poisson distribution (which should be 
> bounded at infinity).
> This sampler should outperform the current {{SmallMeanPoissonSampler}} as it 
> requires 1 uniform deviate per sample.
> Note that the \{[SmallMeanPoissonSampler}} uses a limit for the mean of 
> Integer.MAX_VALUE / 2. This should be updated since it also relies on p0 
> being above zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (RNG-91) Kemp small mean poisson sampler

Reply via email to