[ 
https://issues.apache.org/jira/browse/STATISTICS-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433383#comment-17433383
 ] 

Alex Herbert commented on STATISTICS-35:
----------------------------------------

The PoissonSampler in RNG has the limit for the mean of 0.5 * Integer.MAX_VALUE.

When the mean is large (>1000) the normal distribution is a good approximation 
to the Poisson with or without a half correction (see [Wikipedia Poisson 
Distribution Related 
Distributions|https://en.wikipedia.org/wiki/Poisson_distribution#Related_distributions)]):
{noformat}
Poisson(mu) P(X <= x) = Normal(mu, sqrt(mu)) P(X <= x)
                      = Normal(mu, sqrt(mu)) P(X <= x+0.5)     Half-correction
{noformat}
At the limit of mean = 2^30 plotting the CDF of the Poisson against the Normal 
cannot distinguish them. This is a plot of the difference between the two when 
computed using matlab:

!normal_approx_delta.jpg!

The half correction has the lower error.

The maximum error=2.03e-06 at the mean. Here the PMF of the Poisson is 
p=1.212e-05. It would require approximately 1/p = 8.2137e+04 samples to see one 
sample at the mean. It would require approximately 1/error = 4.9282e+05 samples 
to see a difference of 1 in the sample empirical CDF at the point of the 
maximum difference. Combined this is 4.0479e+10 samples to observe a difference 
of 1 between the normal approximation and the Poisson at the point of maximum 
error on the condition the empirical distribution is perfect. However there 
will be randomness in sampling and any difference between a Poisson and 
Gaussian at this level of error will not be realistically observed.

I suggest returning a normal distribution sampler for large mean. Since the 
mean of the Normal distribution is a location parameter the half correction can 
be applied to the mean:
{noformat}
random Poisson(mu) == random Normal(mu + 0.5, sqrt(mu))
{noformat}
The float value from the normal sample can then be rounded down by casting to 
an integer. This will naturally clip the value to the integer max for any 
larger values. 

The result should be checked that it is not negative. At this size of mean this 
may be an unnecessary step. The Gaussian samplers in RNG have to have extreme 
random deviates to push the sample more than 10 standard deviations from the 
mean. The CDF for the standard normal for x=-10 is 7.6199e-24. It is unlikely 
the sampler will generate a sample more than the 32768 standard deviations from 
the mean which is what is required when the mean is 2^30. However for 
completeness the check against zero can be made. The implementation for a 
lambda function returning int value samples is:
{code:java}
final SharedStateContinuousSampler s =
    GaussianSampler.of(ZigguratSampler.NormalizedGaussian.of(rng),
                       mu + 0.5, Math.sqrt(mu));
return () -> {
    final double x = s.sample();
    return Math.max(0, (int) x);
};
{code}
 
This will support any finite mean up to the range of the maximum integer. Using 
a mean close to or beyond the maximum integer will work but the distribution 
will be clipped. It would not be recommended to use the distribution for such 
an extreme parameterisation. However the sampler will function within the 
limits imposed by a 32-bit signed integer result.


> PoissonDistribution cannot create a sampler with a mean above 2^30
> ------------------------------------------------------------------
>
>                 Key: STATISTICS-35
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-35
>             Project: Apache Commons Statistics
>          Issue Type: Bug
>          Components: distribution
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Minor
>         Attachments: normal_approx_delta.jpg
>
>
> The PoissonDistribution can be parameterised with any mean and yet the 
> PoissonSampler in RNG excludes a mean above 2^30 to avoid truncation. This 
> change was made to avoid distribution truncation and because the algorithm 
> uses (int) floor(mean) (see [RNG-52]).
> Commons RNG now has a LongSampler interface and it may be possible to create 
> a PoissonSampler that outputs a long with the same algorithm and avoid 
> truncation to int values. The final PoissonDistribution is still bounded by 
> int values but the sampler can be less restrictive if it uses long. At large 
> means the Poisson sampler mainly uses a Gaussian
> sampler and runtime performance should not be impacted.
> Investigate if the PoissonSampler algorithm can create samples from the 
> PoissonDistribution that passes a chi-square test when the mean is large. If 
> this is possible then implement this in RNG. Otherwise inverse transform 
> sampling will have to be used for large mean (with an anticipated large 
> performance impact).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to