[
https://issues.apache.org/jira/browse/STATISTICS-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433383#comment-17433383
]
Alex Herbert commented on STATISTICS-35:
----------------------------------------
The PoissonSampler in RNG has the limit for the mean of 0.5 * Integer.MAX_VALUE.
When the mean is large (>1000) the normal distribution is a good approximation
to the Poisson with or without a half correction (see [Wikipedia Poisson
Distribution Related
Distributions|https://en.wikipedia.org/wiki/Poisson_distribution#Related_distributions)]):
{noformat}
Poisson(mu) P(X <= x) = Normal(mu, sqrt(mu)) P(X <= x)
= Normal(mu, sqrt(mu)) P(X <= x+0.5) Half-correction
{noformat}
At the limit of mean = 2^30 plotting the CDF of the Poisson against the Normal
cannot distinguish them. This is a plot of the difference between the two when
computed using matlab:
!normal_approx_delta.jpg!
The half correction has the lower error.
The maximum error=2.03e-06 at the mean. Here the PMF of the Poisson is
p=1.212e-05. It would require approximately 1/p = 8.2137e+04 samples to see one
sample at the mean. It would require approximately 1/error = 4.9282e+05 samples
to see a difference of 1 in the sample empirical CDF at the point of the
maximum difference. Combined this is 4.0479e+10 samples to observe a difference
of 1 between the normal approximation and the Poisson at the point of maximum
error on the condition the empirical distribution is perfect. However there
will be randomness in sampling and any difference between a Poisson and
Gaussian at this level of error will not be realistically observed.
I suggest returning a normal distribution sampler for large mean. Since the
mean of the Normal distribution is a location parameter the half correction can
be applied to the mean:
{noformat}
random Poisson(mu) == random Normal(mu + 0.5, sqrt(mu))
{noformat}
The float value from the normal sample can then be rounded down by casting to
an integer. This will naturally clip the value to the integer max for any
larger values.
The result should be checked that it is not negative. At this size of mean this
may be an unnecessary step. The Gaussian samplers in RNG have to have extreme
random deviates to push the sample more than 10 standard deviations from the
mean. The CDF for the standard normal for x=-10 is 7.6199e-24. It is unlikely
the sampler will generate a sample more than the 32768 standard deviations from
the mean which is what is required when the mean is 2^30. However for
completeness the check against zero can be made. The implementation for a
lambda function returning int value samples is:
{code:java}
final SharedStateContinuousSampler s =
GaussianSampler.of(ZigguratSampler.NormalizedGaussian.of(rng),
mu + 0.5, Math.sqrt(mu));
return () -> {
final double x = s.sample();
return Math.max(0, (int) x);
};
{code}
This will support any finite mean up to the range of the maximum integer. Using
a mean close to or beyond the maximum integer will work but the distribution
will be clipped. It would not be recommended to use the distribution for such
an extreme parameterisation. However the sampler will function within the
limits imposed by a 32-bit signed integer result.
> PoissonDistribution cannot create a sampler with a mean above 2^30
> ------------------------------------------------------------------
>
> Key: STATISTICS-35
> URL: https://issues.apache.org/jira/browse/STATISTICS-35
> Project: Apache Commons Statistics
> Issue Type: Bug
> Components: distribution
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Priority: Minor
> Attachments: normal_approx_delta.jpg
>
>
> The PoissonDistribution can be parameterised with any mean and yet the
> PoissonSampler in RNG excludes a mean above 2^30 to avoid truncation. This
> change was made to avoid distribution truncation and because the algorithm
> uses (int) floor(mean) (see [RNG-52]).
> Commons RNG now has a LongSampler interface and it may be possible to create
> a PoissonSampler that outputs a long with the same algorithm and avoid
> truncation to int values. The final PoissonDistribution is still bounded by
> int values but the sampler can be less restrictive if it uses long. At large
> means the Poisson sampler mainly uses a Gaussian
> sampler and runtime performance should not be impacted.
> Investigate if the PoissonSampler algorithm can create samples from the
> PoissonDistribution that passes a chi-square test when the mean is large. If
> this is possible then implement this in RNG. Otherwise inverse transform
> sampling will have to be used for large mean (with an anticipated large
> performance impact).
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)