[
https://issues.apache.org/jira/browse/STATISTICS-46?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457391#comment-17457391
]
Alex Herbert commented on STATISTICS-46:
----------------------------------------
I've investigated computation of the mean and variance for the truncated normal
distribution. The paper listed in the header does not contain a suitable method
that will work for very small variances. It computes the variance of a narrow
truncation by computing the variance above and below the truncation and
generating the missing variance by subtraction from 1 using an expression for
the variance of a mixture.
The wikipedia page lists a reference for alternative computation of the moments:
[Truncated Normal by Jorge
Fernandez-de-Cossio-Diaz|https://github.com/cossio/TruncatedNormal.jl]
This is implemented in Julia but freely translatable under the MIT license.
This exploits a rearrangement of the formulas for the truncated normal
distribution on wikipedia using the erfcx function:
{noformat}
erfcx(z) = erfc(z) * exp(z^2)
{noformat}
Given the PDF and CDF of the normal distribution use the exponential and error
function the following rearrangement can be used in the computation of moments:
{noformat}
exp(-0.5*x*x) 1
----------------- = ------------------
erfc(x / sqrt(2)) erfcx(x / sqrt(2))
{noformat}
The implementation provides formulas for the first moment (mean) and second
moment from which the variance can be obtained. I have coded this function
using the erfcx function added in NUMBERS-177. It works very well for the
current unit test data. Since the erfcx function is computable with a very
large argument it can compute the mean and variance when an implementation
using the exponential and erfc functions fails as these compute 0. This
provides support for the moments up to the limit of +/- 40 standard deviations
from the mean. Beyond this the truncated normal distribution cannot compute a
probability as it spans no range of the parent normal distribution. In this
case the truncation is invalid and the constructor can throw an exception.
In all tests I have tried the computation of the mean is stable. The variance
still has some issues.
There is one unit test that does not work for the variance. It is an extreme
truncation around the mean. The truncation results in the bound on the parent
distribution as [-1.0101010039621505E-8, 1.0101010129336497E-8]. This covers a
range of p=8.05e-9 of the standard normal distribution. Here the bounds [a, b]
are very close to the mean. So close that the error function, and the normal
distribution CDF, is effectively linear. The moments are computed using this
component:
{noformat}
0.5 * erf(a / sqrt(2)) - a * exp(-0.5*a*a) / sqrt(2 pi){noformat}
Since the exp component is the gradient of the error function, when a is small
and erf is linear this results in two values that are equal, one subtracted
from the other. Here is a plot of the ulp difference between the error function
and its gradient multiplied by the argument plotted using a log 2 scale on the
x axis:
!erf_minus_x_derf_dx.png!
Below around 5e-8 the terms cancel. The exact point is around the value where
exp(-0.5*x*x) remains 1, i.e. the gradient is constant. This value is about
1.82e-8 but differs on different JDKs due to variations in the exp function.
The current implementation of the error function uses a linear function when
the argument is below 1e-10. So this sets a limit as sqrt(2) * 1e-10, which is
lower than a limit imposed by accuracy of the exp function.
An argument could be made that for such small truncations the distribution can
be approximated as a uniform distribution. If this assumption is made then
computation of the moments can use the uniform distribution as an
approximation. This fixes the test case of truncation close to the mean with an
appropriate variance. I tried a few thresholds for switching to a uniform
approximation from 1.5e-8 to 1e-6. None provide a smooth transition from
computing the variance as a truncated normal and approximating it as a uniform
distribution. This is due to there being two values for which the same
computation is made (a and b). If the limit requires the larger magnitude of a
or b to be below it then there are some truncations where the larger magnitude
is above the threshold but the smaller magnitude can be below it and subject to
cancellation. The variance is then computed as if the smaller term is zero but
only for the numerator of the fraction P/Q leading to under or over estimation
of the second moment depending on whether the bounds [a, b] span the mean of
zero or are entirely above or below it.
The current computation appears stable until the true variance approaches the
machine epsilon of 2^-52. The computation uses a rational function P/Q to
express the second moment. When P is less than 1 ulp of Q then the second
moment cannot be computed. The result is a variance that degrades in accuracy
as it approaches 2.22e-16. Eventually the variance cannot be computed and is
returned as 0.
This effect also occurs when the truncation covers a small range that is not
close to the mean, for example [10, 10.00000001], variance=8.33e-18 (computed
using a 128-bit quad precision implementation). In this case a uniform
distribution is not a good approximation as the CDF is not effectively flat.
Also note that a uniform distribution is general is not a good approximation as
it will have the same variance irrespective of distance from the mean:
||a||b||tnvar||approximation||
|a|b|-|(b-a)^2 / 12|
|1|2|0.0727|1/12 (0.0833)|
|10|11|0.00942|1/12|
|100|101|9.99e-5|1/12|
In testing the current implementation is far superior to direct implementation
of the formulas provided on the Wikipedia page. The previous version could not
compute variances far from the mean such as the truncations in the table above.
I propose to update the implementation to use the formulas provided by
Fernandez-de-Cossio-Diaz. The working code has been annotated to indicate that
the variance computation suffers from cancellation when the true variance
approaches machine epsilon. In practice this will be of very little consequence
as (1) the variance is not used for any internal functionality; and (2) the
truncation required covers a very small range of the parent normal distribution
and it will not be a useful truncation to generate probabilities. If a user
questions why the variance is 0 then they can investigate their truncation and
realise that it is a very small range of the original normal distribution and
is most likely unsuitable to provide the probability data they require.
I propose to also update the constructor to throw an exception if the
distribution does not cover an computable range of the parent normal
distribution, e.g:
{code:java}
double mean = 0;
double sd = 1;
double lower = 40;
double upper = 1000;
// +40 standard deviations will result in an invalid truncation.
TruncatedNormalDistribution dist = TruncatedNormalDistribution.of(mean, sd,
lower, upper);{code}
If the constructor does not throw for this condition then probability
computations will generate NaNs due to zero divide by zero.
> TruncatedNormalDistribution
> ---------------------------
>
> Key: STATISTICS-46
> URL: https://issues.apache.org/jira/browse/STATISTICS-46
> Project: Apache Commons Statistics
> Issue Type: Improvement
> Components: distribution
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Priority: Minor
> Attachments: erf_minus_x_derf_dx.png
>
>
> The TruncatedNormalDistribution computes the moments using the formulas on
> the Wikipedia page:
> [https://en.wikipedia.org/wiki/Truncated_normal_distribution#Moments]
> For the two sided truncation this computation is unstable. This is noted for
> the case when the bound [a, b] is far from the mean. It also effects a
> current test case for the following:
> {noformat}
> mean=7.1, sd=9.9, a=7.0999999, b=7.1000001
> {noformat}
> The implementation should be improved. I found one method based on the paper:
> {noformat}
> Foulley JL. A completion simulator for the two-sided truncated
> normal distribution. Genetics, selection, evolution 2000
> Nov-Dec;32(6): p. 631-635.
> {noformat}
> When the bound [a, b] is far from the mean the range is in the long flat tail
> of the normal distribution. Here the moments can be computed as if the
> distribution is a continuous uniform distribution.
> Note that the mean and variance are not used by the implementation. Improving
> the computation will simply provide a better result if the moments are
> desired.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)