[ 
https://issues.apache.org/jira/browse/STATISTICS-46?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457391#comment-17457391
 ] 

Alex Herbert commented on STATISTICS-46:
----------------------------------------

I've investigated computation of the mean and variance for the truncated normal 
distribution. The paper listed in the header does not contain a suitable method 
that will work for very small variances. It computes the variance of a narrow 
truncation by computing the variance above and below the truncation and 
generating the missing variance by subtraction from 1 using an expression for 
the variance of a mixture.

The wikipedia page lists a reference for alternative computation of the moments:

[Truncated Normal by Jorge 
Fernandez-de-Cossio-Diaz|https://github.com/cossio/TruncatedNormal.jl]

This is implemented in Julia but freely translatable under the MIT license. 
This exploits a rearrangement of the formulas for the truncated normal 
distribution on wikipedia using the erfcx function:
{noformat}
erfcx(z) = erfc(z) * exp(z^2)
{noformat}
Given the PDF and CDF of the normal distribution use the exponential and error 
function the following rearrangement can be used in the computation of moments:
{noformat}
  exp(-0.5*x*x)                   1
-----------------      =  ------------------
erfc(x / sqrt(2))         erfcx(x / sqrt(2))
{noformat}
The implementation provides formulas for the first moment (mean) and second 
moment from which the variance can be obtained. I have coded this function 
using the erfcx function added in NUMBERS-177. It works very well for the 
current unit test data. Since the erfcx function is computable with a very 
large argument it can compute the mean and variance when an implementation 
using the exponential and erfc functions fails as these compute 0. This 
provides support for the moments up to the limit of +/- 40 standard deviations 
from the mean. Beyond this the truncated normal distribution cannot compute a 
probability as it spans no range of the parent normal distribution. In this 
case the truncation is invalid and the constructor can throw an exception.

In all tests I have tried the computation of the mean is stable. The variance 
still has some issues.

There is one unit test that does not work for the variance. It is an extreme 
truncation around the mean. The truncation results in the bound on the parent 
distribution as [-1.0101010039621505E-8, 1.0101010129336497E-8]. This covers a 
range of p=8.05e-9 of the standard normal distribution. Here the bounds [a, b] 
are very close to the mean. So close that the error function, and the normal 
distribution CDF, is effectively linear. The moments are computed using this 
component:
{noformat}
0.5 * erf(a / sqrt(2)) - a * exp(-0.5*a*a) / sqrt(2 pi){noformat}
Since the exp component is the gradient of the error function, when a is small 
and erf is linear this results in two values that are equal, one subtracted 
from the other. Here is a plot of the ulp difference between the error function 
and its gradient multiplied by the argument plotted using a log 2 scale on the 
x axis:

!erf_minus_x_derf_dx.png!

Below around 5e-8 the terms cancel. The exact point is around the value where 
exp(-0.5*x*x) remains 1, i.e. the gradient is constant. This value is about 
1.82e-8 but differs on different JDKs due to variations in the exp function. 
The current implementation of the error function uses a linear function when 
the argument is below 1e-10. So this sets a limit as sqrt(2) * 1e-10, which is 
lower than a limit imposed by accuracy of the exp function.

An argument could be made that for such small truncations the distribution can 
be approximated as a uniform distribution. If this assumption is made then 
computation of the moments can use the uniform distribution as an 
approximation. This fixes the test case of truncation close to the mean with an 
appropriate variance. I tried a few thresholds for switching to a uniform 
approximation from 1.5e-8 to 1e-6. None provide a smooth transition from 
computing the variance as a truncated normal and approximating it as a uniform 
distribution. This is due to there being two values for which the same 
computation is made (a and b). If the limit requires the larger magnitude of a 
or b to be below it then there are some truncations where the larger magnitude 
is above the threshold but the smaller magnitude can be below it and subject to 
cancellation. The variance is then computed as if the smaller term is zero but 
only for the numerator of the fraction P/Q leading to under or over estimation 
of the second moment depending on whether the bounds [a, b] span the mean of 
zero or are entirely above or below it.

The current computation appears stable until the true variance approaches the 
machine epsilon of 2^-52. The computation uses a rational function P/Q to 
express the second moment. When P is less than 1 ulp of Q then the second 
moment cannot be computed. The result is a variance that degrades in accuracy 
as it approaches 2.22e-16. Eventually the variance cannot be computed and is 
returned as 0.

This effect also occurs when the truncation covers a small range that is not 
close to the mean, for example [10, 10.00000001], variance=8.33e-18 (computed 
using a 128-bit quad precision implementation). In this case a uniform 
distribution is not a good approximation as the CDF is not effectively flat. 
Also note that a uniform distribution is general is not a good approximation as 
it will have the same variance irrespective of distance from the mean:
||a||b||tnvar||approximation||
|a|b|-|(b-a)^2 / 12|
|1|2|0.0727|1/12 (0.0833)|
|10|11|0.00942|1/12|
|100|101|9.99e-5|1/12|

In testing the current implementation is far superior to direct implementation 
of the formulas provided on the Wikipedia page. The previous version could not 
compute variances far from the mean such as the truncations in the table above.

I propose to update the implementation to use the formulas provided by 
Fernandez-de-Cossio-Diaz. The working code has been annotated to indicate that 
the variance computation suffers from cancellation when the true variance 
approaches machine epsilon. In practice this will be of very little consequence 
as (1) the variance is not used for any internal functionality; and (2) the 
truncation required covers a very small range of the parent normal distribution 
and it will not be a useful truncation to generate probabilities. If a user 
questions why the variance is 0 then they can investigate their truncation and 
realise that it is a very small range of the original normal distribution and 
is most likely unsuitable to provide the probability data they require.

I propose to also update the constructor to throw an exception if the 
distribution does not cover an computable range of the parent normal 
distribution, e.g:
{code:java}
double mean = 0;
double sd = 1;
double lower = 40;
double upper = 1000;
// +40 standard deviations will result in an invalid truncation.
TruncatedNormalDistribution dist = TruncatedNormalDistribution.of(mean, sd, 
lower, upper);{code}
If the constructor does not throw for this condition then probability 
computations will generate NaNs due to zero divide by zero.

> TruncatedNormalDistribution
> ---------------------------
>
>                 Key: STATISTICS-46
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-46
>             Project: Apache Commons Statistics
>          Issue Type: Improvement
>          Components: distribution
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Minor
>         Attachments: erf_minus_x_derf_dx.png
>
>
> The TruncatedNormalDistribution computes the moments using the formulas on 
> the Wikipedia page:
> [https://en.wikipedia.org/wiki/Truncated_normal_distribution#Moments]
> For the two sided truncation this computation is unstable. This is noted for 
> the case when the bound [a, b] is far from the mean. It also effects a 
> current test case for the following:
> {noformat}
> mean=7.1, sd=9.9, a=7.0999999, b=7.1000001
> {noformat}
> The implementation should be improved. I found one method based on the paper:
> {noformat}
> Foulley JL. A completion simulator for the two-sided truncated
> normal distribution. Genetics, selection, evolution 2000
> Nov-Dec;32(6): p. 631-635.
> {noformat}
> When the bound [a, b] is far from the mean the range is in the long flat tail 
> of the normal distribution. Here the moments can be computed as if the 
> distribution is a continuous uniform distribution.
> Note that the mean and variance are not used by the implementation. Improving 
> the computation will simply provide a better result if the moments are 
> desired.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to