[ 
https://issues.apache.org/jira/browse/STATISTICS-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638674#comment-17638674
 ] 

Alex Herbert commented on STATISTICS-61:
----------------------------------------

To investigate the difference between the two parameterizations I created the 
rate parameterization of the exponential distribution and a benchmark to:
 * Generate a random rate log-uniformly distributed in the interval [0.125, 8].
 * Generate random x uniformly within the quantile range [1e-9, 1 - 1e9]
 * Generate random p uniformly in the range [1e-9, 1 - 1e-9]  (note p is 
different from cdf( x ))
 * Compute the max and root mean square error (RMS) from samples using units of 
least precision (ULP)

The following are the results from 1000 distributions with 10000 samples each.
||Method||Max||RMS||Worst case (exp(mean) - exp(rate))||
|mean|0|0.00000|null|
|variance|2|0.809938|rate=1.6080718216142256 : 0.38671330676322496 - 
0.38671330676322485|
|pdf|34|6.52397|rate=3.725630847722884,x=5.1906412250568374 : 
1.4881055514427038E-8 - 1.4881055514427094E-8|
|logpdf|65536|50.5570|rate=2.6412145358824937,x=0.3677187341299205 : 
1.4797137377509095E-5 - 1.4797137377398073E-5|
|cdf|2|0.157186|rate=0.1502575428942362,x=1.66411665023078 : 
0.22123510242653516 - 0.2212351024265351|
|sf|32|6.65660|rate=0.1502575428942362,x=119.9539652107624 : 
1.486898892618959E-8 - 1.4868988926189644E-8|
|inverse cdf|1|0.521090|rate=0.1502575428942362,p=0.904577943262729 : 
15.636123680047765 - 15.636123680047767|
|inverse sf|1|0.520814|rate=0.1502575428942362,p=0.5555662207019987 : 
3.911733524329665 - 3.9117335243296654|

Note that in all cases the value of 1/rate is inexact (requires rounding). This 
can be tested using:
{code:java}
BigDecimal.valueOf(1 / rate).compareTo(new BigDecimal(1 / rate)) == 0{code}
The computations are generally very close with the exception of the survival 
function, PDF and in particular the log PDF. The worst case occurs when there 
is large cancellation in the log pdf computation as the x value is close to the 
mean, the density approaches 1 and the log density approaches zero:
{noformat}
logpdf(x) = -x / mean - log(mean)
          = -x * rate + log(rate)
x = 0.3677187341299205
rate = 2.6412145358824937
mean = 1 / rate

logpdf(x) = -0.9712240657002561 - -0.9712388628376336 = 1.4797137377509095E-5
{noformat}
Note that although the logpdf is different the pdf is the same at this point 
and the error is due to a single ulp difference between -log(mean) and 
log(rate):
{noformat}
exp(-x / mean - log(mean)) == exp(-x * rate + log(rate))
(x / mean - x * rate) ==> 0.0
(-log(mean) - log(rate)) / ulp(log(rate)) ==>  1.0{noformat}
I have tested the worst case scenarios using R and the values are the same as 
the exp(mean) computation (despite R using the rate parameterization leading me 
to believe that R inverts the rate):
{noformat}
> print(dexp(5.1906412250568374, rate=3.725630847722884), digits=17)
[1] 1.4881055514427038e-08
> print(dexp(0.3677187341299205, rate=2.6412145358824937, log=TRUE), digits=17)
[1] 1.4797137377509095e-05
> print(pexp(119.9539652107624, rate=0.1502575428942362, lower.tail=FALSE), 
> digits=17)
[1] 1.4868988926189591e-08
{noformat}
Using Mathematica the log PDF is not available. The results for the PDF and SF 
are:
{noformat}
NumberForm[
 PDF[ExponentialDistribution[
   3.725630847722884109174401601194404065608978271484375], 
  5.19064122505683744890347952605225145816802978515625], 20]

1.4881055514427066576e-8

NumberForm[
 SurvivalFunction[
  ExponentialDistribution[
   0.15025754289423620679855275739100761711597442626953125], 
  119.9539652107624050358936074189841747283935546875], 20]

1.4868988926189618127e-8{noformat}
Values were converted to the exact decimal representation using BigDecimal. 
Here the PDF is between the exp(rate) and exp(mean) versions but the SF is 
actually below both exp(mean) and exp(rate) version.

Error in 1/rate can be computed using BigDecimal to perform the inversion as:
{code:java}
@ParameterizedTest
@ValueSource(doubles = {3.725630847722884, 2.6412145358824937, 
0.1502575428942362})
void testReciprocal(double x) {
    double inv = 1 / x;
    BigDecimal binv = BigDecimal.ONE.divide(new BigDecimal(x), 
MathContext.DECIMAL128);
    double error = binv.subtract(new BigDecimal(inv))
                       .divide(new BigDecimal(Math.ulp(inv)), 
MathContext.DECIMAL128)
                       .doubleValue();
    System.out.printf("1 / %s = %s : ulp error %s%n", x, inv, error);
}{code}
With the error for the 3 worst cases as:
{noformat}
1 / 3.725630847722884 = 0.2684109190826576 : ulp error 0.07253105270286596
1 / 2.6412145358824937 = 0.378613696999769 : ulp error 0.4434044643835858
1 / 0.1502575428942362 = 6.655239934968745 : ulp error 0.1135090256596343
{noformat}
It may be no coincidence that the worst case error for the logpdf is when the 1 
/ rate computation has rounding of the result by almost 0.5 ulp.
h2. Notes

The two parameterizations of the exponential distribution can compute quite 
different results for the log density due to cancellation. When this value is 
used to compute the PDF with exp(logpdf(x)) the values are the same as 
cancellation in the log pdf occurs when the pdf is close to 1.

The other computations are similar and I did not find a conclusive argument 
that a rate parameterization is an improvement. The differences are simply due 
to propagation of rounding errors through the respective computations. Only the 
logpdf computation uses addition and can suffer cancellation of positive and 
negative terms. All other computations use multiplication and error propagation 
is small.

Note that the rate functions should be more accurate for the rate 
parameterization as there is no rounding error when computing mean = 1 / rate; 
they also exchange division for the faster multiplication operation 
(performance was not measured):
{noformat}
pdf(x) = Math.exp(-x * rate) * rate
       = Math.exp(-x / mean) / mean
cdf(x) = -Math.expm1(-x * rate)
       = -Math.expm1(-x / mean)
sf(x) = Math.exp(-x * rate)
      = Math.exp(-x / mean){noformat}
The survival function had a RMS of 6.657 in my test which is a relative error 
of 1.478e-15; the PDF was slightly less at 6.524 ulps or a relative error  of 
1.449e-15.

One situation where using the rate is absolutely required is when 1/rate is 
infinite. This is an extreme distribution and computations with it may not be 
of any practical use. For example consider the CDF of such a distribution at 
close to Double.MAX_VALUE:
{noformat}
rate = 1e-315
cdf(x) = 
-Math.expm1(-1e307 * rate)            ==> 9.99999993481684E-9
-Math.expm1(-Double.MAX_VALUE * rate) ==> 1.7976929705478288E-7{noformat}
h2. Conclusion

The benefits of the rate parameterization are marginal. However there do not 
appear to be negatives beyond the maintenance of two versions and possible user 
confusion in the situation where an instance of ExponentialDistribution can 
have the same mean but compute slightly different output.

Given the importance of the exponential distribution it may be worth adding 
this second parameterization to the library. The distribution can then be used 
directly with a rate parameter to compute the formulas. If included then it 
should be documented that the rate parameterization can compute different 
results than the mean parameterization using 1 / rate due to rounding error.

Any thoughts on this are welcome.

 

> Exponential distribution parameterized by the rate (1 / mean)
> -------------------------------------------------------------
>
>                 Key: STATISTICS-61
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-61
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: distribution
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Minor
>
> The exponential distribution is currently parameterized by the mean. This 
> matches the parameterization used by SciPy and Matlab.
> The distribution can also be parameterized by the rate. This is the 
> parameterization used by R and Mathematica.
> A distribution for the rate can be created as:
> {code:java}
> double rate = ...
> ExponentialDistribution dist = ExponentialDistribution.of(1 / rate);{code}
> The formulas for the two parameterization are simple rearrangements. In 
> certain cases direct use of the rate formula can have a large difference to 
> the mean formula.
> Investigate the option to provide a rate parameterization as:
> {noformat}
> ExponentialDistribution dist = ExponentialDistribution.ofRate(rate);{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to