[
https://issues.apache.org/jira/browse/STATISTICS-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638674#comment-17638674
]
Alex Herbert commented on STATISTICS-61:
----------------------------------------
To investigate the difference between the two parameterizations I created the
rate parameterization of the exponential distribution and a benchmark to:
* Generate a random rate log-uniformly distributed in the interval [0.125, 8].
* Generate random x uniformly within the quantile range [1e-9, 1 - 1e9]
* Generate random p uniformly in the range [1e-9, 1 - 1e-9] (note p is
different from cdf( x ))
* Compute the max and root mean square error (RMS) from samples using units of
least precision (ULP)
The following are the results from 1000 distributions with 10000 samples each.
||Method||Max||RMS||Worst case (exp(mean) - exp(rate))||
|mean|0|0.00000|null|
|variance|2|0.809938|rate=1.6080718216142256 : 0.38671330676322496 -
0.38671330676322485|
|pdf|34|6.52397|rate=3.725630847722884,x=5.1906412250568374 :
1.4881055514427038E-8 - 1.4881055514427094E-8|
|logpdf|65536|50.5570|rate=2.6412145358824937,x=0.3677187341299205 :
1.4797137377509095E-5 - 1.4797137377398073E-5|
|cdf|2|0.157186|rate=0.1502575428942362,x=1.66411665023078 :
0.22123510242653516 - 0.2212351024265351|
|sf|32|6.65660|rate=0.1502575428942362,x=119.9539652107624 :
1.486898892618959E-8 - 1.4868988926189644E-8|
|inverse cdf|1|0.521090|rate=0.1502575428942362,p=0.904577943262729 :
15.636123680047765 - 15.636123680047767|
|inverse sf|1|0.520814|rate=0.1502575428942362,p=0.5555662207019987 :
3.911733524329665 - 3.9117335243296654|
Note that in all cases the value of 1/rate is inexact (requires rounding). This
can be tested using:
{code:java}
BigDecimal.valueOf(1 / rate).compareTo(new BigDecimal(1 / rate)) == 0{code}
The computations are generally very close with the exception of the survival
function, PDF and in particular the log PDF. The worst case occurs when there
is large cancellation in the log pdf computation as the x value is close to the
mean, the density approaches 1 and the log density approaches zero:
{noformat}
logpdf(x) = -x / mean - log(mean)
= -x * rate + log(rate)
x = 0.3677187341299205
rate = 2.6412145358824937
mean = 1 / rate
logpdf(x) = -0.9712240657002561 - -0.9712388628376336 = 1.4797137377509095E-5
{noformat}
Note that although the logpdf is different the pdf is the same at this point
and the error is due to a single ulp difference between -log(mean) and
log(rate):
{noformat}
exp(-x / mean - log(mean)) == exp(-x * rate + log(rate))
(x / mean - x * rate) ==> 0.0
(-log(mean) - log(rate)) / ulp(log(rate)) ==> 1.0{noformat}
I have tested the worst case scenarios using R and the values are the same as
the exp(mean) computation (despite R using the rate parameterization leading me
to believe that R inverts the rate):
{noformat}
> print(dexp(5.1906412250568374, rate=3.725630847722884), digits=17)
[1] 1.4881055514427038e-08
> print(dexp(0.3677187341299205, rate=2.6412145358824937, log=TRUE), digits=17)
[1] 1.4797137377509095e-05
> print(pexp(119.9539652107624, rate=0.1502575428942362, lower.tail=FALSE),
> digits=17)
[1] 1.4868988926189591e-08
{noformat}
Using Mathematica the log PDF is not available. The results for the PDF and SF
are:
{noformat}
NumberForm[
PDF[ExponentialDistribution[
3.725630847722884109174401601194404065608978271484375],
5.19064122505683744890347952605225145816802978515625], 20]
1.4881055514427066576e-8
NumberForm[
SurvivalFunction[
ExponentialDistribution[
0.15025754289423620679855275739100761711597442626953125],
119.9539652107624050358936074189841747283935546875], 20]
1.4868988926189618127e-8{noformat}
Values were converted to the exact decimal representation using BigDecimal.
Here the PDF is between the exp(rate) and exp(mean) versions but the SF is
actually below both exp(mean) and exp(rate) version.
Error in 1/rate can be computed using BigDecimal to perform the inversion as:
{code:java}
@ParameterizedTest
@ValueSource(doubles = {3.725630847722884, 2.6412145358824937,
0.1502575428942362})
void testReciprocal(double x) {
double inv = 1 / x;
BigDecimal binv = BigDecimal.ONE.divide(new BigDecimal(x),
MathContext.DECIMAL128);
double error = binv.subtract(new BigDecimal(inv))
.divide(new BigDecimal(Math.ulp(inv)),
MathContext.DECIMAL128)
.doubleValue();
System.out.printf("1 / %s = %s : ulp error %s%n", x, inv, error);
}{code}
With the error for the 3 worst cases as:
{noformat}
1 / 3.725630847722884 = 0.2684109190826576 : ulp error 0.07253105270286596
1 / 2.6412145358824937 = 0.378613696999769 : ulp error 0.4434044643835858
1 / 0.1502575428942362 = 6.655239934968745 : ulp error 0.1135090256596343
{noformat}
It may be no coincidence that the worst case error for the logpdf is when the 1
/ rate computation has rounding of the result by almost 0.5 ulp.
h2. Notes
The two parameterizations of the exponential distribution can compute quite
different results for the log density due to cancellation. When this value is
used to compute the PDF with exp(logpdf(x)) the values are the same as
cancellation in the log pdf occurs when the pdf is close to 1.
The other computations are similar and I did not find a conclusive argument
that a rate parameterization is an improvement. The differences are simply due
to propagation of rounding errors through the respective computations. Only the
logpdf computation uses addition and can suffer cancellation of positive and
negative terms. All other computations use multiplication and error propagation
is small.
Note that the rate functions should be more accurate for the rate
parameterization as there is no rounding error when computing mean = 1 / rate;
they also exchange division for the faster multiplication operation
(performance was not measured):
{noformat}
pdf(x) = Math.exp(-x * rate) * rate
= Math.exp(-x / mean) / mean
cdf(x) = -Math.expm1(-x * rate)
= -Math.expm1(-x / mean)
sf(x) = Math.exp(-x * rate)
= Math.exp(-x / mean){noformat}
The survival function had a RMS of 6.657 in my test which is a relative error
of 1.478e-15; the PDF was slightly less at 6.524 ulps or a relative error of
1.449e-15.
One situation where using the rate is absolutely required is when 1/rate is
infinite. This is an extreme distribution and computations with it may not be
of any practical use. For example consider the CDF of such a distribution at
close to Double.MAX_VALUE:
{noformat}
rate = 1e-315
cdf(x) =
-Math.expm1(-1e307 * rate) ==> 9.99999993481684E-9
-Math.expm1(-Double.MAX_VALUE * rate) ==> 1.7976929705478288E-7{noformat}
h2. Conclusion
The benefits of the rate parameterization are marginal. However there do not
appear to be negatives beyond the maintenance of two versions and possible user
confusion in the situation where an instance of ExponentialDistribution can
have the same mean but compute slightly different output.
Given the importance of the exponential distribution it may be worth adding
this second parameterization to the library. The distribution can then be used
directly with a rate parameter to compute the formulas. If included then it
should be documented that the rate parameterization can compute different
results than the mean parameterization using 1 / rate due to rounding error.
Any thoughts on this are welcome.
> Exponential distribution parameterized by the rate (1 / mean)
> -------------------------------------------------------------
>
> Key: STATISTICS-61
> URL: https://issues.apache.org/jira/browse/STATISTICS-61
> Project: Commons Statistics
> Issue Type: New Feature
> Components: distribution
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Priority: Minor
>
> The exponential distribution is currently parameterized by the mean. This
> matches the parameterization used by SciPy and Matlab.
> The distribution can also be parameterized by the rate. This is the
> parameterization used by R and Mathematica.
> A distribution for the rate can be created as:
> {code:java}
> double rate = ...
> ExponentialDistribution dist = ExponentialDistribution.of(1 / rate);{code}
> The formulas for the two parameterization are simple rearrangements. In
> certain cases direct use of the rate formula can have a large difference to
> the mean formula.
> Investigate the option to provide a rate parameterization as:
> {noformat}
> ExponentialDistribution dist = ExponentialDistribution.ofRate(rate);{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)