[
https://issues.apache.org/jira/browse/NUMBERS-174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440092#comment-17440092
]
Alex Herbert edited comment on NUMBERS-174 at 11/7/21, 10:13 PM:
-----------------------------------------------------------------
h2. Gamma P Derivative
This function is not in the gamma package. It is used in statistics for the
gamma distribution.
This function is defined as:
{noformat}
f(a, z) = z^a * exp(-z) / Gamma(a) / a
= z^(a - 1) * exp(-z) / Gamma(a){noformat}
It can be implemented with logs:
{noformat}
f(a, z) = exp(a * log(z) - z - logGamma(a) - log(z)){noformat}
The code in Commons Statistics attempts to compute this with high accuracy but
the choice of function can be incorrect (see STATISTICS-39).
I created derivative data using the same a and z values in the Boost datasets
for the incomplete gamma functions. The result was computed using maxima to 64
digits. The simple implementation using logs was tested against the function
from Commons Stats and the new Boost implementation.
||Data||Log|| ||Stats|| ||Boost|| ||
|| ||Max||RMS||Max||RMS||Max||RMS||
|GAMMA_P_DERIV_INT|221.051|44.0093|88.5116|8.13219|22.4857|6.10843|
|GAMMA_P_DERIV_SMALL|104.312|20.5646|452.647|67.9785|3.12638|0.901515|
|GAMMA_P_DERIV_MED|476.683|128.642|66.1587|3.14963|57.2834|14.5155|
|GAMMA_P_DERIV_BIG|1.01E+07|1.02E+06|555.477|62.9605|538.126|52.8767|
h3. Int Data
The use of logs is worse than the stats or Boost method. The Boost method has
much lower max error than the stats method but the RMS is similar.
h3. Medium Data
The use of logs is worse than the stats or Boost method. The Boost method has
lower max error than the stats method but the RMS is worse for Boost. This can
be traced to the following errors above 50 ulp:
{noformat}
stats:
GAMMA_P_DERIV_MED a=0.9759566783905029, z=0.009759566746652126: expected
<1.091103827394005722772134927295910273305494319535696726782190113> != actual
<1.091103827393991> (ulps=-66.15874662180883)
boost:
GAMMA_P_DERIV_MED a=79.829345703125, z=0.7982934713363647: expected
<2.057948567669980623962755091227670264152698891095754996007072457E-125> !=
actual <2.0579485676699656E-125> (ulps=-57.28338674936721)
GAMMA_P_DERIV_MED a=84.98836517333984, z=0.8498836755752563: expected
<1.584723140811257432756401992975844384590043887470012851849659925E-133> !=
actual <1.5847231408112464E-133> (ulps=-56.62359924370867)
{noformat}
So the Boost method has two occurrences where it fails to compute as
accurately. It has a different method for a > 1 using a scaled Lanczos
approximation.
If changed to the direct computation:
{noformat}
Math.pow(z, a - 1) * Math.exp(-z) / tgamma(a)
{noformat}
The errors above 50 ulp are eliminated:
{noformat}
GAMMA_P_DERIV_MED max 45.9872 RMS 11.8938
{noformat}
This fix does not work for all parameters. The method should not be changed
without further investigation of appropriate limits to switch computations.
h3. Small Data
Direct use of logs performs better on this data than the method from
statistics. This is data that demonstrates the incorrect choice of evaluation
method as discussed in STATISTICS-39. The Boost method has far superior
performance on this data as it correctly chooses to compute the direct formula
(Math.pow(z, a) * Math.exp(-z) / tgamma(a) / a) rather than using logs or
alternative computations with the Lanczos approximation.
h3. Big Data
Use of logs here is incorrect. Both the stats and Boost method are using an
appropriate rearrangement of the computation to avoid cancellation errors. The
stats method is based on the documentation for the Boost function. The
difference may be due to the use of the computation log1p(w) - w which has a
special method in Boost (log1pmx(w)) to compute this without loss of accuracy
when the argument w is close to 0.
h2. Overall
The method using logs is outperformed by other methods. This is the subject of
MATH-753. Using logs can result in cancellation for some parameters where
alternative computation is superior.
The method from statistics is successful in computing the result for large z.
It fails when z is small. The Boost method is robust across all data. It would
be recommended to expose it through a public API.
was (Author: alexherbert):
h2. Gamma P Derivative
This function is not in the gamma package. It is used in statistics for the
gamma distribution.
This function is defined as:
{noformat}
f(a, z) = z^a * exp(-z) / Gamma(a) / a
= z^(a - 1) * exp(-z) / Gamma(a){noformat}
It can be implemented with logs:
{noformat}
f(a, z) = exp(a * log(z) - z - logGamma(a) - log(z)){noformat}
The code in Commons Statistics attempts to compute this with high accuracy but
the choice of function can be incorrect (see STATISTICS-39).
I created derivative data using the same a and z values in the Boost datasets
for the incomplete gamma functions. The result was computed using maxima to 64
digits. The simple implementation using logs was tested against the function
from Commons Stats and the new Boost implementation.
||Data||Log|| ||Stats|| ||Boost|| ||
|| ||Max||RMS||Max||RMS||Max||RMS||
|GAMMA_P_DERIV_INT|221.051|44.0093|88.5116|8.13219|22.4857|6.10843|
|GAMMA_P_DERIV_SMALL|104.312|20.5646|452.647|67.9785|3.12638|0.901515|
|GAMMA_P_DERIV_MED|476.683|128.642|66.1587|3.14963|57.2834|14.5155|
|GAMMA_P_DERIV_BIG|1.01E+07|1.02E+06|555.477|62.9605|538.126|52.8767|
h3. Int Data
The use of logs is worse than the stats or Boost method. The Boost method has
much lower max error than the stats method but the RMS is similar.
h3. Medium Data
The use of logs is worse than the stats or Boost method. The Boost method has
lower max error than the stats method but the RMS is worse for Boost. This can
be traced to the following errors above 50 ulp:
{noformat}
stats:
GAMMA_P_DERIV_MED a=0.9759566783905029, z=0.009759566746652126: expected
<1.091103827394005722772134927295910273305494319535696726782190113> != actual
<1.091103827393991> (ulps=-66.15874662180883)
boost:
GAMMA_P_DERIV_MED a=79.829345703125, z=0.7982934713363647: expected
<2.057948567669980623962755091227670264152698891095754996007072457E-125> !=
actual <2.0579485676699656E-125> (ulps=-57.28338674936721)
GAMMA_P_DERIV_MED a=84.98836517333984, z=0.8498836755752563: expected
<1.584723140811257432756401992975844384590043887470012851849659925E-133> !=
actual <1.5847231408112464E-133> (ulps=-56.62359924370867)
{noformat}
So the Boost method has two occurrences where it fails to compute as
accurately. It has a different method when z < 1 which computes:
{noformat}
Math.pow(z, a) * Math.exp(-z) / tgamma(a) / a
{noformat}
If changed to:
{noformat}
Math.pow(z, a - 1) * Math.exp(-z) / tgamma(a)
{noformat}
The errors above 50 ulp are eliminated:
{noformat}
GAMMA_P_DERIV_MED max 45.9872 RMS 11.8938
{noformat}
This fix does not work for all parameters. The method should not be changed
without further investigation of appropriate limits to switch computations.
h3. Small Data
Direct use of logs performs better on this data than the method from
statistics. This is data that demonstrates the incorrect choice of evaluation
method as discussed in STATISTICS-39. The Boost method has far superior
performance on this data as it correctly chooses to compute the direct formula
(Math.pow(z, a) * Math.exp(-z) / tgamma(a) / a) rather than using logs or
alternative computations with the Lanczos approximation.
h3. Big Data
Use of logs here is incorrect. Both the stats and Boost method are using an
appropriate rearrangement of the computation to avoid cancellation errors. The
stats method is based on the documentation for the Boost function. The
difference may be due to the use of the computation log1p(w) - w which has a
special method in Boost (log1pmx(w)) to compute this without loss of accuracy
when the argument w is close to 0.
h2. Overall
The method using logs is outperformed by other methods. This is the subject of
MATH-753. Using logs can result in cancellation for some parameters where
alternative computation is superior.
The method from statistics is successful in computing the result for large z.
It fails when z is small. The Boost method is robust across all data. It would
be recommended to expose it through a public API.
> Update Gamma functions using the Boost implementation
> -----------------------------------------------------
>
> Key: NUMBERS-174
> URL: https://issues.apache.org/jira/browse/NUMBERS-174
> Project: Commons Numbers
> Issue Type: Improvement
> Components: gamma
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Assignee: Alex Herbert
> Priority: Major
> Fix For: 1.1
>
>
> The current regularised incomplete gamma functions for P and Q compute using
> a series representation for all input. This method is not robust to extreme
> arguments.
> The Gamma functions are used in Commons Statistics in the Gamma and Poisson
> distributions.
> Large values of the mean in the Poisson distribution have low
> precision for the CDF. The distribution also has a configurable
> threshold for convergence of the gamma function evaluation
> (STATISTICS-38). Ideally this implementation detail should be
> removed from the API.
> The Gamma distribution has a function switch based on a threshold to
> compute the PDF. This threshold results in incorrect values for
> certain parameters (STATISTICS-39).
> The function switch in the Gamma distribution is based on the
> documentation for the Boost gamma functions [1]. However it does not
> implement all the suggested optimisations detailed in the most recent
> Boost documentation. This includes avoiding using the converging
> series of the gamma function when the convergence is slow or unstable,
> i.e. addresses the need for a configurable convergence threshold in
> the Poisson distribution.
> One part of the evaluation of the incomplete gamma functions require
> accuracy in the leading term:
> {noformat}
> x^a exp(-x) / gamma(a) = exp(a log(x) - x - loggamma(a){noformat}
> When x and a are large then using logs to compute this leads to
> cancellation. Use of logs to compute this term is done in the current
> implementation in numbers for RegularizedGamma P and Q. It is the
> source of low precision for the CDF for the Poisson distribution when
> the mean is large.
> I propose to port the Boost gamma functions to numbers. This will be a
> process similar to the recent port of the error functions to the same
> package in numbers. These functions improved both accuracy and
> speed over the existing implementation. Once the gamma functions are
> ported a comparison can be made between the existing and new
> implementations.
> [1]
> [https://www.boost.org/doc/libs/1_77_0/libs/math/doc/html/math_toolkit/sf_gamma/igamma.html]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)