[jira] [Comment Edited] (NUMBERS-174) Update Gamma functions using the Boost implementation

Alex Herbert (Jira) Sun, 07 Nov 2021 14:14:08 -0800


    [ 
https://issues.apache.org/jira/browse/NUMBERS-174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440092#comment-17440092
 ]


Alex Herbert edited comment on NUMBERS-174 at 11/7/21, 10:13 PM:
-----------------------------------------------------------------

h2. Gamma P Derivative

This function is not in the gamma package. It is used in statistics for the 
gamma distribution.

This function is defined as:
{noformat}
f(a, z) = z^a * exp(-z) / Gamma(a) / a
        = z^(a - 1) * exp(-z) / Gamma(a){noformat}
It can be implemented with logs:
{noformat}
f(a, z) = exp(a * log(z) - z - logGamma(a) - log(z)){noformat}
The code in Commons Statistics attempts to compute this with high accuracy but 
the choice of function can be incorrect (see STATISTICS-39).

I created derivative data using the same a and z values in the Boost datasets 
for the incomplete gamma functions. The result was computed using maxima to 64 
digits. The simple implementation using logs was tested against the function 
from Commons Stats and the new Boost implementation.
||Data||Log|| ||Stats|| ||Boost|| ||
|| ||Max||RMS||Max||RMS||Max||RMS||
|GAMMA_P_DERIV_INT|221.051|44.0093|88.5116|8.13219|22.4857|6.10843|
|GAMMA_P_DERIV_SMALL|104.312|20.5646|452.647|67.9785|3.12638|0.901515|
|GAMMA_P_DERIV_MED|476.683|128.642|66.1587|3.14963|57.2834|14.5155|
|GAMMA_P_DERIV_BIG|1.01E+07|1.02E+06|555.477|62.9605|538.126|52.8767|
h3. Int Data

The use of logs is worse than the stats or Boost method. The Boost method has 
much lower max error than the stats method but the RMS is similar.
h3. Medium Data

The use of logs is worse than the stats or Boost method. The Boost method has 
lower max error than the stats method but the RMS is worse for Boost. This can 
be traced to the following errors above 50 ulp:
{noformat}
stats:
GAMMA_P_DERIV_MED a=0.9759566783905029, z=0.009759566746652126: expected 
<1.091103827394005722772134927295910273305494319535696726782190113> != actual 
<1.091103827393991> (ulps=-66.15874662180883)

boost:
GAMMA_P_DERIV_MED a=79.829345703125, z=0.7982934713363647: expected 
<2.057948567669980623962755091227670264152698891095754996007072457E-125> != 
actual <2.0579485676699656E-125> (ulps=-57.28338674936721)
GAMMA_P_DERIV_MED a=84.98836517333984, z=0.8498836755752563: expected 
<1.584723140811257432756401992975844384590043887470012851849659925E-133> != 
actual <1.5847231408112464E-133> (ulps=-56.62359924370867)
{noformat}
So the Boost method has two occurrences where it fails to compute as 
accurately. It has a different method for a > 1 using a scaled Lanczos 
approximation.
If changed to the direct computation:
{noformat}
Math.pow(z, a - 1) * Math.exp(-z) / tgamma(a)
{noformat}
The errors above 50 ulp are eliminated:
{noformat}
GAMMA_P_DERIV_MED           max    45.9872   RMS    11.8938
{noformat}
This fix does not work for all parameters. The method should not be changed 
without further investigation of appropriate limits to switch computations.
h3. Small Data

Direct use of logs performs better on this data than the method from 
statistics. This is data that demonstrates the incorrect choice of evaluation 
method as discussed in STATISTICS-39. The Boost method has far superior 
performance on this data as it correctly chooses to compute the direct formula 
(Math.pow(z, a) * Math.exp(-z) / tgamma(a) / a) rather than using logs or 
alternative computations with the Lanczos approximation.
h3. Big Data

Use of logs here is incorrect. Both the stats and Boost method are using an 
appropriate rearrangement of the computation to avoid cancellation errors. The 
stats method is based on the documentation for the Boost function. The 
difference may be due to the use of the computation log1p(w) - w which has a 
special method in Boost (log1pmx(w)) to compute this without loss of accuracy 
when the argument w is close to 0.
h2. Overall

The method using logs is outperformed by other methods. This is the subject of 
MATH-753. Using logs can result in cancellation for some parameters where 
alternative computation is superior.

The method from statistics is successful in computing the result for large z. 
It fails when z is small. The Boost method is robust across all data. It would 
be recommended to expose it through a public API.


was (Author: alexherbert):
h2. Gamma P Derivative

This function is not in the gamma package. It is used in statistics for the 
gamma distribution.

This function is defined as:
{noformat}
f(a, z) = z^a * exp(-z) / Gamma(a) / a
        = z^(a - 1) * exp(-z) / Gamma(a){noformat}
It can be implemented with logs:
{noformat}
f(a, z) = exp(a * log(z) - z - logGamma(a) - log(z)){noformat}
The code in Commons Statistics attempts to compute this with high accuracy but 
the choice of function can be incorrect (see STATISTICS-39).

I created derivative data using the same a and z values in the Boost datasets 
for the incomplete gamma functions. The result was computed using maxima to 64 
digits. The simple implementation using logs was tested against the function 
from Commons Stats and the new Boost implementation.
||Data||Log|| ||Stats|| ||Boost|| ||
|| ||Max||RMS||Max||RMS||Max||RMS||
|GAMMA_P_DERIV_INT|221.051|44.0093|88.5116|8.13219|22.4857|6.10843|
|GAMMA_P_DERIV_SMALL|104.312|20.5646|452.647|67.9785|3.12638|0.901515|
|GAMMA_P_DERIV_MED|476.683|128.642|66.1587|3.14963|57.2834|14.5155|
|GAMMA_P_DERIV_BIG|1.01E+07|1.02E+06|555.477|62.9605|538.126|52.8767|
h3. Int Data

The use of logs is worse than the stats or Boost method. The Boost method has 
much lower max error than the stats method but the RMS is similar.
h3. Medium Data

The use of logs is worse than the stats or Boost method. The Boost method has 
lower max error than the stats method but the RMS is worse for Boost. This can 
be traced to the following errors above 50 ulp:
{noformat}
stats:
GAMMA_P_DERIV_MED a=0.9759566783905029, z=0.009759566746652126: expected 
<1.091103827394005722772134927295910273305494319535696726782190113> != actual 
<1.091103827393991> (ulps=-66.15874662180883)

boost:
GAMMA_P_DERIV_MED a=79.829345703125, z=0.7982934713363647: expected 
<2.057948567669980623962755091227670264152698891095754996007072457E-125> != 
actual <2.0579485676699656E-125> (ulps=-57.28338674936721)
GAMMA_P_DERIV_MED a=84.98836517333984, z=0.8498836755752563: expected 
<1.584723140811257432756401992975844384590043887470012851849659925E-133> != 
actual <1.5847231408112464E-133> (ulps=-56.62359924370867)
{noformat}
So the Boost method has two occurrences where it fails to compute as 
accurately. It has a different method when z < 1 which computes:
{noformat}
Math.pow(z, a) * Math.exp(-z) / tgamma(a) / a
{noformat}
If changed to:
{noformat}
Math.pow(z, a - 1) * Math.exp(-z) / tgamma(a)
{noformat}
The errors above 50 ulp are eliminated:
{noformat}
GAMMA_P_DERIV_MED           max    45.9872   RMS    11.8938
{noformat}
This fix does not work for all parameters. The method should not be changed 
without further investigation of appropriate limits to switch computations.
h3. Small Data

Direct use of logs performs better on this data than the method from 
statistics. This is data that demonstrates the incorrect choice of evaluation 
method as discussed in STATISTICS-39. The Boost method has far superior 
performance on this data as it correctly chooses to compute the direct formula 
(Math.pow(z, a) * Math.exp(-z) / tgamma(a) / a) rather than using logs or 
alternative computations with the Lanczos approximation.
h3. Big Data

Use of logs here is incorrect. Both the stats and Boost method are using an 
appropriate rearrangement of the computation to avoid cancellation errors. The 
stats method is based on the documentation for the Boost function. The 
difference may be due to the use of the computation log1p(w) - w which has a 
special method in Boost (log1pmx(w)) to compute this without loss of accuracy 
when the argument w is close to 0.
h2. Overall

The method using logs is outperformed by other methods. This is the subject of 
MATH-753. Using logs can result in cancellation for some parameters where 
alternative computation is superior.

The method from statistics is successful in computing the result for large z. 
It fails when z is small. The Boost method is robust across all data. It would 
be recommended to expose it through a public API.

> Update Gamma functions using the Boost implementation
> -----------------------------------------------------
>
>                 Key: NUMBERS-174
>                 URL: https://issues.apache.org/jira/browse/NUMBERS-174
>             Project: Commons Numbers
>          Issue Type: Improvement
>          Components: gamma
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Major
>             Fix For: 1.1
>
>
> The current regularised incomplete gamma functions for P and Q compute using 
> a series representation for all input. This method is not robust to extreme 
> arguments.
> The Gamma functions are used in Commons Statistics in the Gamma and Poisson 
> distributions.
> Large values of the mean in the Poisson distribution have low
>  precision for the CDF. The distribution also has a configurable
>  threshold for convergence of the gamma function evaluation
>  (STATISTICS-38). Ideally this implementation detail should be
>  removed from the API.
> The Gamma distribution has a function switch based on a threshold to
>  compute the PDF. This threshold results in incorrect values for
>  certain parameters (STATISTICS-39).
> The function switch in the Gamma distribution is based on the
>  documentation for the Boost gamma functions [1]. However it does not
>  implement all the suggested optimisations detailed in the most recent
>  Boost documentation. This includes avoiding using the converging
>  series of the gamma function when the convergence is slow or unstable,
>  i.e. addresses the need for a configurable convergence threshold in
>  the Poisson distribution.
> One part of the evaluation of the incomplete gamma functions require
>  accuracy in the leading term:
> {noformat}
> x^a exp(-x) / gamma(a) = exp(a log(x) - x - loggamma(a){noformat}
> When x and a are large then using logs to compute this leads to
>  cancellation. Use of logs to compute this term is done in the current
>  implementation in numbers for RegularizedGamma P and Q. It is the
>  source of low precision for the CDF for the Poisson distribution when
>  the mean is large.
> I propose to port the Boost gamma functions to numbers. This will be a
>  process similar to the recent port of the error functions to the same
>  package in numbers. These functions improved both accuracy and
>  speed over the existing implementation. Once the gamma functions are
>  ported a comparison can be made between the existing and new
>  implementations. 
> [1] 
> [https://www.boost.org/doc/libs/1_77_0/libs/math/doc/html/math_toolkit/sf_gamma/igamma.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (NUMBERS-174) Update Gamma functions using the Boost implementation

Reply via email to