[
https://issues.apache.org/jira/browse/STATISTICS-39?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458232#comment-17458232
]
Alex Herbert commented on STATISTICS-39:
----------------------------------------
I investigated updating the GammaDistribution to use the Boost
RegularizedGamma.P.derivative function. This improves performance on the
current test data with one exception. A fix has been added to Commons Numbers
to allow the current test thresholds to pass.
The GammaDistribution PDF accuracy is tested using various shape parameters.
{noformat}
pdf = z^(a-1) exp(-z) / gamma(a)
a = shape{noformat}
New data has been added for shape [0.25, 0.5, 0.75]. The test measures the mean
and standard deviation of ULP error against reference data. This is done for
the condition where the regular computation does not overflow and when it does.
The existing implementation detects overflow and switches to a different
computation. This switch is not suitable for small shape values. However the
switch to a specialised overflow computation is not optimal in the new
implementation either. The table below shows performance of both methods
without any changes:
| | |Old| | | |New| | | |
|Shape|Overflow|Mean|Max|SD|RMS|Mean|Max|SD|RMS|
|0.25| |7.50|865.00|42.18|41.53|0.53|2.00|0.79|0.58|
|0.5| |7.47|2001.00|60.84|60.41|0.49|2.00|0.75|0.56|
|0.75| |6.33|697.00|33.80|33.22|0.61|3.00|0.87|0.61|
|1| |1.41|3.00|1.49|0.50|0.11|2.00|0.33|0.31|
|8| |0.42|2.00|0.67|0.52|0.48|3.00|0.73|0.55|
|10| |0.60|3.00|0.85|0.60|0.48|2.00|0.72|0.54|
|100| |1.90|4.00|2.03|0.71|0.47|2.00|0.73|0.56|
|142| |3.22|26.00|3.57|1.54|1.21|113.00|7.04|6.94|
|142|Y|38.26|250.00|53.92|38.06|38.49|123.00|46.53|26.18|
|1000| |0.00|0.00|0.00|0.00|0.00|0.00|0.00|0.00|
|1000|Y|157.49|2153.00|278.37|229.57|154.61|1094.00|242.75|187.18|
The old implementation does not work for small shape (< 1). The new
implementation is better for shape 1, 8 and 100. Shape 10 is slightly worse due
to a higher max error of 3 ulp rather than 2. This is a minor difference.
At shape 142 the regular function will overflow the power function when z is
149 (149^142 = infinity). This value of z occurs at CDF(z=149) ~ 0.72. This is
well within the range that is required for the computation. Note that the mode
of the gamma distribution is (shape - 1) and so for large shape a lot of the
probability density is outside the range that can be computed using a direct
implementation of the PDF.
For very large shape (1000) the new implementation is better. The old and the
new implementation have available a very similar alternative computation for
the PDF. The difference is thresholds used to choose the method to evaluate the
PDF. Here is a chart of the ULP error of the old and new implementation for
shape 142 on the test data. A rolling average with a window size of 20 has been
added for clarity of the trend:
!pdf_error.jpg!
Note that error is very well controlled in both functions before the overflow
point at z=149. At the overflow point the new implementation error increases
dramatically. The old implementation has a gradual decline in accuracy. At
large z the old method is worse as the new method controls the error for very
large z.
The computation used in the old method for overflow is always the same. The new
method can use this computation or choose a different one. It chooses this
computation using the following logic:
{noformat}
a ~ z && a > 150{noformat}
Logic for a ~ z not shown. If this is updated to use a > 100 then the
computation is used for shape=142 when z is in [148, 271]. This fixes the large
errors. Looking at the chart above the upper limit of z=271 is around the point
where the two functions moving average cross. So the logic for a ~ z is
choosing the correct function. The issue is that a > 150 excludes the shape=142
case.
I updated the numbers gamma implementation to lower the threshold to 100. This
does not affect any tests in the gamma package. However further optimisation is
required to ensure the threshold is suitable.
For reference the old implementation can be fixed for small shape parameters to
use the direct computation. Here are the updated performance figures:
| | |Old| | | |New| | | |
|Shape|Overflow|Mean|Max|SD|RMS|Mean|Max|SD|RMS|
|0.25| |1.307|3|1.484|0.704|0.53|2.00|0.79|0.58|
|0.5| |1.313|3|1.505|0.736|0.49|2.00|0.75|0.56|
|0.75| |0.89|3|1.118|0.677|0.61|3.00|0.87|0.61|
|1| |1.41|3.00|1.49|0.50|0.11|2.00|0.33|0.31|
|8| |0.42|2.00|0.67|0.52|0.48|3.00|0.73|0.55|
|10| |0.60|3.00|0.85|0.60|0.48|2.00|0.72|0.54|
|100| |1.90|4.00|2.03|0.71|0.47|2.00|0.73|0.56|
|142| |3.22|26.00|3.57|1.54|0.585|8|0.938|0.733|
|142|Y|38.26|250.00|53.92|38.06|27.87|123|38.95|27.26|
|1000| |0.00|0.00|0.00|0.00|0.00|0.00|0.00|0.00|
|1000|Y|157.49|2153.00|278.37|229.57|154.61|1094.00|242.75|187.18|
When the old implementation is fixed for small shape it is worse than the new
implementation. This is due to the improved accuracy of the gamma function
(since the power and exponential functions are the same).
When the new implementation is fixed for shape=142 it is better than the old
implementation.
I will prepare a change to the STATISTICS code to use the gamma function
derivative from NUMBERS. Any optimisation of thresholds used to switch
computations should be done as a new issue in the NUMBERS project.
> Correct the switch of function evaluation in the GammaDistribution
> ------------------------------------------------------------------
>
> Key: STATISTICS-39
> URL: https://issues.apache.org/jira/browse/STATISTICS-39
> Project: Apache Commons Statistics
> Issue Type: Improvement
> Components: distribution
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Priority: Major
> Attachments: pdf_error.jpg
>
>
> The Gamma distribution density is not accurate when shape < 1. There are
> extensive tests related to MATH-753 [7] for accuracy of the Gamma
> distribution. These test shape >= 1. The work requires extension to smaller
> shapes where the density -> infinite as X -> 0. When shape >= 1 then density
> -> 0 as X -> 0 (see [https://en.wikipedia.org/wiki/Gamma_distribution]).
> This results in inaccuracy of the Chi-squared distribution which uses
> Gamma(df/2, 2) when the degrees of freedom (df) are < 2. I've read the code
> and the documentation for the Boost functions which the code is based on and
> I think the switch to the alternative computation to avoid overflow is
> currently not working for all cases.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)