[
https://issues.apache.org/jira/browse/STATISTICS-36?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462308#comment-17462308
]
Alex Herbert commented on STATISTICS-36:
----------------------------------------
I created a JMH benchmark to test the inverse probability function. The
absolute and relative accuracy tolerances were varied from the current settings
(1e-9, 1e-14) to extreme settings (5e-324, ~1e-16 (actually 2^-53)). The
distributions were parameterised using the worst known cases for accuracy from
the current test suite. The benchmark calls the inverse function using a random
p-value in [0, 1]. This will not test very small probability values (the
minimum non-zero number is 2^-53) but does provide a picture of the performance
with different thresholds.
The mean time for a single function call is listed in the table:
||Distribution||Params||Function||Rel||1e-14||1e-14||~1e-16||~1e-16||
|| || || ||Abs||1e-9||5e-324||1e-9||5e-324||
|Beta|4, 0.1|cdf| |6551.1|7368.6|6754.9|7683.4|
|ChiSquared|0.1|cdf| |4267.4|7189.3|*4249.4*|*7377.3*|
|F|5, 6|cdf| |4326.4|5251.8|4352.9|5971.1|
|Gamma|4, 2|cdf| |1597.1|1666.9|1578.7|1728.2|
|Nakagami|1/3, 1|cdf| |3077.1|3282.1|3078.8|3411.7|
|T|5|cdf| |3490.1|3697.5|3486.9|3986.5|
|Beta|4, 0.1|sf| |5996.2|6792.2|6033.0|7053.7|
|ChiSquared|0.1|sf| |4834.3|7918.8|*4700.2*|*8264.9*|
|F|5, 6|sf| |4781.0|5191.4|4759.7|5445.6|
|Gamma|4, 2|sf| |1671.3|1755.1|1680.8|1845.1|
|Nakagami|1/3, 1|sf| |3393.0|3655.0|3393.4|3884.2|
|T|5|sf| |3533.4|3718.3|3509.6|4595.2|
These functions are slow. For reference the mean time for a no-operation
function that returns the same p-value is about 4.8.
Changing the relative threshold from 1e-14 to 1e-16 makes little difference
(worst case is ~23.6% for the T distribution survival function when the
absolute tolerance is effectively 0).
Changing the absolute threshold from 1e-9 to min value makes some difference on
Beta (17% sf) and F (37% cdf) and the most difference on ChiSquared (74% cdf,
76% sf) when the relative tolerance is 2^-53. However this parameterisation of
Chi-squared is where the inverse does not work at all when the absolute
accuracy is 1e-9. So this is comparing a functionally broken inversion to an
accurate inversion.
Even in the most extreme changes to the accuracy thresholds the speed is not
even twice as slow. This suggests updating the accuracy thresholds to the
tighter tolerances to allow correct inversion of the probabilities.
h2. Inversion slowness
In a worst case scenario the inversion is performed using bisection. For each
iteration the bracket [a, b] is divided in half. The number of iterations n to
achieve the absolute eps is:
{noformat}
eps = (b - a) / 2^n
2^n = (b - a) / eps
n = (log(b - a) - log(eps)) / log(2)
= log2(b - a) - log2(eps){noformat}
For an extreme range of the maximum double value with the current threshold of
1e-9, which is roughly 2^-30, this is:
||a||b||abs eps||n||
|-2^1023|2^1023|2^-30|2076|
| | |2^-53|2099|
| | |2^-1074|3120|
So in the theoretical worst case of bisection changing the absolute threshold
from 1e-9 to min value would have an approximate 50% increase in run time to
search from a max value bracket to a result at effectively zero. If the result
is far from zero then the eps cannot be so small (it is limited to 1 ULP of the
result) and the performance drop is not as significant.
The number of CDF evaluations (over 2000) shows why the performance of the
inversion search is generally slow. The general slowness of the inversion could
be addressed using:
# The PDF to provide gradient information for the CDF or SF
# Better initial estimate and bracketing of the bounds of the root
Note that the PDF may as expensive to compute as the CDF. So any gains in
performance from using it have to be more significant than applying 2
iterations of the Brent solver using only the CDF. It should be noted that
during porting the Boost C++ gamma function for NUMBERS-174 the simultaneous
computation of the CDF and PDF was removed as this functionality was only used
in the inverse gamma methods. For the gamma function the PDF is naturally
computed as part of the CDF computation and this is exploited in a custom
inversion method.
A better target for improvement is bracketing and initial estimation. For
example a T distribution with large degrees of freedom approaches a normal
distribution. So the inverse of a normal distribution could be used to provide
an initial estimate and closer bounds for the bracket.
This is something to be investigated in future work.
> Optimise the default inverse CDF function for continuous distributions
> ----------------------------------------------------------------------
>
> Key: STATISTICS-36
> URL: https://issues.apache.org/jira/browse/STATISTICS-36
> Project: Apache Commons Statistics
> Issue Type: Improvement
> Components: distribution
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Assignee: Alex Herbert
> Priority: Minor
>
> Many distributions only compute the inverse CDF to 1e-9 relative error. Some
> fail to compute when the input p value maps to a very small x value (e.g.
> 1e-50). This is related to the defaults for the brent solver used in the
> default implementation.
> Investigation of tolerances in the brent solver discovered a bug when
> evaluating very small objective functions ([NUMBERS-168]).
> The brent solver requires a release to be used with this fix so work can use
> a local copy until all issues with it have been resolved, and the changes
> ported back to numbers.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)