[
https://issues.apache.org/jira/browse/STATISTICS-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480738#comment-17480738
]
Alex Herbert commented on STATISTICS-25:
----------------------------------------
The issue with an unstable CDF at large degrees of freedom is also present in
the PDF.
{noformat}
df pdf(x=0.025)
1.0E9 0.3988172087667682
1.0E10 0.39882101220230587
1.0E11 0.3988149267230468
1.0E12 0.3984256491290572
1.0E13 0.4167317706498295 <-- Incorrect from here
1.0E14 0.36776449680706863
1.0E15 0.13529299756808105
1.0E16 1.2660208615663138E-14
1.0E17 0.39881763041638174 <-- Switched to normal approximation{noformat}
The computation is wrong when DF > 1e13.
The density computation uses the form:
{noformat}
gamma((v+1) / 2)
--------------------- * (1 + x^2/v)^-((v+1)/2)
sqrt(v*pi) gamma(v/2){noformat}
See [Wikipedia: Student's T distribution
PDF|https://en.wikipedia.org/wiki/Student%27s_t-distribution#Probability_density_function]
This is computed using the log form and the density as exp(logDensity).
The gamma term for the log form:
{code:java}
factor = LogGamma.value(degreesOfFreedom/2 + 0.5) -
0.5 * (Math.log(Math.PI) + Math.log(degreesOfFreedom)) -
LogGamma.value(degreesOfFreedom/2); {code}
is not computed accurately as the degrees of freedom is large.
The term can be expressed using the log beta function:
{code:java}
factor = -0.5 * Math.log(degreesOfFreedom) - LogBeta.value(0.5,
degreesOfFreedom/2); {code}
The two computations should be the same. But at large DF they are not:
||DF||LogGamma||LogBeta||ULP difference||
|1.0|-1.1447298858494002|-1.1447298858494004|-1|
|10.0|-0.9438973521509522|-0.9438973521509524|-2|
|100.0|-0.9214384915430003|-0.9214384915430047|-40|
|1000.0|-0.9191885331629237|-0.9191885331630059|-740|
|10000.0|-0.918963533193164|-0.9189635332046309|-103284|
|100000.0|-0.9189410332473926|-0.918941033204673|384784|
|1000000.0|-0.918938783928752|-0.9189387832046734|6521920|
|2000000.0|-0.918938659131527|-0.9189386582046728|8348360|
|2980000.0|-0.9189386144280434|-0.9189386170972904|-24042440|
|2990000.0|-0.9189386144280434|-0.9189386168167131|-21515224|
|3000000.0|-0.9189386181533337|-0.9189386165380062|14549576|
|4000000.0|-0.9189385958015919|-0.918938595704673|872968|
|1.0E7|-0.9189385622739792|-0.9189385582046734|36653048|
|1.0E8|-0.9189385175704956|-0.9189385357046742|-163338160|
|1.0E9|-0.9189395904541016|-0.9189385334546714|9520604480|
|1.0E10|-0.9189300537109375|-0.9189385332296744|-76376714848|
|1.0E11|-0.9189453125|-0.9189385332071733|61062441296|
|1.0E12|-0.919921875|-0.9189385332049227|8857155483776|
|1.0E13|-0.875|-0.9189385332046989|-395763123535776|
|1.0E14|-1.0|-0.9189385332046758|730136783307056|
|1.0E15|-2.0|-0.918938533204674|5233736410677568|
|1.0E16|-32.0|-0.918938533204674|23248134920159552|
As DF increases the difference between the two terms increases until as DF >
1e13 the LogGamma computation has extreme cancellation.
Updating the density computation to use the LogBeta function corrects this
issue. [Numbers-181] added a Beta function. So the density computation can
precompute the factor using Beta and the log density can precompute the
addition using LogBeta. This fixes the result:
{noformat}
DF PDF(x=0.025)
1.0E9 0.398817630316553
1.0E10 0.3988176304063989
1.0E11 0.39881763041538365
1.0E12 0.39881763041628193
1.0E13 0.3988176304163718
1.0E14 0.3988176304163808
1.0E15 0.39881763041638174
1.0E16 0.39881763041638174
1.0E17 0.39881763041638174
1.0E18 0.39881763041638174{noformat}
> T Distribution Inverse Cumulative Probability Function gives the Wrong Answer
> -----------------------------------------------------------------------------
>
> Key: STATISTICS-25
> URL: https://issues.apache.org/jira/browse/STATISTICS-25
> Project: Apache Commons Statistics
> Issue Type: Bug
> Reporter: Andreas Stefik
> Priority: Major
>
> Hi There,
> Given code like this:
>
> import org.apache.commons.math3.analysis.UnivariateFunction;
> import org.apache.commons.math3.analysis.solvers.BrentSolver;
> import org.apache.commons.math3.distribution.TDistribution;
> public class Main {
> public static void main(String[] args) {
> double df = 1E38;
> double t = 0.975;
> TDistribution dist = new TDistribution(df);
>
> double prob = dist.inverseCumulativeProbability(1.0 - t);
>
> System.out.println("Prob: " + prob);
> }
> }
>
> It is possible I am misunderstanding, but that seems equivalent to:
>
> scipy.stats.t.cdf(1.0 - 0.975, 1e38)
>
> In Python. They give different answers. Python gives 0.509972518193, which
> seems correct, whereas Apache Commons gives Prob: -6.462184036284304E-10.
> That's a huge difference.
> My hunch is that as you get closer to infinity it begins to fail, but I
> haven't checked carefully. For calls with much smaller degrees of freedom,
> you get answers that are basically the same as Python or online calculators.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)