[ 
https://issues.apache.org/jira/browse/STATISTICS-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480738#comment-17480738
 ] 

Alex Herbert commented on STATISTICS-25:
----------------------------------------

The issue with an unstable CDF at large degrees of freedom is also present in 
the PDF.
{noformat}
df     pdf(x=0.025)  
1.0E9  0.3988172087667682
1.0E10 0.39882101220230587
1.0E11 0.3988149267230468
1.0E12 0.3984256491290572
1.0E13 0.4167317706498295       <-- Incorrect from here
1.0E14 0.36776449680706863
1.0E15 0.13529299756808105
1.0E16 1.2660208615663138E-14
1.0E17 0.39881763041638174      <-- Switched to normal approximation{noformat}
The computation is wrong when DF > 1e13.

The density computation uses the form:
{noformat}
  gamma((v+1) / 2)
---------------------  * (1 + x^2/v)^-((v+1)/2)
sqrt(v*pi) gamma(v/2){noformat}
See [Wikipedia: Student's T distribution 
PDF|https://en.wikipedia.org/wiki/Student%27s_t-distribution#Probability_density_function]

This is computed using the log form and the density as exp(logDensity).

The gamma term for the log form:
{code:java}
factor = LogGamma.value(degreesOfFreedom/2 + 0.5) -
         0.5 * (Math.log(Math.PI) + Math.log(degreesOfFreedom)) -
         LogGamma.value(degreesOfFreedom/2); {code}
is not computed accurately as the degrees of freedom is large.

The term can be expressed using the log beta function:
{code:java}
factor = -0.5 * Math.log(degreesOfFreedom) - LogBeta.value(0.5, 
degreesOfFreedom/2); {code}
The two computations should be the same. But at large DF they are not:
||DF||LogGamma||LogBeta||ULP difference||
|1.0|-1.1447298858494002|-1.1447298858494004|-1|
|10.0|-0.9438973521509522|-0.9438973521509524|-2|
|100.0|-0.9214384915430003|-0.9214384915430047|-40|
|1000.0|-0.9191885331629237|-0.9191885331630059|-740|
|10000.0|-0.918963533193164|-0.9189635332046309|-103284|
|100000.0|-0.9189410332473926|-0.918941033204673|384784|
|1000000.0|-0.918938783928752|-0.9189387832046734|6521920|
|2000000.0|-0.918938659131527|-0.9189386582046728|8348360|
|2980000.0|-0.9189386144280434|-0.9189386170972904|-24042440|
|2990000.0|-0.9189386144280434|-0.9189386168167131|-21515224|
|3000000.0|-0.9189386181533337|-0.9189386165380062|14549576|
|4000000.0|-0.9189385958015919|-0.918938595704673|872968|
|1.0E7|-0.9189385622739792|-0.9189385582046734|36653048|
|1.0E8|-0.9189385175704956|-0.9189385357046742|-163338160|
|1.0E9|-0.9189395904541016|-0.9189385334546714|9520604480|
|1.0E10|-0.9189300537109375|-0.9189385332296744|-76376714848|
|1.0E11|-0.9189453125|-0.9189385332071733|61062441296|
|1.0E12|-0.919921875|-0.9189385332049227|8857155483776|
|1.0E13|-0.875|-0.9189385332046989|-395763123535776|
|1.0E14|-1.0|-0.9189385332046758|730136783307056|
|1.0E15|-2.0|-0.918938533204674|5233736410677568|
|1.0E16|-32.0|-0.918938533204674|23248134920159552|

As DF increases the difference between the two terms increases until as DF > 
1e13 the LogGamma computation has extreme cancellation.

Updating the density computation to use the LogBeta function corrects this 
issue. [Numbers-181] added a Beta function. So the density computation can 
precompute the factor using Beta and the log density can precompute the 
addition using LogBeta. This fixes the result:
{noformat}
DF     PDF(x=0.025)
1.0E9  0.398817630316553
1.0E10 0.3988176304063989
1.0E11 0.39881763041538365
1.0E12 0.39881763041628193
1.0E13 0.3988176304163718
1.0E14 0.3988176304163808
1.0E15 0.39881763041638174
1.0E16 0.39881763041638174
1.0E17 0.39881763041638174
1.0E18 0.39881763041638174{noformat}

> T Distribution Inverse Cumulative Probability Function gives the Wrong Answer
> -----------------------------------------------------------------------------
>
>                 Key: STATISTICS-25
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-25
>             Project: Apache Commons Statistics
>          Issue Type: Bug
>            Reporter: Andreas Stefik
>            Priority: Major
>
> Hi There,
> Given code like this:
>  
> import org.apache.commons.math3.analysis.UnivariateFunction;
> import org.apache.commons.math3.analysis.solvers.BrentSolver;
> import org.apache.commons.math3.distribution.TDistribution;
> public class Main {
>  public static void main(String[] args) {
>  double df = 1E38;
>  double t = 0.975;
>  TDistribution dist = new TDistribution(df);
>  
>  double prob = dist.inverseCumulativeProbability(1.0 - t);
>  
>  System.out.println("Prob: " + prob);
>  }
> }
>  
> It is possible I am misunderstanding, but that seems equivalent to:
>  
> scipy.stats.t.cdf(1.0 - 0.975, 1e38)
>  
> In Python. They give different answers. Python gives 0.509972518193, which 
> seems correct, whereas Apache Commons gives  Prob: -6.462184036284304E-10. 
> That's a huge difference.
> My hunch is that as you get closer to infinity it begins to fail, but I 
> haven't checked carefully. For calls with much smaller degrees of freedom, 
> you get answers that are basically the same as Python or online calculators.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to