[ 
https://issues.apache.org/jira/browse/STATISTICS-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480439#comment-17480439
 ] 

Alex Herbert commented on STATISTICS-25:
----------------------------------------

[NUMBERS-181] updated the regularized beta function to improve support for 
large a and or b arguments and add a complement function. This was ported from 
the Boost C++ beta routines.

The T-distribution CDF has been updated to use the new function. The [Boost 
documentation for the 
T-distribution|https://www.boost.org/doc/libs/1_78_0/libs/math/doc/html/math_toolkit/dist_ref/dists/students_t_dist.html]
 states that the CDF is computed using the regularized beta or its complement. 
The logic is based on when t is small compared to the degrees of freedom:
{noformat}
v = degrees of freedom

if (v < 2*t*t) 
  z = incomplete beta(v / (v + t*t), v / 2, 0.5) / 2;
else
  z = incomplete beta complement(t*t / (v + t*t), 0.5, v / 2) / 2;

p = t > 0 ? 1 - z : z;
{noformat}

I updated the function to use this logic. Following this change the CDF is 
computed accurately and approaches the normal distribution at large degrees of 
freedom. The previous threshold for the normal approximation (2.99e6) is not 
required, or too low.

Reference test values for the original case (CDF(t=0.025)) have been computed 
using Python for degrees of freedom up to 1e10. I updated the values to 20 
significant digits. The T-distribution is accurate to within 6 ULP at degrees 
of freedom up to 1e18.

However a look at the source code for scipy's t-distribution shows it uses the 
Boost implementation. This has a threshold for switching to the normal 
approximation at 1 / machine epsilon. This is 2^52 or 4.50e15. So the reference 
values should be verified using a different implementation.

I've updated the code to use a higher threshold of 2^52 in commit:

876b5ec9782a36887f08612bee5b0ee8917df6f2

Note that the T-distribution currently switches to a Normal approximation in 
the factory constructor when the variance is 1.0. That threshold occurs when 
DF=3.60e16. So perhaps the normal approximation is no longer required in the 
CDF for the regular T-distribution. Further work should be done with more input 
X values to determine if 2^52 is a suitable threshold.


> T Distribution Inverse Cumulative Probability Function gives the Wrong Answer
> -----------------------------------------------------------------------------
>
>                 Key: STATISTICS-25
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-25
>             Project: Apache Commons Statistics
>          Issue Type: Bug
>            Reporter: Andreas Stefik
>            Priority: Major
>
> Hi There,
> Given code like this:
>  
> import org.apache.commons.math3.analysis.UnivariateFunction;
> import org.apache.commons.math3.analysis.solvers.BrentSolver;
> import org.apache.commons.math3.distribution.TDistribution;
> public class Main {
>  public static void main(String[] args) {
>  double df = 1E38;
>  double t = 0.975;
>  TDistribution dist = new TDistribution(df);
>  
>  double prob = dist.inverseCumulativeProbability(1.0 - t);
>  
>  System.out.println("Prob: " + prob);
>  }
> }
>  
> It is possible I am misunderstanding, but that seems equivalent to:
>  
> scipy.stats.t.cdf(1.0 - 0.975, 1e38)
>  
> In Python. They give different answers. Python gives 0.509972518193, which 
> seems correct, whereas Apache Commons gives  Prob: -6.462184036284304E-10. 
> That's a huge difference.
> My hunch is that as you get closer to infinity it begins to fail, but I 
> haven't checked carefully. For calls with much smaller degrees of freedom, 
> you get answers that are basically the same as Python or online calculators.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to