[ 
https://issues.apache.org/jira/browse/MATH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606534#comment-14606534
 ] 

Phil Steitz commented on MATH-1179:
-----------------------------------

I did a little more research.  The 2-sample case is different because  the test 
statistic has a discrete distribution.  This makes exact computation possible, 
which we do for small m,n.  We use a naively computed Smirnov approximation for 
large n*m and Monte Carlo for "moderate" n*m by default.  There are two things 
we can do to improve this:

1.  I can't find a freely available reference, but I am chasing down dead trees 
for a much better exact computation algorithm that I have seen referenced [1].  
What the exactP code does now is simple and correct but ridiculously slow.  I 
conjecture that the code that I have not been able to figure out in R may be 
using that algorithm for small n,m.  By implementing a better exact algorithm, 
we can use it up to n * m <= 10000, which is the cut R uses.  That basically 
eliminates the need for the Monte Carlo stuff.
2.  The sum computed in approximateP has known bad numerical properties and our 
code does not really do anything to correct for this.  The magic numbers in 
some of the code you reference above have to do with continuity corrections.  
We need to research this a little more and apply the corrections to the 
computation of the sum.

[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the 
two sample Kolmogorov–Smirnov  criterion D_mn(m<=n) Selected  Tables  in  
Mathematical  Statistics 80–129  American Mathematical Society

> kolmogorovSmirnovTest poor performance in monteCarloP method
> ------------------------------------------------------------
>
>                 Key: MATH-1179
>                 URL: https://issues.apache.org/jira/browse/MATH-1179
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Gilad
>             Fix For: 4.0
>
>         Attachments: KSTest-JavaAndR.txt, KSTestSnippet.txt
>
>
> I'm using the kolmogovSmirnovTest method to calculate pvalues.
> However, when i try running the test on two double[] of sizes 5 and 45 the 
> results take over 10 seconds to calculate.
> This seems very long, whereas in R it takes a few miliseconds for the same 
> calculation.
> I'd be very happy to hear any comment you may have on the subject.
>    Gilad



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to