[
https://issues.apache.org/jira/browse/MATH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606534#comment-14606534
]
Phil Steitz commented on MATH-1179:
-----------------------------------
I did a little more research. The 2-sample case is different because the test
statistic has a discrete distribution. This makes exact computation possible,
which we do for small m,n. We use a naively computed Smirnov approximation for
large n*m and Monte Carlo for "moderate" n*m by default. There are two things
we can do to improve this:
1. I can't find a freely available reference, but I am chasing down dead trees
for a much better exact computation algorithm that I have seen referenced [1].
What the exactP code does now is simple and correct but ridiculously slow. I
conjecture that the code that I have not been able to figure out in R may be
using that algorithm for small n,m. By implementing a better exact algorithm,
we can use it up to n * m <= 10000, which is the cut R uses. That basically
eliminates the need for the Monte Carlo stuff.
2. The sum computed in approximateP has known bad numerical properties and our
code does not really do anything to correct for this. The magic numbers in
some of the code you reference above have to do with continuity corrections.
We need to research this a little more and apply the corrections to the
computation of the sum.
[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the
two sample Kolmogorov–Smirnov criterion D_mn(m<=n) Selected Tables in
Mathematical Statistics 80–129 American Mathematical Society
> kolmogorovSmirnovTest poor performance in monteCarloP method
> ------------------------------------------------------------
>
> Key: MATH-1179
> URL: https://issues.apache.org/jira/browse/MATH-1179
> Project: Commons Math
> Issue Type: Bug
> Reporter: Gilad
> Fix For: 4.0
>
> Attachments: KSTest-JavaAndR.txt, KSTestSnippet.txt
>
>
> I'm using the kolmogovSmirnovTest method to calculate pvalues.
> However, when i try running the test on two double[] of sizes 5 and 45 the
> results take over 10 seconds to calculate.
> This seems very long, whereas in R it takes a few miliseconds for the same
> calculation.
> I'd be very happy to hear any comment you may have on the subject.
> Gilad
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)