[
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742726#comment-14742726
]
Phil Steitz commented on MATH-1246:
-----------------------------------
I have done some research and I am convinced that the definition based on the
empirical distributions as given is correct. In other words
1. The statistic that we should use is that given by comparing the empirical
distributions with ties contributing the mass that they do. This is the
Kolmogorov metric that is part of the definition of the test. Distributions
with point masses should be allowed and empirical distributions based on data
including repeated values should be taken as presented by the data.
2. The correct definition of p-value (with or without ties) is the probability
that when an n-set and m-set are randomly selected from the combined dataset
the associated D-value is greater than (resp greater than or equal to) the
observed D value (with ties included). Equivalently, it is the probability
that when group assignment is done randomly, the resulting empirical
distributions are separated by Kolmogorov distance as large as the observed D.
This is supported theoretically in [1], recommended in [2] and implemented in
the R-package ks.boot, which the R community recommends when ties are present
in the data.
The current small sample, exactP method computes the probability defined above
by actually enumerating all n-m splits and agrees with tabulated data and R for
samples with no ties. As explained above, in the presence of ties, exact
computation requires that the partition enumeration be over the actual combined
data (including the ties). The fix committed in
ce98d00852e21ce34d8d247db7f6be138967b559 does that, so I think it is correct.
I will run some comparisons with ks.boot (see below) to check consistency /
find errors in the implementation.
Happily, in [2] I found a much more efficient way to compute exactP in the
no-ties case. Unfortunately, I can't find [2] or the algorithm presented
freely available anywhere. I am going to try to implement it and once that is
done, we can likely use Monte Carlo only for moderate size samples with ties
(since the faster algorithm should work for the non-tied case up to the level
where the asymptotic approximation is fine - this is basically what R does). I
think in any case, our Monte Carlo implementation should use the combined
sample semantics (as in [1]), which means in the presence of ties, it will have
to use the multi-set as sampling universe.
[1] Abadie, Alberto. 2002. "Bootstrap Tests for Distributional Treatment
Effects in Instrumental Variable Models.'' Journal of the American Statistical
Association, 97:457 (March) 284-292. Currently available online at
http://hks.harvard.edu/fs/aabadie/dtep.pdf
[2] Wilcox, Rand. 2012. ??Introduction to Robust Estimation and Hypothesis
Testing??, 3rd Ed. Academic Press. 2012
> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
> Issue Type: Bug
> Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the
> distribution of a D-statistic for m-n sets with no ties. No warning or
> special handling is delivered in the presence of ties.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)