[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

Phil Steitz (JIRA) Sun, 13 Sep 2015 16:37:12 -0700

    [ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742726#comment-14742726
 ]


Phil Steitz commented on MATH-1246:
-----------------------------------

I have done some research and I am convinced that the definition based on the 
empirical distributions as given is correct.  In other words

1.  The statistic that we should use is that given by comparing the empirical 
distributions with ties contributing the mass that they do.  This is the 
Kolmogorov metric that is part of the definition of the test.  Distributions 
with point masses should be allowed and empirical distributions based on data 
including repeated values should be taken as presented by the data.
2.  The correct definition of p-value (with or without ties) is the probability 
that when an n-set and m-set are randomly selected from the combined dataset 
the associated D-value is greater than (resp greater than or equal to) the 
observed D value (with ties included).  Equivalently, it is the probability 
that when group assignment is done randomly, the resulting empirical 
distributions are separated by Kolmogorov distance as large as the observed D.

This is supported theoretically in [1], recommended in [2] and implemented in 
the R-package ks.boot, which the R community recommends when ties are present 
in the data.

The current small sample, exactP method computes the probability defined above 
by actually enumerating all n-m splits and agrees with tabulated data and R for 
samples with no ties.  As explained above, in the presence of ties, exact 
computation requires that the partition enumeration be over the actual combined 
data (including the ties).  The fix committed in 
ce98d00852e21ce34d8d247db7f6be138967b559 does that, so I think it is correct.  
I will run some comparisons with ks.boot (see below) to check consistency / 
find errors in the implementation.

Happily, in [2] I found a much more efficient way to compute exactP in the 
no-ties case.  Unfortunately, I can't find [2] or the algorithm presented 
freely available anywhere.  I am going to try to implement it and once that is 
done, we can likely use Monte Carlo only for moderate size samples with ties 
(since the faster algorithm should work for the non-tied case up to the level 
where the asymptotic approximation is fine - this is basically what R does).  I 
think in any case, our Monte Carlo implementation should use the combined 
sample semantics (as in [1]), which means in the presence of ties, it will have 
to use the multi-set as sampling universe.

[1] Abadie, Alberto. 2002.  "Bootstrap Tests for Distributional Treatment 
Effects in Instrumental Variable Models.'' Journal of the American Statistical 
Association, 97:457 (March) 284-292.  Currently available online at 
http://hks.harvard.edu/fs/aabadie/dtep.pdf

[2] Wilcox, Rand. 2012. ??Introduction to Robust Estimation and Hypothesis 
Testing??, 3rd Ed. Academic Press. 2012

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

Reply via email to