[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746500#comment-14746500
 ] 

Phil Steitz commented on MATH-1246:
-----------------------------------

I don't think there is really a question about the definition of the p-value - 
I can't find any reference that does not confirm it to be what I described 
above.  And it is well-defined in the presence of ties - just messy to compute, 
as the distribution of D depends on not just n and m but the location of the 
ties.  The permutation method does correspond to ks.boot and what I would 
propose for the monte carlo impl in the presence of ties.  The method currently 
implemented for exactP(x,y,b) computes p-values based on full enumeration of 
the underlying sample space sampled by ks.boot (resp. the permutation method).

I understand, though, the inefficiency of doing full enumeration and the 
convenience of working with D statistics that depend only on n and m.  So I 
think it may be best to do as you suggest and make the behavior in the presence 
of ties configurable.  I like the idea of introducing a (public or protected) 
jitter method that just randomly perturbs combined samples with ties.  Then if 
you configure ties handling to use jitter, the implementation just applies the 
jitter and uses the (not yet implemented) fast method for exact computation 
without ties and the current monteCarlo implementation (that does not handle 
ties) for monteCarloP.  [An interesting theorem to prove is that the expected 
p-value computed using random jitter in the presence of ties equals the true 
p-value equals the expectation of the permutation method (the last part is what 
[1] shows)].  Once we have the fast no-ties method implemented, we may be able 
to dispense with the version of monteCarloP that does not handle ties, as the 
fast, exact method should be usable up to the sample sizes where the K-S 
distribution based method is OK (and more accurate).

Assuming you are OK with this, I will proceed with a) the fast implementation 
of the no ties exactP b) a version of monteCarloP that basically does what 
ks.boot does (resamples the combined dataset with ties included).  

If we agree on this approach we need to decide
# How configuration should work
# How (if at all) we signal to the user that there are ties in the data
# What the default behavior should be

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to