[
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746500#comment-14746500
]
Phil Steitz commented on MATH-1246:
-----------------------------------
I don't think there is really a question about the definition of the p-value -
I can't find any reference that does not confirm it to be what I described
above. And it is well-defined in the presence of ties - just messy to compute,
as the distribution of D depends on not just n and m but the location of the
ties. The permutation method does correspond to ks.boot and what I would
propose for the monte carlo impl in the presence of ties. The method currently
implemented for exactP(x,y,b) computes p-values based on full enumeration of
the underlying sample space sampled by ks.boot (resp. the permutation method).
I understand, though, the inefficiency of doing full enumeration and the
convenience of working with D statistics that depend only on n and m. So I
think it may be best to do as you suggest and make the behavior in the presence
of ties configurable. I like the idea of introducing a (public or protected)
jitter method that just randomly perturbs combined samples with ties. Then if
you configure ties handling to use jitter, the implementation just applies the
jitter and uses the (not yet implemented) fast method for exact computation
without ties and the current monteCarlo implementation (that does not handle
ties) for monteCarloP. [An interesting theorem to prove is that the expected
p-value computed using random jitter in the presence of ties equals the true
p-value equals the expectation of the permutation method (the last part is what
[1] shows)]. Once we have the fast no-ties method implemented, we may be able
to dispense with the version of monteCarloP that does not handle ties, as the
fast, exact method should be usable up to the sample sizes where the K-S
distribution based method is OK (and more accurate).
Assuming you are OK with this, I will proceed with a) the fast implementation
of the no ties exactP b) a version of monteCarloP that basically does what
ks.boot does (resamples the combined dataset with ties included).
If we agree on this approach we need to decide
# How configuration should work
# How (if at all) we signal to the user that there are ties in the data
# What the default behavior should be
> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
> Issue Type: Bug
> Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the
> distribution of a D-statistic for m-n sets with no ties. No warning or
> special handling is delivered in the presence of ties.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)