[
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744680#comment-14744680
]
Phil Steitz commented on MATH-1246:
-----------------------------------
Thanks again, Otmar for looking carefully at this. It is a little painful to
try to do this in JIRA comments, but since we started here and it will be best
to keep the comments together, I will try to respond to each of your points
above.
# I must be missing something here. In the case x = (1, 3, 3, 5) and y = (2,
3, 3, 6), I don't see how there is ambiguity in the D statistic, which looks
correct to me at .25. The D statistic is the maximum difference in the
empirical distributions. In this case, the max is .25, attained at two domain
values: 1 and 5.
# The D statistics are the same, but the empirical distributions and underlying
datasets are different. The p-value depends on both the D-statistic and the
empirical distributions. When there are no ties, D_n,m is has the same
distribution regardless of the underlying sample data. When ties are present,
the distribution is still discrete, but it depends on the number and location
of the ties.
# What is proven is that bootstrapping gives asymptotically correct results.
The bootstrapping is over the combined dataset, including ties (as ks.boot
does). Exact computation using enumeration of all possible splits will give
the same result as what will be expected from bootstrapping.
It could be that we are not agreeing on the core definition of what the p-value
is supposed to be. To me, ties in the data just add mass to the empirical
distributions where they fall and the 2-sample test is really just assessing
the null hypothesis that the distributions represent draws from the same
underlying population distribution. The common underlying distribution under
the null hypothesis is best represented by the pooled data. This is the
interpretation that [1] appears to agree with, and [2] (sadly not free) as
well.
> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
> Issue Type: Bug
> Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the
> distribution of a D-statistic for m-n sets with no ties. No warning or
> special handling is delivered in the presence of ties.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)