[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744680#comment-14744680
 ] 

Phil Steitz commented on MATH-1246:
-----------------------------------

Thanks again, Otmar for looking carefully at this.  It is a little painful to 
try to do this in JIRA comments, but since we started here and it will be best 
to keep the comments together, I will try to respond to each of your points 
above.

# I must be missing something here.  In the case x = (1, 3, 3, 5) and y = (2, 
3, 3, 6), I don't see how there is ambiguity in the D statistic, which looks 
correct to me at .25.  The D statistic is the maximum difference in the 
empirical distributions.  In this case, the max is .25, attained at two domain 
values: 1 and 5.
# The D statistics are the same, but the empirical distributions and underlying 
datasets are different.  The p-value depends on both the D-statistic and the 
empirical distributions.  When there are no ties, D_n,m is has the same 
distribution regardless of the underlying sample data.  When ties are present, 
the distribution is still discrete, but it depends on the number and location 
of the ties.
# What is proven is that bootstrapping gives asymptotically correct results.  
The bootstrapping is over the combined dataset, including ties (as ks.boot 
does).  Exact computation using enumeration of all possible splits will give 
the same result as what will be expected from bootstrapping.

It could be that we are not agreeing on the core definition of what the p-value 
is supposed to be.  To me, ties in the data just add mass to the empirical 
distributions where they fall and the 2-sample test is really just assessing 
the null hypothesis that the distributions represent draws from the same 
underlying population distribution.   The common underlying distribution under 
the null hypothesis is best represented by the pooled data.  This is the 
interpretation that [1] appears to agree with, and [2] (sadly not free) as 
well.  

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to