[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616134#comment-14616134
 ] 

Phil Steitz commented on MATH-1246:
-----------------------------------

I think the current implementation can be fixed as follows.  If we move to a 
faster implementation, the strategy below may not work.

What exactP does now is to exhaustively compute all possible D-statistics for 
all m-set / n-set partitions of m+n and simply tally the number that exceed 
(strict) or are as large as (not strict) the observed D.  If there are ties in 
the data, it is not correct to look at partitions of m+n, since not all 
partitions of an m+n set with duplicates are distinct and the set of possible D 
values is different in the presence of ties.  I think we can correctly handle 
ties in the data if we compute and tally D statistics based on a combined 
multi-set sample with duplicates in the positions corresponding to what is 
observed in the data.  For example, suppose that the two samples are x = [0, 3, 
6, 9, 9, 10] and y = [1, 3, 4, 8, 11].  then the multi-set universe is  U = {0, 
1, 3, 3, 4, 6, 8, 9, 9, 10, 11}.  As before, we generate partitions of 11 into 
a 6-set and a 5-set, but instead of computing the D-statistics on the subsets 
of 11, we use indexes into U instead.  So if a generated split is mSet = {0, 2, 
3, 7, 8, 9}, nSet = {1, 4, 5, 6, 10}, we compute D for [0, 3, 3, 9, 9, 10] and 
[1, 4, 6, 8, 11].  The rationale here is that the p-value is the probability 
that if U is split randomly into a 5-set and a 6-set, the D-value exceeds the 
observed d.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to