[
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739536#comment-14739536
]
Otmar Ertl commented on MATH-1246:
----------------------------------
The Monte Carlo approach can be modified by simultaneously sampling D. Here is
an outline how this sampling could be achieved:
# First determine set of points P = (p_i) for which equal values exist in both
samples.
# Determine maximum difference of CDFs over all values not included in P
# Determine for each point p_i if it is possible at all to get a CDF difference
that is larger than the calculated maximum. If not, those points can be
excluded from P. Otherwise, remember the difference of the CDF d_i just before
that point and the number of equal values in both samples n_i and m_i,
respectively.
# Within each Monte Carlo iteration, generate for each point p_i a random
ordering of the n_i and m_i equal values (using a function similar to
fillBooleanArrayRandomlyWithFixedNumberTrueValues). Determine the maximum
differences of the CDFs at all points p_i using the random ordering and d_i,
and take the maximum of them and the maximum calculated in 2) which gives us
the sampled (observed) D-statistic that is finally compared to curD.
Anyway, we should find the right definition first before implementing anything.
> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
> Issue Type: Bug
> Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the
> distribution of a D-statistic for m-n sets with no ties. No warning or
> special handling is delivered in the presence of ties.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)