[jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

Phil Steitz (JIRA) Mon, 09 Nov 2015 15:33:47 -0800

    [ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997372#comment-14997372
 ]


Phil Steitz edited comment on MATH-1246 at 11/9/15 11:32 PM:
-------------------------------------------------------------

I did some more extensive testing against R's ks.boot and found significant 
differences from the code in ce98d00852e21ce34d8d247db7f6be138967b559.  I have 
determined the reason why the results are different and that my initial 
approach was incorrect.  The difference is due to the fact that ks.boot samples 
"with replacement" from the combined empirical distribution while my approach 
constrains the n-m split to be a split that can be achieved using the combined 
dataset.  I interpreted the p-value to be essentially the same as in the no 
ties case - what is the probability that when the combined set of values is 
split into an n-set and an m-set, the KS statistic is greater than or equal to 
what we observe in the data.  The theoretical development in [1] and the 
implementation in ks.boot define the p-value to be the probability that when an 
m-set and n-set are drawn independently from the combined empirical 
distribution, the p-value exceeds what we see in the data.  This is not the 
same and when there are a lot of ties the estimates diverge.  Apologies for 
being a little dense on this.


was (Author: psteitz):
I did some more extensive testing against R's ks.boot and found significant 
differences from the code in ce98d00852e21ce34d8d247db7f6be138967b559.  I have 
determined the reason why the results are different and that my initial 
approach was incorrect.  The difference is due to the fact that ks.boot samples 
"with replacement" from the combined empirical distribution while my approach 
constrains the n-m split to be a split that can be achieved using the combined 
dataset.  I interpreted the p-value to be essentially the same as in the no 
ties case - what is the probability that when the combined set of values is 
split into an n-set and an m-set, the KS statistic is greater than or equal to 
what we observe in the data.  The theoretical development in [1] and the 
implementation in ks.boot define the p-value to be the probability that when an 
m-set and n-set are drawn independently from the combined empirical 
distribution, the p-value exceeds what we see in the data.  This is not the 
same and when there a lot of ties the estimates diverge.  Apologies for being a 
little dense on this.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

Reply via email to