[ 
https://issues.apache.org/jira/browse/MATH-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119332#comment-17119332
 ] 

Steffen Herbold commented on MATH-1535:
---------------------------------------

Yes, that could work. I checked how R handles this data. Seems to work fine, 
except the usual tie-warning.
{noformat}
> x = c(0.8767630865438496, 0.9998809418147052, 0.9999999715463531, 
> 0.9999985849345421, 0.973584315883326, 0.9999999875782982, 0.999999999999994, 
> 0.9999999999908233, 1.0, 0.9999999890925574, 0.9999998345734327, 
> 0.9999999350772448, 0.999999999999426, 0.9999147040688201, 
> 0.9999999999999922, 1.0, 1.0, 0.9919050954798272, 0.8649014770687263, 
> 0.9990869497973084, 0.9993222540990464, 0.999999999998189, 
> 0.9999999999999365, 0.9790934801762917, 0.9999578695006303, 
> 0.9999999999999998, 0.999999999996166, 0.9999999999995546, 
> 0.9999999999908036, 0.99999999999744, 0.9999998802655555, 0.9079334221214075, 
> 0.9794398308007372, 0.9999044231134367, 0.9999999999999813, 
> 0.9999957841707683, 0.9277678892094009, 0.999948269893843, 
> 0.9999999886132888, 0.9999998909699096, 0.9999099536620326, 
> 0.9999999962217623, 0.9138936987350447, 0.9999999999779976, 
> 0.999999999998822, 0.999979247207911, 0.9926904388316407, 1.0, 
> 0.9999999999998814, 1.0, 0.9892505696426215, 0.9999996514123723, 
> 0.9999999999999429, 0.9999999995399116, 0.999999999948221, 
> 0.7358264887843119, 0.9999999994098534, 1.0, 0.9999986456748472, 1.0, 
> 0.9999999999921501, 0.9999999999999996, 0.9999999999999944, 
> 0.9473070068606853, 0.9993714060209042, 0.9999999409098718, 
> 0.9999999592791519, 0.9999999999999805)
> y = c(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0)
> ks.test(x,y)
Two-sample Kolmogorov-Smirnov test
data:  x and y
D = 0.89706, p-value < 2.2e-16
alternative hypothesis: two-sided
Warning message:In ks.test(x, y) : cannot compute exact p-value with ties
{noformat}
 

I also looked up the source code in R for the KS test.

Here is the R code 
[https://github.com/wch/r-source/blob/5ddd903da06ad1493582dcc754677d1bd992ca1f/src/library/stats/R/ks.test.R]
 and the underlying C 
code([https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/stats/src/ks.c],
 starts in line 92).  

There does not seem to be any tiebreaking. Instead, it looks like the tied data 
is removed before this goes into the C code (line 51 in the R code).

 

Scipy stats seems to use this approach as well 
([https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py], line 6700). 

 

The code below fails at i=16. So the current code and already handle more than 
10 ties. 
{code:java}
double[] sampleX = new double[] {0.8767630865438496, 0.9998809418147052, 
0.9999999715463531, 0.9999985849345421, 0.973584315883326, 0.9999999875782982, 
0.999999999999994, 0.9999999999908233, 1.0, 0.9999999890925574, 
0.9999998345734327, 0.9999999350772448, 0.999999999999426, 0.9999147040688201, 
0.9999999999999922, 1.0, 1.0, 0.9919050954798272, 0.8649014770687263, 
0.9990869497973084, 0.9993222540990464, 0.999999999998189, 0.9999999999999365, 
0.9790934801762917, 0.9999578695006303, 0.9999999999999998, 0.999999999996166, 
0.9999999999995546, 0.9999999999908036, 0.99999999999744, 0.9999998802655555, 
0.9079334221214075, 0.9794398308007372, 0.9999044231134367, 0.9999999999999813, 
0.9999957841707683, 0.9277678892094009, 0.999948269893843, 0.9999999886132888, 
0.9999998909699096, 0.9999099536620326, 0.9999999962217623, 0.9138936987350447, 
0.9999999999779976, 0.999999999998822, 0.999979247207911, 0.9926904388316407, 
1.0, 0.9999999999998814, 1.0, 0.9892505696426215, 0.9999996514123723, 
0.9999999999999429, 0.9999999995399116, 0.999999999948221, 0.7358264887843119, 
0.9999999994098534, 1.0, 0.9999986456748472, 1.0, 0.9999999999921501, 
0.9999999999999996, 0.9999999999999944, 0.9473070068606853, 0.9993714060209042, 
0.9999999409098718, 0.9999999592791519, 0.9999999999999805};
double[] sampleY = new double[] {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
for( int i=5; i<sampleX.length; i++) {
    System.out.println(i);
    ksTest.kolmogorovSmirnovTest(Arrays.copyOf(sampleX, i), 
Arrays.copyOf(sampleY, i));
}
{code}
 

I honestly do not know the KS test well enough to understand the difference 
between adding jitter to remove ties or removing tied instances from the data. 
I would suggest to see what happens if the tiebreaking would be extended beyond 
10 instances. Alternatively, you could just throw an exception and document 
that at most 10 ties can be handled.

 

> MathInternalError in KolmogorovSmirnovTest in case of many ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1535
>                 URL: https://issues.apache.org/jira/browse/MATH-1535
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.6.1
>         Environment: commons-math 3.6.1, oracle jdk11 for windows
>            Reporter: Steffen Herbold
>            Priority: Major
>
> I encounter a math internal error with some very ugly data that has lots of 
> ties. The code below triggers the exception. I could try to build in a 
> detection in my code that identifies this strange case where the generated 
> data has many ties, to avoid this. But I guess the MathInternalError in 
> commons-math should still be avoided. 
>  
> {code:java}
> // works
> double[] sample1 = new double[] {0.8767630865438496, 0.9998809418147052, 
> 0.9999999715463531, 0.9999985849345421};
> double[] sample2 = new double[] {1.0, 1.0, 1.0, 1.0};
> ksTest.kolmogorovSmirnovTest(sample1, sample2);
> // fails with illegal state
> double[] sample3 = new double[] {0.8767630865438496, 0.9998809418147052, 
> 0.9999999715463531, 0.9999985849345421, 0.973584315883326, 
> 0.9999999875782982, 0.999999999999994, 0.9999999999908233, 1.0, 
> 0.9999999890925574, 0.9999998345734327, 0.9999999350772448, 
> 0.999999999999426, 0.9999147040688201, 0.9999999999999922, 1.0, 1.0, 
> 0.9919050954798272, 0.8649014770687263, 0.9990869497973084, 
> 0.9993222540990464, 0.999999999998189, 0.9999999999999365, 
> 0.9790934801762917, 0.9999578695006303, 0.9999999999999998, 
> 0.999999999996166, 0.9999999999995546, 0.9999999999908036, 0.99999999999744, 
> 0.9999998802655555, 0.9079334221214075, 0.9794398308007372, 
> 0.9999044231134367, 0.9999999999999813, 0.9999957841707683, 
> 0.9277678892094009, 0.999948269893843, 0.9999999886132888, 
> 0.9999998909699096, 0.9999099536620326, 0.9999999962217623, 
> 0.9138936987350447, 0.9999999999779976, 0.999999999998822, 0.999979247207911, 
> 0.9926904388316407, 1.0, 0.9999999999998814, 1.0, 0.9892505696426215, 
> 0.9999996514123723, 0.9999999999999429, 0.9999999995399116, 
> 0.999999999948221, 0.7358264887843119, 0.9999999994098534, 1.0, 
> 0.9999986456748472, 1.0, 0.9999999999921501, 0.9999999999999996, 
> 0.9999999999999944, 0.9473070068606853, 0.9993714060209042, 
> 0.9999999409098718, 0.9999999592791519, 0.9999999999999805};
> double[] sample4 = new double[] {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 
> 1.0};
> ksTest.kolmogorovSmirnovTest(sample3, sample4);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to