[ 
https://issues.apache.org/jira/browse/MATH-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046995#comment-14046995
 ] 

Phil Steitz commented on MATH-1131:
-----------------------------------

A few of comments not directly related to the performance issue, but likely 
relevant to the OP and anyone using KolmogorovSmirnovTest to evaluate the null 
hypothesis that a sample comes from a normal (Gaussian) distribution:

1. The KS test using parameters estimated from the data is in general not the 
best test to use to test normality.  We do not currently implement the 
Lillifors or other tests.  Patches welcome :)  (Discuss first on the mailing 
list, then open separate tickets for these if interested.)
2.  *No* classical frequentist test really works for large samples.  KS, 
Liilifors, Shapiro-Wilks et al are uniformly too powerful to be meaningful for 
samples even as small as 5000 observations.  See, e.g. [1].
3.  An interesting alternative for large samples is [2].   Here again, patches 
welcome.  A similar approach implementable using Commons Math version 3.x would 
be to bin the data in standard deviation units and then apply a G-test with 
expected counts computed using quantiles of the normal distribution.

[1] 
http://www.statisticalmisses.nl/index.php/frequently-asked-questions/77-what-is-wrong-with-tests-of-normality
[2] 
https://ideals.illinois.edu/bitstream/handle/2142/29878/largesamplenorma93171bera.pdf

> Kolmogorov-Smirnov Tests takes 'forever' on 10,000 item dataset
> ---------------------------------------------------------------
>
>                 Key: MATH-1131
>                 URL: https://issues.apache.org/jira/browse/MATH-1131
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.3
>         Environment: Java 8
>            Reporter: Schalk W. Cronjé
>         Attachments: 1.txt, MATH-1131.patch, ReproduceKsIssue.groovy, 
> ReproduceKsIssue.java
>
>
> I have code simplified to the following:
>     KolmogorovSmirnovTest kst = new KolmogorovSmirnovTest();
>     NormalDistribution nd = new NormalDistribution(mean,stddev);
>     kst.kolmogorovSmirnovTest(nd,dataset)
> I find that for my dataset of 10,000 items, the call to kolmogorovSmirnovTest 
> takes 'forever'. It has not returned after nearly 15minutes and in one my my 
> tests has gone over 150MB in  memory usage. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to