[
https://issues.apache.org/jira/browse/MATH-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046995#comment-14046995
]
Phil Steitz commented on MATH-1131:
-----------------------------------
A few of comments not directly related to the performance issue, but likely
relevant to the OP and anyone using KolmogorovSmirnovTest to evaluate the null
hypothesis that a sample comes from a normal (Gaussian) distribution:
1. The KS test using parameters estimated from the data is in general not the
best test to use to test normality. We do not currently implement the
Lillifors or other tests. Patches welcome :) (Discuss first on the mailing
list, then open separate tickets for these if interested.)
2. *No* classical frequentist test really works for large samples. KS,
Liilifors, Shapiro-Wilks et al are uniformly too powerful to be meaningful for
samples even as small as 5000 observations. See, e.g. [1].
3. An interesting alternative for large samples is [2]. Here again, patches
welcome. A similar approach implementable using Commons Math version 3.x would
be to bin the data in standard deviation units and then apply a G-test with
expected counts computed using quantiles of the normal distribution.
[1]
http://www.statisticalmisses.nl/index.php/frequently-asked-questions/77-what-is-wrong-with-tests-of-normality
[2]
https://ideals.illinois.edu/bitstream/handle/2142/29878/largesamplenorma93171bera.pdf
> Kolmogorov-Smirnov Tests takes 'forever' on 10,000 item dataset
> ---------------------------------------------------------------
>
> Key: MATH-1131
> URL: https://issues.apache.org/jira/browse/MATH-1131
> Project: Commons Math
> Issue Type: Bug
> Affects Versions: 3.3
> Environment: Java 8
> Reporter: Schalk W. Cronjé
> Attachments: 1.txt, MATH-1131.patch, ReproduceKsIssue.groovy,
> ReproduceKsIssue.java
>
>
> I have code simplified to the following:
> KolmogorovSmirnovTest kst = new KolmogorovSmirnovTest();
> NormalDistribution nd = new NormalDistribution(mean,stddev);
> kst.kolmogorovSmirnovTest(nd,dataset)
> I find that for my dataset of 10,000 items, the call to kolmogorovSmirnovTest
> takes 'forever'. It has not returned after nearly 15minutes and in one my my
> tests has gone over 150MB in memory usage.
--
This message was sent by Atlassian JIRA
(v6.2#6252)