I am working on MATH-437 (turning K-S distribution into a proper K-S test impl) and have to decide how to implement 2-sample tests. Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a K-S distribution, so for large samples just using the cdf we already have is appropriate. For small samples (actually for any size sample), the test statistic distribution is discrete and can be computed exactly. A brute force way to do that is to enumerate all of the n-m partitions of {0, ..., n+m-1} and compute all the possible D_n,m values. R seems to use a more clever way to do this. Does anyone have a reference for an efficient way to compute the exact distribution, or background on where R got their implementation?
Absent a "clever" approach, I see three alternatives and would appreciate some feedback: 0) just use the asymptotic distribution even for small samples 1) fully enumerate all n-m partitions and compute the D_n,m as above 1) use a monte carlo approach - instead of full enumeration of the D_n,m, randomly generate a large number of splits and compute the p-value for observed D_n,m by computing the number of random n-m splits generate a D value less than what is observed. Thanks in advance for any feedback / pointers. Phil [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org