On 8/10/13 10:41 AM, Ajo Fod wrote: > This paper provides some approximations and situations where you could use > exact computations: > http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf > ... they even refer to a java program at the end of page 11.
He he. This is the reference that our current K-S impl is based on :) So our impl correctly handles small samples for the 1-sample test. This paper does provide some references for exact computation of D_n. Unfortunately, what I need is D_n,m for the two-sample test and while asymptotically D_n,m is K-S distributed, I don't think I can adapt exact algorithms for D_n to compute exact D_n,m. I could be wrong here - would love to be shown how to do it. Phil > > Cheers, > Ajo. > > > On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <phil.ste...@gmail.com> wrote: > >> On 8/10/13 9:35 AM, Ajo Fod wrote: >>> In: >>> >> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf >>> Take a look at 2 sample KS stats and the relationship to the 1 sample ... >>> page 88. You already have the 1 sample table. >> The problem is that I don't have the *exact* 1 sample table, or more >> precisely the means to generate it. What our current machinery >> allows us to compute is an asymptotically valid approximation to the >> distribution of D_n,m. Theorem 2 on page 87 of the reference above >> justifies the large sample approach; but it is an asymptotic >> result. Using it is fine for large n, m, but not so good for small >> samples. For that we need the exact, discrete distribution of >> D_n,m. Like other math stat references I have consulted, the one >> above states that the discrete distribution has been tabulated for >> small n,m and those tables are available in most math stat texts. >> What I need is the algorithm used to generate those tables. >> >> Phil >>> Cheers, >>> Ajo >>> >>> >>> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <phil.ste...@gmail.com> >> wrote: >>>> On 8/10/13 8:59 AM, Ajo Fod wrote: >>>>> This depends on data size. If it fits in memory, a single pass through >>>> the >>>>> sorted array to find the biggest differences would suffice. >>>>> >>>>> If the data doesn't fit, you probably need a StorelessQuantile >> estimator >>>>> like QuantileBin1D from the colt libraries. Then pick a resolution and >> do >>>>> the single pass search. >>>> Thanks, Ajo. I have no problem computing the D_n,m statistics. My >>>> problem is in computing the exact p-values for the test. For that, >>>> I need to compute the exact distribution of D_n,m. Brute-forcing >>>> requires that you examine every element of n + m choose n. R seems >>>> to use a clever approach, but there is no documentation in the R >>>> sources on how the algorithm works. Moreover, my first attempts at >>>> Monte Carlo simulation don't give the same results. Most likely, I >>>> have not set the simulation up correctly. Any better ideas or >>>> references on how to compute the exact distribution would be >>>> appreciated. >>>> >>>> Phil >>>>> Cheers, >>>>> -Ajo >>>>> >>>>> >>>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.ste...@gmail.com> >>>> wrote: >>>>>> I am working on MATH-437 (turning K-S distribution into a proper K-S >>>>>> test impl) and have to decide how to implement 2-sample tests. >>>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a >>>>>> K-S distribution, so for large samples just using the cdf we already >>>>>> have is appropriate. For small samples (actually for any size >>>>>> sample), the test statistic distribution is discrete and can be >>>>>> computed exactly. A brute force way to do that is to enumerate all >>>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the >>>>>> possible D_n,m values. R seems to use a more clever way to do >>>>>> this. Does anyone have a reference for an efficient way to compute >>>>>> the exact distribution, or background on where R got their >>>>>> implementation? >>>>>> >>>>>> Absent a "clever" approach, I see three alternatives and would >>>>>> appreciate some feedback: >>>>>> >>>>>> 0) just use the asymptotic distribution even for small samples >>>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above >>>>>> 1) use a monte carlo approach - instead of full enumeration of the >>>>>> D_n,m, randomly generate a large number of splits and compute the >>>>>> p-value for observed D_n,m by computing the number of random n-m >>>>>> splits generate a D value less than what is observed. >>>>>> >>>>>> Thanks in advance for any feedback / pointers. >>>>>> >>>>>> Phil >>>>>> >>>>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>> >>>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org