Re: [math] Kolmogorov-Smirnov 2-sample test

Phil Steitz Sat, 10 Aug 2013 11:01:05 -0700

On 8/10/13 10:41 AM, Ajo Fod wrote:
> This paper provides some approximations and situations where you could use
> exact computations:
> http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf
> ... they even refer to a java program at the end of page 11.


He he.  This is the reference that our current K-S impl is based on
:)  So our impl correctly handles small samples for the 1-sample test.

This paper does provide some references for exact computation of
D_n.  Unfortunately, what I need is D_n,m for the two-sample test
and while asymptotically D_n,m is K-S distributed, I don't think I
can adapt exact algorithms for D_n to compute exact D_n,m.  I could
be wrong here - would love to be shown how to do it.

Phil
>
> Cheers,
> Ajo.
>
>
> On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <phil.ste...@gmail.com> wrote:
>
>> On 8/10/13 9:35 AM, Ajo Fod wrote:
>>> In:
>>>
>> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf
>>> Take a look at 2 sample KS stats and the relationship to the 1 sample ...
>>> page 88. You already have the 1 sample table.
>> The problem is that I don't have the *exact* 1 sample table, or more
>> precisely the means to generate it.  What our current machinery
>> allows us to compute is an asymptotically valid approximation to the
>> distribution of D_n,m.  Theorem 2 on page 87 of the reference above
>> justifies the large sample approach; but it is an asymptotic
>> result.   Using it is fine for large n, m, but not so good for small
>> samples.  For that we need the exact, discrete distribution of
>> D_n,m.  Like other math stat references I have consulted, the one
>> above states that the discrete distribution has been tabulated for
>> small n,m and those tables are available in most math stat texts.
>> What I need is the algorithm used to generate those tables.
>>
>> Phil
>>> Cheers,
>>> Ajo
>>>
>>>
>>> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <phil.ste...@gmail.com>
>> wrote:
>>>> On 8/10/13 8:59 AM, Ajo Fod wrote:
>>>>> This depends on data size. If it fits in memory, a single pass through
>>>> the
>>>>> sorted array to find the biggest differences would suffice.
>>>>>
>>>>> If the data doesn't fit, you probably need a StorelessQuantile
>> estimator
>>>>> like QuantileBin1D from the colt libraries. Then pick a resolution and
>> do
>>>>> the single pass search.
>>>> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
>>>> problem is in computing the exact p-values for the test.  For that,
>>>> I need to compute the exact distribution of D_n,m.  Brute-forcing
>>>> requires that you examine every element of n + m choose n.  R seems
>>>> to use a clever approach, but there is no documentation in the R
>>>> sources on how the algorithm works.  Moreover, my first attempts at
>>>> Monte Carlo simulation don't give the same results.  Most likely, I
>>>> have not set the simulation up correctly.  Any better ideas or
>>>> references on how to compute the exact distribution would be
>>>> appreciated.
>>>>
>>>> Phil
>>>>> Cheers,
>>>>> -Ajo
>>>>>
>>>>>
>>>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.ste...@gmail.com>
>>>> wrote:
>>>>>> I am working on MATH-437 (turning K-S distribution into a proper K-S
>>>>>> test impl) and have to decide how to implement 2-sample tests.
>>>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
>>>>>> K-S distribution, so for large samples just using the cdf we already
>>>>>> have is appropriate.  For small samples (actually for any size
>>>>>> sample), the test statistic distribution is discrete and can be
>>>>>> computed exactly.  A brute force way to do that is to enumerate all
>>>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>>>>>> possible D_n,m values.  R seems to use a more clever way to do
>>>>>> this.  Does anyone have a reference for an efficient way to compute
>>>>>> the exact distribution, or background on where R got their
>>>>>> implementation?
>>>>>>
>>>>>> Absent a "clever" approach, I see three alternatives and would
>>>>>> appreciate some feedback:
>>>>>>
>>>>>> 0) just use the asymptotic distribution even for small samples
>>>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
>>>>>> 1) use a monte carlo approach - instead of full enumeration of the
>>>>>> D_n,m, randomly generate a large number of splits and compute the
>>>>>> p-value for observed D_n,m by computing the number of random n-m
>>>>>> splits generate a D value less than what is observed.
>>>>>>
>>>>>> Thanks in advance for any feedback / pointers.
>>>>>>
>>>>>> Phil
>>>>>>
>>>>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] Kolmogorov-Smirnov 2-sample test

Reply via email to