Alex Herbert created STATISTICS-65:
--------------------------------------

             Summary: Reimplement the Kolmogorov-Smirnov Test
                 Key: STATISTICS-65
                 URL: https://issues.apache.org/jira/browse/STATISTICS-65
             Project: Commons Statistics
          Issue Type: New Feature
          Components: inference
            Reporter: Alex Herbert
            Assignee: Alex Herbert
             Fix For: 1.1


The inference module contains the Kolmogorov-Smirnov (KS) test. The two-sample 
test computes the D statistic as the supremum (effectively the maximum) 
difference between the empirical CDF of two samples:

 
{noformat}
D  = sup | F(x) - G(x) |  = max(D+, D-)
D+ = sup   F(x) - G(x)
D- = sup   G(x) - F(x){noformat}
Two-sided tests use D, one sided use D+ (greater than) or D- (less than).

 

The implementation in CM contains computation of the D value and p-values for 
D. Computation of D involves sorting the combined dataset of the two-samples 
(x_n and y_m) and moving D down 1/n if x is observed, otherwise moving up by 
1/m if y is observed. The resulting D is the maximum deviation from 0, with the 
result in [0, 1]. (D+ is above 0; D- is below 0). This procedure does not work 
in the presence of ties as you do not know whether to adjust up or down, so the 
entire tied region is skipped and the cumulative effect on D is used.

The computation of the p-value is adjusted based on the presence of ties in the 
data. Currently if ties are present the exact p-value is noted to be invalid. 
The method then randomly adjusts the input values to eliminate ties so that a D 
value can be created and the exact p-value computed for the random sample D of 
all possible D. This behaviour occurs without a user choice. Adjustment of data 
uses small steps (a few ULP) in order to minimise any change to the ranking of 
x and y. But depending on the data a change to the ranking can occur, and it 
can occur outside the tied region as all data is subject to random jitter, not 
just the tied values. Effectively the method is randomly adjusting the input 
data and hoping that it will not change non-tied regions just to generate a 
random path through the tied region and a single sample for D.

Note the following observations:
 * Ties do not make the p-value computation incorrect; they make the D value 
computation a single sample of possible D values. The p-value computation is 
still exact but it is using an input D with a value that is potentially lower 
than it could be, depending on how ties are resolved.
 * Ties within the same sample will not change the result
 * Ties between the two samples *may* change the result
 * The distribution of D is discrete
 * The distribution of D in the absence of ties has PMF(D=0) = 0. If there are 
no ties you either go up or down. This will move away from 0 at least 1 step.

What is missing from the implementation is a detection of whether the ties 
matter. Any tied region can be traversed a number of different ways. If none of 
these possible combinations change the D value then the tied region is located 
where it does not matter, e.g.
{noformat}
[             4, 5, 6, 7, 8]
[ 0, 1, 2, 3,       6      ]{noformat}
This square data n=m takes 4x, 2y before the single tie then 2y. If the tie 
resolves as (1x, 1y) or (1y, 1x) then the maximum D is unchanged.

A region of a and b ties for x and y will have binom(a+b, b) possible 
combinations. However the largest change to the D value is the two most extreme 
paths (ax, by) or (by, ax). Since the entire tied region is traversed in a 
single step this is trivial to compute as the number of steps in x and y is 
known. The implementation must then only track the D value when tied regions 
are skipped (current behaviour), or the most extreme D value through any tied 
regions. If the resulting max D is the same then none of the tied regions has 
an effect on D then the D value is unique among all possible D and the p-value 
is exact. If it is possible to traverse the tied regions and create a more 
extreme D then the p-value is not exact.

I suggest updating the implementation to:
 * Compute D
 * Compute the most extreme D value generated through tied regions
 * Return a result that indicates if the p-value is an estimate. Optionally the 
most extreme D statistic can be returned.

Note that the most extreme D value is informative if for example it is much 
larger than D. This will occur when the tied regions are long. The most extreme 
case is input of matching data where D=0 but the extreme D value is 1.

In the case where the tied regions can change the D value there are at least 
two options:
 # Use the bootstrap sampling method in the current CM implementation. This 
generates n random samples (sx_n and sy_m) from the combined distribution of x 
and y and D values for each. The p-value is the number of times the sample D 
value for sx_n and sy_m was larger than the D for x and y. This only applies if 
the combined sample is representative of the shared distribution so is for 
large sample sizes.
 # Walk n random paths through tied regions to create a distribution for D. 
Generate the average p value for all D in the sample. Note that if the number 
of the paths though the tied regions is small it is possible to enumerate them. 
The combinations of binom(a+b, b) can be efficiently enumerated using the 
Combinations class from the Numbers combinatorics module.

The option to generate an estimate for P using a sampled distribution for D 
should not be performed in the main KS test routine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to