[jira] [Commented] (STATISTICS-65) Reimplement the Kolmogorov-Smirnov Test

Alex Herbert (Jira) Fri, 03 Feb 2023 08:53:06 -0800


    [ 
https://issues.apache.org/jira/browse/STATISTICS-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683960#comment-17683960
 ]


Alex Herbert commented on STATISTICS-65:
----------------------------------------

Note that ties are handled differently by two commonly used implementations.

The R implementation will detect ties as any duplicate data in the combined x 
and y array. In this case the p-value computation is changed to use an 
asymptotic p-value. A warning is printed to the console that the test is cannot 
compute an exact p-value with ties. The statistic is computed by traversing the 
entire tied region(s) in one step. Note that ties within the same data, or ties 
between the data both trigger the warning.

The SciPy implementation ignores ties. The statistic is computed by traversing 
the entire tied region(s) in one step.

These provide two extreme behaviours for ties. The suggested implementation is 
a compromise. The D statistic will be computed the same as SciPy and R. The 
p-value will be computed as done by SciPy. The result will contain a warning 
flag that the P-value is not exact as done by R. This warning should document 
that the computed D value is the minimum of a population of possible D values 
generated by traversing all paths through tied regions.

 

> Reimplement the Kolmogorov-Smirnov Test
> ---------------------------------------
>
>                 Key: STATISTICS-65
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-65
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Major
>             Fix For: 1.1
>
>
> The inference module contains the Kolmogorov-Smirnov (KS) test. The 
> two-sample test computes the D statistic as the supremum (effectively the 
> maximum) difference between the empirical CDF of two samples:
>  
> {noformat}
> D  = sup | F(x) - G(x) |  = max(D+, D-)
> D+ = sup   F(x) - G(x)
> D- = sup   G(x) - F(x){noformat}
> Two-sided tests use D, one sided use D+ (greater than) or D- (less than).
>  
> The implementation in CM contains computation of the D value and p-values for 
> D. Computation of D involves sorting the combined dataset of the two-samples 
> (x_n and y_m) and moving D down 1/n if x is observed, otherwise moving up by 
> 1/m if y is observed. The resulting D is the maximum deviation from 0, with 
> the result in [0, 1]. (D+ is above 0; D- is below 0). This procedure does not 
> work in the presence of ties as you do not know whether to adjust up or down, 
> so the entire tied region is skipped and the cumulative effect on D is used.
> The computation of the p-value is adjusted based on the presence of ties in 
> the data. Currently if ties are present the exact p-value is noted to be 
> invalid. The method then randomly adjusts the input values to eliminate ties 
> so that a D value can be created and the exact p-value computed for the 
> random sample D of all possible D. This behaviour occurs without a user 
> choice. Adjustment of data uses small steps (a few ULP) in order to minimise 
> any change to the ranking of x and y. But depending on the data a change to 
> the ranking can occur, and it can occur outside the tied region as all data 
> is subject to random jitter, not just the tied values. Effectively the method 
> is randomly adjusting the input data and hoping that it will not change 
> non-tied regions just to generate a random path through the tied region and a 
> single sample for D.
> Note the following observations:
>  * Ties do not make the p-value computation incorrect; they make the D value 
> computation a single sample of possible D values. The p-value computation is 
> still exact but it is using an input D with a value that is potentially lower 
> than it could be, depending on how ties are resolved.
>  * Ties within the same sample will not change the result
>  * Ties between the two samples *may* change the result
>  * The distribution of D is discrete
>  * The distribution of D in the absence of ties has PMF(D=0) = 0. If there 
> are no ties you either go up or down. This will move away from 0 at least 1 
> step.
> What is missing from the implementation is a detection of whether the ties 
> matter. Any tied region can be traversed a number of different ways. If none 
> of these possible combinations change the D value then the tied region is 
> located where it does not matter, e.g.
> {noformat}
> [             4, 5, 6, 7, 8]
> [ 0, 1, 2, 3,       6      ]{noformat}
> This square data n=m takes 4x, 2y before the single tie then 2y. If the tie 
> resolves as (1x, 1y) or (1y, 1x) then the maximum D is unchanged.
> A region of a and b ties for x and y will have binom(a+b, b) possible 
> combinations. However the largest change to the D value is the two most 
> extreme paths (ax, by) or (by, ax). Since the entire tied region is traversed 
> in a single step this is trivial to compute as the number of steps in x and y 
> is known. The implementation must then only track the D value when tied 
> regions are skipped (current behaviour), or the most extreme D value through 
> any tied regions. If the resulting max D is the same then none of the tied 
> regions has an effect on D then the D value is unique among all possible D 
> and the p-value is exact. If it is possible to traverse the tied regions and 
> create a more extreme D then the p-value is not exact.
> I suggest updating the implementation to:
>  * Compute D
>  * Compute the most extreme D value generated through tied regions
>  * Return a result that indicates if the p-value is an estimate. Optionally 
> the most extreme D statistic can be returned.
> Note that the most extreme D value is informative if for example it is much 
> larger than D. This will occur when the tied regions are long. The most 
> extreme case is input of matching data where D=0 but the extreme D value is 1.
> In the case where the tied regions can change the D value there are at least 
> two options:
>  # Use the bootstrap sampling method in the current CM implementation. This 
> generates n random samples (sx_n and sy_m) from the combined distribution of 
> x and y and D values for each. The p-value is the number of times the sample 
> D value for sx_n and sy_m was larger than the D for x and y. This only 
> applies if the combined sample is representative of the shared distribution 
> so is for large sample sizes.
>  # Walk n random paths through tied regions to create a distribution for D. 
> Generate the average p value for all D in the sample. Note that if the number 
> of the paths though the tied regions is small it is possible to enumerate 
> them. The combinations of binom(a+b, b) can be efficiently enumerated using 
> the Combinations class from the Numbers combinatorics module.
> The option to generate an estimate for P using a sampled distribution for D 
> should not be performed in the main KS test routine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (STATISTICS-65) Reimplement the Kolmogorov-Smirnov Test

Reply via email to