[ 
https://issues.apache.org/jira/browse/STATISTICS-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Herbert resolved STATISTICS-65.
------------------------------------
    Resolution: Implemented

The updated KS test has been implemented in the port of the inference module 
(see STATISTICS-62)

This provides a result for the two-sample test that contains:
 * The D statistic
 * P-value for the D statistic
 * Flag indicating if there were ties in the data that could result in a 
different D statistic depending on the ordering of tied data
 * The upper D statistic (most extreme D possible from all possible resolutions 
of ties)
 * The p-value for the upper D statistic

This is the default behaviour. Computation of the upper bound on D is a 
negligible cost. Computation of the upper p-value does add a performance cost. 
This could be refactored to compute dynamically if required subject to 
performance testing (as computation of D may be a significant part of the 
entire cost of the function; currently the relative cost of computing the 
p-value is unknown).

An option is provided in the test to compute the p-value using an ESTIMATE 
method. Currently this only support the sampling method: N iterations are used 
to sample from the joint distribution of the two samples, each sample has the D 
value computed. The p-value is the fraction of D that was more extreme than the 
observed test statistic. Significant digits in the p-value are thus upper 
bounded by log10(iterations).

In this case the p-value for the upper D statistic is not computed. To compute 
it using the sampling method (from the same samples) would require modification 
of the short-circuit condition used on the D statistic and degrade performance. 
Note that comparison of the p-values of D and upper D are less relevant for 
estimates than for the exact p computation as the results are only one possible 
result from random sampling.

An option to sample/enumerate paths through the tied regions has not been 
added. This would provide a sample for, or the entire set of, all possible D 
and could thus provide an average p-value from the distribution of D. This is 
equivalent to performing a KS test on the data multiple times and resolving 
ties differently each time. The resulting p-value may result in either a type I 
(false positive) or type II (false negative) error. The current functionality 
to provide p-values for the extreme range of D provides a bound on the p-value 
to minimise type I or type II errors. It would be a lot more computation to 
create an average p-value with a questionable benefit for hypothesis testing. 
If the D statistic and upper D have very different p-values then the data has a 
large number of ties and the KS test is not suitable for the data. Thus this 
functionality has been omitted and can be added later if required.

 

> Reimplement the Kolmogorov-Smirnov Test
> ---------------------------------------
>
>                 Key: STATISTICS-65
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-65
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Major
>             Fix For: 1.1
>
>
> The inference module contains the Kolmogorov-Smirnov (KS) test. The 
> two-sample test computes the D statistic as the supremum (effectively the 
> maximum) difference between the empirical CDF of two samples:
>  
> {noformat}
> D  = sup | F(x) - G(x) |  = max(D+, D-)
> D+ = sup   F(x) - G(x)
> D- = sup   G(x) - F(x){noformat}
> Two-sided tests use D, one sided use D+ (greater than) or D- (less than).
>  
> The implementation in CM contains computation of the D value and p-values for 
> D. Computation of D involves sorting the combined dataset of the two-samples 
> (x_n and y_m) and moving D down 1/n if x is observed, otherwise moving up by 
> 1/m if y is observed. The resulting D is the maximum deviation from 0, with 
> the result in [0, 1]. (D+ is above 0; D- is below 0). This procedure does not 
> work in the presence of ties as you do not know whether to adjust up or down, 
> so the entire tied region is skipped and the cumulative effect on D is used.
> The computation of the p-value is adjusted based on the presence of ties in 
> the data. Currently if ties are present the exact p-value is noted to be 
> invalid. The method then randomly adjusts the input values to eliminate ties 
> so that a D value can be created and the exact p-value computed for the 
> random sample D of all possible D. This behaviour occurs without a user 
> choice. Adjustment of data uses small steps (a few ULP) in order to minimise 
> any change to the ranking of x and y. But depending on the data a change to 
> the ranking can occur, and it can occur outside the tied region as all data 
> is subject to random jitter, not just the tied values. Effectively the method 
> is randomly adjusting the input data and hoping that it will not change 
> non-tied regions just to generate a random path through the tied region and a 
> single sample for D.
> Note the following observations:
>  * Ties do not make the p-value computation incorrect; they make the D value 
> computation a single sample of possible D values. The p-value computation is 
> still exact but it is using an input D with a value that is potentially lower 
> than it could be, depending on how ties are resolved.
>  * Ties within the same sample will not change the result
>  * Ties between the two samples *may* change the result
>  * The distribution of D is discrete
>  * The distribution of D in the absence of ties has PMF(D=0) = 0. If there 
> are no ties you either go up or down. This will move away from 0 at least 1 
> step.
> What is missing from the implementation is a detection of whether the ties 
> matter. Any tied region can be traversed a number of different ways. If none 
> of these possible combinations change the D value then the tied region is 
> located where it does not matter, e.g.
> {noformat}
> [             4, 5, 6, 7, 8]
> [ 0, 1, 2, 3,       6      ]{noformat}
> This square data n=m takes 4x, 2y before the single tie then 2y. If the tie 
> resolves as (1x, 1y) or (1y, 1x) then the maximum D is unchanged.
> A region of a and b ties for x and y will have binom(a+b, b) possible 
> combinations. However the largest change to the D value is the two most 
> extreme paths (ax, by) or (by, ax). Since the entire tied region is traversed 
> in a single step this is trivial to compute as the number of steps in x and y 
> is known. The implementation must then only track the D value when tied 
> regions are skipped (current behaviour), or the most extreme D value through 
> any tied regions. If the resulting max D is the same then none of the tied 
> regions has an effect on D then the D value is unique among all possible D 
> and the p-value is exact. If it is possible to traverse the tied regions and 
> create a more extreme D then the p-value is not exact.
> I suggest updating the implementation to:
>  * Compute D
>  * Compute the most extreme D value generated through tied regions
>  * Return a result that indicates if the p-value is an estimate. Optionally 
> the most extreme D statistic can be returned.
> Note that the most extreme D value is informative if for example it is much 
> larger than D. This will occur when the tied regions are long. The most 
> extreme case is input of matching data where D=0 but the extreme D value is 1.
> In the case where the tied regions can change the D value there are at least 
> two options:
>  # Use the bootstrap sampling method in the current CM implementation. This 
> generates n random samples (sx_n and sy_m) from the combined distribution of 
> x and y and D values for each. The p-value is the number of times the sample 
> D value for sx_n and sy_m was larger than the D for x and y. This only 
> applies if the combined sample is representative of the shared distribution 
> so is for large sample sizes.
>  # Walk n random paths through tied regions to create a distribution for D. 
> Generate the average p value for all D in the sample. Note that if the number 
> of the paths though the tied regions is small it is possible to enumerate 
> them. The combinations of binom(a+b, b) can be efficiently enumerated using 
> the Combinations class from the Numbers combinatorics module.
> The option to generate an estimate for P using a sampled distribution for D 
> should not be performed in the main KS test routine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to