[
https://issues.apache.org/jira/browse/STATISTICS-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Herbert resolved STATISTICS-65.
------------------------------------
Resolution: Implemented
The updated KS test has been implemented in the port of the inference module
(see STATISTICS-62)
This provides a result for the two-sample test that contains:
* The D statistic
* P-value for the D statistic
* Flag indicating if there were ties in the data that could result in a
different D statistic depending on the ordering of tied data
* The upper D statistic (most extreme D possible from all possible resolutions
of ties)
* The p-value for the upper D statistic
This is the default behaviour. Computation of the upper bound on D is a
negligible cost. Computation of the upper p-value does add a performance cost.
This could be refactored to compute dynamically if required subject to
performance testing (as computation of D may be a significant part of the
entire cost of the function; currently the relative cost of computing the
p-value is unknown).
An option is provided in the test to compute the p-value using an ESTIMATE
method. Currently this only support the sampling method: N iterations are used
to sample from the joint distribution of the two samples, each sample has the D
value computed. The p-value is the fraction of D that was more extreme than the
observed test statistic. Significant digits in the p-value are thus upper
bounded by log10(iterations).
In this case the p-value for the upper D statistic is not computed. To compute
it using the sampling method (from the same samples) would require modification
of the short-circuit condition used on the D statistic and degrade performance.
Note that comparison of the p-values of D and upper D are less relevant for
estimates than for the exact p computation as the results are only one possible
result from random sampling.
An option to sample/enumerate paths through the tied regions has not been
added. This would provide a sample for, or the entire set of, all possible D
and could thus provide an average p-value from the distribution of D. This is
equivalent to performing a KS test on the data multiple times and resolving
ties differently each time. The resulting p-value may result in either a type I
(false positive) or type II (false negative) error. The current functionality
to provide p-values for the extreme range of D provides a bound on the p-value
to minimise type I or type II errors. It would be a lot more computation to
create an average p-value with a questionable benefit for hypothesis testing.
If the D statistic and upper D have very different p-values then the data has a
large number of ties and the KS test is not suitable for the data. Thus this
functionality has been omitted and can be added later if required.
> Reimplement the Kolmogorov-Smirnov Test
> ---------------------------------------
>
> Key: STATISTICS-65
> URL: https://issues.apache.org/jira/browse/STATISTICS-65
> Project: Commons Statistics
> Issue Type: New Feature
> Components: inference
> Reporter: Alex Herbert
> Assignee: Alex Herbert
> Priority: Major
> Fix For: 1.1
>
>
> The inference module contains the Kolmogorov-Smirnov (KS) test. The
> two-sample test computes the D statistic as the supremum (effectively the
> maximum) difference between the empirical CDF of two samples:
>
> {noformat}
> D = sup | F(x) - G(x) | = max(D+, D-)
> D+ = sup F(x) - G(x)
> D- = sup G(x) - F(x){noformat}
> Two-sided tests use D, one sided use D+ (greater than) or D- (less than).
>
> The implementation in CM contains computation of the D value and p-values for
> D. Computation of D involves sorting the combined dataset of the two-samples
> (x_n and y_m) and moving D down 1/n if x is observed, otherwise moving up by
> 1/m if y is observed. The resulting D is the maximum deviation from 0, with
> the result in [0, 1]. (D+ is above 0; D- is below 0). This procedure does not
> work in the presence of ties as you do not know whether to adjust up or down,
> so the entire tied region is skipped and the cumulative effect on D is used.
> The computation of the p-value is adjusted based on the presence of ties in
> the data. Currently if ties are present the exact p-value is noted to be
> invalid. The method then randomly adjusts the input values to eliminate ties
> so that a D value can be created and the exact p-value computed for the
> random sample D of all possible D. This behaviour occurs without a user
> choice. Adjustment of data uses small steps (a few ULP) in order to minimise
> any change to the ranking of x and y. But depending on the data a change to
> the ranking can occur, and it can occur outside the tied region as all data
> is subject to random jitter, not just the tied values. Effectively the method
> is randomly adjusting the input data and hoping that it will not change
> non-tied regions just to generate a random path through the tied region and a
> single sample for D.
> Note the following observations:
> * Ties do not make the p-value computation incorrect; they make the D value
> computation a single sample of possible D values. The p-value computation is
> still exact but it is using an input D with a value that is potentially lower
> than it could be, depending on how ties are resolved.
> * Ties within the same sample will not change the result
> * Ties between the two samples *may* change the result
> * The distribution of D is discrete
> * The distribution of D in the absence of ties has PMF(D=0) = 0. If there
> are no ties you either go up or down. This will move away from 0 at least 1
> step.
> What is missing from the implementation is a detection of whether the ties
> matter. Any tied region can be traversed a number of different ways. If none
> of these possible combinations change the D value then the tied region is
> located where it does not matter, e.g.
> {noformat}
> [ 4, 5, 6, 7, 8]
> [ 0, 1, 2, 3, 6 ]{noformat}
> This square data n=m takes 4x, 2y before the single tie then 2y. If the tie
> resolves as (1x, 1y) or (1y, 1x) then the maximum D is unchanged.
> A region of a and b ties for x and y will have binom(a+b, b) possible
> combinations. However the largest change to the D value is the two most
> extreme paths (ax, by) or (by, ax). Since the entire tied region is traversed
> in a single step this is trivial to compute as the number of steps in x and y
> is known. The implementation must then only track the D value when tied
> regions are skipped (current behaviour), or the most extreme D value through
> any tied regions. If the resulting max D is the same then none of the tied
> regions has an effect on D then the D value is unique among all possible D
> and the p-value is exact. If it is possible to traverse the tied regions and
> create a more extreme D then the p-value is not exact.
> I suggest updating the implementation to:
> * Compute D
> * Compute the most extreme D value generated through tied regions
> * Return a result that indicates if the p-value is an estimate. Optionally
> the most extreme D statistic can be returned.
> Note that the most extreme D value is informative if for example it is much
> larger than D. This will occur when the tied regions are long. The most
> extreme case is input of matching data where D=0 but the extreme D value is 1.
> In the case where the tied regions can change the D value there are at least
> two options:
> # Use the bootstrap sampling method in the current CM implementation. This
> generates n random samples (sx_n and sy_m) from the combined distribution of
> x and y and D values for each. The p-value is the number of times the sample
> D value for sx_n and sy_m was larger than the D for x and y. This only
> applies if the combined sample is representative of the shared distribution
> so is for large sample sizes.
> # Walk n random paths through tied regions to create a distribution for D.
> Generate the average p value for all D in the sample. Note that if the number
> of the paths though the tied regions is small it is possible to enumerate
> them. The combinations of binom(a+b, b) can be efficiently enumerated using
> the Combinations class from the Numbers combinatorics module.
> The option to generate an estimate for P using a sampled distribution for D
> should not be performed in the main KS test routine.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)