[jira] [Commented] (STATISTICS-62) Port o.a.c.math.stat.inference to a commons-statistics-inference module

Alex Herbert (Jira) Fri, 03 Feb 2023 07:25:06 -0800


    [ 
https://issues.apache.org/jira/browse/STATISTICS-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683918#comment-17683918
 ]


Alex Herbert commented on STATISTICS-62:
----------------------------------------

I have finished the work on the computations performed by the new inference 
module. This has led to development of the API based on the currently supported 
options.

The first observation is that all tests are separated into: creating a 
statistic; creating a p-value for the statistic; computing a boolean value to 
reject the null hypothesis given a significance level. This is trivially:
{code:java}
return p < alpha;
{code}
It is extreme code bloat to duplicate methods just to pass a significance level 
and perform this boolean expression. Also note that if you require a p-value 
then you also have to have a statistic, so these should be paired in a result 
(statistic, p-value).

I have written each test to have the following generic API where methods have 
compulsory arguments and optional ones. The syntax below is akin to a language 
that supports optional named arguments:
{code:java}
double statistic(x, y, option1=a)
SignificanceResult test(x, y, option1=a, option2=b, option3=c){code}
The test result is:
{code:java}
public interface SignificanceResult {
    double getStatistic();
    double getPValue();
    default boolean reject(double alpha) {
        // validate alpha in (0, 0.5], then
        return getPValue() < alpha;
    }
} {code}
Tests may return more information by extending the SignificanceResult. This is 
actually useful for some tests which have a lot more information, for example 
the OneWayAnova test can return all data typically reported for ANOVA tests 
(degrees of freedom between and within groups, variances between and within 
groups).

Note that the statistic method is seemingly redundant as you can call test and 
extract the statistic from the result. However the use case is when you have to 
compare a statistic against a pre-computed critical value (e.g. from a table of 
critical values). Here you do not require the computation effort to generate 
the p-value. An extreme example is each build of the Commons RNG core module 
performs approximately 17*500 chi-square tests for uniformity per RNG 
implementation (50 current tested instances) which is at least 425,000 tests 
per build, all using the same critical value. There are other places where a 
critical value is used too so this is an underestimate.

Also note that this removes the ability to compute a p-value given a statistic. 
However this is functionality that belongs in the Statistics distribution 
package. The only distributions not there that are required are the 
distributions for the Kolmogorov-Smirnov, Mann-Whitney U and the Wilcox signed 
rank statistic. Since these only require the p-value from the survival function 
the implementations are partial and are missing CDF, PDF and moments to allow 
inclusion in the distribution package. The implementations could be ported 
there if a full implementation is completed. I am not aware of the usefulness 
of these distributions outside of inference testing.

Since Java does not support optional arguments there are a few ways to 
implement the API. Options can be strongly typed as immutable objects with 
properties. The example below shows this using a builder pattern for the 
Kolomorov-Smirnov test, the example below is SciPy's test signature to which I 
have added the ability to compute the p-value with a strict inequality (an 
option carried over from the CM implementation):
{noformat}
scipy.stats.ks_2samp(data1, data2, alternative='two-sided', method='auto', 
strict=False){noformat}
Java with Options:
{code:java}
public final class KolmogorovSmirnovTest {
    public static class Options {
        public static class Builder {
            public Builder setAlternative(AlternativeHypothesis v);
            public Builder setPValueMethod(PValueMethod v);
            public Builder setStrictInequality(boolean v);
            public Options build();
        }
        public static Options defaults();
        public static Builder builder();
        public Builder toBuilder();
        public AlternativeHypothesis getAlternative();
        public PValueMethod getPValueMethod();
        public boolean isStrictInequality();
    }
    public static double statistic(double[] x, double[] y,
                                   AlternativeHypothesis alternative) {
    public static SignificanceResult test(double[] x, double[] y) {
        return test(x, y, Options.defaults());
    }
    public static SignificanceResult test(double[] x, double[] y, Options 
options);
} {code}
Calling it with the defaults is simple, with any other options is quite verbose:
{code:java}
double[] x, y;
SignificanceResult r1 = KolmogorovSmirnovTest.test(x, y);
SignificanceResult r2 = KolmogorovSmirnovTest.test(x, y,
    Options.builder().setAlternative(AlternativeHypothesis.GREATER_THAN)
                     .setPValueMethod(PValueMethod.EXACT)
                     .setStrictInequality(true)
                     .build();{code}
Note that for repeat testing the options can be pre-built and passed in.

A simpler API without the bloat of strongly typed options (with some way to 
build them) is to have optional arguments as a varargs array:
{code:java}
public final class KolmogorovSmirnovTest {
    public static double statistic(double[] x, double[] y,
                                   AlternativeHypothesis alternative) {
    public static SignificanceResult test(double[] x, double[] y, Object... 
options);
}  {code}
Calling it then becomes:
{code:java}
double[] x, y;
SignificanceResult r1 = KolmogorovSmirnovTest.test(x, y);
SignificanceResult r2 = KolmogorovSmirnovTest.test(x, y,
    AlternativeHypothesis.GREATER_THAN,
    PValueMethod.EXACT,
    Inequality.STRICT); {code}
Here the Object[] must be parsed by the test method to extract any options it 
recognises. This is similar to the Optimizer API in CM4 (see 
[BaseOptmizer.optimize|https://commons.apache.org/proper/commons-math/javadocs/api-4.0-beta1/org/apache/commons/math4/legacy/optim/BaseOptimizer.html#optimize(org.apache.commons.math4.legacy.optim.OptimizationData...)])
 but without all options required to implement a marker interface, e.g.:
{code:java}
public final class KolmogorovSmirnovTest {     
    // ...
    public static SignificanceResult test(double[] x, double[] y, TestOption... 
options);
} {code}
When using varargs any primitive values must be wrapped with a class that can 
be uniquely identified. Hence the API for the chi-square test with an optional 
degrees of freedom adjustment is called using:
{code:java}
public final class ChiSquareTest {
    // ...
    public static SignificanceResult test(double[] expected, long[] observed, 
Object... options)
}

ChiSquareTest.test(expected, observed, DegreesOfFreedomAdjustment.of(1));{code}
This highlights the issue where tests only have a single option. For 
consistency the API would specify the varargs. But for simplicity the method 
can be provided with the optional parameter as an overloaded method.

What I do not wish to happen is that the API is expanded over time with a daisy 
chain of overloaded methods as more options are added to existing tests. So to 
prevent this I would recommend some type of minimum API that naturally expands 
to accommodate additional options.

Currently the API consists of:
{noformat}
BinomialTest:
// statistic = numberOfTrials / numberOfSuccesses so is omitted from the API
test(int numberOfTrials, int numberOfSuccesses, double probability, 
alternative=two-sided)

ChiSquareTest
statistic(long[] observed)
statistic(double[] expected, long[] observed)
statistic(long[][] counts)
statistic(long[] observed1, long[] observed2)
test(long[] observed, degreesOfFreedomAdjustment=0)
test(double[] expected, long[] observed, degreesOfFreedomAdjustment=0)
test(long[][] counts)
test(long[] observed1, long[] observed2)

GTest:
statistic(long[] observed)
statistic(double[] expected, long[] observed)
statistic(long[][] counts)
test(long[] observed, degreesOfFreedomAdjustment=0)
test(double[] expected, long[] observed, degreesOfFreedomAdjustment=0)
test(long[][] counts)

KolmogorovSmirnovTest:
statistic(double[] x, DoubleUnaryOperator cdf, alternative=two-sided)
statistic(double[] x, double[] y, alternative=two-sided)
test(double[] x, DoubleUnaryOperator cdf, alternative=two-sided, method=auto)
test(double[] x, double[] y, alternative=two-sided, method=auto, strict=false)
estimateP(double[] x, double[] y,
          UniformRandomProvider rng,
          int iterations,
          method=[sampling, random-walk],
          alternative=two-sided, strict=false)

MannWhitneyUTest:
statistic(double[] x, double[] y)
test(double[] x, double[] y, alternative=two-sided, method=auto, correct=true)

OneWayAnova:
// statistic is omitted as the statistic must be specified with degrees of 
freedom: (F, df_bg, df_wg)
test(Collection<double[]> data)

TTest:
statistic(m, v, n, mu=0)
statistic(double[] x, m=0)
pairedStatistic(double[] x, double[] y, mu=0)
statistic(m1, v1, n1, m2, v2, n2, mu=0, homoscedastic=false)
statistic(double[] x, double[] y, mu=0, homoscedastic=false)
test(m, v, n, mu=0, alternative=two-sided)
test(double[] x, mu=0, alternative=two-sided)
pairedTest(double[] x, double[] y, mu=0, alternative=two-sided)
test(m1, v1, n1, m2, v2, n2, mu=0, homoscedastic=false, alternative=two-sided)
test(double[] x, double[] y, mu=0, homoscedastic=false, alternative=two-sided)

WilcoxonSignedRankTest:
statistic(double[] z)
statistic(double[] x, double[] y)
test(double[] z, alternative=two-sided, method=auto, correct=true)
test(double[] x, double[] y, alternative=two-sided, method=auto, 
correct=true){noformat}
Note that the paired TTest could be provided as an option for the two-sample 
test, i.e. paired or unpaired. This is the way it is implemented in R. In SciPy 
they provide a method for two-sample independent (scipy.stats.ttest_ind) and 
two-sample related (scipy.stats.ttest_rel).

The KolmogorovSmirnovTest has a method to estimate p-values. The CM 
implementation has two estimation methods requiring a random generator and also 
functionality to removes ties in the data using randomness. I have changed the 
functionality but the details should be under a separate ticket. Here we will 
assume that the standard statistic and p-value computation are deterministic 
and any non-deterministic estimation is in a separate method, thus the user is 
aware they are using randomness to generate the result. The API choice then 
becomes how to pass non-default parameters to the estimation method, e.g. those 
controlling the estimation procedure.

 

Currently I am favouring the test(x, y, Object... options) API to remove all 
the bloat of builders for Options. It allows more options to be added with no 
API changes. Any opinions on this would be welcome.

 

> Port o.a.c.math.stat.inference to a commons-statistics-inference module
> -----------------------------------------------------------------------
>
>                 Key: STATISTICS-62
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-62
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Major
>
> The o.a.c.math4.legacy.stat.inference package contains:
>  
> {noformat}
> AlternativeHypothesis.java
> BinomialTest.java
> ChiSquareTest.java
> GTest.java
> InferenceTestUtils.java
> KolmogorovSmirnovTest.java
> MannWhitneyUTest.java
> OneWayAnova.java
> TTest.java
> WilcoxonSignedRankTest.java{noformat}
> The are few dependencies on other math packages. The notable exceptions are:
>  
> 1. KolmogorovSmirnovTest which requires matrix support. This is for 
> multiplication of a square matrix to support a matrix power function. This 
> uses a double matrix and the same code is duplicated for a BigFraction 
> matrix. Such code can be ported internally to support only the required 
> functions. It can also drop the defensive copy strategy used by Commons Math 
> in matrices to allow multiply in-place where appropriate for performance 
> gains.
> 2. OneWayAnova which collates the sum, sum of squares and count using 
> SummaryStatistics. This can be done using an internal class. It is possible 
> to call the test method using already computed SummaryStatistics. The method 
> that does this using the SummaryStatistics as part of the API can be dropped, 
> or supported using an interface that returns: getSum, getSumOfSquares, getN.
> All the inference Test classes have instance methods but no state. The 
> InferenceTestUtils is a static class that holds references to a singleton for 
> each class and provides static methods to pass through the underlying 
> instances.
> I suggest changing the test classes to have only static methods and dropping 
> InferenceTestUtils.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (STATISTICS-62) Port o.a.c.math.stat.inference to a commons-statistics-inference module

Reply via email to