[
https://issues.apache.org/jira/browse/STATISTICS-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683918#comment-17683918
]
Alex Herbert commented on STATISTICS-62:
----------------------------------------
I have finished the work on the computations performed by the new inference
module. This has led to development of the API based on the currently supported
options.
The first observation is that all tests are separated into: creating a
statistic; creating a p-value for the statistic; computing a boolean value to
reject the null hypothesis given a significance level. This is trivially:
{code:java}
return p < alpha;
{code}
It is extreme code bloat to duplicate methods just to pass a significance level
and perform this boolean expression. Also note that if you require a p-value
then you also have to have a statistic, so these should be paired in a result
(statistic, p-value).
I have written each test to have the following generic API where methods have
compulsory arguments and optional ones. The syntax below is akin to a language
that supports optional named arguments:
{code:java}
double statistic(x, y, option1=a)
SignificanceResult test(x, y, option1=a, option2=b, option3=c){code}
The test result is:
{code:java}
public interface SignificanceResult {
double getStatistic();
double getPValue();
default boolean reject(double alpha) {
// validate alpha in (0, 0.5], then
return getPValue() < alpha;
}
} {code}
Tests may return more information by extending the SignificanceResult. This is
actually useful for some tests which have a lot more information, for example
the OneWayAnova test can return all data typically reported for ANOVA tests
(degrees of freedom between and within groups, variances between and within
groups).
Note that the statistic method is seemingly redundant as you can call test and
extract the statistic from the result. However the use case is when you have to
compare a statistic against a pre-computed critical value (e.g. from a table of
critical values). Here you do not require the computation effort to generate
the p-value. An extreme example is each build of the Commons RNG core module
performs approximately 17*500 chi-square tests for uniformity per RNG
implementation (50 current tested instances) which is at least 425,000 tests
per build, all using the same critical value. There are other places where a
critical value is used too so this is an underestimate.
Also note that this removes the ability to compute a p-value given a statistic.
However this is functionality that belongs in the Statistics distribution
package. The only distributions not there that are required are the
distributions for the Kolmogorov-Smirnov, Mann-Whitney U and the Wilcox signed
rank statistic. Since these only require the p-value from the survival function
the implementations are partial and are missing CDF, PDF and moments to allow
inclusion in the distribution package. The implementations could be ported
there if a full implementation is completed. I am not aware of the usefulness
of these distributions outside of inference testing.
Since Java does not support optional arguments there are a few ways to
implement the API. Options can be strongly typed as immutable objects with
properties. The example below shows this using a builder pattern for the
Kolomorov-Smirnov test, the example below is SciPy's test signature to which I
have added the ability to compute the p-value with a strict inequality (an
option carried over from the CM implementation):
{noformat}
scipy.stats.ks_2samp(data1, data2, alternative='two-sided', method='auto',
strict=False){noformat}
Java with Options:
{code:java}
public final class KolmogorovSmirnovTest {
public static class Options {
public static class Builder {
public Builder setAlternative(AlternativeHypothesis v);
public Builder setPValueMethod(PValueMethod v);
public Builder setStrictInequality(boolean v);
public Options build();
}
public static Options defaults();
public static Builder builder();
public Builder toBuilder();
public AlternativeHypothesis getAlternative();
public PValueMethod getPValueMethod();
public boolean isStrictInequality();
}
public static double statistic(double[] x, double[] y,
AlternativeHypothesis alternative) {
public static SignificanceResult test(double[] x, double[] y) {
return test(x, y, Options.defaults());
}
public static SignificanceResult test(double[] x, double[] y, Options
options);
} {code}
Calling it with the defaults is simple, with any other options is quite verbose:
{code:java}
double[] x, y;
SignificanceResult r1 = KolmogorovSmirnovTest.test(x, y);
SignificanceResult r2 = KolmogorovSmirnovTest.test(x, y,
Options.builder().setAlternative(AlternativeHypothesis.GREATER_THAN)
.setPValueMethod(PValueMethod.EXACT)
.setStrictInequality(true)
.build();{code}
Note that for repeat testing the options can be pre-built and passed in.
A simpler API without the bloat of strongly typed options (with some way to
build them) is to have optional arguments as a varargs array:
{code:java}
public final class KolmogorovSmirnovTest {
public static double statistic(double[] x, double[] y,
AlternativeHypothesis alternative) {
public static SignificanceResult test(double[] x, double[] y, Object...
options);
} {code}
Calling it then becomes:
{code:java}
double[] x, y;
SignificanceResult r1 = KolmogorovSmirnovTest.test(x, y);
SignificanceResult r2 = KolmogorovSmirnovTest.test(x, y,
AlternativeHypothesis.GREATER_THAN,
PValueMethod.EXACT,
Inequality.STRICT); {code}
Here the Object[] must be parsed by the test method to extract any options it
recognises. This is similar to the Optimizer API in CM4 (see
[BaseOptmizer.optimize|https://commons.apache.org/proper/commons-math/javadocs/api-4.0-beta1/org/apache/commons/math4/legacy/optim/BaseOptimizer.html#optimize(org.apache.commons.math4.legacy.optim.OptimizationData...)])
but without all options required to implement a marker interface, e.g.:
{code:java}
public final class KolmogorovSmirnovTest {
// ...
public static SignificanceResult test(double[] x, double[] y, TestOption...
options);
} {code}
When using varargs any primitive values must be wrapped with a class that can
be uniquely identified. Hence the API for the chi-square test with an optional
degrees of freedom adjustment is called using:
{code:java}
public final class ChiSquareTest {
// ...
public static SignificanceResult test(double[] expected, long[] observed,
Object... options)
}
ChiSquareTest.test(expected, observed, DegreesOfFreedomAdjustment.of(1));{code}
This highlights the issue where tests only have a single option. For
consistency the API would specify the varargs. But for simplicity the method
can be provided with the optional parameter as an overloaded method.
What I do not wish to happen is that the API is expanded over time with a daisy
chain of overloaded methods as more options are added to existing tests. So to
prevent this I would recommend some type of minimum API that naturally expands
to accommodate additional options.
Currently the API consists of:
{noformat}
BinomialTest:
// statistic = numberOfTrials / numberOfSuccesses so is omitted from the API
test(int numberOfTrials, int numberOfSuccesses, double probability,
alternative=two-sided)
ChiSquareTest
statistic(long[] observed)
statistic(double[] expected, long[] observed)
statistic(long[][] counts)
statistic(long[] observed1, long[] observed2)
test(long[] observed, degreesOfFreedomAdjustment=0)
test(double[] expected, long[] observed, degreesOfFreedomAdjustment=0)
test(long[][] counts)
test(long[] observed1, long[] observed2)
GTest:
statistic(long[] observed)
statistic(double[] expected, long[] observed)
statistic(long[][] counts)
test(long[] observed, degreesOfFreedomAdjustment=0)
test(double[] expected, long[] observed, degreesOfFreedomAdjustment=0)
test(long[][] counts)
KolmogorovSmirnovTest:
statistic(double[] x, DoubleUnaryOperator cdf, alternative=two-sided)
statistic(double[] x, double[] y, alternative=two-sided)
test(double[] x, DoubleUnaryOperator cdf, alternative=two-sided, method=auto)
test(double[] x, double[] y, alternative=two-sided, method=auto, strict=false)
estimateP(double[] x, double[] y,
UniformRandomProvider rng,
int iterations,
method=[sampling, random-walk],
alternative=two-sided, strict=false)
MannWhitneyUTest:
statistic(double[] x, double[] y)
test(double[] x, double[] y, alternative=two-sided, method=auto, correct=true)
OneWayAnova:
// statistic is omitted as the statistic must be specified with degrees of
freedom: (F, df_bg, df_wg)
test(Collection<double[]> data)
TTest:
statistic(m, v, n, mu=0)
statistic(double[] x, m=0)
pairedStatistic(double[] x, double[] y, mu=0)
statistic(m1, v1, n1, m2, v2, n2, mu=0, homoscedastic=false)
statistic(double[] x, double[] y, mu=0, homoscedastic=false)
test(m, v, n, mu=0, alternative=two-sided)
test(double[] x, mu=0, alternative=two-sided)
pairedTest(double[] x, double[] y, mu=0, alternative=two-sided)
test(m1, v1, n1, m2, v2, n2, mu=0, homoscedastic=false, alternative=two-sided)
test(double[] x, double[] y, mu=0, homoscedastic=false, alternative=two-sided)
WilcoxonSignedRankTest:
statistic(double[] z)
statistic(double[] x, double[] y)
test(double[] z, alternative=two-sided, method=auto, correct=true)
test(double[] x, double[] y, alternative=two-sided, method=auto,
correct=true){noformat}
Note that the paired TTest could be provided as an option for the two-sample
test, i.e. paired or unpaired. This is the way it is implemented in R. In SciPy
they provide a method for two-sample independent (scipy.stats.ttest_ind) and
two-sample related (scipy.stats.ttest_rel).
The KolmogorovSmirnovTest has a method to estimate p-values. The CM
implementation has two estimation methods requiring a random generator and also
functionality to removes ties in the data using randomness. I have changed the
functionality but the details should be under a separate ticket. Here we will
assume that the standard statistic and p-value computation are deterministic
and any non-deterministic estimation is in a separate method, thus the user is
aware they are using randomness to generate the result. The API choice then
becomes how to pass non-default parameters to the estimation method, e.g. those
controlling the estimation procedure.
Currently I am favouring the test(x, y, Object... options) API to remove all
the bloat of builders for Options. It allows more options to be added with no
API changes. Any opinions on this would be welcome.
> Port o.a.c.math.stat.inference to a commons-statistics-inference module
> -----------------------------------------------------------------------
>
> Key: STATISTICS-62
> URL: https://issues.apache.org/jira/browse/STATISTICS-62
> Project: Commons Statistics
> Issue Type: New Feature
> Components: inference
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Priority: Major
>
> The o.a.c.math4.legacy.stat.inference package contains:
>
> {noformat}
> AlternativeHypothesis.java
> BinomialTest.java
> ChiSquareTest.java
> GTest.java
> InferenceTestUtils.java
> KolmogorovSmirnovTest.java
> MannWhitneyUTest.java
> OneWayAnova.java
> TTest.java
> WilcoxonSignedRankTest.java{noformat}
> The are few dependencies on other math packages. The notable exceptions are:
>
> 1. KolmogorovSmirnovTest which requires matrix support. This is for
> multiplication of a square matrix to support a matrix power function. This
> uses a double matrix and the same code is duplicated for a BigFraction
> matrix. Such code can be ported internally to support only the required
> functions. It can also drop the defensive copy strategy used by Commons Math
> in matrices to allow multiply in-place where appropriate for performance
> gains.
> 2. OneWayAnova which collates the sum, sum of squares and count using
> SummaryStatistics. This can be done using an internal class. It is possible
> to call the test method using already computed SummaryStatistics. The method
> that does this using the SummaryStatistics as part of the API can be dropped,
> or supported using an interface that returns: getSum, getSumOfSquares, getN.
> All the inference Test classes have instance methods but no state. The
> InferenceTestUtils is a static class that holds references to a singleton for
> each class and provides static methods to pass through the underlying
> instances.
> I suggest changing the test classes to have only static methods and dropping
> InferenceTestUtils.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)