[ 
https://issues.apache.org/jira/browse/STATISTICS-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645994#comment-17645994
 ] 

Alex Herbert commented on STATISTICS-62:
----------------------------------------

In short, yes to refactoring. But I do not know the best way to consolidate the 
package.

I have done the initial work to remove all the math dependencies. Next step is 
to get the test suite working. I've made notes on all changes made.

There is a lot of inconsistency across the classes so I hope to resolve all 
this. Most classes could be static. But the MannWhitneyUTest and 
WilcoxonSignedRankTest can be passes a RankingAlgorithm to change their 
behaviour. The KolmogorovSmirnovTest has 3 methods to compute the p-value for 
the single sample test, and 4 methods for the 2 sample test. It also has a 
method to resolve ties when the sample size is small and the p-value method is 
exact with no explanation of why this is required. It would work without tie 
resolution. If tie resolution is required it should be modified to perform this 
in a single pass rather than using random jitter. I have to check the repo 
commit history to see why this has been done.

In order to support all tests in their current form I think that we can keep 
the instance based methods and maintain a singleton for the default settings. 
Any other settings could be accessed through a builder pattern to create an 
instance to perform the desired computation. The test instances then have these 
methods:
{code:java}
XyzTest t = XyzTest.instance();
XyzTest t = XyzTest.builder()
                   .configure(...)
                   .build();

double statistic(...);           // compute the test statistic
double test(...);                // compute the test p-value
boolean test(..., double alpha); // compute a reject/accept test{code}
The idea behind the builder is to support a cleaner API where the tests are 
performed using the same methods, irrespective of the method used to compute 
the statistic, or p-value. It allows the classes where many options are 
available such as the KolmogorovSmirnovTest to be configured beyond the 
defaults:
{code:java}
double[] x;
double[] y;

// Old
public double kolmogorovSmirnovTest(double[] x, double[] y)
public double monteCarloP(double d,
                          int n,
                          int m,
                          boolean strict,
                          int iterations,
                          UniformRandomProvider rng);

KolmogorovSmirnovTest t = new KolmogorovSmirnovTest();
double = t.monteCarloP(t.kolmogorovSmirnovTest(x, y), 
                       x.length,
                       y.length,
                       true,
                       100000,
                       rng);

// New
double p1 = KolmogorovSmirnovTest.instance().test(x, y);
double p2 = KolmogorovSmirnovTest.builder()
                                 
.method(KolmogorovSmirnovTest.TwoSample.MONTE_CARLO)
                                 .strict(true)
                                 .iterations(100000)
                                 .randomSource(rng)
                                 .build()
                                 .test(x, y);{code}
Here the amount of typing has not really changed.

It does make documenting all the options more difficult as the test methods are 
separated from their arguments. In the case above an enum 
KolmogorovSmirnovTest.TwoSample has been used to provide computation options 
for the two-sample test, then the parameters strict, iterations and 
randomSource must be documented to apply to this two-sample method.

However a builder is not always suitable. See the TTest which offers:
 * One-sample or two-sample
 * One-sided or two-sided
 * Paired or unpaired (for two-sample tests)
 * Homoscedastic (equal variance assumption) or heteroscedastic (for two sample 
tests)

In this case a builder is not really required. This can be done using a 
singleton instance for each test type. However the paired test and assumptions 
on variance only applies to two sample computation.

Thus there are issues with the builder pattern as the returned objects are not 
conforming to a set interface. Some tests support one-sample and two-sample 
tests. Some support input as a double[], and also using some form of summary 
statistics (OneWayAnova accepts mean, variance and size of each sample). Some 
tests have variations based on test type (T-test homo/heteroscedastic, 
paired/unpaired). Some tests have many options to compute the p-value 
(KolmogorovSmirnovTest).

I will get the test suite working and can commit an initial port with the API 
mostly unchanged. Then we can discuss how to refactor and keep this refactoring 
in the revision history. Or we can discuss how to refactor first and commit a 
clean revision history with the new API.

 

> Port o.a.c.math.stat.inference to a commons-statistics-inference module
> -----------------------------------------------------------------------
>
>                 Key: STATISTICS-62
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-62
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Major
>
> The o.a.c.math4.legacy.stat.inference package contains:
>  
> {noformat}
> AlternativeHypothesis.java
> BinomialTest.java
> ChiSquareTest.java
> GTest.java
> InferenceTestUtils.java
> KolmogorovSmirnovTest.java
> MannWhitneyUTest.java
> OneWayAnova.java
> TTest.java
> WilcoxonSignedRankTest.java{noformat}
> The are few dependencies on other math packages. The notable exceptions are:
>  
> 1. KolmogorovSmirnovTest which requires matrix support. This is for 
> multiplication of a square matrix to support a matrix power function. This 
> uses a double matrix and the same code is duplicated for a BigFraction 
> matrix. Such code can be ported internally to support only the required 
> functions. It can also drop the defensive copy strategy used by Commons Math 
> in matrices to allow multiply in-place where appropriate for performance 
> gains.
> 2. OneWayAnova which collates the sum, sum of squares and count using 
> SummaryStatistics. This can be done using an internal class. It is possible 
> to call the test method using already computed SummaryStatistics. The method 
> that does this using the SummaryStatistics as part of the API can be dropped, 
> or supported using an interface that returns: getSum, getSumOfSquares, getN.
> All the inference Test classes have instance methods but no state. The 
> InferenceTestUtils is a static class that holds references to a singleton for 
> each class and provides static methods to pass through the underlying 
> instances.
> I suggest changing the test classes to have only static methods and dropping 
> InferenceTestUtils.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to