[jira] [Commented] (STATISTICS-71) Implementation of Univariate Statistics

Alex Herbert (Jira) Tue, 21 Nov 2023 04:35:20 -0800


    [ 
https://issues.apache.org/jira/browse/STATISTICS-71?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788399#comment-17788399
 ]


Alex Herbert commented on STATISTICS-71:
----------------------------------------

I have create an initial implementation for a summary of composite statistics. 
The implementation is described below. Please raise a discussion on any of the 
points; e.g. changes to the API; functionality that is missing; functionality 
that is not required.
h2. Background

The equivalent in the JDK is {{DoubleSummaryStatistics}} which provides the 
following fixed:
{code:java}
void accept(double);
void combine(DoubleSummaryStatistics);

long getCount();
double getSum();
double getMin();
double getMax();
double getAverage();

String toString(); // "DoubleSummaryStatistics{count=%d, sum=%f, min=%f, 
average=%f, max=%f}"{code}
Note that the void return type of the combine method prevents the use of 
combine as a function reference in a reduce operation for a stream since that 
requires a {{{}BinaryOperator{}}}:
{code:java}
interface BinaryOperator<T> extends BiFunction<T,T,T> {
    T apply(T t, T u);
    // ...
}

T reduce(T identity, BinaryOperator<T> accumulator);
Optional<T> reduce(BinaryOperator<T> accumulator);

// Requires
BinaryOperator<T> accumulator = (a, b) -> { a.combine(b); return a; }{code}
The equivalent in Commons Math is an implementation of the 
{{StatisticalSummary}} interface:
{code:java}
public interface StatisticalSummary {
    double getMean();
    double getVariance();
    double getStandardDeviation();
    double getMax();
    double getMin();
    long getN();
    double getSum();
}
{code}
This is implemented by {{SummaryStatistics}} and {{{}DescriptiveStatistics{}}}. 
However these classes also add some of the other statistic implementations 
within the codebase (Skewness, Kurtosis, GeometricMean, etc). The additional 
implementations are not the complete set of available stats as for example 
there is no Product and there is a provision of PopulationVariance and Variance 
but only StandardDeviation (no population StandardDeviation). The 
{{SummaryStatistics}} provides only {{{}UnivariateStorelessStatistics{}}}. The 
{{DescriptiveStatistics}} adds to this by providing the percentile by using an 
expandable store of the observed values. Each summary class allows the 
underlying implementation for each statistic to be set or obtained using 
getters. The two summary classes are not related (i.e. 
{{DescriptiveStatistics}} does not extend {{{}SummaryStatistics{}}}).

This API does not provide a method to choose the statistics computed. This is 
possible by providing a NoOperation implementation for computing statistics 
that are not required; this is not possible using a simple method reference as 
the {{StorelessUnivariateStatistic}} is not a {{{}FunctionalInterface{}}}. 
There is not a NoOperation {{StorelessUnivariateStatistic}} implementation 
provided by the CM code so a user would have to create one and pass it in as 
the implementation for all unwanted stats.

There is no way to combine {{SummaryStatistics}} or 
{{{}DescriptiveStatistics{}}}. It is possible to copy them.
h2. New API

For the composite statistics the goals are:
 * Provide an API that allows choosing which statistics to compute.
 * Allow combination of statistics that have been computed in parallel.

This is the working API:
{code:java}
public enum Statistic {
    MIN,
    MAX,
    MEAN,
    STANDARD_DEVIATION,
    VARIANCE,
    SKEWNESS,
    KURTOSIS,
    PRODUCT,
    SUM,
    SUM_OF_LOGS,
    SUM_OF_SQUARES,
    GEOMETRIC_MEAN;
}

public final class DoubleStatistics implements DoubleConsumer {
    public final class Builder {
        public DoubleStatistics build();
        public DoubleStatistics build(double...);
    }

    public static DoubleStatistics of(Statistic...);
    public static Builder builder(Statistic...);

    public void accept(double);
    public long getCount();
    public boolean isSupported(Statistic);
    public double get(Statistic);
    public DoubleStatistics combine(DoubleStatistics);
}
{code}
h3. Notes
 # The builder is required to allow the set-up of the underlying computation to 
be a one time cost. The builder is then used as a factory to build instances of 
the DoubleStatistics. These instances will be compatible using the combine 
method.
 # An empty instance to compute a selected set of statistics is create using 
the {{of}} method. This is a simple one line implementation provided for 
convenience. However unlike the {{of}} method for individual statistics it does 
not accept a {{double[]}} and the instance is initially empty.
{code:java}
// One-liner could be removed ...
public static DoubleStatistics of(Statistic... statistics) {
    return builder(statistics).build();
}

// Could be added:
public static DoubleStatistics of(double[] values, Statistic... statistics) {
    return builder(statistics).build(values);
}{code}

 # The get(Statistic) method will throw an exception (IAE) if the statistic is 
not supported. This is not user friendly so the API provides a method to 
indicate which statistics are being computed: isSupported(Statistic) which will 
not throw exceptions.
 # The combine method requires that all the computations in the object are also 
provided by the other DoubleStatistic, i.e. the combine method requires they 
are {_}compatible{_}. Only the left-hand side is updated by combine so 
effectively this means that the object must be able to copy what it requires 
from the other object. For example it is possible to combine one way and not 
the other:
{code:java}
DoubleStatistics varStat = 
DoubleStatistics.builder(Statistic.VARIANCE).build(data);
DoubleStatistics meanStats = 
DoubleStatistics.builder(Statistic.MEAN).build(data);
// Throws
varStats.combine(meanStats);
// Allowed: mean is computed during the computation of variance
meanStats.combine(varStats);
{code}

h3. Example usage
 * A single array
{code:java}
double[] data = {1, 2, 3, 4, 5, 6, 7, 8};
DoubleStatistics stats = DoubleStatistics.builder(
    Statistic.MIN, Statistic.MAX)
    .build(data);
stats.get(Statistic.MIN);
stats.get(Statistic.MAX);
{code}

 * Arrays in parallel; provide all available statistics by the underlying 
computation
{code:java}
double[][] data = {
    {1, 2, 3, 4},
    {5, 6, 7, 8},
};
DoubleStatistics.Builder builder = DoubleStatistics.builder(Statistic.VARIANCE);
DoubleStatistics stats = Arrays.stream(data)
    .map(builder::build)
    .reduce(DoubleStatistics::combine)
    .get();
stats.get(Statistic.VARIANCE);
// Get other statistics supported by the underlying computations
stats.get(Statistic.STANDARD_DEVIATION);
stats.get(Statistic.MEAN);
{code}

 * Arrays in parallel multiple times using a {{Collector}}
{code:java}
double[][] data = {
    {1, 2, 3, 4},
    {5, 6, 7, 8},
};
DoubleStatistics.Builder builder = DoubleStatistics.builder(
    Statistic.MIN, Statistic.MAX);
Collector<double[], DoubleStatistics, DoubleStatistics> collector =
    Collector.of(builder::build, (s, d) -> s.combine(builder.build(d)), 
DoubleStatistics::combine);
DoubleStatistics stats = Arrays.stream(data).collect(collector);
stats.get(Statistic.MIN);
stats.get(Statistic.MAX);

// Different data...
DoubleStatistics stats2 = Arrays.stream(data2).collect(collector);
{code}

h2. API considerations to discuss
h3. How to choose stats

The API uses a varargs of Statistic parameter to define what to compute. This 
could also be provided by an EnumSet. However for a single usage this is more 
verbose:
{code:java}
DoubleStatistics stats = DoubleStatistics.of(Statistic.MIN, Statistic.MAX);
// Or
DoubleStatistics stats = DoubleStatistics.of(
    EnumSet.of(Statistic.MIN, Statistic.MAX));
{code}
Th JDK API for variable arguments often uses varargs (e.g. nio LinkOption 
enum/OpenOption interface). Sometimes the use of an EnumSet is not possible as 
the arguments are an interface. The JDK stream/spliterator API uses integer bit 
flags; this functionality is provided by EnumSet for enums with less than 65 
values but use of int types is more explicit. Using bit flags removes the 
option to add additional behaviour to each stat constant. There is no current 
reason for this but it can provide behaviour such as to specify the default 
values for all stats:

 
{code:java}
public enum Statistic {
    MIN {
        @Override
        double defaultAsDouble() {
            return Double.NEGATIVE_INFINITY;
        }
    },
    // ...

    double defaultAsDouble() {
        return Double.NaN;
    }

    int defaultAsInt() {
        return (int) defaultAsDouble();
    }
} {code}
To directly use bit flags would for example be implemented using:
{code:java}
public final class Statistic {
    public final int MIN = 0x1;
    public final int MAX = 0x2,
    // ...
}

public final class DoubleStatistics implements DoubleConsumer {
    // ...

    public static DoubleStatistics of(int);
    public static Builder builder(int);

    // ...
}

double[] data = {1, 2, 3, 4, 5, 6, 7, 8};
DoubleStatistics stats = DoubleStatistics.builder(
    Statistic.MIN | Statistic.MAX)
    .build(data);
stats.get(Statistic.MIN);
stats.get(Statistic.MAX);
{code}
h3. Optimising obtaining a stat

Currently the get(Statistic) method is implemented using a large switch 
statement that identifies the function used to compute the statistic and then 
computes it. We could provide a method to obtain the function itself. This 
would perform a one time look-up of the function. The stat can then be computed 
with minimum cost for each invocation of the supplier:
{code:java}
public DoubleSupplier getSupplier(Statistic statistic);

DoubleStatistics stats = DoubleStatistics.of(Statistic.VARIANCE);

DoubleSupplier fm = stats.getSupplier(Statistic.MEAN);
DoubleSupplier fv = stats.getSupplier(Statistic.VARIANCE);

// As data is added ...
double m = fm.getAsDouble();
double v = fv.getAsDouble();
{code}
h3. Configuring stats

Currently there are some statistics that have only one of many possible 
implementations. For example the population or sample variance/standard 
deviation; the skewness and the kurtosis. The API should provide a method to 
configure the Statistic.

One option is to add methods to the DoubleStatistics class. The same class can 
then be used to compute different versions of the statistic. If the options are 
added to the builder then this could set the computation on construction. Since 
the examples noted above are a case of adjusting normalisation constants based 
on the sample size (not changing the underlying accruing computation) it makes 
sense to allow configuring the instance. If you configure the builder then all 
instances will be configured the same. This will be useful if each instance is 
to be used for reporting separately, e.g. one instance per thread reporting to 
a log file.

A possible API would be to add configuration options to the individual 
statistic implementations. The summary then accepts a function to configure the 
computation. The following demonstrates a generic usage:
{code:java}
public final class Variance implements DoubleStatistic, 
DoubleStatisticAccumulator<Variance> {
    // ...
    public void setBiased(boolean biased);
}

public final class DoubleStatistics implements DoubleConsumer {
    // ...
    public void addConfiguration(Consumer<DoubleStatistic> action);
}

DoubleStatistics varStat = 
DoubleStatistics.builder(Statistic.VARIANCE).build(data); 
varStats.addConfiguration(s -> {
    if (s instanceof Variance) {
        ((Variance) s).setBiased(false);
    }
});
{code}
However the configuration options would be limited by the implementations 
provided in the library. An alternative would be a builder for configuration 
options that supports all the library options and a set of defaults:
{code:java}
public final class StatisticOptions {
    public static final class Builder {
        public Builder setBiased(boolean biased);
        public StatisticOptions build();
    }

    public static StatisticOptions defaults();

    public static Builder builder();
    public Builder toBuilder();

    public boolean isBiased();
}
{code}
h3. Int and Long statistics

The JDK streams support int and long primitives with {{IntSummaryStatistics}} 
and {{{}LongSummaryStatistics{}}}. It is possible to compute the stats using 
double values. However some statistics are simpler to compute using the native 
primitive type (e.g. mean can be sum/n); and in the case of long may be 
inaccurate using a double due to limited precision (e.g. sum).

The above API can be replicated for int and long types. The statistics that 
have advantages using the native type can be implemented using a type specific 
implementation: e.g. IntMin; LongMin.

 

> Implementation of Univariate Statistics
> ---------------------------------------
>
>                 Key: STATISTICS-71
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-71
>             Project: Commons Statistics
>          Issue Type: Task
>          Components: descriptive
>            Reporter: Anirudh Joshi
>            Assignee: Anirudh Joshi
>            Priority: Minor
>              Labels: gsoc, gsoc2023
>
> Jira ticket to track the implementation of the Univariate statistics required 
> for the updated SummaryStatistics API. 
> The implementation would be "storeless". It should be used for calculating 
> statistics that can be computed in one pass through the data without storing 
> the sample values.
> Currently I have the definition of API as (this might evolve as I continue 
> working)
> {code:java}
> public interface DoubleStorelessUnivariateStatistic extends DoubleSupplier {
>     DoubleStorelessUnivariateStatistic add(double v);
>     long getCount();
>     void combine(DoubleStorelessUnivariateStatistic other);
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (STATISTICS-71) Implementation of Univariate Statistics

Reply via email to