[
https://issues.apache.org/jira/browse/STATISTICS-71?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788399#comment-17788399
]
Alex Herbert commented on STATISTICS-71:
----------------------------------------
I have create an initial implementation for a summary of composite statistics.
The implementation is described below. Please raise a discussion on any of the
points; e.g. changes to the API; functionality that is missing; functionality
that is not required.
h2. Background
The equivalent in the JDK is {{DoubleSummaryStatistics}} which provides the
following fixed:
{code:java}
void accept(double);
void combine(DoubleSummaryStatistics);
long getCount();
double getSum();
double getMin();
double getMax();
double getAverage();
String toString(); // "DoubleSummaryStatistics{count=%d, sum=%f, min=%f,
average=%f, max=%f}"{code}
Note that the void return type of the combine method prevents the use of
combine as a function reference in a reduce operation for a stream since that
requires a {{{}BinaryOperator{}}}:
{code:java}
interface BinaryOperator<T> extends BiFunction<T,T,T> {
T apply(T t, T u);
// ...
}
T reduce(T identity, BinaryOperator<T> accumulator);
Optional<T> reduce(BinaryOperator<T> accumulator);
// Requires
BinaryOperator<T> accumulator = (a, b) -> { a.combine(b); return a; }{code}
The equivalent in Commons Math is an implementation of the
{{StatisticalSummary}} interface:
{code:java}
public interface StatisticalSummary {
double getMean();
double getVariance();
double getStandardDeviation();
double getMax();
double getMin();
long getN();
double getSum();
}
{code}
This is implemented by {{SummaryStatistics}} and {{{}DescriptiveStatistics{}}}.
However these classes also add some of the other statistic implementations
within the codebase (Skewness, Kurtosis, GeometricMean, etc). The additional
implementations are not the complete set of available stats as for example
there is no Product and there is a provision of PopulationVariance and Variance
but only StandardDeviation (no population StandardDeviation). The
{{SummaryStatistics}} provides only {{{}UnivariateStorelessStatistics{}}}. The
{{DescriptiveStatistics}} adds to this by providing the percentile by using an
expandable store of the observed values. Each summary class allows the
underlying implementation for each statistic to be set or obtained using
getters. The two summary classes are not related (i.e.
{{DescriptiveStatistics}} does not extend {{{}SummaryStatistics{}}}).
This API does not provide a method to choose the statistics computed. This is
possible by providing a NoOperation implementation for computing statistics
that are not required; this is not possible using a simple method reference as
the {{StorelessUnivariateStatistic}} is not a {{{}FunctionalInterface{}}}.
There is not a NoOperation {{StorelessUnivariateStatistic}} implementation
provided by the CM code so a user would have to create one and pass it in as
the implementation for all unwanted stats.
There is no way to combine {{SummaryStatistics}} or
{{{}DescriptiveStatistics{}}}. It is possible to copy them.
h2. New API
For the composite statistics the goals are:
* Provide an API that allows choosing which statistics to compute.
* Allow combination of statistics that have been computed in parallel.
This is the working API:
{code:java}
public enum Statistic {
MIN,
MAX,
MEAN,
STANDARD_DEVIATION,
VARIANCE,
SKEWNESS,
KURTOSIS,
PRODUCT,
SUM,
SUM_OF_LOGS,
SUM_OF_SQUARES,
GEOMETRIC_MEAN;
}
public final class DoubleStatistics implements DoubleConsumer {
public final class Builder {
public DoubleStatistics build();
public DoubleStatistics build(double...);
}
public static DoubleStatistics of(Statistic...);
public static Builder builder(Statistic...);
public void accept(double);
public long getCount();
public boolean isSupported(Statistic);
public double get(Statistic);
public DoubleStatistics combine(DoubleStatistics);
}
{code}
h3. Notes
# The builder is required to allow the set-up of the underlying computation to
be a one time cost. The builder is then used as a factory to build instances of
the DoubleStatistics. These instances will be compatible using the combine
method.
# An empty instance to compute a selected set of statistics is create using
the {{of}} method. This is a simple one line implementation provided for
convenience. However unlike the {{of}} method for individual statistics it does
not accept a {{double[]}} and the instance is initially empty.
{code:java}
// One-liner could be removed ...
public static DoubleStatistics of(Statistic... statistics) {
return builder(statistics).build();
}
// Could be added:
public static DoubleStatistics of(double[] values, Statistic... statistics) {
return builder(statistics).build(values);
}{code}
# The get(Statistic) method will throw an exception (IAE) if the statistic is
not supported. This is not user friendly so the API provides a method to
indicate which statistics are being computed: isSupported(Statistic) which will
not throw exceptions.
# The combine method requires that all the computations in the object are also
provided by the other DoubleStatistic, i.e. the combine method requires they
are {_}compatible{_}. Only the left-hand side is updated by combine so
effectively this means that the object must be able to copy what it requires
from the other object. For example it is possible to combine one way and not
the other:
{code:java}
DoubleStatistics varStat =
DoubleStatistics.builder(Statistic.VARIANCE).build(data);
DoubleStatistics meanStats =
DoubleStatistics.builder(Statistic.MEAN).build(data);
// Throws
varStats.combine(meanStats);
// Allowed: mean is computed during the computation of variance
meanStats.combine(varStats);
{code}
h3. Example usage
* A single array
{code:java}
double[] data = {1, 2, 3, 4, 5, 6, 7, 8};
DoubleStatistics stats = DoubleStatistics.builder(
Statistic.MIN, Statistic.MAX)
.build(data);
stats.get(Statistic.MIN);
stats.get(Statistic.MAX);
{code}
* Arrays in parallel; provide all available statistics by the underlying
computation
{code:java}
double[][] data = {
{1, 2, 3, 4},
{5, 6, 7, 8},
};
DoubleStatistics.Builder builder = DoubleStatistics.builder(Statistic.VARIANCE);
DoubleStatistics stats = Arrays.stream(data)
.map(builder::build)
.reduce(DoubleStatistics::combine)
.get();
stats.get(Statistic.VARIANCE);
// Get other statistics supported by the underlying computations
stats.get(Statistic.STANDARD_DEVIATION);
stats.get(Statistic.MEAN);
{code}
* Arrays in parallel multiple times using a {{Collector}}
{code:java}
double[][] data = {
{1, 2, 3, 4},
{5, 6, 7, 8},
};
DoubleStatistics.Builder builder = DoubleStatistics.builder(
Statistic.MIN, Statistic.MAX);
Collector<double[], DoubleStatistics, DoubleStatistics> collector =
Collector.of(builder::build, (s, d) -> s.combine(builder.build(d)),
DoubleStatistics::combine);
DoubleStatistics stats = Arrays.stream(data).collect(collector);
stats.get(Statistic.MIN);
stats.get(Statistic.MAX);
// Different data...
DoubleStatistics stats2 = Arrays.stream(data2).collect(collector);
{code}
h2. API considerations to discuss
h3. How to choose stats
The API uses a varargs of Statistic parameter to define what to compute. This
could also be provided by an EnumSet. However for a single usage this is more
verbose:
{code:java}
DoubleStatistics stats = DoubleStatistics.of(Statistic.MIN, Statistic.MAX);
// Or
DoubleStatistics stats = DoubleStatistics.of(
EnumSet.of(Statistic.MIN, Statistic.MAX));
{code}
Th JDK API for variable arguments often uses varargs (e.g. nio LinkOption
enum/OpenOption interface). Sometimes the use of an EnumSet is not possible as
the arguments are an interface. The JDK stream/spliterator API uses integer bit
flags; this functionality is provided by EnumSet for enums with less than 65
values but use of int types is more explicit. Using bit flags removes the
option to add additional behaviour to each stat constant. There is no current
reason for this but it can provide behaviour such as to specify the default
values for all stats:
{code:java}
public enum Statistic {
MIN {
@Override
double defaultAsDouble() {
return Double.NEGATIVE_INFINITY;
}
},
// ...
double defaultAsDouble() {
return Double.NaN;
}
int defaultAsInt() {
return (int) defaultAsDouble();
}
} {code}
To directly use bit flags would for example be implemented using:
{code:java}
public final class Statistic {
public final int MIN = 0x1;
public final int MAX = 0x2,
// ...
}
public final class DoubleStatistics implements DoubleConsumer {
// ...
public static DoubleStatistics of(int);
public static Builder builder(int);
// ...
}
double[] data = {1, 2, 3, 4, 5, 6, 7, 8};
DoubleStatistics stats = DoubleStatistics.builder(
Statistic.MIN | Statistic.MAX)
.build(data);
stats.get(Statistic.MIN);
stats.get(Statistic.MAX);
{code}
h3. Optimising obtaining a stat
Currently the get(Statistic) method is implemented using a large switch
statement that identifies the function used to compute the statistic and then
computes it. We could provide a method to obtain the function itself. This
would perform a one time look-up of the function. The stat can then be computed
with minimum cost for each invocation of the supplier:
{code:java}
public DoubleSupplier getSupplier(Statistic statistic);
DoubleStatistics stats = DoubleStatistics.of(Statistic.VARIANCE);
DoubleSupplier fm = stats.getSupplier(Statistic.MEAN);
DoubleSupplier fv = stats.getSupplier(Statistic.VARIANCE);
// As data is added ...
double m = fm.getAsDouble();
double v = fv.getAsDouble();
{code}
h3. Configuring stats
Currently there are some statistics that have only one of many possible
implementations. For example the population or sample variance/standard
deviation; the skewness and the kurtosis. The API should provide a method to
configure the Statistic.
One option is to add methods to the DoubleStatistics class. The same class can
then be used to compute different versions of the statistic. If the options are
added to the builder then this could set the computation on construction. Since
the examples noted above are a case of adjusting normalisation constants based
on the sample size (not changing the underlying accruing computation) it makes
sense to allow configuring the instance. If you configure the builder then all
instances will be configured the same. This will be useful if each instance is
to be used for reporting separately, e.g. one instance per thread reporting to
a log file.
A possible API would be to add configuration options to the individual
statistic implementations. The summary then accepts a function to configure the
computation. The following demonstrates a generic usage:
{code:java}
public final class Variance implements DoubleStatistic,
DoubleStatisticAccumulator<Variance> {
// ...
public void setBiased(boolean biased);
}
public final class DoubleStatistics implements DoubleConsumer {
// ...
public void addConfiguration(Consumer<DoubleStatistic> action);
}
DoubleStatistics varStat =
DoubleStatistics.builder(Statistic.VARIANCE).build(data);
varStats.addConfiguration(s -> {
if (s instanceof Variance) {
((Variance) s).setBiased(false);
}
});
{code}
However the configuration options would be limited by the implementations
provided in the library. An alternative would be a builder for configuration
options that supports all the library options and a set of defaults:
{code:java}
public final class StatisticOptions {
public static final class Builder {
public Builder setBiased(boolean biased);
public StatisticOptions build();
}
public static StatisticOptions defaults();
public static Builder builder();
public Builder toBuilder();
public boolean isBiased();
}
{code}
h3. Int and Long statistics
The JDK streams support int and long primitives with {{IntSummaryStatistics}}
and {{{}LongSummaryStatistics{}}}. It is possible to compute the stats using
double values. However some statistics are simpler to compute using the native
primitive type (e.g. mean can be sum/n); and in the case of long may be
inaccurate using a double due to limited precision (e.g. sum).
The above API can be replicated for int and long types. The statistics that
have advantages using the native type can be implemented using a type specific
implementation: e.g. IntMin; LongMin.
> Implementation of Univariate Statistics
> ---------------------------------------
>
> Key: STATISTICS-71
> URL: https://issues.apache.org/jira/browse/STATISTICS-71
> Project: Commons Statistics
> Issue Type: Task
> Components: descriptive
> Reporter: Anirudh Joshi
> Assignee: Anirudh Joshi
> Priority: Minor
> Labels: gsoc, gsoc2023
>
> Jira ticket to track the implementation of the Univariate statistics required
> for the updated SummaryStatistics API.
> The implementation would be "storeless". It should be used for calculating
> statistics that can be computed in one pass through the data without storing
> the sample values.
> Currently I have the definition of API as (this might evolve as I continue
> working)
> {code:java}
> public interface DoubleStorelessUnivariateStatistic extends DoubleSupplier {
> DoubleStorelessUnivariateStatistic add(double v);
> long getCount();
> void combine(DoubleStorelessUnivariateStatistic other);
> } {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)