[
https://issues.apache.org/jira/browse/STATISTICS-71?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794965#comment-17794965
]
Alex Herbert commented on STATISTICS-71:
----------------------------------------
I have added provisional support for computing a biased or bias corrected
statistic. This applies to the moment based statistics, see wikipedia:
[Population and sample
variance|https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance]
[Sample Skewness|https://en.wikipedia.org/wiki/Skewness#Sample_skewness]
[Sample kurtosis|https://en.wikipedia.org/wiki/Kurtosis#Sample_kurtosis]
h2. Background
In other language libraries the following options are found:
||Library||Function||Options||
|Python numpy|std/var|Allows setting the delta degrees of freedom (ddof) used
to normalise the sum of squared deviations to the variance (N - ddof). Default
is 0.|
|Python scipy|skew|Allows choosing biased/unbiased; default bias=True.|
|Python scipy|kurtosis|Allows changing the computation between Pearson and
Fisher kurtosis. Allows choosing biased/unbiased;
default bias=True; fisher=True.|
|Matlab|std/var|Allows choosing biased/unbiased; default bias=False.|
| |skew|Allows choosing biased/unbiased; default bias=False.|
| |kurtosis|Allows choosing biased/unbiased; default bias=False. Computes the
Pearson definition.|
|R|var|Allows choosing biased/unbiased; default bias=False.|
|R moments|skew/kurtosis|Computes Pearson's skewness/kurtosis. No
biased/unbiased option.|
|Commons Math|std/var|Allows choosing biased/unbiased; default bias=False. Uses
a property setter method so the choice is mutable.|
| |Skewness|Computes the unbiased skewness.|
| |Kurtosis|Computes the unbiased Fisher kurtosis.|
The python libraries default to the biased statistic. Matlab and R default to
unbiased. Only Python provides an option to change the degrees of freedom used
to normalise the sum of squared deviations for the variance. Other libraries
assume either 0 or 1. A scan through the source code of numpy and scipy does
not find a case where the value is anything other than 0 or 1. However the code
allows a float so that normalisation can be e.g. (N - 1.5). The use case for
such is unclear although wikipedia indicates this can be used to compute an
unbiased standard deviation (depending on the assumed underlying distribution,
e.g. see [Bessel's correction
(Wikipedia)|https://en.wikipedia.org/wiki/Bessel%27s_correction]).
The kurtosis has two formulas. Fisher and Pearson. Scipy provide both options.
This is done simply by subtracting 3 from the statistic to create the Fisher
result. Matlab returns the Pearson statistic. Adding support for this
difference does not seem essential.
h2. Goals
Since the computation of the biased or unbiased formulas for these statistics
is applied in the final stage, the option can be controlled dynamically on the
same instance. This is the method employed in Commons Math for Variance/Std.
However CM does not support the option for skewness or kurtosis.
* Allow biased computation of std/var/skew/kurtosis
* Allow changing the statistic instance using a setter to modify the computed
result
* Allow changing the summary statistics using a setter to modify the computed
stats
* Allow repeat building the summary statistics using preconfigured options
* Allow the summary statistics to create immutable suppliers of the currently
configured statistic
h2. API
{code:java}
// Same for StandardDeviation/Skewness/Kurtosis
Variance {
public Variance setBiased(boolean);
}
// Immutable config
public final class StatisticsConfiguration {
public static StatisticsConfiguration withDefaults();
public StatisticsConfiguration withBiased(boolean);
public boolean isBiased();
}
DoubleStatistics {
Builder {
public Builder setConfiguration(StatisticsConfiguration);
}
public DoubleStatistics setConfiguration(StatisticsConfiguration);
}
{code}
h2. Examples
{code:java}
// 0.6666666
double v = Variance.of(1, 2, 3).setBiased(true).getAsDouble();
DoubleStatistics stats = DoubleStatistics.builder(Statistic.VARIANCE).build(1,
2, 3);
DoubleSupplier v1 = stats.getSupplier(Statistic.VARIANCE);
DoubleSupplier v2 = stats.setConfiguration(
StatisticsConfiguration.withDefaults().withBiased(true))
.getSupplier(Statistic.VARIANCE);
// Supplier functions are frozen to the configuration at their creation time
v1.getAsDouble(); // 1.0
v2.getAsDouble(); // 0.66666666
{code}
h2. Discussion
I added primitive setters for the individual statistics and a configuration
class for the DoubleStatistics aggregator. This is to avoid adding many
properties to the aggregator class if further options are supported in the
future. These can be collected into the configuration class which may be shared
in the future with Int and Long specialisations of the same class. Adding
primitive setters for the stats ensures only those properties that affect the
statistic can be set. The alternative to set a configuration does allow passing
in an object with irrelavant properties. For the aggregator class it must
support all potential properties across all statistics so having a single
configuration class encapsulates that.
I did not add support for the Pearson or Fisher kurtosis. This could be added
in the future if required. The code currently computes the Fisher version.
h2. Variants
* Change the setters to void return types. This is classic Java but prevents
fluent API usage (see examples).
* Change the statistics to accept the StatisticsConfiguration object.
* Remove the StatisticsConfiguration object (it only has one option at
present). Future options for statistics would require adding more property
setters to the summary statistics class (and builder).
Feedback welcome.
> Implementation of Univariate Statistics
> ---------------------------------------
>
> Key: STATISTICS-71
> URL: https://issues.apache.org/jira/browse/STATISTICS-71
> Project: Commons Statistics
> Issue Type: Task
> Components: descriptive
> Reporter: Anirudh Joshi
> Assignee: Anirudh Joshi
> Priority: Minor
> Labels: gsoc, gsoc2023
>
> Jira ticket to track the implementation of the Univariate statistics required
> for the updated SummaryStatistics API.
> The implementation would be "storeless". It should be used for calculating
> statistics that can be computed in one pass through the data without storing
> the sample values.
> Currently I have the definition of API as (this might evolve as I continue
> working)
> {code:java}
> public interface DoubleStorelessUnivariateStatistic extends DoubleSupplier {
> DoubleStorelessUnivariateStatistic add(double v);
> long getCount();
> void combine(DoubleStorelessUnivariateStatistic other);
> } {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)