[ 
https://issues.apache.org/jira/browse/STATISTICS-71?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794965#comment-17794965
 ] 

Alex Herbert commented on STATISTICS-71:
----------------------------------------

I have added provisional support for computing a biased or bias corrected 
statistic. This applies to the moment based statistics, see wikipedia:

[Population and sample 
variance|https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance]
[Sample Skewness|https://en.wikipedia.org/wiki/Skewness#Sample_skewness]
[Sample kurtosis|https://en.wikipedia.org/wiki/Kurtosis#Sample_kurtosis]
h2. Background

In other language libraries the following options are found:
||Library||Function||Options||
|Python numpy|std/var|Allows setting the delta degrees of freedom (ddof) used 
to normalise the sum of squared deviations to the variance (N - ddof). Default 
is 0.|
|Python scipy|skew|Allows choosing biased/unbiased; default bias=True.|
|Python scipy|kurtosis|Allows changing the computation between Pearson and 
Fisher kurtosis. Allows choosing biased/unbiased;
default bias=True; fisher=True.|
|Matlab|std/var|Allows choosing biased/unbiased; default bias=False.|
| |skew|Allows choosing biased/unbiased; default bias=False.|
| |kurtosis|Allows choosing biased/unbiased; default bias=False. Computes the 
Pearson definition.|
|R|var|Allows choosing biased/unbiased; default bias=False.|
|R moments|skew/kurtosis|Computes Pearson's skewness/kurtosis. No 
biased/unbiased option.|
|Commons Math|std/var|Allows choosing biased/unbiased; default bias=False. Uses 
a property setter method so the choice is mutable.|
| |Skewness|Computes the unbiased skewness.|
| |Kurtosis|Computes the unbiased Fisher kurtosis.|

The python libraries default to the biased statistic. Matlab and R default to 
unbiased. Only Python provides an option to change the degrees of freedom used 
to normalise the sum of squared deviations for the variance. Other libraries 
assume either 0 or 1. A scan through the source code of numpy and scipy does 
not find a case where the value is anything other than 0 or 1. However the code 
allows a float so that normalisation can be e.g. (N - 1.5). The use case for 
such is unclear although wikipedia indicates this can be used to compute an 
unbiased standard deviation (depending on the assumed underlying distribution, 
e.g. see [Bessel's correction 
(Wikipedia)|https://en.wikipedia.org/wiki/Bessel%27s_correction]).

The kurtosis has two formulas. Fisher and Pearson. Scipy provide both options. 
This is done simply by subtracting 3 from the statistic to create the Fisher 
result. Matlab returns the Pearson statistic. Adding support for this 
difference does not seem essential.
h2. Goals

Since the computation of the biased or unbiased formulas for these statistics 
is applied in the final stage, the option can be controlled dynamically on the 
same instance. This is the method employed in Commons Math for Variance/Std. 
However CM does not support the option for skewness or kurtosis.
 * Allow biased computation of std/var/skew/kurtosis
 * Allow changing the statistic instance using a setter to modify the computed 
result
 * Allow changing the summary statistics using a setter to modify the computed 
stats
 * Allow repeat building the summary statistics using preconfigured options
 * Allow the summary statistics to create immutable suppliers of the currently 
configured statistic

h2. API
{code:java}
// Same for StandardDeviation/Skewness/Kurtosis
Variance {
    public Variance setBiased(boolean);
}

// Immutable config
public final class StatisticsConfiguration {
  public static StatisticsConfiguration withDefaults();
  public StatisticsConfiguration withBiased(boolean);
  public boolean isBiased();
}

DoubleStatistics {
    Builder {
        public Builder setConfiguration(StatisticsConfiguration);
    }
    public DoubleStatistics setConfiguration(StatisticsConfiguration);
}
{code}
h2. Examples
{code:java}
// 0.6666666
double v = Variance.of(1, 2, 3).setBiased(true).getAsDouble();

DoubleStatistics stats = DoubleStatistics.builder(Statistic.VARIANCE).build(1, 
2, 3);
DoubleSupplier v1 = stats.getSupplier(Statistic.VARIANCE);
DoubleSupplier v2 = stats.setConfiguration(
        StatisticsConfiguration.withDefaults().withBiased(true))
    .getSupplier(Statistic.VARIANCE);

// Supplier functions are frozen to the configuration at their creation time 
v1.getAsDouble(); // 1.0
v2.getAsDouble(); // 0.66666666
{code}
h2. Discussion

I added primitive setters for the individual statistics and a configuration 
class for the DoubleStatistics aggregator. This is to avoid adding many 
properties to the aggregator class if further options are supported in the 
future. These can be collected into the configuration class which may be shared 
in the future with Int and Long specialisations of the same class. Adding 
primitive setters for the stats ensures only those properties that affect the 
statistic can be set. The alternative to set a configuration does allow passing 
in an object with irrelavant properties. For the aggregator class it must 
support all potential properties across all statistics so having a single 
configuration class encapsulates that.

I did not add support for the Pearson or Fisher kurtosis. This could be added 
in the future if required. The code currently computes the Fisher version.
h2. Variants
 * Change the setters to void return types. This is classic Java but prevents 
fluent API usage (see examples).
 * Change the statistics to accept the StatisticsConfiguration object.
 * Remove the StatisticsConfiguration object (it only has one option at 
present). Future options for statistics would require adding more property 
setters to the summary statistics class (and builder).

Feedback welcome.

 

 

> Implementation of Univariate Statistics
> ---------------------------------------
>
>                 Key: STATISTICS-71
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-71
>             Project: Commons Statistics
>          Issue Type: Task
>          Components: descriptive
>            Reporter: Anirudh Joshi
>            Assignee: Anirudh Joshi
>            Priority: Minor
>              Labels: gsoc, gsoc2023
>
> Jira ticket to track the implementation of the Univariate statistics required 
> for the updated SummaryStatistics API. 
> The implementation would be "storeless". It should be used for calculating 
> statistics that can be computed in one pass through the data without storing 
> the sample values.
> Currently I have the definition of API as (this might evolve as I continue 
> working)
> {code:java}
> public interface DoubleStorelessUnivariateStatistic extends DoubleSupplier {
>     DoubleStorelessUnivariateStatistic add(double v);
>     long getCount();
>     void combine(DoubleStorelessUnivariateStatistic other);
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to