[
https://issues.apache.org/jira/browse/STATISTICS-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Herbert resolved STATISTICS-85.
------------------------------------
Fix Version/s: 1.1
Resolution: Implemented
Add in commit:
2e3303f637dc3a94a41facd42f6b2baa0a9b1c5b
> Quantile implementation
> -----------------------
>
> Key: STATISTICS-85
> URL: https://issues.apache.org/jira/browse/STATISTICS-85
> Project: Commons Statistics
> Issue Type: New Feature
> Components: descriptive
> Reporter: Alex Herbert
> Priority: Major
> Fix For: 1.1
>
> Attachments: 100QuantilesRandomDataLength10000.png,
> MedianRandomData.png
>
>
> Add a quantile implementation. This will interpolate the value of a sorted
> array of data for probability p in [0, 1].
> Replace the legacy API from Commons Math Percentile with an updated API. The
> new API should:
> * Decouple estimation of quantile positions inside data of length n; and the
> selection of correctly-ordered indices in array data.
> * Support multiple data types.
> * Support pre-sorted data.
> * Avoid performance issues observed in the CM Percentile implementation.
> h2. Proposed API
> {code:java}
> org.apache.commons.statistics.descriptive
> public final class Quantile {
> // overwrite=true; EstimationMethod.HF8; NaNPolicy.ERROR
> public static Quantile withDefaults();
> public Quantile withOverwrite(boolean);
> public Quantile with(EstimationMethod);
> // Could support NaN handling ... see below for NaNPolicy
> public Quantile with(NaNPolicy);
> // Create n uniform probabilities in range [p1, p2]
> public static double[] probabilities(int n);
> public static double[] probabilities(int n, double p1, double p2);
> // Quantiles on sorted data a of size n
> public double evaluate(int n, java.util.function.IntToDoubleFunction a,
> double p);
> public double[] evaluate(int n, java.util.function.IntToDoubleFunction a,
> double... p);
> // Quantiles on the primitive types that cannot be easily sorted
> public double evaluate(double[] a, double p);
> public double[] evaluate(double[] a, double... p);
> public double evaluate(int[] a, double p);
> public double[] evaluate(int[] a, double... p);
> public double evaluate(long[] a, double p);
> public double[] evaluate(long[] a, double... p);
> public double evaluate(float[] a, double p);
> public double[] evaluate(float[] a, double... p);
> // Provide the 9 methods in Hyndman and Fan (1996)
> // Sample Quantiles in Statistical Packages.
> // The American Statistician, 50, 361-365.
> public abstract class Quantile$EstimationMethod extends
> java.lang.Enum<Quantile$EstimationMethod> {
> public static final Quantile$EstimationMethod HF1;
> public static final Quantile$EstimationMethod HF2;
> public static final Quantile$EstimationMethod HF3;
> public static final Quantile$EstimationMethod HF4;
> public static final Quantile$EstimationMethod HF5;
> public static final Quantile$EstimationMethod HF6;
> public static final Quantile$EstimationMethod HF7;
> public static final Quantile$EstimationMethod HF8;
> public static final Quantile$EstimationMethod HF9;
> }
> }
> Note: The CM API used the 9 methods from Hyndman and Fann but labelled them
> as R1-9; this may be derived from the same names used in the R language. I
> propose to rename as HF1-9 to reflect the origin.
> {code}
> h2. NaNPolicy
> There are multiple options here. For reference R and Python's numpy only
> provide the option to exclude NaN:
> * R: quantile errors if NaN is present. median returns NaN. They is an
> option to exclude NaN.
> * numpy: two methods are provided: median/nanmedian + quantile/nanquantile
> (the non-nan versions will return NaN if any NaNs are present)
> Commons Math provides a remapping. Note the Statistics ranking module has the
> same NaNStrategy as that in CM:
> * MINIMAL: map to -infinity
> * MAXIMAL: map to +infinity
> * REMOVED: ignore from the data
> * FIXED: leave in place. This makes no sense for quantiles. It is done by
> moving to the end following the order imposed by Double.compare.
> * FAILED: raise an exception
> I favour the simpler option of: treating NaN so they are above/below all
> other values; removing them from the data; or raising an exception. I do not
> see the requirement to remap NaN to infinity. This can be done by the user.
> The API can be simplified by using:
> {code:java}
> public final class NaNPolicy extends java.lang.Enum<NaNPolicy> {
> public static final NaNPolicy LAST; // Move to end of data
> public static final NaNPolicy FIRST; // Move to start of data
> public static final NaNPolicy REMOVE; // Remove from data
> public static final NaNPolicy ERROR; // Raise exception
> }
> {code}
> Note that the FIRST option is not strictly required. But if there is an
> option to order the NaNs (i.e. LAST) then both orders should be supported.
> Which option to use as the default is not clear. As a drop in substitute for
> Arrays.sort the default would be handle NaN with a move to the end (LAST). As
> an API to signal to the user that the quantiles are possibly corrupted then
> ERROR would be the default. A user can then decide what to do if they receive
> errors during their analysis. Note that a separation of the Quantile API from
> the partitioning API (see below) may result in doing NaN processing twice
> introducing a performance overhead. If this will occur by default it should
> be documented so a user can override the NaN behaviour if they do not expect
> NaNs.
> Commons Math places the Percentile class in the rank package. I propose to
> move this implementation to place it in the descriptive package where it will
> sit beside other descriptive statistics for data such as mean and standard
> deviation.
> h2. Examples
> {code:java}
> double[] data = ...
> double q = Quantile.withDefaults().evaluate(data, 0.25);
> double[] quantiles = Quantile.withDefaults().evaluate(data, 0.25, 0.5, 0.75);
> // Use cases:
> // 1. Median / other quantile estimation
> // 2. Box plot of data
> // 3. Interquartile range analysis
> double[] p = Quantile.probabilities(100, 0.01, 0.99);
> double[] quantiles = Quantile.withDefaults().evaluate(data, p);
> // Use cases:
> // 1. plot p vs quantiles
> // 2. use p with an expected (inverse) probability distribution to create a
> QQ plot
> // Sorted input / unsupported datatype
> short[] data = ...
> Arrays.sort(data);
> double[] quantiles = Quantile.withDefaults().evaluate(data.length, i ->
> data[i], p);
> {code}
> h2. Implementation
> The Quantile API and the underlying implementation can be separated.
> Performing quantile estimation requires knowing the value of elements (i,
> i+1) in a sorted array a. Note that i+1 is used for interpolation:
> {noformat}
> value = a[i] + alpha * (a[i+1] - a[i]) ; alpha in [0, 1)
> {noformat}
> Note: The alpha factor and the index i are chosen from the percentile p using
> the EstimationMethod.
> The array does not have to be fully sorted. It can be partitioned so that
> each required element i is equal to the value in a fully sorted array. A
> partitioning API can be placed in Commons Numbers:
> {code:java}
> org.apache.commons.numbers.arrays
> public final class Selection {
> // Operate in-place and support a range as per Arrays.sort
> public void select(double[] a, int k);
> public void select(double[] a, int... k);
> public void select(int from, int to, double[] a, int k);
> public void select(int from, int to, double[] a, int... k);
> // Extend to other primitive types where sorting is slow
> // Sorting of 16-bit data is fast using a histogram so selection
> // has no major speed gain above a small length:
> // byte[] does counting sort at length 64 (temurin JDK 11)
> // short[]/char[] at length 1750 (temurin JDK 11)
> }
> {code}
> h3. Examples
> {code:java}
> // Find the bottom 100 (with 0-based indexing)
> Selection.select(data, 99);
> double[] bottom100 = Arrays.copyOf(data, 100);
> // Find the bottom and top 10
> Selection.select(data, 9, data.length - 10);
> double[] bottom10 = Arrays.copyOf(data, 10);
> double[] top10 = Arrays.copyOfRange(data, data.length - 10, data.length);
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)