[
https://issues.apache.org/jira/browse/MATH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033967#comment-14033967
]
Gilles commented on MATH-1129:
------------------------------
The
[Javadoc|http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/descriptive/rank/Percentile.html]
for {{Percentile}} does provide some warning about NaN within data:
{noformat}
To compute percentiles, the data must be at least partially ordered. Input
arrays are copied and recursively partitioned using an ordering definition. The
ordering used by Arrays.sort(double[]) is the one determined by
Double.compareTo(Double). This ordering makes Double.NaN larger than any other
value (including Double.POSITIVE_INFINITY). Therefore, for example, the median
(50th percentile) of {0, 1, 2, 3, 4, Double.NaN} evaluates to 2.5.
Since percentile estimation usually involves interpolation between array
elements, arrays containing NaN or infinite values will often result in NaN or
infinite values returned.
{noformat}
but the caveat does not appear in {{DescriptiveStatistics}}.
Even when no NaN is returned, the result varies with the position of the NaN
value in the data array. :(
It looks like the sorting is wrong in the presence of NaN. See below.
bq. This also creates doubts that the other methods handle NaN values correctly.
I don't know whether the intention was that the result should always be
considered undefined in the presence of NaN.
Local sort
Without NaN: 25th percentile -0.1773147094639404 75th percentile
0.2748649403760461
With NaN: 25th percentile 0.24166759508327315 75th percentile
-0.028075857595882995
With +inf: 25th percentile -0.15595963093172435 75th percentile
0.37445697625436497
java.util.Arrays.sort (sorting the whole data array)
Without NaN: 25th percentile -0.1773147094639404 75th percentile
0.2748649403760461
With NaN: 25th percentile -0.15595963093172435 75th percentile
0.37445697625436497
With +inf: 25th percentile -0.15595963093172435 75th percentile
0.37445697625436497
I've attempted to fix the local sort:
Without NaN: 25th percentile -0.1773147094639404 75th percentile
0.2748649403760461
With NaN: 25th percentile -0.15595963093172435 75th percentile
0.37445697625436497
With +inf: 25th percentile -0.15595963093172435 75th percentile
0.37445697625436497
If nobody objects, I'll commit this modification, and further tests can be
devised to ensure that it works correctly for other inputs.
> Percentile Computation errs
> ---------------------------
>
> Key: MATH-1129
> URL: https://issues.apache.org/jira/browse/MATH-1129
> Project: Commons Math
> Issue Type: Bug
> Affects Versions: 3.2
> Environment: Java 1.8.0
> Reporter: Carl Witt
>
> In the following test, the 75th percentile is _smaller_ than the 25th
> percentile, leaving me with a negative interquartile range.
> {code:title=Bar.java|borderStyle=solid}
> @Test public void negativePercentiles(){
> double[] data = new double[]{
> -0.012086732064244697,
> -0.24975668704012527,
> 0.5706168483164684,
> -0.322111769955327,
> 0.24166759508327315,
> Double.NaN,
> 0.16698443218942854,
> -0.10427763937565114,
> -0.15595963093172435,
> -0.028075857595882995,
> -0.24137994506058857,
> 0.47543170476574426,
> -0.07495595384947631,
> 0.37445697625436497,
> -0.09944199541668033
> };
> DescriptiveStatistics descriptiveStatistics = new
> DescriptiveStatistics(data);
> double threeQuarters = descriptiveStatistics.getPercentile(75);
> double oneQuarter = descriptiveStatistics.getPercentile(25);
> double IQR = threeQuarters - oneQuarter;
>
> System.out.println(String.format("25th percentile %s 75th percentile
> %s", oneQuarter, threeQuarters ));
>
> assert IQR >= 0;
>
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)