My current work on Certified values exposes a limitation in calculating
Variance in UnivariateImpl.
In the following case where this example data is used to induce roundoff
error:
public double getVariance() {
double variance = Double.NaN;
if( n == 1 ) {
variance = 0.0;
} else if( n > 1 ) {
double xbar = getMean();
variance = sumsq - xbar*xbar*((double) n) /
((double)(n-1));
}
return variance;
}
Basically when "sumsq" approaches "xbar*xbar*((double) n)". the
numerator in the variance calc can become negative and the return from
variance is negative, Math.sqrt then goes on to return a NaN. Of course,
negative variances are not in the domain of possible return values for
variance. The test I discover this on is in the NumAcc4.txt dataset that
is now under /src/test/org/apache/commons/math/stat/data.
this data set basically looks like the following:
10000000.2
10000000.1
10000000.3
10000000.1
10000000.3
10000000.1
10000000.3
10000000.1
10000000.3
...
I think that this different approach below handles extreme cases like this with
more stability:
public double getVariance() {
double variance = Double.NaN;
if( n == 1 ) {
variance = 0.0;
} else if( n > 1 ) {
variance =( sumsq - (Math.pow(sum,2)/(double)n ) / (double)(n-1);
}
return variance;
}
The roundoff error in "sum of squares" is always positively greater than
that of the "square of the sum". Thus not inducing negative numbers and
instability when dealing with extreme situations like this. I think that
while one should try to avoid situations where roundoff errors arise,
but dealing with such cases in a stable fashion should help to produce a
robust algorithm.
I wanted to open this to discussion prior to committing this change.
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]