[EMAIL PROTECTED] wrote:
My current work on Certified values exposes a limitation in calculating Variance in UnivariateImpl.

In the following case where this example data is used to induce roundoff error:

public double getVariance() {
   double variance = Double.NaN;
   if( n == 1 ) {
      variance = 0.0;
   } else if( n > 1 ) {
      double xbar = getMean();
      variance = sumsq - xbar*xbar*((double) n) /
                 ((double)(n-1));
   }
   return variance;
}

Basically when "sumsq" approaches "xbar*xbar*((double) n)". the numerator in the variance calc can become negative and the return from variance is negative, Math.sqrt then goes on to return a NaN. Of course, negative variances are not in the domain of possible return values for variance. The test I discover this on is in the NumAcc4.txt dataset that is now under /src/test/org/apache/commons/math/stat/data.

this data set basically looks like the following:

  10000000.2
  10000000.1
  10000000.3
  10000000.1
  10000000.3
  10000000.1
  10000000.3
  10000000.1
  10000000.3
  ...

I think that this different approach below handles extreme cases like this with more stability:

public double getVariance() {
   double variance = Double.NaN;

   if( n == 1 ) {
    variance = 0.0;
   } else if( n > 1 ) {
    variance =( sumsq - (Math.pow(sum,2)/(double)n ) / (double)(n-1);
   }
   return variance;
}

The roundoff error in "sum of squares" is always positively greater than that of the "square of the sum". Thus not inducing negative numbers and instability when dealing with extreme situations like this. I think that while one should try to avoid situations where roundoff errors arise, but dealing with such cases in a stable fashion should help to produce a robust algorithm.

Assuming that the unit tests all succeed and the accuracy against the reference data sets is at least as good as the current implementation for all examples, I am OK with this change, but not for the reasons that you give.


Since xbar = sum/n, the change has no impact on the which sums are computed or squared. Instead of (sum/n)*(sum/n)*n your change just computes sum**2/n. The difference is that you are a) eliminating one division by n and one multiplication by n (no doubt a good thing) and b) replacing direct multiplication with pow(-,2). The second of these used to be discouraged, but I doubt it makes any difference with modern compilers. I would suggest collapsing the denominators and doing just one cast -- i.e., use

(1) variance = sumsq - sum * (sum/(double) (n * (n - 1)))

If

(2) variance = sumsq - (sum * sum)/(double) (n * (n - 1))) or

(3) variance = sumsq - Math.pow(sum,2)/(double) (n * (n - 1))) give

better accuracy, use one of them; but I would favor (1) since it will be able to handle larger positive sums.

I would also recommend forcing getVariance() to return 0 if the result is negative (which can happen in the right circumstances for any of these formulas).

Phil


I wanted to open this to discussion prior to committing this change. Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to