Hi,

I was wondering what coding algorithm does Calc use when implementing the various statistical functions?

[ Take my apologies if you received this e-mail 2 times. Although I subscribed to the mailing list, I sent the first mail using my second e-mail address. As that e-mail was not posted on the mailing list, I decided to send it again. This is largely the same e-mail, although I added more informations.]


REASON
=======
Although different formulas are mathematically identical, they give different *error estimates* when coded on a computer.

EXAMPLE
========
The variance can be calculated using any of 2 formulas: (using either n or n-1 won't affect the final result I wish to point at):
(1) sigma^2 = (1/n) * SUM(Xi^2) - mean^2,   OR
(2) sigma^2 = (1/n) * SUM(Xi - mean)^2,  where i = 1 to n;

The two formulas are mathematically identical, yet they provide different results, AND MOST IMPORTANT different *accuracy* when implemented as a computer algorithm.

[This is discussed in the following article, too: http://links.jstor.org/sici?sici=0003-1305%28198308%2937%3A3%3C242%3AAFCTSV%3E2.0.CO%3B2-0&origin=crossref unfortunately I do NOT have access to the full article. I would appreciate if somebody could send it to me.]

REASON
=======
There are 2 main reasons, why identical formulas give different results on a computer:
- non-integer numbers can NOT be represented perfectly on a computer
- numbers resulting from various mathematical operations must be rounded:
 -- symmetrical rounding: introduces an error = 1/2 * 10^(1-t)
 -- rounding by slicing introduces an error = 10^(1-t)

It can be proved, that the first formula for variance gives a *MUCH* higher error. I recently participated at an advanced course of mathematics for programmers (I am a physician and got there by chance, so I did not necessarily understand everything). However, the idea is:

e = error for the variance;
d = upper margin of rounding error (Rounding error will be abbreviated with R);
D = maximum of all Delta(i), where Delta(i) = error of (Xi - mean), then

(1) |e| <= 3/2 * d * mean^2 * n, while formula (2) gives:
(2) |e| <= 1/2 * d * D^2 * n

What one sees, is that:
|E2/E1| = 1/3 * D^2 / mean^2, where E = maximal error for equations *(2)* and *(1)*

Because D (as it is an error) is *<<<* the mean => E2 <<< E1 (taken as positive numbers). Therefore *formula 2* is the preferred way of implementation in a computer program.

Acknowledgedly, the error is small. It becomes more important with *huge* data sets (and especially with some particular data sets, but one can not predict which ones). However, I would like to know if anybody (e.g. the developers) have paid some attention to these issues while implementing the various functions?.

For me it was counterintuitive that the second formula is *much* better (I was fully convinced, that the first formula, having less calculations would be more accurate). This seems even more important for more advanced statistical techniques, where such errors may play an even greater role.

Let me encourage you to develop a state of the art program and please take my comments as a wish to improve Calc even further.

Kind regards,

Leonard Mada
[aka discoleo]

The exact error estimate for formula (2) is: (do not ask me exactly how you get it, it is a little bit to much high mathematics)

e = 1/n * SUM(i = 1 to n) [(2*Rsi + Rmi) * (Xi - mean)^2] + 1/n * SUM(j = 1 to n-1) [Raj * SUM(i = 1 to j +1) (Xi - mean)^2 ] + 1/n * Rd * SUM(i =1 to n) (Xi - mean)^2, where Ri = rounding error for subtraction (s), multiplication (m), addition (a), and division (d)

if we set D = max (of all Delta(i)), where Delta(i) = error of (Xi - mean),
and d = upper margin of rounding error, then the previous formula can be simplified to:
|e| <= d * D^2 * (n^2+9*n-2)/(2*n)
and lim(n-> infinity)|e| <= 1/2 * d * D^2 * n

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to