Hi,
I was wondering what coding algorithm does Calc use when implementing
the various statistical functions?
[ Take my apologies if you received this e-mail 2 times. Although I
subscribed to the mailing list, I sent the first mail using my second
e-mail address. As that e-mail was not posted on the mailing list, I
decided to send it again. This is largely the same e-mail, although I
added more informations.]
REASON
=======
Although different formulas are mathematically identical, they give
different *error estimates* when coded on a computer.
EXAMPLE
========
The variance can be calculated using any of 2 formulas: (using either n
or n-1 won't affect the final result I wish to point at):
(1) sigma^2 = (1/n) * SUM(Xi^2) - mean^2, OR
(2) sigma^2 = (1/n) * SUM(Xi - mean)^2, where i = 1 to n;
The two formulas are mathematically identical, yet they provide
different results, AND MOST IMPORTANT different *accuracy* when
implemented as a computer algorithm.
[This is discussed in the following article, too:
http://links.jstor.org/sici?sici=0003-1305%28198308%2937%3A3%3C242%3AAFCTSV%3E2.0.CO%3B2-0&origin=crossref
unfortunately I do NOT have access to the full article. I would
appreciate if somebody could send it to me.]
REASON
=======
There are 2 main reasons, why identical formulas give different results
on a computer:
- non-integer numbers can NOT be represented perfectly on a computer
- numbers resulting from various mathematical operations must be rounded:
-- symmetrical rounding: introduces an error = 1/2 * 10^(1-t)
-- rounding by slicing introduces an error = 10^(1-t)
It can be proved, that the first formula for variance gives a *MUCH*
higher error. I recently participated at an advanced course of
mathematics for programmers (I am a physician and got there by chance,
so I did not necessarily understand everything). However, the idea is:
e = error for the variance;
d = upper margin of rounding error (Rounding error will be abbreviated
with R);
D = maximum of all Delta(i), where Delta(i) = error of (Xi - mean), then
(1) |e| <= 3/2 * d * mean^2 * n, while formula (2) gives:
(2) |e| <= 1/2 * d * D^2 * n
What one sees, is that:
|E2/E1| = 1/3 * D^2 / mean^2, where E = maximal error for equations
*(2)* and *(1)*
Because D (as it is an error) is *<<<* the mean => E2 <<< E1 (taken as
positive numbers). Therefore *formula 2* is the preferred way of
implementation in a computer program.
Acknowledgedly, the error is small. It becomes more important with
*huge* data sets (and especially with some particular data sets, but one
can not predict which ones). However, I would like to know if anybody
(e.g. the developers) have paid some attention to these issues while
implementing the various functions?.
For me it was counterintuitive that the second formula is *much* better
(I was fully convinced, that the first formula, having less calculations
would be more accurate). This seems even more important for more
advanced statistical techniques, where such errors may play an even
greater role.
Let me encourage you to develop a state of the art program and please
take my comments as a wish to improve Calc even further.
Kind regards,
Leonard Mada
[aka discoleo]
The exact error estimate for formula (2) is: (do not ask me exactly how
you get it, it is a little bit to much high mathematics)
e = 1/n * SUM(i = 1 to n) [(2*Rsi + Rmi) * (Xi - mean)^2] + 1/n * SUM(j
= 1 to n-1) [Raj * SUM(i = 1 to j +1) (Xi - mean)^2 ] + 1/n * Rd * SUM(i
=1 to n) (Xi - mean)^2,
where Ri = rounding error for subtraction (s), multiplication (m),
addition (a), and division (d)
if we set D = max (of all Delta(i)), where Delta(i) = error of (Xi - mean),
and d = upper margin of rounding error, then the previous formula can be
simplified to:
|e| <= d * D^2 * (n^2+9*n-2)/(2*n)
and lim(n-> infinity)|e| <= 1/2 * d * D^2 * n
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]