[sc-dev] Statistical Functions Implementation

Leonard Mada Sat, 21 Oct 2006 13:48:56 -0700

Hi,

I was wondering what coding algorithm does Calc use when implementingthe various statistical functions?

[ Take my apologies if you received this e-mail 2 times. Although Isubscribed to the mailing list, I sent the first mail using my seconde-mail address. As that e-mail was not posted on the mailing list, Idecided to send it again. This is largely the same e-mail, although Iadded more informations.]



REASON
=======

Although different formulas are mathematically identical, they givedifferent *error estimates* when coded on a computer.


EXAMPLE
========

The variance can be calculated using any of 2 formulas: (using either nor n-1 won't affect the final result I wish to point at):

(1) sigma^2 = (1/n) * SUM(Xi^2) - mean^2,   OR
(2) sigma^2 = (1/n) * SUM(Xi - mean)^2,  where i = 1 to n;

The two formulas are mathematically identical, yet they providedifferent results, AND MOST IMPORTANT different *accuracy* whenimplemented as a computer algorithm.

[This is discussed in the following article, too:http://links.jstor.org/sici?sici=0003-1305%28198308%2937%3A3%3C242%3AAFCTSV%3E2.0.CO%3B2-0&origin=crossrefunfortunately I do NOT have access to the full article. I wouldappreciate if somebody could send it to me.]


REASON
=======

There are 2 main reasons, why identical formulas give different resultson a computer:

- non-integer numbers can NOT be represented perfectly on a computer
- numbers resulting from various mathematical operations must be rounded:
 -- symmetrical rounding: introduces an error = 1/2 * 10^(1-t)
 -- rounding by slicing introduces an error = 10^(1-t)

It can be proved, that the first formula for variance gives a *MUCH*higher error. I recently participated at an advanced course ofmathematics for programmers (I am a physician and got there by chance,so I did not necessarily understand everything). However, the idea is:


e = error for the variance;

d = upper margin of rounding error (Rounding error will be abbreviatedwith R);

D = maximum of all Delta(i), where Delta(i) = error of (Xi - mean), then

(1) |e| <= 3/2 * d * mean^2 * n, while formula (2) gives:
(2) |e| <= 1/2 * d * D^2 * n

What one sees, is that:

|E2/E1| = 1/3 * D^2 / mean^2, where E = maximal error for equations*(2)* and *(1)*

Because D (as it is an error) is *<<<* the mean => E2 <<< E1 (taken aspositive numbers). Therefore *formula 2* is the preferred way ofimplementation in a computer program.

Acknowledgedly, the error is small. It becomes more important with*huge* data sets (and especially with some particular data sets, but onecan not predict which ones). However, I would like to know if anybody(e.g. the developers) have paid some attention to these issues whileimplementing the various functions?.

For me it was counterintuitive that the second formula is *much* better(I was fully convinced, that the first formula, having less calculationswould be more accurate). This seems even more important for moreadvanced statistical techniques, where such errors may play an evengreater role.

Let me encourage you to develop a state of the art program and pleasetake my comments as a wish to improve Calc even further.


Kind regards,

Leonard Mada
[aka discoleo]

The exact error estimate for formula (2) is: (do not ask me exactly howyou get it, it is a little bit to much high mathematics)

e = 1/n * SUM(i = 1 to n) [(2*Rsi + Rmi) * (Xi - mean)^2] + 1/n * SUM(j= 1 to n-1) [Raj * SUM(i = 1 to j +1) (Xi - mean)^2 ] + 1/n * Rd * SUM(i=1 to n) (Xi - mean)^2,where Ri = rounding error for subtraction (s), multiplication (m),addition (a), and division (d)


if we set D = max (of all Delta(i)), where Delta(i) = error of (Xi - mean),

and d = upper margin of rounding error, then the previous formula can besimplified to:

|e| <= d * D^2 * (n^2+9*n-2)/(2*n)
and lim(n-> infinity)|e| <= 1/2 * d * D^2 * n

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[sc-dev] Statistical Functions Implementation

Reply via email to