Re: [math] UnivariateImpl - when sumsq ~ xbarxbar((double) n)

Phil Steitz Wed, 04 Jun 2003 00:54:14 -0700

Al Chou wrote:

--- [EMAIL PROTECTED] wrote:
Phil Steitz wrote:

Phil Steitz wrote:

[EMAIL PROTECTED] wrote:

Phil Steitz wrote:

Since xbar = sum/n, the change has no impact on the which sums are computed or squared. Instead of (sum/n)*(sum/n)*n your change just computes sum**2/n. The difference is that you are a) eliminating one division by n and one multiplication by n (no doubt a good thing) and b) replacing direct multiplication with pow(-,2). The second of these used to be discouraged, but I doubt it makes any difference with modern compilers. I would suggest collapsing the denominators and doing just one cast -- i.e., use

(1) variance = sumsq - sum * (sum/(double) (n * (n - 1)))

If

(2) variance = sumsq - (sum * sum)/(double) (n * (n - 1))) or

(3) variance = sumsq - Math.pow(sum,2)/(double) (n * (n - 1))) give

better accuracy, use one of them; but I would favor (1) since it will be able to handle larger positive sums.

I would also recommend forcing getVariance() to return 0 if the result is negative (which can happen in the right circumstances for any of these formulas).

Phil

collapsing is definitely good, but I'm not sure about these equations, from my experience, approaching (2) would look something more like

variance = (((double)n)*sumsq - (sum * sum)) / (double) (n * (n - 1));

see (5) in http://mathworld.wolfram.com/k-Statistic.html

That formula is the formula for the 2nd k-statistic, which is *not* the same as the sample variance. The standard formula for the sample variance is presented in equation (3) here: http://mathworld.wolfram.com/SampleVariance.html or in any elementary statistics text. Formulas (1)-(3) above (and the current implementation) are all equivalent to the standard defintion.

What you have above is not. The relation between the variance and the second k-statistic is presented in (9) on http://mathworld.wolfram.com/k-Statistic.html

I just realized that I misread Wolfram's definitions. What he is defining as the 2nd k-statistic is the correct formula for the sample variance. I am also missing some parenthesis above. Your formula is correct. Sorry.

Phil

No problem, I just wanted to make sure we're all on the same page, there are always many times I am wrong too.
Thanks for double checking,
-Mark
Uh, did we drop the idea of using the "corrected two-pass" algorithm for the
variance in the non-rolling case?  I excerpted that thread below.

I was going to mention that. The discussion above regards the "non-stored vector" approach (UnivariateImpl). I submitted some patches to the "stored" classes (StoredUnivariate,AbstractStoreUnivariate) last weekend (adding percentiles -- http://issues.apache.org/bugzilla/show_bug.cgi?id=20377) and I was waiting for them to get committed before suggesting this change/submitting the patch.

Given our recent NR entanglements, we need to make sure that if and when we apply that patch we document using the original source of the formula.

Thanks for the remider.

What is relevant to the UnivariateImpl computations is Brent's suggestion re: making "running" error corrections. Have you investigated this further?

Al
Date: Mon, 26 May 2003 18:28:45 -0700 (PDT)
From: Al Chou <[EMAIL PROTECTED]>
Subject: [math] rolling formulas for statistics (was RE: Greetings from a
newcomer (b
--- Phil Steitz <[EMAIL PROTECTED]> wrote:
Al Chou wrote:
--- Phil Steitz <[EMAIL PROTECTED]> wrote:
Brent Worden wrote:
Mark R. Diggory wrote:
There are easy formulas for skewness and kurtosis based on the central
moments which could be used for the stored, univariate implementations:
http://mathworld.wolfram.com/Skewness.html
http://mathworld.wolfram.com/Kurtosis.html
As for the rolling implementations, there might be some more research involved before using this method because of their memoryless property.
But
for starters, the sum and sumsq can easily be replaced with there central
moment counterparts, mean and variance. There are formulas that update
those
metrics when a new value is added.  Weisberg's "Applied Linear Regression"
outlines two such updating formulas for mean and sum of squares which are
numerically superior to direct computation and the raw moment methods.
Why exactly, are these numerically superior? For what class of examples? Looks like lots more operations to me, especially in the UnivariateImpl case where the mean, variance are computed only when demanded -- i.e., there is no requirement to generate mean[0], mean[1],...etc. I understand that adding (or better, swapping) operations can sometimes add precision, but I am having a hard time seeing exactly where the benefit is in this case, especially given the amount of additional computation required.
_Numerical Recipes in C_, 2nd ed. p. 613
(http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf) explains:
"Many textbooks use the binomial theorem to expand out the definitions [of
statistical quantities] into sums of various powers of the data, ...[,] but
this can magnify the roundoff error by a large factor... .  A clever way to
minimize roundoff error, especially for large samples, is to use the
corrected

two-pass algorithm [1]: First calculate x[bar, the mean of x], then
calculate
[formula for variance in terms of x - xbar.] ..."
[1] "Algorithms for Computing the Sample Variance: Analysis and
Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American
Statistician, vol. 37, pp. 242?247.
Thany you, Al!

I am convinced by this for that at least for the variance and higher order moments, whenever we actually store the values (that would include UnivariateImpl with finite window), we should use the "corrected one [sic] pass" formula (14.1.8) from http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf. It is a clever idea to explicitly correct for error in the mean.
=====
Albert Davidson Chou
Get answers to Mac questions at http://www.Mac-Mgrs.org/ .
__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [math] UnivariateImpl - when sumsq ~ xbar*xbar*((double) n)

Reply via email to

Re: [math] UnivariateImpl - when sumsq ~ xbarxbar((double) n)