--- [EMAIL PROTECTED] wrote: > Phil Steitz wrote: > > Phil Steitz wrote: > > > >> [EMAIL PROTECTED] wrote: > >> > >>> Phil Steitz wrote: > >>> > >>>> Since xbar = sum/n, the change has no impact on the which sums are > >>>> computed or squared. Instead of (sum/n)*(sum/n)*n your change just > >>>> computes sum**2/n. The difference is that you are a) eliminating > >>>> one division by n and one multiplication by n (no doubt a good > >>>> thing) and b) replacing direct multiplication with pow(-,2). The > >>>> second of these used to be discouraged, but I doubt it makes any > >>>> difference with modern compilers. I would suggest collapsing the > >>>> denominators and doing just one cast -- i.e., use > >>>> > >>>> (1) variance = sumsq - sum * (sum/(double) (n * (n - 1))) > >>>> > >>>> If > >>>> > >>>> (2) variance = sumsq - (sum * sum)/(double) (n * (n - 1))) or > >>>> > >>>> (3) variance = sumsq - Math.pow(sum,2)/(double) (n * (n - 1))) give > >>>> > >>>> better accuracy, use one of them; but I would favor (1) since it > >>>> will be able to handle larger positive sums. > >>>> > >>>> I would also recommend forcing getVariance() to return 0 if the > >>>> result is negative (which can happen in the right circumstances for > >>>> any of these formulas). > >>>> > >>>> Phil > >>> > >>> > >>> > >>> > >>> collapsing is definitely good, but I'm not sure about these > >>> equations, from my experience, approaching (2) would look something > >>> more like > >>> > >>> variance = (((double)n)*sumsq - (sum * sum)) / (double) (n * (n - 1)); > >>> > >>> see (5) in http://mathworld.wolfram.com/k-Statistic.html > >> > >> > >> > >> That formula is the formula for the 2nd k-statistic, which is *not* > >> the same as the sample variance. The standard formula for the sample > >> variance is presented in equation (3) here: > >> http://mathworld.wolfram.com/SampleVariance.html or in any elementary > >> statistics text. Formulas (1)-(3) above (and the current > >> implementation) are all equivalent to the standard defintion. > >> > >> What you have above is not. The relation between the variance and the > >> second k-statistic is presented in (9) on > >> http://mathworld.wolfram.com/k-Statistic.html > > > > > > I just realized that I misread Wolfram's definitions. What he is > > defining as the 2nd k-statistic is the correct formula for the sample > > variance. I am also missing some parenthesis above. Your formula is > > correct. Sorry. > > > > Phil > > > > No problem, I just wanted to make sure we're all on the same page, there are > always many times I am wrong too. > > Thanks for double checking, > -Mark
Uh, did we drop the idea of using the "corrected two-pass" algorithm for the variance in the non-rolling case? I excerpted that thread below. Al Date: Mon, 26 May 2003 18:28:45 -0700 (PDT) From: Al Chou <[EMAIL PROTECTED]> Subject: [math] rolling formulas for statistics (was RE: Greetings from a newcomer (b --- Phil Steitz <[EMAIL PROTECTED]> wrote: > Al Chou wrote: > > --- Phil Steitz <[EMAIL PROTECTED]> wrote: > > > >>Brent Worden wrote: > >>>>Mark R. Diggory wrote: > >>>There are easy formulas for skewness and kurtosis based on the central > >>>moments which could be used for the stored, univariate implementations: > >>>http://mathworld.wolfram.com/Skewness.html > >>>http://mathworld.wolfram.com/Kurtosis.html > >>> > >>>As for the rolling implementations, there might be some more research > >>>involved before using this method because of their memoryless property. > >> > >>But > >>>for starters, the sum and sumsq can easily be replaced with there central > >>>moment counterparts, mean and variance. There are formulas that update > >>those > >>>metrics when a new value is added. Weisberg's "Applied Linear Regression" > >>>outlines two such updating formulas for mean and sum of squares which are > >>>numerically superior to direct computation and the raw moment methods. > >> > >>Why exactly, are these numerically superior? For what class of > >>examples? Looks like lots more operations to me, especially in the > >>UnivariateImpl case where the mean, variance are computed only when > >>demanded -- i.e., there is no requirement to generate mean[0], > >>mean[1],...etc. I understand that adding (or better, swapping) > >>operations can sometimes add precision, but I am having a hard time > >>seeing exactly where the benefit is in this case, especially given the > >>amount of additional computation required. > > > > > > _Numerical Recipes in C_, 2nd ed. p. 613 > > (http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf) explains: > > "Many textbooks use the binomial theorem to expand out the definitions [of > > statistical quantities] into sums of various powers of the data, ...[,] but > > this can magnify the roundoff error by a large factor... . A clever way to > > minimize roundoff error, especially for large samples, is to use the > corrected > > two-pass algorithm [1]: First calculate x[bar, the mean of x], then > calculate > > [formula for variance in terms of x - xbar.] ..." > > > > [1] "Algorithms for Computing the Sample Variance: Analysis and > > Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American > > Statistician, vol. 37, pp. 242?247. > > Thany you, Al! > > I am convinced by this for that at least for the variance and higher > order moments, whenever we actually store the values (that would include > UnivariateImpl with finite window), we should use the "corrected one [sic] > pass" formula (14.1.8) from > http://lib-www.lanl.gov/numerical/bookcpdf/c14-1.pdf. It is a clever > idea to explicitly correct for error in the mean. ===== Albert Davidson Chou Get answers to Mac questions at http://www.Mac-Mgrs.org/ . __________________________________ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
