Two replies to Robert's question: (1) There was a thread on edstat a year or two ago that included several versions of an updating algorithm that was rather elegant. That was before my then ISP went bankrupt and all my files (including all my e-mail files) vanished into cyberspace, so I can't look them up so conveniently. But it might be worth browsing the edstat archive. (2) You may care to consult my dissertation: "Computer-generated errors in statistical analysis" (Cornell, 1969), the core of which was published in the first issue of J. Statistical Computing and Simulation. Thumbnail result: number of digits of precision lost in using the usual "computing formula" = log(k^2) approximately, where k = ratio of mean to standard deviation in the data. An implication of that result is that if 0 lies in the range of the data, loss in precision will not ordinarily exceed one decimal digit (k = 3 is fairly extreme, even when 0 is at one end of the set of data values, log(9) < 1 in base 10; and I conjectured that k could not in any case exceed 5 (log(25) < 1.5, base 10) -- I think there's a theorem there, but never developed it). Further implication: if one stores the first (non-missing) value of each variable in a vector, subtracts that vector from each subsequent observation, and uses those differences as one's data (so that zero is a fortiori within the range of values), the loss in precision due to rounding error in the computational formula is at most one decimal digit. An advantage of this scheme is that the data supplied to the algorithm are always *without rounding error* (unless the original data required more precision than that supplied by the floating-point hardware), which is not always true of data from an updating-algorithm approach (whose precision depends on the precision of the current temporary mean). Which also means, in practice, that the data thus supplied seldom require more than three decimal digits to express, and if stored in ASCII form (at, in those days, six characters per 36-bit word or eight characters per 48-bit word in CDC machines) took up notably less storage space than if stored in floating-point form, one value per word. (In the late 1960s, memory space was a rather more important consideration than it now is.) But I always considered the more important implication to be the absence of *any* rounding error in the data supplied to the algorithm.
On Thu, 26 Feb 2004, Robert J. MacG. Dawson wrote: > The thing usually presented as a "computational formula" for > variance does NOT avoid rounding error, but on the contrary is known > to be a far worse offender in this regard than the conceptually simple > debiased-mean-square-residual formula. In particular, in the > nightmare scenario, it can yield a negative value for variance, > causing the program to crash when standard deviation is computed. > > It is used (when it is) because it reduces (slightly) the number > of operations and (significantly) the space complexity and number of > memory calls. > > I seem to recall from my long-ago exposure to numerical analysis > that there are formulae that work significantly better than the > "standard" ones in terms of rounding error, but I've never seen them > in a stats text. Anbody know a good reference? > > -Robert Dawson ------------------------------------------------------------ Donald F. Burrill [EMAIL PROTECTED] 56 Sebbins Pond Drive, Bedford, NH 03110 (603) 626-0816 . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
