Robert Kern wrote: > David M. Cooke wrote: > >> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >> >>> Let me offer a third path: the algorithms used for .mean() and .var() are >>> substandard. There are much better incremental algorithms that entirely >>> avoid >>> the need to accumulate such large (and therefore precision-losing) >>> intermediate >>> values. The algorithms look like the following for 1D arrays in Python: >>> >>> def mean(a): >>> m = a[0] >>> for i in range(1, len(a)): >>> m += (a[i] - m) / (i + 1) >>> return m >>> >> This isn't really going to be any better than using a simple sum. >> It'll also be slower (a division per iteration). >> > > With one exception, every test that I've thrown at it shows that it's better > for > float32. That exception is uniformly spaced arrays, like linspace(). > > > You do avoid > > accumulating large sums, but then doing the division a[i]/len(a) and > > adding that will do the same. > > Okay, this is true. > > >> Now, if you want to avoid losing precision, you want to use a better >> summation technique, like compensated (or Kahan) summation: >> >> def mean(a): >> s = e = a.dtype.type(0) >> for i in range(0, len(a)): >> temp = s >> y = a[i] + e >> s = temp + y >> e = (temp - s) + y >> return s / len(a) >> >> Some numerical experiments in Maple using 5-digit precision show that >> your mean is maybe a bit better in some cases, but can also be much >> worse, than sum(a)/len(a), but both are quite poor in comparision to the >> Kahan summation. >> >> (We could probably use a fast implementation of Kahan summation in >> addition to a.sum()) >> > > +1 > > >>> def var(a): >>> m = a[0] >>> t = a.dtype.type(0) >>> for i in range(1, len(a)): >>> q = a[i] - m >>> r = q / (i+1) >>> m += r >>> t += i * q * r >>> t /= len(a) >>> return t >>> >>> Alternatively, from Knuth: >>> >>> def var_knuth(a): >>> m = a.dtype.type(0) >>> variance = a.dtype.type(0) >>> for i in range(len(a)): >>> delta = a[i] - m >>> m += delta / (i+1) >>> variance += delta * (a[i] - m) >>> variance /= len(a) >>> return variance >>> >> These formulas are good when you can only do one pass over the data >> (like in a calculator where you don't store all the data points), but >> are slightly worse than doing two passes. Kahan summation would probably >> also be good here too. >> On the flip side, it doesn't take a very large array (somewhere in the vicinity of 10,000 values in my experience) before memory bandwidth becomes a limiting factor. In that region a two pass algorithm could well be twice as slow as a single pass algorithm even if the former is somewhat simpler in terms of numeric operations.
-tim > > Again, my tests show otherwise for float32. I'll condense my ipython log into > a > module for everyone's perusal. It's possible that the Kahan summation of the > squared residuals will work better than the current two-pass algorithm and > the > implementations I give above. > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion