On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.k...@gmail.com> wrote: > On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgood...@gmail.com> wrote: > >> But this sample gives a difference: >> >>>> a = np.random.rand(100) >>>> a.var() >> 0.080232196646619805 >>>> var(a) >> 0.080232196646619791 >> >> As you know, I'm trying to make a drop-in replacement for >> scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in >> part. Either that, or suck it up and store the damn mean. > > The difference is less than eps. Quite possibly, the one-pass version > is even closer to the true value than the two-pass version.
I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays: >> a = np.random.rand(1000000) # numpy; doesn't take care of NaNs >> a.std() 0.28852169850186793 # cython of np.sqrt(((arr - arr.mean())**2).mean()) >> nanstd_twopass(a, ddof=0) 0.28852169850186798 # cython of np.sqrt((arr*arr).mean() - arr.mean()**2) >> nanstd_simple(a, ddof=0) 0.28852169850187437 # http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm >> nanstd_online(a, ddof=0) 0.28852169850187243 # My target, scipy version >> from scipy.stats import nanstd >> nanstd(a, bias=True) 0.28852169850186798 Timing: >> timeit nanstd(a, bias=True) 10 loops, best of 3: 27.8 ms per loop >> timeit a.std() 100 loops, best of 3: 11.5 ms per loop >> timeit nanstd_twopass(a, ddof=0) 100 loops, best of 3: 3.24 ms per loop >> timeit nanstd_simple(a, ddof=0) 1000 loops, best of 3: 1.6 ms per loop >> timeit nanstd_online(a, ddof=0) 100 loops, best of 3: 10.8 ms per loop nanstd_simple is the fastest but I assume the algorithm is no good for general use? I think I'll go with nanstd_twopass. It will most closely match numpy/scipy, is more robust than nanstd_simple, and is the second fastest. Here's the code. Improvements welcomed. @cython.boundscheck(False) @cython.wraparound(False) def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai a2sum += ai * ai count += 1 if count > 0: asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], n = 0 cdef np.float64_t mean = 0, M2 = 0, delta, x for i in range(a0): x = a[i] if x == x: n += 1 delta = x - mean mean = mean + delta / n M2 = M2 + delta * (x - mean) if n > 0: return np.float64(sqrt(M2 / (n - ddof))) else: return np.float64(NAN) @cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count > 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN) _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion