On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman <[email protected]> wrote:
> On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <[email protected]> > wrote: > > On Sun, Nov 21, 2010 at 19:49, Keith Goodman <[email protected]> > wrote: > > > >> But this sample gives a difference: > >> > >>>> a = np.random.rand(100) > >>>> a.var() > >> 0.080232196646619805 > >>>> var(a) > >> 0.080232196646619791 > >> > >> As you know, I'm trying to make a drop-in replacement for > >> scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in > >> part. Either that, or suck it up and store the damn mean. > > > > The difference is less than eps. Quite possibly, the one-pass version > > is even closer to the true value than the two-pass version. > > I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays: > > >> a = np.random.rand(1000000) > > # numpy; doesn't take care of NaNs > >> a.std() > 0.28852169850186793 > > # cython of np.sqrt(((arr - arr.mean())**2).mean()) > >> nanstd_twopass(a, ddof=0) > 0.28852169850186798 > > # cython of np.sqrt((arr*arr).mean() - arr.mean()**2) > >> nanstd_simple(a, ddof=0) > 0.28852169850187437 > > # > http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm > >> nanstd_online(a, ddof=0) > 0.28852169850187243 > > # My target, scipy version > >> from scipy.stats import nanstd > >> nanstd(a, bias=True) > 0.28852169850186798 > > Timing: > > >> timeit nanstd(a, bias=True) > 10 loops, best of 3: 27.8 ms per loop > >> timeit a.std() > 100 loops, best of 3: 11.5 ms per loop > >> timeit nanstd_twopass(a, ddof=0) > 100 loops, best of 3: 3.24 ms per loop > >> timeit nanstd_simple(a, ddof=0) > 1000 loops, best of 3: 1.6 ms per loop > >> timeit nanstd_online(a, ddof=0) > 100 loops, best of 3: 10.8 ms per loop > > nanstd_simple is the fastest but I assume the algorithm is no good for > general use? > > I think I'll go with nanstd_twopass. It will most closely match > numpy/scipy, is more robust than nanstd_simple, and is the second > fastest. > > Here's the code. Improvements welcomed. > > @cython.boundscheck(False) > @cython.wraparound(False) > def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof): > "nanstd of 1d numpy array with dtype=np.float64 along axis=0." > cdef Py_ssize_t i > cdef int a0 = a.shape[0], count = 0 > cdef np.float64_t asum = 0, a2sum=0, ai > for i in range(a0): > ai = a[i] > if ai == ai: > asum += ai > a2sum += ai * ai > count += 1 > if count > 0: > asum = asum * asum > return sqrt((a2sum - asum / count) / (count - ddof)) > else: > return np.float64(NAN) > > @cython.boundscheck(False) > @cython.wraparound(False) > def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof): > "nanstd of 1d numpy array with dtype=np.float64 along axis=0." > cdef Py_ssize_t i > cdef int a0 = a.shape[0], n = 0 > cdef np.float64_t mean = 0, M2 = 0, delta, x > for i in range(a0): > x = a[i] > if x == x: > n += 1 > delta = x - mean > mean = mean + delta / n > M2 = M2 + delta * (x - mean) > if n > 0: > return np.float64(sqrt(M2 / (n - ddof))) > else: > return np.float64(NAN) > > @cython.boundscheck(False) > @cython.wraparound(False) > def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): > "nanstd of 1d numpy array with dtype=np.float64 along axis=0." > cdef Py_ssize_t i > cdef int a0 = a.shape[0], count = 0 > cdef np.float64_t asum = 0, a2sum=0, amean, ai, da > for i in range(a0): > ai = a[i] > if ai == ai: > asum += ai > count += 1 > if count > 0: > amean = asum / count > asum = 0 > for i in range(a0): > ai = a[i] > if ai == ai: > da = ai - amean > asum += da > a2sum += (da * da) > asum = asum * asum > return sqrt((a2sum - asum / count) / (count - ddof)) > else: > return np.float64(NAN) > I wonder how the results would change if the size of the array was larger than the processor cache? I still can't seem to wrap my head around the idea that a two-pass algorithm would be faster than a single-pass. Is this just a big-O thing where sometimes one algorithm will be faster than the other based on the size of the problem? Ben Root
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
