On Thu, Feb 10, 2011 at 10:31 AM, Pauli Virtanen <p...@iki.fi> wrote:
> Thu, 10 Feb 2011 12:16:12 -0600, Robert Kern wrote: > [clip] > > One thing that might be worthwhile is to make > > implementations of sum() and cumsum() that avoid the ufunc machinery and > > do their iterations more quickly, at least for some common combinations > > of dtype and contiguity. > > I wonder what is the balance between the iterator overhead and the time > taken in the reduction inner loop. This should be straightforward to > benchmark. > > Apparently, some overhead decreased with the new iterators, since current > Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark: > > In [8]: %timeit M.sum(1) # Numpy 1.5.1 > 10 loops, best of 3: 85 ms per loop > > In [8]: %timeit M.sum(1) # Numpy master > 10 loops, best of 3: 49.5 ms per loop > > I don't think this is explainable by the new memory layout optimizations, > since M is C-contiguous. > > Perhaps there would be room for more optimization, even within the ufunc > framework? > I played around with this in einsum, where it's a bit easier to specialize this case than in the ufunc machinery. What I found made the biggest difference is to use SSE prefetching instructions to prepare the cache in advance. Here are the kind of numbers I get, all from the current Numpy master: In [7]: timeit M.sum(1) 10 loops, best of 3: 44.6 ms per loop In [8]: timeit dot(M, o) 10 loops, best of 3: 36.8 ms per loop In [9]: timeit einsum('ij->i', M) 10 loops, best of 3: 32.1 ms per loop ... In [14]: timeit M.sum(1) 10 loops, best of 3: 41.5 ms per loop In [15]: timeit dot(M, o) 10 loops, best of 3: 42.1 ms per loop In [16]: timeit einsum('ij->i', M) 10 loops, best of 3: 30 ms per loop -Mark
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion