dear all, I've routinely found that:
1) ndarray.take is up to 1 order of magnitude faster than fancy indexing 2) Hand-coded Cython boolean indexing is many times faster than ndarray indexing 3) putmask is significantly faster than ndarray indexing For example, I stumbled on this tonight: straightforward cython code: def row_bool_subset(ndarray[float64_t, ndim=2] values, ndarray[uint8_t, cast=True] mask): cdef: Py_ssize_t i, j, n, k, pos = 0 ndarray[float64_t, ndim=2] out n, k = (<object> values).shape assert(n == len(mask)) out = np.empty((mask.sum(), k), dtype=np.float64) for i in range(n): if mask[i]: for j in range(k): out[pos, j] = values[i, j] pos += 1 return out In [1]: values = randn(1000000, 4) In [2]: mask = np.ones(1000000, dtype=bool) In [3]: import pandas._sandbox as sbx In [4]: result = sbx.row_bool_subset(values, mask) In [5]: timeit result = sbx.row_bool_subset(values, mask) 100 loops, best of 3: 9.58 ms per loop pure NumPy: In [6]: timeit values[mask] 10 loops, best of 3: 81.6 ms per loop Here's the kind of take performance problems that I routinely experience: In [12]: values = randn(1000000, 4) v In [13]: values.shape Out[13]: (1000000, 4) In [14]: indexer = np.random.permutation(1000000)[:500000] In [15]: timeit values[indexer] 10 loops, best of 3: 70.7 ms per loop In [16]: timeit values.take(indexer, axis=0) 100 loops, best of 3: 13.3 ms per loop When I can spare the time in the future I will personally work on these indexing routines in the C codebase, but I suspect that I'm not the only one adversely affected by these kinds of performance problems, and it might be worth thinking about a community effort to split up the work of retooling these methods to be more performant. We could use a tool like my vbench project (github.com/wesm/vbench) to make a list of the performance benchmarks of interest (like the ones above). Unfortunately I am too time constrained at least for the next 6 months to devote to a complete rewrite of the code in question. Possibly sometime in 2013 if no one has gotten to it yet, but this seems like someplace that we should be concerned about as the language performance wars continue to rage. - Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion