On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson <pelson....@gmail.com> wrote:
> The ticket https://github.com/numpy/numpy/issues/2269 discusses the > possibility of implementing a "find first" style function which can > optimise the process of finding the first value(s) which match a predicate > in a given 1D array. For example: > > > >>> a = np.sin(np.linspace(0, np.pi, 200)) > >>> print find_first(a, lambda a: a > 0.9) > ((71, ), 0.900479032457) > > > This has been discussed in several locations: > > https://github.com/numpy/numpy/issues/2269 > https://github.com/numpy/numpy/issues/2333 > > http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item > > > *Rationale* > > For small arrays there is no real reason to avoid doing: > > >>> a = np.sin(np.linspace(0, np.pi, 200)) > >>> ind = (a > 0.9).nonzero()[0][0] > >>> print (ind, ), a[ind] > (71,) 0.900479032457 > > > But for larger arrays, this can lead to massive amounts of work even if > the result is one of the first to be computed. Example: > > >>> a = np.arange(1e8) > >>> print (a == 5).nonzero()[0][0] > 5 > > > So a function which terminates when the first matching value is found is > desirable. > > As mentioned in #2269, it is possible to define a consistent ordering > which allows this functionality for >1D arrays, but IMHO it overcomplicates > the problem and was not a case that I personally needed, so I've limited > the scope to 1D arrays only. > > > *Implementation* > > My initial assumption was that to get any kind of performance I would need > to write the *find* function in C, however after prototyping with some > array chunking it became apparent that a trivial python function would be > quick enough for my needs. > > The approach I've implemented in the code found in #2269 simply breaks the > array into sub-arrays of maximum length *chunk_size* (2048 by default, > though there is no real science to this number), applies the given > predicating function, and yields the results from *nonzero()*. The given > function should be a python function which operates on the whole of the > sub-array element-wise (i.e. the function should be vectorized). Returning > a generator also has the benefit of allowing users to get the first > *n*matching values/indices. > > > *Results* > > > I timed the implementation of *find* found in my comment at > https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an > obvious test: > > > In [1]: from np_utils import find > > In [2]: import numpy as np > > In [3]: import numpy.random > > In [4]: np.random.seed(1) > > In [5]: a = np.random.randn(1e8) > > In [6]: a.min(), a.max() > Out[6]: (-6.1194900990552776, 5.9632246301166321) > > In [7]: next(find(a, lambda a: np.abs(a) > 6)) > Out[7]: ((33105441,), -6.1194900990552776) > > In [8]: (np.abs(a) > 6).nonzero() > Out[8]: (array([33105441]),) > > In [9]: %timeit (np.abs(a) > 6).nonzero() > 1 loops, best of 3: 1.51 s per loop > > In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6)) > 1 loops, best of 3: 912 ms per loop > > In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=100000)) > 1 loops, best of 3: 470 ms per loop > > In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=1000000)) > 1 loops, best of 3: 483 ms per loop > > > This shows that picking a sensible *chunk_size* can yield massive > speed-ups (nonzero is x3 slower in one case). A similar example with a much > smaller 1D array shows similar promise: > > In [41]: a = np.random.randn(1e4) > > In [42]: %timeit next(find(a, lambda a: np.abs(a) > 3)) > 10000 loops, best of 3: 35.8 us per loop > > In [43]: %timeit (np.abs(a) > 3).nonzero() > 10000 loops, best of 3: 148 us per loop > > > As I commented on the issue tracker, if you think this function is worth > taking forward, I'd be happy to open up a pull request. > > Feedback greatfully received. > > Cheers, > > Phil > > > In the interest of generalizing code and such, could such approaches be used for functions like np.any() and np.all() for short-circuiting if True or False (respectively) are found? I wonder what other sort of functions in NumPy might benefit from this? Ben Root
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion