[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Neal Becker Wed, 01 Nov 2023 05:33:44 -0700

Thanks Juan, this is really great!  I plan to make use of this right away.

On Wed, Nov 1, 2023 at 8:13 AM Juan Nunez-Iglesias <j...@fastmail.com> wrote:


> Have you tried timing things? Thankfully this is easy to test because the
> Python source of numba-jitted functions is available at jitted_func.py_func.
>
> In [23]: @numba.njit
>     ...: def _first(arr, pred):
>     ...:     for i, elem in enumerate(arr):
>     ...:         if pred(elem):
>     ...:             return i
>     ...:
>     ...: def first(arr, pred):
>     ...:     _pred = numba.njit(pred)
>     ...:     return _first(arr, _pred)
>     ...:
>
> In [24]: arr = np.random.random(100_000_000)
>
> In [25]: %timeit first(arr, lambda x: x > 5)
> 72 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>
> In [26]: %timeit arr + 5
> 90.3 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> In [27]: %timeit _first.py_func(arr, lambda x: x > 5)
> 7.8 s ± 46.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>
> So numba gives a >100x speedup. It's still not as fast as a NumPy function
> call that doesn't have an allocation overhead:
>
> In [30]: arr2 = np.empty_like(arr, dtype=bool)
>
> In [32]: %timeit np.greater(arr, 5, out=arr2)
> 13.9 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> But it's certainly much better than pure Python! And it's not a huge cost
> for the flexibility.
>
> Juan.
>
> On Wed, 1 Nov 2023, at 10:42 AM, Dom Grigonis wrote:
>
> This results in a very slow code. The function calls of
>
> _pred = numba.njit(pred)
>
> are expensive and this sort of approach will be comparable to pure python
> functions.
>
> This is only recommended for sourcing functions that are not called
> frequently, but rather have a large computational content within them. In
> other words not suitable for predicates.
>
> Regards,
> DG
>
> On 1 Nov 2023, at 01:05, Juan Nunez-Iglesias <j...@fastmail.com> wrote:
>
> If you add a layer of indirection with Numba you can get a *very* nice API:
>
> @numba.njit
> def _first(arr, pred):
>     for i, elem in enumerate(arr):
>         if pred(elem):
>             return i
>
> def first(arr, pred):
>     _pred = numba.njit(pred)
>     return _first(arr, _pred)
>
> This even works with lambdas! (TIL, thanks Numba devs!)
>
> >>> first(np.random.random(10_000_000), lambda x: x > 0.99)
> 215
>
> Since Numba has ufunc support I don't suppose it would be hard to make it
> work with an axis= argument, but I've never played with that API myself.
>
> On Tue, 31 Oct 2023, at 6:49 PM, Lev Maximov wrote:
>
> I've implemented such functions in Cython and packaged them into a library
> called numpy_illustrated <https://pypi.org/project/numpy-illustrated/>
>
> It exposes the following functions:
>
> find(a, v)  # returns the index of the first occurrence of v in a
> first_above(a, v)   # returns the index of the first element in a that is
> strictly above v
> first_nonzero(a)   # returns the index of the first nonzero element
>
> They scan the array and bail out immediately once the match is found. Have
> a significant performance gain if the element to be
> found is closer to the beginning of the array. Have roughly the same speed
> as alternative methods if the value is missing.
>
> The complete signatures of the functions look like this:
>
> find(a, v, rtol=1e-05, atol=1e-08, sorted=False, default=-1, raises=False)
> first_above(a, v, sorted=False, missing=-1, raises=False)
> first_nonzero(a, missing=-1, raises=False)
>
> This covers the most common use cases and does not accept Python callbacks
> because accepting them would nullify any speed gain
> one would expect from such a function. A Python callback can be
> implemented with Numba, but anyone who can write the callback
> in Numba has no need for a library that wraps it into a dedicated function.
>
> The library has a 100% test coverage. Code style 'black'. It should be
> easy to add functions like 'first_below' if necessary.
>
> A more detailed description of these functions can be found here
> <https://betterprogramming.pub/the-numpy-illustrated-library-7531a7c43ffb?sk=8dd60bfafd6d49231ac76cb148a4d16f>
> .
>
> Best regards,
>   Lev Maximov
>
> On Tue, Oct 31, 2023 at 3:50 AM Dom Grigonis <dom.grigo...@gmail.com>
> wrote:
>
> I juggled a bit and found pretty nice solution using numba. Which is
> probably not very robust, but proves that such thing can be optimised while
> retaining flexibility. Check if it works for your use cases and let me know
> if anything fails or if it is slow compared to what you used.
>
>
> first_true_str = """def first_true(arr, n):    result = np.full((n, 
> arr.shape[1]), -1, dtype=np.int32)    for j in range(arr.shape[1]):        k 
> = 0        for i in range(arr.shape[0]):            x = arr[i:i + 1, j]       
>      if cond(x):                result[k, j] = i                k += 1        
>         if k >= n:                    break    return result"""
>
> *class* *FirstTrue*:
>     CONTEXT = {'np': np}
>
>     *def* __init__(self, expr):
>         self.expr = expr
>         self.expr_ast = ast.parse(expr, mode='exec').body[0].value
>         self.func_ast = ast.parse(first_true_str, mode='exec')
>         self.func_ast.body[0].body[1].body[1].body[1].test = self.expr_ast
>         self.func_cmp = compile(self.func_ast, filename="<ast>", mode="exec")
>         *exec*(self.func_cmp, self.CONTEXT)
>         self.func_nb = nb.njit(self.CONTEXT[self.func_ast.body[0].name])
>
>     *def* __call__(self, arr, n=1, axis=None):
>         *# PREPARE INPUTS*
>         in_1d = False
>         *if* axis *is* None:
>             arr = np.ravel(arr)[:, None]
>             in_1d = True
>         *elif* axis == 0:
>             *if* arr.ndim == 1:
>                 in_1d = True
>                 arr = arr[:, None]
>         *else*:
>             *raise* *ValueError*('axis ~in (None, 0)')
>         res = self.func_nb(arr, n)
>         *if* in_1d:
>             res = res[:, 0]
>         *return* res
>
> *if* __name__ == '__main__':
>     arr = np.arange(125).reshape((5, 5, 5))
>     ft = FirstTrue('np.sum(x) > 30')
>     *print*(ft(arr, n=2, axis=0))
>
> [[1 0 0 0 0]
>  [2 1 1 1 1]]
>
>
> In [16]: %timeit ft(arr, 2, axis=0)1.31 µs ± 3.94 ns per loop (mean ± std. 
> dev. of 7 runs, 1,000,000 loops each)
>
> Regards,
> DG
>
> On 29 Oct 2023, at 23:18, rosko37 <rosk...@gmail.com> wrote:
>
> An example with a 1-D array (where it is easiest to see what I mean) is
> the following. I will follow Dom Grigonis's suggestion that the range not
> be provided as a separate argument, as it can be just as easily "folded
> into" the array by passing a slice. So it becomes just:
> idx = first_true(arr, cond)
>
> As Dom also points out, the "cond" would likely need to be a "function
> pointer" (i.e., the name of a function defined elsewhere, turning
> first_true into a higher-order function), unless there's some way to pass a
> parseable expression for simple cases. A few special cases like the first
> zero/nonzero element could be handled with dedicated options (sort of like
> matplotlib colors), but for anything beyond that it gets unwieldy fast.
>
> So let's say we have this:
> ******************
> def cond(x):
>     return x>50
>
> search_arr = np.exp(np.arange(0,1000))
>
> print(np.first_true(search_arr, cond))
> *******************
>
> This should print 4, because the element of search_arr at index 4 (i.e.
> the 5th element) is e^4, which is slightly greater than 50 (while e^3 is
> less than 50). It should return this *without testing the 6th through
> 1000th elements of the array at all to see whether they exceed 50 or not*.
> This example is rather contrived, because simply taking the natural log of
> 50 and rounding up is far superior, not even *evaluating the array of
> exponentials *(which my example clearly still does--and in the use cases
> I've had for such a function, I can't predict the array elements like
> this--they come from loaded data, the output of a simulation, etc., and are
> all already in a numpy array). And in this case, since the values are
> strictly increasing, search_sorted() would work as well. But it illustrates
> the idea.
>
>
>
>
> On Thu, Oct 26, 2023 at 5:54 AM Dom Grigonis <dom.grigo...@gmail.com>
> wrote:
>
> Could you please give a concise example? I know you have provided one, but
> it is engrained deep in verbose text and has some typos in it, which makes
> hard to understand exactly what inputs should result in what output.
>
> Regards,
> DG
>
> > On 25 Oct 2023, at 22:59, rosko37 <rosk...@gmail.com> wrote:
> >
> > I know this question has been asked before, both on this list as well as
> several threads on Stack Overflow, etc. It's a common issue. I'm NOT asking
> for how to do this using existing Numpy functions (as that information can
> be found in any of those sources)--what I'm asking is whether Numpy would
> accept inclusion of a function that does this, or whether (possibly more
> likely) such a proposal has already been considered and rejected for some
> reason.
> >
> > The task is this--there's a large array and you want to find the next
> element after some index that satisfies some condition. Such elements are
> common, and the typical number of elements to be searched through is small
> relative to the size of the array. Therefore, it would greatly improve
> performance to avoid testing ALL elements against the conditional once one
> is found that returns True. However, all built-in functions that I know of
> test the entire array.
> >
> > One can obviously jury-rig some ways, like for instance create a "for"
> loop over non-overlapping slices of length slice_length and call something
> like np.where(cond) on each--that outer "for" loop is much faster than a
> loop over individual elements, and the inner loop at most will go
> slice_length-1 elements past the first "hit". However, needing to use such
> a convoluted piece of code for such a simple task seems to go against the
> Numpy spirit of having one operation being one function of the form
> func(arr)".
> >
> > A proposed function for this, let's call it "np.first_true(arr,
> start_idx, [stop_idx])" would be best implemented at the C code level,
> possibly in the same code file that defines np.where. I'm wondering if I,
> or someone else, were to write such a function, if the Numpy developers
> would consider merging it as a standard part of the codebase. It's possible
> that the idea of such a function is bad because it would violate some
> existing broadcasting or fancy indexing rules. Clearly one could make it
> possible to pass an "axis" argument to np.first_true() that would select an
> axis to search over in the case of multi-dimensional arrays, and then the
> result would be an array of indices of one fewer dimension than the
> original array. So np.first_true(np.array([1,5],[2,7],[9,10],cond) would
> return [1,1,0] for cond(x): x>4. The case where no elements satisfy the
> condition would need to return a "signal value" like -1. But maybe there
> are some weird cases where there isn't a sensible return val
>  ue, hence why such a function has not been added.
> >
> > -Andrew Rosko
> > _______________________________________________
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: dom.grigo...@gmail.com
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: rosk...@gmail.com
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com
>
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: lev.maxi...@gmail.com
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: j...@fastmail.com
>
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: j...@fastmail.com
>
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ndbeck...@gmail.com
>


-- 
*Those who don't understand recursion are doomed to repeat it*

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Reply via email to