[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Dom Grigonis Tue, 31 Oct 2023 16:44:54 -0700

This results in a very slow code. The function calls of 
_pred = numba.njit(pred)
are expensive and this sort of approach will be comparable to pure python 
functions.


This is only recommended for sourcing functions that are not called frequently, 
but rather have a large computational content within them. In other words not 
suitable for predicates.

Regards,
DG

> On 1 Nov 2023, at 01:05, Juan Nunez-Iglesias <j...@fastmail.com> wrote:
> 
> If you add a layer of indirection with Numba you can get a *very* nice API:
> 
> @numba.njit
> def _first(arr, pred):
>     for i, elem in enumerate(arr):
>         if pred(elem):
>             return i
> 
> def first(arr, pred):
>     _pred = numba.njit(pred)
>     return _first(arr, _pred)
> 
> This even works with lambdas! (TIL, thanks Numba devs!)
> 
> >>> first(np.random.random(10_000_000), lambda x: x > 0.99)
> 215
> 
> Since Numba has ufunc support I don't suppose it would be hard to make it 
> work with an axis= argument, but I've never played with that API myself.
> 
> On Tue, 31 Oct 2023, at 6:49 PM, Lev Maximov wrote:
>> I've implemented such functions in Cython and packaged them into a library 
>> called numpy_illustrated <https://pypi.org/project/numpy-illustrated/>
>> 
>> It exposes the following functions:
>> 
>> find(a, v)  # returns the index of the first occurrence of v in a
>> first_above(a, v)   # returns the index of the first element in a that is 
>> strictly above v
>> first_nonzero(a)   # returns the index of the first nonzero element
>> 
>> They scan the array and bail out immediately once the match is found. Have a 
>> significant performance gain if the element to be
>> found is closer to the beginning of the array. Have roughly the same speed 
>> as alternative methods if the value is missing.
>> 
>> The complete signatures of the functions look like this:
>> 
>> find(a, v, rtol=1e-05, atol=1e-08, sorted=False, default=-1, raises=False)
>> first_above(a, v, sorted=False, missing=-1, raises=False)
>> first_nonzero(a, missing=-1, raises=False)
>> 
>> This covers the most common use cases and does not accept Python callbacks 
>> because accepting them would nullify any speed gain
>> one would expect from such a function. A Python callback can be implemented 
>> with Numba, but anyone who can write the callback
>> in Numba has no need for a library that wraps it into a dedicated function.
>> 
>> The library has a 100% test coverage. Code style 'black'. It should be easy 
>> to add functions like 'first_below' if necessary.
>> 
>> A more detailed description of these functions can be found here 
>> <https://betterprogramming.pub/the-numpy-illustrated-library-7531a7c43ffb?sk=8dd60bfafd6d49231ac76cb148a4d16f>.
>> 
>> Best regards,
>>   Lev Maximov
>> 
>> On Tue, Oct 31, 2023 at 3:50 AM Dom Grigonis <dom.grigo...@gmail.com 
>> <mailto:dom.grigo...@gmail.com>> wrote:
>> I juggled a bit and found pretty nice solution using numba. Which is 
>> probably not very robust, but proves that such thing can be optimised while 
>> retaining flexibility. Check if it works for your use cases and let me know 
>> if anything fails or if it is slow compared to what you used.
>> 
>> 
>> 
>> first_true_str = """
>> def first_true(arr, n):
>>     result = np.full((n, arr.shape[1]), -1, dtype=np.int32)
>>     for j in range(arr.shape[1]):
>>         k = 0
>>         for i in range(arr.shape[0]):
>>             x = arr[i:i + 1, j]
>>             if cond(x):
>>                 result[k, j] = i
>>                 k += 1
>>                 if k >= n:
>>                     break
>>     return result
>> """
>> 
>> 
>> class FirstTrue:
>>     CONTEXT = {'np': np}
>> 
>>     def __init__(self, expr):
>>         self.expr = expr
>>         self.expr_ast = ast.parse(expr, mode='exec').body[0].value
>>         self.func_ast = ast.parse(first_true_str, mode='exec')
>>         self.func_ast.body[0].body[1].body[1].body[1].test = self.expr_ast
>>         self.func_cmp = compile(self.func_ast, filename="<ast>", mode="exec")
>>         exec(self.func_cmp, self.CONTEXT)
>>         self.func_nb = nb.njit(self.CONTEXT[self.func_ast.body[0].name])
>> 
>>     def __call__(self, arr, n=1, axis=None):
>>         # PREPARE INPUTS
>>         in_1d = False
>>         if axis is None:
>>             arr = np.ravel(arr)[:, None]
>>             in_1d = True
>>         elif axis == 0:
>>             if arr.ndim == 1:
>>                 in_1d = True
>>                 arr = arr[:, None]
>>         else:
>>             raise ValueError('axis ~in (None, 0)')
>>         res = self.func_nb(arr, n)
>>         if in_1d:
>>             res = res[:, 0]
>>         return res
>> 
>> 
>> if __name__ == '__main__':
>>     arr = np.arange(125).reshape((5, 5, 5))
>>     ft = FirstTrue('np.sum(x) > 30')
>>     print(ft(arr, n=2, axis=0))
>> 
>> [[1 0 0 0 0]
>>  [2 1 1 1 1]]
>> 
>> 
>> In [16]: %timeit ft(arr, 2, axis=0)
>> 1.31 µs ± 3.94 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
>> 
>> Regards,
>> DG
>> 
>>> On 29 Oct 2023, at 23:18, rosko37 <rosk...@gmail.com 
>>> <mailto:rosk...@gmail.com>> wrote:
>>> 
>>> An example with a 1-D array (where it is easiest to see what I mean) is the 
>>> following. I will follow Dom Grigonis's suggestion that the range not be 
>>> provided as a separate argument, as it can be just as easily "folded into" 
>>> the array by passing a slice. So it becomes just:
>>> idx = first_true(arr, cond)
>>> 
>>> As Dom also points out, the "cond" would likely need to be a "function 
>>> pointer" (i.e., the name of a function defined elsewhere, turning 
>>> first_true into a higher-order function), unless there's some way to pass a 
>>> parseable expression for simple cases. A few special cases like the first 
>>> zero/nonzero element could be handled with dedicated options (sort of like 
>>> matplotlib colors), but for anything beyond that it gets unwieldy fast.
>>> 
>>> So let's say we have this:
>>> ******************
>>> def cond(x):
>>>     return x>50
>>> 
>>> search_arr = np.exp(np.arange(0,1000))
>>> 
>>> print(np.first_true(search_arr, cond))
>>> *******************
>>> 
>>> This should print 4, because the element of search_arr at index 4 (i.e. the 
>>> 5th element) is e^4, which is slightly greater than 50 (while e^3 is less 
>>> than 50). It should return this without testing the 6th through 1000th 
>>> elements of the array at all to see whether they exceed 50 or not. This 
>>> example is rather contrived, because simply taking the natural log of 50 
>>> and rounding up is far superior, not even evaluating the array of 
>>> exponentials (which my example clearly still does--and in the use cases 
>>> I've had for such a function, I can't predict the array elements like 
>>> this--they come from loaded data, the output of a simulation, etc., and are 
>>> all already in a numpy array). And in this case, since the values are 
>>> strictly increasing, search_sorted() would work as well. But it illustrates 
>>> the idea.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Oct 26, 2023 at 5:54 AM Dom Grigonis <dom.grigo...@gmail.com 
>>> <mailto:dom.grigo...@gmail.com>> wrote:
>>> Could you please give a concise example? I know you have provided one, but 
>>> it is engrained deep in verbose text and has some typos in it, which makes 
>>> hard to understand exactly what inputs should result in what output.
>>> 
>>> Regards,
>>> DG
>>> 
>>> > On 25 Oct 2023, at 22:59, rosko37 <rosk...@gmail.com 
>>> > <mailto:rosk...@gmail.com>> wrote:
>>> > 
>>> > I know this question has been asked before, both on this list as well as 
>>> > several threads on Stack Overflow, etc. It's a common issue. I'm NOT 
>>> > asking for how to do this using existing Numpy functions (as that 
>>> > information can be found in any of those sources)--what I'm asking is 
>>> > whether Numpy would accept inclusion of a function that does this, or 
>>> > whether (possibly more likely) such a proposal has already been 
>>> > considered and rejected for some reason.
>>> > 
>>> > The task is this--there's a large array and you want to find the next 
>>> > element after some index that satisfies some condition. Such elements are 
>>> > common, and the typical number of elements to be searched through is 
>>> > small relative to the size of the array. Therefore, it would greatly 
>>> > improve performance to avoid testing ALL elements against the conditional 
>>> > once one is found that returns True. However, all built-in functions that 
>>> > I know of test the entire array. 
>>> > 
>>> > One can obviously jury-rig some ways, like for instance create a "for" 
>>> > loop over non-overlapping slices of length slice_length and call 
>>> > something like np.where(cond) on each--that outer "for" loop is much 
>>> > faster than a loop over individual elements, and the inner loop at most 
>>> > will go slice_length-1 elements past the first "hit". However, needing to 
>>> > use such a convoluted piece of code for such a simple task seems to go 
>>> > against the Numpy spirit of having one operation being one function of 
>>> > the form func(arr)".
>>> > 
>>> > A proposed function for this, let's call it "np.first_true(arr, 
>>> > start_idx, [stop_idx])" would be best implemented at the C code level, 
>>> > possibly in the same code file that defines np.where. I'm wondering if I, 
>>> > or someone else, were to write such a function, if the Numpy developers 
>>> > would consider merging it as a standard part of the codebase. It's 
>>> > possible that the idea of such a function is bad because it would violate 
>>> > some existing broadcasting or fancy indexing rules. Clearly one could 
>>> > make it possible to pass an "axis" argument to np.first_true() that would 
>>> > select an axis to search over in the case of multi-dimensional arrays, 
>>> > and then the result would be an array of indices of one fewer dimension 
>>> > than the original array. So 
>>> > np.first_true(np.array([1,5],[2,7],[9,10],cond) would return [1,1,0] for 
>>> > cond(x): x>4. The case where no elements satisfy the condition would need 
>>> > to return a "signal value" like -1. But maybe there are some weird cases 
>>> > where there isn't a sensible return val
>>>  ue, hence why such a function has not been added.
>>> > 
>>> > -Andrew Rosko
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list -- numpy-discussion@python.org 
>>> > <mailto:numpy-discussion@python.org>
>>> > To unsubscribe send an email to numpy-discussion-le...@python.org 
>>> > <mailto:numpy-discussion-le...@python.org>
>>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>> > <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>> > Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
>>> 
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>>> <mailto:numpy-discussion@python.org>
>>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>>> <mailto:numpy-discussion-le...@python.org>
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>> Member address: rosk...@gmail.com <mailto:rosk...@gmail.com>
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>>> <mailto:numpy-discussion@python.org>
>>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>>> <mailto:numpy-discussion-le...@python.org>
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
>> 
>> 
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: lev.maxi...@gmail.com <mailto:lev.maxi...@gmail.com>
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: j...@fastmail.com <mailto:j...@fastmail.com>
>> 
> 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Reply via email to