[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Bill Ross Tue, 31 Oct 2023 16:18:24 -0700

Could a sub-python level spin up extra threads (processes)? Search could
benefit easily.


I switched back to Java for number crunching because one gets to share
memory without using OS-supplied shared memory. Maybe put a JVM behind
python, or do python in the JVM?

Bill 

--

Phobrain.com 

On 2023-10-31 16:05, Juan Nunez-Iglesias wrote:

> If you add a layer of indirection with Numba you can get a *very* nice API: 
> 
> @numba.njit 
> def _first(arr, pred): 
> for i, elem in enumerate(arr): 
> if pred(elem): 
> return i 
> 
> def first(arr, pred): 
> _pred = numba.njit(pred) 
> return _first(arr, _pred) 
> 
> This even works with lambdas! (TIL, thanks Numba devs!) 
> 
>>>> first(np.random.random(10_000_000), lambda x: x > 0.99) 
> 215 
> 
> Since Numba has ufunc support I don't suppose it would be hard to make it 
> work with an axis= argument, but I've never played with that API myself. 
> 
> On Tue, 31 Oct 2023, at 6:49 PM, Lev Maximov wrote: 
> 
> I've implemented such functions in Cython and packaged them into a library 
> called numpy_illustrated [1] 
> 
> It exposes the following functions: 
> 
> find(a, v)  # returns the index of the first occurrence of v in a 
> first_above(a, v)   # returns the index of the first element in a that is 
> strictly above v 
> first_nonzero(a)   # returns the index of the first nonzero element 
> 
> They scan the array and bail out immediately once the match is found. Have a 
> significant performance gain if the element to be 
> found is closer to the beginning of the array. Have roughly the same speed as 
> alternative methods if the value is missing. 
> 
> The complete signatures of the functions look like this: 
> 
> find(a, v, rtol=1e-05, atol=1e-08, sorted=False, default=-1, raises=False)
> first_above(a, v, sorted=False, missing=-1, raises=False) 
> first_nonzero(a, missing=-1, raises=False) 
> 
> This covers the most common use cases and does not accept Python callbacks 
> because accepting them would nullify any speed gain 
> one would expect from such a function. A Python callback can be implemented 
> with Numba, but anyone who can write the callback 
> in Numba has no need for a library that wraps it into a dedicated function. 
> 
> The library has a 100% test coverage. Code style 'black'. It should be easy 
> to add functions like 'first_below' if necessary. 
> 
> A more detailed description of these functions can be found here [2]. 
> 
> Best regards, 
> Lev Maximov 
> 
> On Tue, Oct 31, 2023 at 3:50 AM Dom Grigonis <dom.grigo...@gmail.com> wrote: 
> 
> I juggled a bit and found pretty nice solution using numba. Which is probably 
> not very robust, but proves that such thing can be optimised while retaining 
> flexibility. Check if it works for your use cases and let me know if anything 
> fails or if it is slow compared to what you used. 
> 
> first_true_str = """
> def first_true(arr, n):
> result = np.full((n, arr.shape[1]), -1, dtype=np.int32)
> for j in range(arr.shape[1]):
> k = 0
> for i in range(arr.shape[0]):
> x = arr[i:i + 1, j]
> if cond(x):
> result[k, j] = i
> k += 1
> if k >= n:
> break
> return result
> """
> 
> class FirstTrue:
> CONTEXT = {'np': np}
> 
> def __init__(self, expr):
> self.expr = expr
> self.expr_ast = ast.parse(expr, mode='exec').body[0].value
> self.func_ast = ast.parse(first_true_str, mode='exec')
> self.func_ast.body[0].body[1].body[1].body[1].test = self.expr_ast
> self.func_cmp = compile(self.func_ast, filename="<ast>", mode="exec")
> exec(self.func_cmp, self.CONTEXT)
> self.func_nb = nb.njit(self.CONTEXT[self.func_ast.body[0].name])
> 
> def __call__(self, arr, n=1, axis=None):
> _# PREPARE INPUTS_
> in_1d = False
> if axis is None:
> arr = np.ravel(arr)[:, None]
> in_1d = True
> elif axis == 0:
> if arr.ndim == 1:
> in_1d = True
> arr = arr[:, None]
> else:
> raise ValueError('axis ~in (None, 0)')
> res = self.func_nb(arr, n)
> if in_1d:
> res = res[:, 0]
> return res
> 
> if __name__ == '__main__':
> arr = np.arange(125).reshape((5, 5, 5))
> ft = FirstTrue('np.sum(x) > 30')
> print(ft(arr, n=2, axis=0))
> 
> [[1 0 0 0 0]
> [2 1 1 1 1]]
> 
> In [16]: %timeit ft(arr, 2, axis=0)
> 1.31 µs ± 3.94 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> 
> Regards, 
> DG 
> 
> On 29 Oct 2023, at 23:18, rosko37 <rosk...@gmail.com> wrote: 
> 
> An example with a 1-D array (where it is easiest to see what I mean) is the 
> following. I will follow Dom Grigonis's suggestion that the range not be 
> provided as a separate argument, as it can be just as easily "folded into" 
> the array by passing a slice. So it becomes just: 
> idx = first_true(arr, cond) 
> 
> As Dom also points out, the "cond" would likely need to be a "function 
> pointer" (i.e., the name of a function defined elsewhere, turning first_true 
> into a higher-order function), unless there's some way to pass a parseable 
> expression for simple cases. A few special cases like the first zero/nonzero 
> element could be handled with dedicated options (sort of like matplotlib 
> colors), but for anything beyond that it gets unwieldy fast. 
> 
> So let's say we have this: 
> ****************** 
> 
> def cond(x): 
> return x>50 
> 
> search_arr = np.exp(np.arange(0,1000)) 
> 
> print(np.first_true(search_arr, cond)) 
> ******************* 
> 
> This should print 4, because the element of search_arr at index 4 (i.e. the 
> 5th element) is e^4, which is slightly greater than 50 (while e^3 is less 
> than 50). It should return this _without testing the 6th through 1000th 
> elements of the array at all to see whether they exceed 50 or not_. This 
> example is rather contrived, because simply taking the natural log of 50 and 
> rounding up is far superior, not even _evaluating the array of exponentials 
> _(which my example clearly still does--and in the use cases I've had for such 
> a function, I can't predict the array elements like this--they come from 
> loaded data, the output of a simulation, etc., and are all already in a numpy 
> array). And in this case, since the values are strictly increasing, 
> search_sorted() would work as well. But it illustrates the idea. 
> 
> On Thu, Oct 26, 2023 at 5:54 AM Dom Grigonis <dom.grigo...@gmail.com> wrote: 
> Could you please give a concise example? I know you have provided one, but it 
> is engrained deep in verbose text and has some typos in it, which makes hard 
> to understand exactly what inputs should result in what output. 
> 
> Regards, 
> DG 
> 
>> On 25 Oct 2023, at 22:59, rosko37 <rosk...@gmail.com> wrote: 
>> 
>> I know this question has been asked before, both on this list as well as 
>> several threads on Stack Overflow, etc. It's a common issue. I'm NOT asking 
>> for how to do this using existing Numpy functions (as that information can 
>> be found in any of those sources)--what I'm asking is whether Numpy would 
>> accept inclusion of a function that does this, or whether (possibly more 
>> likely) such a proposal has already been considered and rejected for some 
>> reason. 
>> 
>> The task is this--there's a large array and you want to find the next 
>> element after some index that satisfies some condition. Such elements are 
>> common, and the typical number of elements to be searched through is small 
>> relative to the size of the array. Therefore, it would greatly improve 
>> performance to avoid testing ALL elements against the conditional once one 
>> is found that returns True. However, all built-in functions that I know of 
>> test the entire array. 
>> 
>> One can obviously jury-rig some ways, like for instance create a "for" loop 
>> over non-overlapping slices of length slice_length and call something like 
>> np.where(cond) on each--that outer "for" loop is much faster than a loop 
>> over individual elements, and the inner loop at most will go slice_length-1 
>> elements past the first "hit". However, needing to use such a convoluted 
>> piece of code for such a simple task seems to go against the Numpy spirit of 
>> having one operation being one function of the form func(arr)". 
>> 
>> A proposed function for this, let's call it "np.first_true(arr, start_idx, 
>> [stop_idx])" would be best implemented at the C code level, possibly in the 
>> same code file that defines np.where. I'm wondering if I, or someone else, 
>> were to write such a function, if the Numpy developers would consider 
>> merging it as a standard part of the codebase. It's possible that the idea 
>> of such a function is bad because it would violate some existing 
>> broadcasting or fancy indexing rules. Clearly one could make it possible to 
>> pass an "axis" argument to np.first_true() that would select an axis to 
>> search over in the case of multi-dimensional arrays, and then the result 
>> would be an array of indices of one fewer dimension than the original array. 
>> So np.first_true(np.array([1,5],[2,7],[9,10],cond) would return [1,1,0] for 
>> cond(x): x>4. The case where no elements satisfy the condition would need to 
>> return a "signal value" like -1. But maybe there are some weird cases where 
>> there isn't a sensible !
 return
val 
> ue, hence why such a function has not been added. 
>> 
>> -Andrew Rosko 
>> _______________________________________________ 
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> Member address: dom.grigo...@gmail.com 
> 
> _______________________________________________ 
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> Member address: rosk...@gmail.com 
> _______________________________________________ 
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> Member address: dom.grigo...@gmail.com

_______________________________________________ 
NumPy-Discussion mailing list -- numpy-discussion@python.org 
To unsubscribe send an email to numpy-discussion-le...@python.org 
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
Member address: lev.maxi...@gmail.com 
_______________________________________________ 
NumPy-Discussion mailing list -- numpy-discussion@python.org 
To unsubscribe send an email to numpy-discussion-le...@python.org 
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
Member address: j...@fastmail.com 

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: bross_phobr...@sonic.net 

Links:
------
[1] https://pypi.org/project/numpy-illustrated/
[2]
https://betterprogramming.pub/the-numpy-illustrated-library-7531a7c43ffb?sk=8dd60bfafd6d49231ac76cb148a4d16f

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Reply via email to