On May 9, 2017 9:47 AM, "Martin Spacek" <nu...@mspacek.mm.st> wrote:
Hello, I've opened up a pull request to add a function called np.search(), or something like it, to complement np.searchsorted(): https://github.com/numpy/numpy/pull/9055 There's also this issue I opened before starting the PR: https://github.com/numpy/numpy/issues/9052 Proposed API changes require discussion on the list, so here I am! This proposed function (and perhaps array method?) does the same as np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly checks if all the values in `v` are a subset of those in `a`. If not, it currently raises an error, but that could be controlled via a kwarg. As I mentioned in the PR, I often find myself abusing np.searchsorted() by not explicitly checking these assumptions. The temptation to use it is great, because it's such a fast and convenient function, and most of the time that I use it, the assumptions are indeed valid. Explicitly checking those assumptions each and every time before I use np.searchsorted() is tedious, and easy to forget to do. I wouldn't be surprised if many others abuse np.searchsorted() in the same way. It's worth noting though that the "sorted" part is a critical part of what makes it fast. If we're looking for k needles in an n-item haystack, then: If the haystack is already sorted and we know it, using searchsorted does it in k*log2(n) comparisons. (Could be reduced to average case O(k log log n) for simple scalars by using interpolation search, but I don't think searchsorted is that clever atm.) If the haystack is not sorted, then sorting it and then using searchsorted requires a total of O(n log n) + k*log2(n) comparisons. And if the haystack is not sorted, then doing linear search to find the first item like list.index does requires on average 0.5*k*n comparisons. This analysis ignores memory effects, which are important -- linear memory access is faster than random access, and searchsorted is all about making memory access maximally unpredictable. But even so, I think sorting-then-searching will be reasonably competitive pretty much from the start, and for moderately large k and n values the difference between (n + k)*log(n) and n*k is huge. Another issue is that sorting requires an O(n)-sized temporary buffer (assuming you can't mutate the haystack in place). But if your haystack is a large enough fraction of memory that you can't afford is buffer, then it's likely large enough that you can't afford linear searching either... Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array. So, I propose that np.search(), or something like it, could be even more useful than np.searchsorted(). My main concern here would be creating a trap for the unwary, where people use search() naively because it's so nice and convenient, and then eventually get surprised by a nasty quadratic slowdown. There's a whole blog about these traps :-) https://accidentallyquadratic.tumblr.com/ Otoh there are also huge number of numpy use cases where it doesn't matter if some calculation is 1000x slower than it should be, as long as it works and is discoverable... So it sounds like one obvious thing would be to have a version of searchsorted that checks for matches (maybe side="exact"? Though that's not easy to find...). That's clearly useful, and orthogonal to the linear/binary search issue, so we shouldn't make it a reason people are tempted to choose the inferior algorithm. ...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with options "linear", "searchsorted", and "auto". Linear does the obvious thing, searchsorted generates a sorter array using argsort (unless the user provided one) and then calls searchsorted, and auto picks one of them depending on whether a sorter array was provided and how large the arrays are. The default is auto. In all cases it looks for exact matches. I guess by default "not found" should be signaled with an exception, and then there should be some option to have it return a sentinel value instead? The problem is that since we're returning integers then there's no sentinel value that's necessarily an invalid index (e.g. indexing will happily accept -1). -n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion