[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference

Sebastian Berg Fri, 04 Apr 2025 15:10:26 -0700

On Sun, 2025-03-23 at 10:17 +0800, Fang Zhang wrote:
> Sebastian,
> 
> I think the core issue you pointed out is indeed correct, but your
> detailed
> explanation is backwards, since `maximum(arr[:, 0], arr[:, 1])`
> implements



Ah, right, I explained things thinking of axis=1, thanks!  As you said
the arguments should be identical, but reversed.


> `arr.max(axis=1)` instead of `arr.max(axis=0)`. So OP's transpose
> method is
> essentially approach 1, which for this array shape has less overhead
> than
> approach 2 (and since 2 is small enough, it doesn't matter that we
> are
> paying this overhead in Python instead of in C).
> 
> > From my tests, when n = 1_000_000, the advantage of the transpose
> > method
> persists up to m = 6. But for smaller n the normal method becomes
> better
> again, so it's not that clear cut (although it would be nice to have
> a way
> to use the transpose method but pay the overhead in C instead of
> Python!)
> 
> Test code:
> 
> import numpy as np
> from IPython import get_ipython
> 
> def calculate_bbox_normal(vertices: np.ndarray):
>     return vertices.min(axis=0), vertices.max(axis=0)
> 
> 
> def calculate_bbox_transpose(vertices: np.ndarray):
>     return np.array([col.min() for col in vertices.T]),
> np.array([col.max()
> for col in vertices.T])
> 
> n = 1_000_000
> for m in range(4, 9):
>     print(f"shape: {n, m}")
>     vertices = np.random.random((n, m))
>     vmin, vmax = calculate_bbox_normal(vertices)
>     vmin_t, vmax_t = calculate_bbox_transpose(vertices)
>     assert (vmin == vmin_t).all()
>     assert (vmax == vmax_t).all()
>     print("Normal:    ", end="")
>     get_ipython().run_line_magic("timeit",
> "calculate_bbox_normal(vertices)")
>     print("Transpose: ", end="")
>     get_ipython().run_line_magic("timeit",
> "calculate_bbox_transpose(vertices)")
>     print()
> 
> - Fang
> 
> On Sat, Mar 22, 2025 at 7:10 PM Sebastian Berg
> <sebast...@sipsolutions.net>
> wrote:
> 
> > On Fri, 2025-03-21 at 23:22 +0100, Tiziano Zito via NumPy-
> > Discussion
> > wrote:
> > > Hi George,
> > > 
> > > what you see is due to the memory layout of numpy arrays. If you
> > > switch your array to F-order you'll see that the two functions
> > > have
> > > the same timings, i.e. both are fast (on my machine 25 times
> > > faster
> > > for the 1_000_000 points case).
> > 
> > 
> > Memory order is indeed very important, but it should be indirectly
> > so!
> > 
> > The thing is that NumPy tries to re-order most operations for
> > memory
> > access order (not particularly advanced).
> > 
> > I.e. NumPy can calculate this type of operation in two ways (the
> > inner-
> > loop is always 1-D, so in C first, outer, loop here is separate):
> > 1. for i in arr.shape[0]: res[i] = arr[i].max()
> > 2. for i in arr.shape[1]: res[:] = maximum(res, arr[i])
> >    (For i == 0, you would just copy `arr[0]` or initialize `res`
> > first)
> > 
> > NumPy chooses the first form here, precisely because it is
> > (normally)
> > better for the memory layout, here.
> > But for 1 `arr[i].max()` is actually a function call inside NumPy
> > (deep
> > in C, not on an actual NumPy array object of course).
> > 
> > So for `arr.shape[1] == 2`, that just gets in the way because of
> > overheads!
> > Writing it as 2. is much better (memory order doesn't even matter),
> > calculating it as `maximum(arr[0], arr[1])` would be the best.
> > 
> > But even for other small values for `arr.shape[1]` it would
> > probably be
> > better to use the second version, memory order will start to matter
> > more and more (I could imagine even for arr.shape[1] == 3 or 4 it
> > dominates quickly for huge arrays, but didn't try).
> > 
> > So what you see should be overheads of using approach 1.  Taking
> > the
> > maximum of 2 elements is simply fast compared to even a lightweight
> > outer for loop and just calling the function to take the maximum.
> > (And the core function does a bit of extra work to optimize for
> > many
> > elements additionally [1])
> > You really don't need large inner-loops to hide most of those
> > overheads, but 2 elements just isn't enough.
> > 
> > Anyway, of course it could be cool to add a heuristic to pick
> > approach
> > 2 here when we obviously have `arr.shape[1] == 2` or maybe
> > `arr.shape <
> > small_value`.
> > There are also probably some low-hanging fruits to just reduce the
> > overheads here (for this specific case I know the outer iteration
> > can
> > be optimized, although I am not sure it would improve things, also,
> > some inner-loop optimizations could be moved out of the inner-loop
> > since a few year now -- but only casts actually do so).
> > 
> > In the end, `maximum(arr[:, 0], arr[:, 1])` will always the fastest
> > way
> > for this specific thing, because it explicitly picks the clearly
> > best
> > approach.
> > It would be good to close that gap at least a bit, though.
> > 
> > - Sebastian
> > 
> > 
> > [1] locally I can actually see a small speed-up if I pick less
> > optimized loops at run-time via `NPY_DISABLE_CPU_FEATURES=AVX2`.
> > 
> > > 
> > > Try:
> > > 
> > > vertices = np.array(np.random.random((n, 2)), order='F')
> > > 
> > > When your array doesn't fit in L1-cache anymore, either order 'C'
> > > or
> > > order 'F' becomes (much) more efficient depending on which
> > > dimension
> > > you are internally looping through.
> > > 
> > > You can read more about it here:
> > > 
> > https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays
> > > and
> > > 
> > https://numpy.org/doc/stable/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray
> > > 
> > > Hope that helps,
> > > Tiziano
> > > 
> > > 
> > > On Fri 21 Mar, 12:10 +0200, George Tsiamasiotis via NumPy-
> > > Discussion
> > > <numpy-discussion@python.org> wrote:
> > > > Hello NumPy community!
> > > > 
> > > > I was writing a function that calculates the bounding box of a
> > > > polygon (the smallest rectangle that fully contains the
> > > > polygon,
> > > > and who's sides are parallel to the x and y axes). The input is
> > > > a
> > > > (N,2) array containing the vertices of the polygon, and the
> > > > output
> > > > is a 4-tuple containing the vertices of the 2 corners of the
> > > > bounding box.
> > > > 
> > > > I found two ways to do that using the np.min() and np.max()
> > > > methods, and I was surprised to see a significant speed
> > > > difference,
> > > > even though they seemingly do the same thing.
> > > > 
> > > > While for small N the speed is essentially the same, the
> > > > difference
> > > > becomes noticeable for larger N. >From my testing, the
> > > > difference
> > > > seems to plateau, with the one way being around 4-5 times
> > > > faster
> > > > than the other.
> > > > 
> > > > Is there an explanation for this?
> > > > 
> > > > Here is a small benchmark I wrote (must be executed with
> > > > IPython):
> > > > 
> > > > import numpy as np
> > > > from IPython import get_ipython
> > > > 
> > > > vertices = np.random.random((1000, 2))
> > > > 
> > > > def calculate_bbox_normal(vertices: np.ndarray) ->
> > > > tuple[np.float64]:
> > > >     xmin, ymin = vertices.min(axis=0)
> > > >     xmax, ymax = vertices.max(axis=0)
> > > >     return xmin, ymin, xmax, ymax
> > > > 
> > > > def calculate_bbox_transpose(vertices: np.ndarray) ->
> > > > tuple[np.float64]:
> > > >     xmin = vertices.T[0].min()
> > > >     xmax = vertices.T[0].max()
> > > >     ymin = vertices.T[1].min()
> > > >     ymax = vertices.T[1].max()
> > > >     return xmin, ymin, xmax, ymax
> > > > 
> > > > bbox_normal = calculate_bbox_normal(vertices)
> > > > bbox_transpose = calculate_bbox_transpose(vertices)
> > > > 
> > > > print(f"Equality: {bbox_normal == bbox_transpose}")
> > > > 
> > > > for n in [10, 100, 1000, 10_000, 100_000, 1_000_000]:
> > > >     print(f"Number of points: {n}")
> > > >     vertices = np.random.random((n, 2))
> > > >     print("Normal:    ", end="")
> > > >     get_ipython().run_line_magic("timeit",
> > > > "calculate_bbox_normal(vertices)")
> > > >     print("Transpose: ", end="")
> > > >     get_ipython().run_line_magic("timeit",
> > > > "calculate_bbox_transpose(vertices)")
> > > >     print()
> > > > 
> > > 
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > > > To unsubscribe send an email to
> > > > numpy-discussion-le...@python.org
> > > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > > > Member address: opossumn...@gmail.com
> > > 
> > > _______________________________________________
> > > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > > To unsubscribe send an email to numpy-discussion-le...@python.org
> > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > > Member address: sebast...@sipsolutions.net
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: fan...@umich.edu
> > 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference

Reply via email to