On Thu, 2021-11-11 at 01:04 +0100, Ilhan Polat wrote: > Hmm not sure I understand the question but this is what I mean by naive > looping, suppose I allocate a scratch register work3, then > > for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j] >
NumPy does not end up doing anything special. Special would be to use a blocked iteration and NumPy doesn't have it unfortunately. The only thing it does is use pointers to cut some overheads, something (very rough) like: ptr1 = arr1.data ptr2_col = arr2.data strides2_col = arr.strides[0] strides2_row = arr2.strides[1] for i in range(n): ptr2 = ptr2_col for j in range(n): *ptr2 = *ptr1 ptr1++ ptr2 += strides2_row ptr2_col += strides2_col And if you write that in cython, you are likely faster since you can cut quite a few corners (all is aligned, contiguous, etc.). (with potentially, loop unrolling/compiler optimization fluctuations, numpy probably tells GCC to unroll and optimize the innermost loop there) I would not be surprised if you can find a lightweight fast copy- transpose out there, or if some tools like MKL/Cuda just include it. It is too bad NumPy is missing it. Cheers, Sebastian > > > This basically doing the row to column based indexing and obviously we > create a lot of cache misses since work3 entries are accessed in the > shuffled fashion. The idea of all this Cython attempt is to avoid such > access hence if the original some_C_layout_func takes 10 units of time, > 6 > of it is spent on this loop when the data doesn't fit the cache. When I > discard the correctness of the function and comment out this loop and > then > remeasure the original func spends roughly 3 units of time. However > take > any random array in C order in NumPy using regular Python and use > np.asfortranarray() it spends roughly about 0.1 units of time. So > apparently it is possible to do this somehow at the low level in a > performant way. That's what I would like to understand or clear out my > misunderstanding. > > > > > > On Thu, Nov 11, 2021 at 12:56 AM Andras Deak <deak.and...@gmail.com> > wrote: > > > On Thursday, November 11, 2021, Ilhan Polat <ilhanpo...@gmail.com> > > wrote: > > > > > I've asked this in Cython mailing list but probably I should also > > > get > > > some feedback here too. > > > > > > I have the following function defined in Cython and using flat > > > memory > > > pointers to hold n by n array data. > > > > > > > > > cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef > > > double * > > > work1 = <double*>malloc(n*n*sizeof(double)) cdef double *work2 = > > > <double > > > *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations > > > here # > > > ... dgetrs(<char*>'T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], > > > &n, & > > > info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here, I have done everything in C layout with work1 and work2 but I > > > have > > > to convert work2 into Fortran layout to be able to solve AX = B. A > > > can be > > > transposed in Lapack internally via the flag 'T' so the only > > > obstacle I > > > have now is to shuffle work2 which holds B transpose in the eyes of > > > Fortran > > > since it is still in C layout. > > > > > > If I go naively and make loops to get one layout to the other that > > > actually spoils all the speed benefits from this Cythonization due > > > to cache > > > misses. In fact 60% of the time is spent in that naive loop across > > > the > > > whole function. > > > > > > > > Sorry if this is a dumb question, but is this true whether or not you > > loop > > over contiguous blocks of the input vs the output array? Or is the > > faster > > of the two options still slower than the linsolve? > > > > András > > > > > > > > > > Same goes for the copy_fortran() of memoryviews. > > > > > > I have measured the regular NumPy np.asfortranarray() and the > > > performance is quite good enough compared to the actual linear > > > solve. Hence > > > whatever it is doing underneath I would like to reach out and do > > > the same > > > possibly via the C-API. But my C knowledge basically failed me > > > around this > > > line > > > https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817 > > > > > > I have found the SO post from > > > https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous > > > but I am not sure if that is the canonical way to do it in newer > > > Python > > > versions. > > > > > > Can anyone show me how to go about it without interacting with > > > Python > > > objects? > > > > > > Best, > > > ilhan > > > > > _______________________________________________ > > NumPy-Discussion mailing list -- numpy-discussion@python.org > > To unsubscribe send an email to numpy-discussion-le...@python.org > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > > Member address: ilhanpo...@gmail.com > > > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: sebast...@sipsolutions.net
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com