Here are some actual numbers within the context of operation (nogil removed and def'd for linetracing)
Line # Hits Time Per Hit % Time Line Contents ============================================================== 80 # Bilinear identity to shave off some flops 81 # inv(V-U) (V+U) = inv(V-U) (V-U+2V) = I + 2 inv(V-U) U 82 1 15.0 15.0 0.1 daxpy(&n2, &neg_one, &work2[0 ], &int1, &work3[0], &int1) 83 84 # Convert array layout for solving AX = B 85 1 3.0 3.0 0.0 for i in range(n): 86 60 137.0 2.3 0.9 for j in range(n): 87 3600 8437.0 2.3 57.7 work4[j*n+i] = work2[i*n+j] 88 89 1 408.0 408.0 2.8 dgetrf( &n, &n, &work3[0], &n, &ipiv[0], &info ) 90 1 4122.0 4122.0 28.2 dgetrs(<char*>'T', &n, &n, &work3[0], &n, &ipiv[0], &work4[0], &n, &info ) 91 1 25.0 25.0 0.2 dscal(&n2, &two, &work4[0], &int1) 92 # Add identity matrix 93 1 4.0 4.0 0.0 for i in range(n): 94 60 146.0 2.4 1.0 work4[i*(n+1 )] += 1. 95 1 16.0 16.0 0.1 dcopy(&n2, &work4[0], &int1, &Am[0, 0, 0], &int1 ) On Thu, Nov 11, 2021 at 1:04 AM Ilhan Polat <ilhanpo...@gmail.com> wrote: > Hmm not sure I understand the question but this is what I mean by naive > looping, suppose I allocate a scratch register work3, then > > for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j] > > > > This basically doing the row to column based indexing and obviously we > create a lot of cache misses since work3 entries are accessed in the > shuffled fashion. The idea of all this Cython attempt is to avoid such > access hence if the original some_C_layout_func takes 10 units of time, 6 > of it is spent on this loop when the data doesn't fit the cache. When I > discard the correctness of the function and comment out this loop and then > remeasure the original func spends roughly 3 units of time. However take > any random array in C order in NumPy using regular Python and use > np.asfortranarray() it spends roughly about 0.1 units of time. So > apparently it is possible to do this somehow at the low level in a > performant way. That's what I would like to understand or clear out my > misunderstanding. > > > > > > On Thu, Nov 11, 2021 at 12:56 AM Andras Deak <deak.and...@gmail.com> > wrote: > >> On Thursday, November 11, 2021, Ilhan Polat <ilhanpo...@gmail.com> wrote: >> >>> I've asked this in Cython mailing list but probably I should also get >>> some feedback here too. >>> >>> I have the following function defined in Cython and using flat memory >>> pointers to hold n by n array data. >>> >>> >>> cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double * >>> work1 = <double*>malloc(n*n*sizeof(double)) cdef double *work2 = <double >>> *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here # >>> ... dgetrs(<char*>'T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, & >>> info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...) >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Here, I have done everything in C layout with work1 and work2 but I have >>> to convert work2 into Fortran layout to be able to solve AX = B. A can be >>> transposed in Lapack internally via the flag 'T' so the only obstacle I >>> have now is to shuffle work2 which holds B transpose in the eyes of Fortran >>> since it is still in C layout. >>> >>> If I go naively and make loops to get one layout to the other that >>> actually spoils all the speed benefits from this Cythonization due to cache >>> misses. In fact 60% of the time is spent in that naive loop across the >>> whole function. >>> >>> >> Sorry if this is a dumb question, but is this true whether or not you >> loop over contiguous blocks of the input vs the output array? Or is the >> faster of the two options still slower than the linsolve? >> >> AndrĂ¡s >> >> >>> >>> Same goes for the copy_fortran() of memoryviews. >>> >>> I have measured the regular NumPy np.asfortranarray() and the >>> performance is quite good enough compared to the actual linear solve. Hence >>> whatever it is doing underneath I would like to reach out and do the same >>> possibly via the C-API. But my C knowledge basically failed me around this >>> line >>> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817 >>> >>> I have found the SO post from >>> https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous >>> but I am not sure if that is the canonical way to do it in newer Python >>> versions. >>> >>> Can anyone show me how to go about it without interacting with Python >>> objects? >>> >>> Best, >>> ilhan >>> >> _______________________________________________ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: ilhanpo...@gmail.com >> >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com