Here are some actual numbers within the context of operation (nogil removed
and def'd for linetracing)


Line # Hits Time Per Hit % Time Line Contents
============================================================== 80 #
Bilinear identity to shave off some flops 81 # inv(V-U) (V+U) = inv(V-U)
(V-U+2V) = I + 2 inv(V-U) U 82 1 15.0 15.0 0.1 daxpy(&n2, &neg_one, &work2[0
], &int1, &work3[0], &int1) 83 84 # Convert array layout for solving AX = B
85 1 3.0 3.0 0.0 for i in range(n): 86 60 137.0 2.3 0.9 for j in range(n):
87 3600 8437.0 2.3 57.7 work4[j*n+i] = work2[i*n+j] 88 89 1 408.0 408.0 2.8
dgetrf( &n, &n, &work3[0], &n, &ipiv[0], &info ) 90 1 4122.0 4122.0 28.2
dgetrs(<char*>'T', &n, &n, &work3[0], &n, &ipiv[0], &work4[0], &n, &info )
91 1 25.0 25.0 0.2 dscal(&n2, &two, &work4[0], &int1) 92 # Add identity
matrix 93 1 4.0 4.0 0.0 for i in range(n): 94 60 146.0 2.4 1.0 work4[i*(n+1
)] += 1. 95 1 16.0 16.0 0.1 dcopy(&n2, &work4[0], &int1, &Am[0, 0, 0], &int1
)



















On Thu, Nov 11, 2021 at 1:04 AM Ilhan Polat <ilhanpo...@gmail.com> wrote:

> Hmm not sure I understand the question but this is what I mean by naive
> looping, suppose I allocate a scratch register work3, then
>
> for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]
>
>
>
> This basically doing the row to column based indexing and obviously we
> create a lot of cache misses since work3 entries are accessed in the
> shuffled fashion. The idea of all this Cython attempt is to avoid such
> access hence if the original some_C_layout_func takes 10 units of time, 6
> of it is spent on this loop when the data doesn't fit the cache. When I
> discard the correctness of the function and comment out this loop and then
> remeasure the original func spends roughly 3 units of time. However take
> any random array in C order in NumPy using regular Python and use
> np.asfortranarray() it spends roughly about 0.1 units of time. So
> apparently it is possible to do this somehow at the low level in a
> performant way. That's what I would like to understand or clear out my
> misunderstanding.
>
>
>
>
>
> On Thu, Nov 11, 2021 at 12:56 AM Andras Deak <deak.and...@gmail.com>
> wrote:
>
>> On Thursday, November 11, 2021, Ilhan Polat <ilhanpo...@gmail.com> wrote:
>>
>>> I've asked this in Cython mailing list but probably I should also get
>>> some feedback here too.
>>>
>>> I have the following function defined in Cython and using flat memory
>>> pointers to hold n by n array data.
>>>
>>>
>>> cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *
>>> work1 = <double*>malloc(n*n*sizeof(double)) cdef double *work2 = <double
>>> *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here #
>>> ... dgetrs(<char*>'T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &
>>> info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Here, I have done everything in C layout with work1 and work2 but I have
>>> to convert work2 into Fortran layout to be able to solve AX = B. A can be
>>> transposed in Lapack internally via the flag 'T' so the only obstacle I
>>> have now is to shuffle work2 which holds B transpose in the eyes of Fortran
>>> since it is still in C layout.
>>>
>>> If I go naively and make loops to get one layout to the other that
>>> actually spoils all the speed benefits from this Cythonization due to cache
>>> misses. In fact 60% of the time is spent in that naive loop across the
>>> whole function.
>>>
>>>
>> Sorry if this is a dumb question, but is this true whether or not you
>> loop over contiguous blocks of the input vs the output array? Or is the
>> faster of the two options still slower than the linsolve?
>>
>> AndrĂ¡s
>>
>>
>>>
>>>  Same goes for the copy_fortran() of memoryviews.
>>>
>>> I have measured the regular NumPy np.asfortranarray()  and the
>>> performance is quite good enough compared to the actual linear solve. Hence
>>> whatever it is doing underneath I would like to reach out and do the same
>>> possibly via the C-API. But my C knowledge basically failed me around this
>>> line
>>> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
>>>
>>> I have found the SO post from
>>> https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
>>> but I am not sure if that is the canonical way to do it in newer Python
>>> versions.
>>>
>>> Can anyone show me how to go about it without interacting with Python
>>> objects?
>>>
>>> Best,
>>> ilhan
>>>
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: ilhanpo...@gmail.com
>>
>
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to