On Thu, 2021-11-11 at 01:04 +0100, Ilhan Polat wrote:
> Hmm not sure I understand the question but this is what I mean by naive
> looping, suppose I allocate a scratch register work3, then
> 
> for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]
> 

NumPy does not end up doing anything special.  Special would be to use
a blocked iteration and NumPy doesn't have it unfortunately.
The only thing it does is use pointers to cut some overheads, something
(very rough) like:

    ptr1 = arr1.data
    ptr2_col = arr2.data
    
    strides2_col = arr.strides[0]
    strides2_row = arr2.strides[1]

    for i in range(n):
        ptr2 = ptr2_col
        for j in range(n):
             *ptr2 = *ptr1
             ptr1++
             ptr2 += strides2_row

        ptr2_col += strides2_col

And if you write that in cython, you are likely faster since you can
cut quite a few corners (all is aligned, contiguous, etc.).
(with potentially, loop unrolling/compiler optimization fluctuations,
numpy probably tells GCC to unroll and optimize the innermost loop
there)

I would not be surprised if you can find a lightweight fast copy-
transpose out there, or if some tools like MKL/Cuda just include it. It
is too bad NumPy is missing it.

Cheers,

Sebastian


> 
> 
> This basically doing the row to column based indexing and obviously we
> create a lot of cache misses since work3 entries are accessed in the
> shuffled fashion. The idea of all this Cython attempt is to avoid such
> access hence if the original some_C_layout_func takes 10 units of time,
> 6
> of it is spent on this loop when the data doesn't fit the cache. When I
> discard the correctness of the function and comment out this loop and
> then
> remeasure the original func spends roughly 3 units of time. However
> take
> any random array in C order in NumPy using regular Python and use
> np.asfortranarray() it spends roughly about 0.1 units of time. So
> apparently it is possible to do this somehow at the low level in a
> performant way. That's what I would like to understand or clear out my
> misunderstanding.
> 
> 
> 
> 
> 
> On Thu, Nov 11, 2021 at 12:56 AM Andras Deak <deak.and...@gmail.com>
> wrote:
> 
> > On Thursday, November 11, 2021, Ilhan Polat <ilhanpo...@gmail.com>
> > wrote:
> > 
> > > I've asked this in Cython mailing list but probably I should also
> > > get
> > > some feedback here too.
> > > 
> > > I have the following function defined in Cython and using flat
> > > memory
> > > pointers to hold n by n array data.
> > > 
> > > 
> > > cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef
> > > double *
> > > work1 = <double*>malloc(n*n*sizeof(double)) cdef double *work2 =
> > > <double
> > > *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations
> > > here #
> > > ... dgetrs(<char*>'T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0],
> > > &n, &
> > > info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Here, I have done everything in C layout with work1 and work2 but I
> > > have
> > > to convert work2 into Fortran layout to be able to solve AX = B. A
> > > can be
> > > transposed in Lapack internally via the flag 'T' so the only
> > > obstacle I
> > > have now is to shuffle work2 which holds B transpose in the eyes of
> > > Fortran
> > > since it is still in C layout.
> > > 
> > > If I go naively and make loops to get one layout to the other that
> > > actually spoils all the speed benefits from this Cythonization due
> > > to cache
> > > misses. In fact 60% of the time is spent in that naive loop across
> > > the
> > > whole function.
> > > 
> > > 
> > Sorry if this is a dumb question, but is this true whether or not you
> > loop
> > over contiguous blocks of the input vs the output array? Or is the
> > faster
> > of the two options still slower than the linsolve?
> > 
> > András
> > 
> > 
> > > 
> > >  Same goes for the copy_fortran() of memoryviews.
> > > 
> > > I have measured the regular NumPy np.asfortranarray()  and the
> > > performance is quite good enough compared to the actual linear
> > > solve. Hence
> > > whatever it is doing underneath I would like to reach out and do
> > > the same
> > > possibly via the C-API. But my C knowledge basically failed me
> > > around this
> > > line
> > > https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
> > > 
> > > I have found the SO post from
> > > https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
> > > but I am not sure if that is the canonical way to do it in newer
> > > Python
> > > versions.
> > > 
> > > Can anyone show me how to go about it without interacting with
> > > Python
> > > objects?
> > > 
> > > Best,
> > > ilhan
> > > 
> > _______________________________________________
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: ilhanpo...@gmail.com
> > 
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to