Amir wrote:
> Dag Sverre Seljebotn <da...@...> writes:
>>
>> Amir wrote:
>> > A test script at bottom is 1.8 times faster when I expand numpy
>> > calls into simple for loops (n,m = 1000,1500). weave.inline is 2.7
>> > times faster. Looking at the cython -a output, not sure where most
>> > of that time is lost. Looks like strides generate many more calls
>> > and dot products are done using Python calls for multiplications,
>> > for example.
>>
>> Yes, unfortunately that's what the status is; the only thing that is
>> optimized by Cython is element indexing (i.e. your theta[j] and
>> v[j]).  This is where you'd really remove a bottleneck in some code,
>> but it means that "mixed" code like yours doesn't benefit that much.
>>
>> Remember though that in your case, as n and m goes to infinity, the
>> Python overhead will be rather small.
>>
>
> I see. Well, it's great that it can understand regular numpy code.
>
> If I only use pointers to ndarray.data in my inner loop and no buffer
> striding, I get a more than factor 3 speedup. The only difference in
> the generated code is __Pyx_BufPtrStrided1d and __Pyx_BufPtrStrided2d
> calls. These should be very fast. Do these cost that much more than
> using direct pointers?

You can try adding mode="c", i.e.

cdef ndarray[double, ndim=1, mode="c"] theta_old = empty((m,))

This saves one multiplication per access with an unknown variable (and
instead always multiplies with 8...). The requirement of course is that
the buffer is contiguous.

Dag Sverre

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to