Amir wrote: > Dag Sverre Seljebotn <da...@...> writes: >> >> Amir wrote: >> > A test script at bottom is 1.8 times faster when I expand numpy >> > calls into simple for loops (n,m = 1000,1500). weave.inline is 2.7 >> > times faster. Looking at the cython -a output, not sure where most >> > of that time is lost. Looks like strides generate many more calls >> > and dot products are done using Python calls for multiplications, >> > for example. >> >> Yes, unfortunately that's what the status is; the only thing that is >> optimized by Cython is element indexing (i.e. your theta[j] and >> v[j]). This is where you'd really remove a bottleneck in some code, >> but it means that "mixed" code like yours doesn't benefit that much. >> >> Remember though that in your case, as n and m goes to infinity, the >> Python overhead will be rather small. >> > > I see. Well, it's great that it can understand regular numpy code. > > If I only use pointers to ndarray.data in my inner loop and no buffer > striding, I get a more than factor 3 speedup. The only difference in > the generated code is __Pyx_BufPtrStrided1d and __Pyx_BufPtrStrided2d > calls. These should be very fast. Do these cost that much more than > using direct pointers?
You can try adding mode="c", i.e. cdef ndarray[double, ndim=1, mode="c"] theta_old = empty((m,)) This saves one multiplication per access with an unknown variable (and instead always multiplies with 8...). The requirement of course is that the buffer is contiguous. Dag Sverre _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
