On Nov 10, 2007 1:12 AM, Travis E. Oliphant <[EMAIL PROTECTED]> wrote: > Hans Meine wrote: > > | static void > > | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func) > > | { > > | register intp i; > > | intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; > > | char *i1=args[0], *i2=args[1], *op=args[2]; > > | for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { > > | *((double *)op)=*((double *)i1) + *((double *)i2); > > | } > > | } > > > > If I understood David correctly, he suggested an unstrided variant of > > this inner loop, which simply uses pointer increments. > I'm not sure if he was saying that or not. But, more generally, I'm all > for optimizations at this level. But, they will have to be done > differently for different cases with this loop as the fall-back.
I was not really clear, but yes, his was part of my argument: I don't think compilers can optimize the above well when there is no stride (more exactly when the stride could be done by simply using array indexing). This would need some benchmarks, but I have always read that using pointer arithmetics should be avoided when speed matters (e.g. *a + n * sizeof(*a) compared to a[n]), because it becomes much more difficult for the compiler to optimize, Generally, if you can get to a function which does the thing the "obvious way", this is better. Of course, you have to separate the case where this is possible and where it is not. But such work would also be really helpful if/when we optimize some basic things with MMX/SSE and co, and I think the above is impossible to auto vectorize (gcc 4.3, not released yet, gives some really helpful analysis for that, and can tell you when it fails to auto-vectorize, and why). > > While this is > > a good idea (also probably quite some work), the real thing bugging me is > > that the above DOUBLE_add could (and should!) be called by the ufunc > > framework in such a way that it is equally efficient for C and Fortran > > arrays. > Yes, that's what I was talking about. There is actually a path through > the ufunc code where this loop is called only once. The requirement > right now is that all the arrays are C-contiguous, but this should be > changed to all arrays have the same contiguousness (and the output-array > creation code changed to create Fortran-order arrays when the inputs are > all Fortran-order). This was my other point. For some of my own C code, that's what I do: I have some function to detect whether the array can be treated sequentially, for cases where C vs F does not matter at all. I don't see difficulty to provide such a facility in numpy, at least for the common operations we are talking about, but maybe I am missing something. I should take a look a the ufunc machinery David _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion