On Nov 9, 2007 7:23 PM, David Cournapeau <[EMAIL PROTECTED]> wrote: > On Nov 10, 2007 1:12 AM, Travis E. Oliphant <[EMAIL PROTECTED]> > wrote: > > Hans Meine wrote: > > > | static void > > > | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func) > > > | { > > > | register intp i; > > > | intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; > > > | char *i1=args[0], *i2=args[1], *op=args[2]; > > > | for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { > > > | *((double *)op)=*((double *)i1) + *((double *)i2); > > > | } > > > | } > > > > > > If I understood David correctly, he suggested an unstrided variant of > > > this inner loop, which simply uses pointer increments. > > I'm not sure if he was saying that or not. But, more generally, I'm all > > for optimizations at this level. But, they will have to be done > > differently for different cases with this loop as the fall-back. > > I was not really clear, but yes, his was part of my argument: I don't > think compilers can optimize the above well when there is no stride > (more exactly when the stride could be done by simply using array > indexing). > > This would need some benchmarks, but I have always read that using > pointer arithmetics should be avoided when speed matters (e.g. *a + n > * sizeof(*a) compared to a[n]), because it becomes much more difficult > for the compiler to optimize, Generally, if you can get to a function > which does the thing the "obvious way", this is better. Of course, you > have to separate the case where this is possible and where it is not. > But such work would also be really helpful if/when we optimize some > basic things with MMX/SSE and co, and I think the above is impossible > to auto vectorize (gcc 4.3, not released yet, gives some really > helpful analysis for that, and can tell you when it fails to > auto-vectorize, and why). >
The relative speed of pointers vs indexing depends on a lot of things: 1) the architecture (registers, instruction sets, implementation) 2) the compiler 3) the code (number of base addresses, etc.) My subjective feel is that on newer intel type hardware and using a recent compiler, indexing is generally faster. But it does depend. I wrote an indexed version of the code above: typedef int intp; void DOUBLE_add_ptr(char **args, intp *dimensions, intp *steps, void *func) { register intp i; intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; char *i1=args[0], *i2=args[1], *op=args[2]; for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { *((double *)op)=*((double *)i1) + *((double *)i2); } } void DOUBLE_add_ind(char **args, intp *dimensions, intp *steps, void *func) { const intp i1s=steps[0]/sizeof(double); const intp i2s=steps[1]/sizeof(double); const intp ios=steps[2]/sizeof(double); const n=dimensions[0]; double * const p1 = (double*)args[0]; double * const p2 = (double*)args[1]; double * const po = (double*)args[2]; intp i, io, i1, i2; for(i=0, io=0, i1=0, i2=0; i<n; ++i) { po[io] = p1[i1] + p2[i2]; io += ios; i1 += i1s; i2 += i2s; } } Compiled with gcc -O2 -fno-strict-aliasing -march=prescott -m32 -S, the inner loop of the pointer version looks like .L4: fldl (%edx) faddl (%ecx) fstpl (%eax) addl $1, %ebx addl -20(%ebp), %edx addl -16(%ebp), %ecx addl %edi, %eax cmpl %esi, %ebx jne .L4 While the indexed version looks like .L11: movl -24(%ebp), %esi fldl (%esi,%ecx,8) movl -20(%ebp), %esi faddl (%esi,%edx,8) movl -16(%ebp), %esi fstpl (%esi,%eax,8) addl -36(%ebp), %eax addl -32(%ebp), %ecx addl %edi, %edx addl $1, %ebx cmpl -28(%ebp), %ebx jne .L11 Note that the indexed version is rather short of registers and has to load all of the pointers and two of the indexes from the stack. Chuck
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion