Travis E. Oliphant wrote: > Anne Archibald wrote: > >> On 22/03/2008, Travis E. Oliphant <[EMAIL PROTECTED]> wrote: >> >> >>> James Philbin wrote: >>> > Personally, I think that the time would be better spent optimizing >>> > routines for single-threaded code and relying on BLAS and LAPACK >>> > libraries to use multiple cores for more complex calculations. In >>> > particular, doing some basic loop unrolling and SSE versions of the >>> > ufuncs would be beneficial. I have some experience writing SSE code >>> > using intrinsics and would be happy to give it a shot if people tell >>> > me what functions I should focus on. >>> >>> Fabulous! This is on my Project List of todo items for NumPy. See >>> http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend >>> some time refactoring the ufunc loops so that the templating does not >>> get in the way of doing this on a case by case basis. >>> >>> 1) You should focus on the math operations: add, subtract, multiply, >>> divide, and so forth. >>> 2) Then for "combined operations" we should expose the functionality at >>> a high-level. So, that somebody could write code to take advantage of it. >>> >>> It would be easiest to use intrinsics which would then work for AMD, >>> Intel, on multiple compilers. >>> >>> >> I think even heavier use of code generation would be a good idea here. >> There are so many different versions of each loop, and the fastest way >> to run each one is going to be different for different versions and >> different platforms, that a routine that assembled the code from >> chunks and picked the fastest combination for each instance might make >> a big difference - this is roughly what FFTW and ATLAS do. >> >> There are also some optimizations to be made at a higher level that >> might give these optimizations more traction. For example: >> >> A = randn(100*100) >> A.shape = (100,100) >> A*A >> >> There's no reason the multiply ufunc couldn't flatten A and use a >> single unstrided loop to do the multiplication. >> >> > Good idea, it does already do that :-) The ufunc machinery is also a > good place for an optional thread pool. > > Perhaps we could drum up interest in a Need for Speed Sprint on NumPy > sometime over the next few months. > > > -Travis O. >
Hi, I have a very limited knowledge of openmp but please consider this testcase : #include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #define N 100000000 int main(void) { double *data; data = malloc(N*sizeof(double)); long i; #pragma omp parallel for for(i=0;i<N;i++) { data[i]=i; } long j; for(j=0;j<4;j++) { #pragma omp parallel for for(i=0;i<N;i++) { data[i]=cos(data[i]); } } return 0; } gcc -fopenmp -Wall -lm -O3 sin.c -o sinopenmp and gcc -fopenmp -Wall -lm -O3 sin.c -o sin On my core2duo : time ./sin real 0m15.910s user 0m15.249s sys 0m0.646s and time ./sinopenmp real 0m8.699s user 0m16.287s sys 0m0.893s It scales very well :) (gcc-4.2). It would be so nice to see that usign numpy.sin(a) Ok it is a very simple case but numpy.sin(a) is such a case isn't it ?? Please give it a try ;) Xavier _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion