Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing, ...)
Matthieu 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> > Hi, > I assume that someone here could maybe help me, and I'm hoping it's > not too much off topic. > I have 2 arrays of 2d point coordinates and would like to calculate > all pairwise distances as fast as possible. > Going from Python/Numpy to a (Swigged) C extension already gave me a > 55x speedup. > (.9ms vs. 50ms for arrays of length 329 and 340). > I'm using gcc on Linux. > Now I'm wondering if I could go even faster !? > My hope that the compiler might automagically do some SSE2 > optimization got disappointed. > Since I have a 4 core CPU I thought OpenMP might be an option; > I never used that, and after some playing around I managed to get > (only) 50% slowdown(!) :-( > > My code in short is this: > (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) > -------<Ccode> ---------- > void dists2d( > double *a_ps, int nx1, int na, > double *b_ps, int nx2, int nb, > double *dist, int nx3, int ny3) throw (char*) > { > if(nx1 != 2) throw (char*) "a must be of shape (n,2)"; > if(nx2 != 2) throw (char*) "b must be of shape (n,2)"; > if(nx3 != nb || ny3 != na) throw (char*) "c must be of shape (na,nb)"; > > double *dist_; > int i, j; > > #pragma omp parallel private(dist_, j, i) > { > #pragma omp for nowait > for(i=0;i<na;i++) > { > //num_threads=omp_get_num_threads(); --> 4 > dist_ = dist+i*nb; // dists_ is only > introduced for OpenMP > for(j=0;j<nb;j++) > { > *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + > sq(a_ps[i*nx1+1] - > b_ps[j*nx2+1]) ); > } > } > } > } > -------</Ccode> ---------- > There is probably a simple mistake in this code - as I said I never > used OpenMP before. > It should be not too difficult to use OpenMP correctly here > or - maybe better - > is there a simple SSE(2,3,4) version that might be even better than > OpenMP... !? > > I supposed, that I did not get the #pragma omp lines right - any idea ? > Or is it in general not possible to speed this kind of code up using OpenMP > !? > > Thanks, > Sebastian Haase > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion