Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-20 Thread Francesc Alted
A Sunday 20 February 2011 00:01:59 Sturla Molden escrigué: pthreads will give you better control than OpenMP, but are messy and painful to work with. With MPI you have separate processes, so everything is completely isolated. It's more difficult to program and debug than OpenMP code, but

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-19 Thread Sebastian Haase
Thanks a lot. Very informative. I guess what you say about cache line is dirtied is related to the info I got with valgrind (see my email in this thread: L1 Data Write Miss 3636). Can one assume that the cache line is always a few mega bytes ? Thanks, Sebastian On Sat, Feb 19, 2011 at 12:40 AM,

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-19 Thread Pauli Virtanen
On Sat, 19 Feb 2011 18:13:44 +0100, Sebastian Haase wrote: Thanks a lot. Very informative. I guess what you say about cache line is dirtied is related to the info I got with valgrind (see my email in this thread: L1 Data Write Miss 3636). Can one assume that the cache line is always a few mega

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-19 Thread Sturla Molden
Den 19.02.2011 18:13, skrev Sebastian Haase: Can one assume that the cache line is always a few mega bytes ? Don't confuse the size of a cache with the size of a cache line. A cache line (which is the unit that gets marked dirty) is typically 8-512 bytes. Make sure your OpenMP threads stay

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-18 Thread Sturla Molden
Den 17.02.2011 16:31, skrev Matthieu Brucher: It may also be the sizes of the chunk OMP uses. You can/should specify them.in http://them.in Matthieu the OMP pragma so that it is a multiple of the cache line size or something close. Also beware of false sharing among the threads. When one

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Sebastian Haase
Eric, thanks for insisting on this. I noticed that, when I saw it first, just to forget about it again ... The new timings on my machine are: $: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math $: gcc -shared -o the_lib.so the_lib.o -lgomp -lm $: python2.5 the_python_prog.py c_threads 1 time

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Matthieu Brucher
Do you think, one could get even better ? And, where does the 7% slow-down (for single thread) come from ? Is it possible to have the OpenMP option in a code, without _any_ penalty for 1 core machines ? There will always be a penalty for parallel code that runs on one core. You have at least

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Matthieu Brucher
Then, where does the overhead come from ? -- The call toomp_set_dynamic(dynamic); Or the #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) It may be this. You initialize a thread pool, even if it has only one thread, and there is the dynamic part, so OpenMP may create several

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Matthieu Brucher
It may also be the sizes of the chunk OMP uses. You can/should specify them.in the OMP pragma so that it is a multiple of the cache line size or something close. Matthieu 2011/2/17 Sebastian Haase seb.ha...@gmail.com Hi, More surprises: shaase@iris:~/code/SwiggedDistOMP: gcc -O3 -c the_lib.c

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-17 Thread Eric Carlson
For 4 cores, on your system, your conclusion makes some sense. That said, I played around with this on both a core 2 duo and the 12 core system. For the 12-core system, on my tests the 0 case ran extremely close to the 2-thread case for all my sizes. The core 2 duo runs windows 7, and after

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Sebastian Haase
Chris, OK, sorry -- I miss read (cdist doc says A and B must have same number of columns(!) not rows). On my machine I got the exact same timing as my (non OpenMP) C code. That is really got, compared to normal ufunc based numpy code. But my question in this thread is, how to get better than that,

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Sebastian Haase
Eric, this is amazing !! Thanks very much, I have rarely seen such a compact source example that just worked. The timings I get are: c_threads 1 time 0.00155731916428 c_threads 2 time 0.000829789638519 c_threads 3 time 0.00061688839 c_threads 4 time 0.000704760551453 c_threads 5

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Sebastian Haase
Update: I just noticed that using Eric's OpenMP code gave me only a 1.35x speedup when comparing 3 threads vs. my non OpenMP code. However, when comparing 3 threads vs. 1 thread, I could call this a 2.55x speedup. This obviously sounds much better, but is obviously not the number that matters...

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-16 Thread Eric Carlson
Sebastian, Optimization appears to be important here. I used no optimization in my previous post, so you could try the -O3 compile option: gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math for na=329 and nb=340 I get (about 7.5 speedup) c_threads 1 time 0.00103106021881 c_threads 2 time

[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible. Going from Python/Numpy to a (Swigged) C extension already gave me a 55x speedup. (.9ms

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Matthieu Brucher
Hi, My first move would be to add a restrict keyword to dist (i.e. dist is the only pointer to the specific memory location), and then declare dist_ inside the first loop also with a restrict. Then, I would run valgrind or a PAPI profil on your code to see what causes the issue (false sharing,

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what PAPI profil is ...? -Sebastian On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Hi, My first move would be to add a restrict

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Matthieu Brucher
Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog ( http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically, if you have a python script, you can valgrind --optionsinmyblog python myscript.py For

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Wes McKinney
On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: Use directly restrict in C99 mode (__restrict does not have exactly the same semantics). For a valgrind profil, you can check my blog (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) Basically,

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do some easy(!) CUDA programming here, maybe you could guide me into the right direction... Thanks, Sebastian. On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote: On

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Wes McKinney
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote: Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do some easy(!) CUDA programming here, maybe you could guide me into the right direction... Thanks, Sebastian.

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread eat
Hi, On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.comwrote: Hi, I assume that someone here could maybe help me, and I'm hoping it's not too much off topic. I have 2 arrays of 2d point coordinates and would like to calculate all pairwise distances as fast as possible.

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Sebastian Haase
Hi Eat, I will surely try these routines tomorrow, but I still think that neither scipy function does the complete distance calculation of all possible pairs as done by my C code. For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I need to get nX times nY distances computed. From

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Chris Colbert
The `cdist` function in scipy spatial does what you want, and takes ~ 1ms on my machine. In [1]: import numpy as np In [2]: from scipy.spatial.distance import cdist In [3]: a = np.random.random((340, 2)) In [4]: b = np.random.random((329, 2)) In [5]: c = cdist(a, b) In [6]: c.shape Out[6]:

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Jonathan Taylor
Take a look at a nice project coming out of my department: http://code.google.com/p/cudamat/ Best, Jon. On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote: Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do

Re: [Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

2011-02-15 Thread Eric Carlson
I don't have the slightest idea what I'm doing, but file name - the_lib.c ___ #include stdio.h #include time.h #include omp.h #include math.h void dists2d( double *a_ps, int na, double *b_ps, int nb, double *dist, int num_threads) { int