A Sunday 20 February 2011 00:01:59 Sturla Molden escrigué:
pthreads will give you better control than OpenMP, but are messy and
painful to work with.
With MPI you have separate processes, so everything is completely
isolated. It's more difficult to program and debug than OpenMP code,
but
Thanks a lot. Very informative. I guess what you say about cache line
is dirtied is related to the info I got with valgrind (see my email
in this thread: L1 Data Write Miss 3636).
Can one assume that the cache line is always a few mega bytes ?
Thanks,
Sebastian
On Sat, Feb 19, 2011 at 12:40 AM,
On Sat, 19 Feb 2011 18:13:44 +0100, Sebastian Haase wrote:
Thanks a lot. Very informative. I guess what you say about cache line
is dirtied is related to the info I got with valgrind (see my email in
this thread: L1 Data Write Miss 3636). Can one assume that the cache
line is always a few mega
Den 19.02.2011 18:13, skrev Sebastian Haase:
Can one assume that the cache line is always a few mega bytes ?
Don't confuse the size of a cache with the size of a cache line.
A cache line (which is the unit that gets marked dirty) is typically
8-512 bytes.
Make sure your OpenMP threads stay
Den 17.02.2011 16:31, skrev Matthieu Brucher:
It may also be the sizes of the chunk OMP uses. You can/should specify
them.in http://them.in
Matthieu
the OMP pragma so that it is a multiple of the cache line size or
something close.
Also beware of false sharing among the threads. When one
Eric,
thanks for insisting on this. I noticed that, when I saw it first,
just to forget about it again ...
The new timings on my machine are:
$: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math
$: gcc -shared -o the_lib.so the_lib.o -lgomp -lm
$: python2.5 the_python_prog.py
c_threads 1 time
Do you think, one could get even better ?
And, where does the 7% slow-down (for single thread) come from ?
Is it possible to have the OpenMP option in a code, without _any_
penalty for 1 core machines ?
There will always be a penalty for parallel code that runs on one core. You
have at least
Then, where does the overhead come from ? --
The call toomp_set_dynamic(dynamic);
Or the
#pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
It may be this. You initialize a thread pool, even if it has only one
thread, and there is the dynamic part, so OpenMP may create several
It may also be the sizes of the chunk OMP uses. You can/should specify
them.in the OMP pragma so that it is a multiple of the cache line size or
something close.
Matthieu
2011/2/17 Sebastian Haase seb.ha...@gmail.com
Hi,
More surprises:
shaase@iris:~/code/SwiggedDistOMP: gcc -O3 -c the_lib.c
For 4 cores, on your system, your conclusion makes some sense. That
said, I played around with this on both a core 2 duo and the 12 core
system. For the 12-core system, on my tests the 0 case ran extremely
close to the 2-thread case for all my sizes.
The core 2 duo runs windows 7, and after
Chris,
OK, sorry -- I miss read (cdist doc says A and B must have same number
of columns(!) not rows).
On my machine I got the exact same timing as my (non OpenMP) C code.
That is really got, compared to normal ufunc based numpy code.
But my question in this thread is, how to get better than that,
Eric,
this is amazing !! Thanks very much, I have rarely seen such a compact
source example that just worked.
The timings I get are:
c_threads 1 time 0.00155731916428
c_threads 2 time 0.000829789638519
c_threads 3 time 0.00061688839
c_threads 4 time 0.000704760551453
c_threads 5
Update:
I just noticed that using Eric's OpenMP code gave me only a 1.35x
speedup when comparing 3 threads vs. my non OpenMP code. However, when
comparing 3 threads vs. 1 thread, I could call this a 2.55x speedup.
This obviously sounds much better, but is obviously not the number
that matters...
Sebastian,
Optimization appears to be important here. I used no optimization in my
previous post, so you could try the -O3 compile option:
gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math
for na=329 and nb=340 I get (about 7.5 speedup)
c_threads 1 time 0.00103106021881
c_threads 2 time
Hi,
I assume that someone here could maybe help me, and I'm hoping it's
not too much off topic.
I have 2 arrays of 2d point coordinates and would like to calculate
all pairwise distances as fast as possible.
Going from Python/Numpy to a (Swigged) C extension already gave me a
55x speedup.
(.9ms
Hi,
My first move would be to add a restrict keyword to dist (i.e. dist is the
only pointer to the specific memory location), and then declare dist_ inside
the first loop also with a restrict.
Then, I would run valgrind or a PAPI profil on your code to see what causes
the issue (false sharing,
Thanks Matthieu,
using __restrict__ with g++ did not change anything. How do I use
valgrind with C extensions?
I don't know what PAPI profil is ...?
-Sebastian
On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
matthieu.bruc...@gmail.com wrote:
Hi,
My first move would be to add a restrict
Use directly restrict in C99 mode (__restrict does not have exactly the same
semantics).
For a valgrind profil, you can check my blog (
http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
Basically, if you have a python script, you can valgrind --optionsinmyblog
python myscript.py
For
On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
matthieu.bruc...@gmail.com wrote:
Use directly restrict in C99 mode (__restrict does not have exactly the same
semantics).
For a valgrind profil, you can check my blog
(http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
Basically,
Wes,
I think I should have a couple of GPUs. I would be ready for anything
... if you think that I could do some easy(!) CUDA programming here,
maybe you could guide me into the right direction...
Thanks,
Sebastian.
On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney wesmck...@gmail.com wrote:
On
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote:
Wes,
I think I should have a couple of GPUs. I would be ready for anything
... if you think that I could do some easy(!) CUDA programming here,
maybe you could guide me into the right direction...
Thanks,
Sebastian.
Hi,
On Tue, Feb 15, 2011 at 5:50 PM, Sebastian Haase seb.ha...@gmail.comwrote:
Hi,
I assume that someone here could maybe help me, and I'm hoping it's
not too much off topic.
I have 2 arrays of 2d point coordinates and would like to calculate
all pairwise distances as fast as possible.
Hi Eat,
I will surely try these routines tomorrow,
but I still think that neither scipy function does the complete
distance calculation of all possible pairs as done by my C code.
For 2 arrays, X and Y, of nX and nY 2d coordinates respectively, I
need to get nX times nY distances computed.
From
The `cdist` function in scipy spatial does what you want, and takes ~ 1ms on
my machine.
In [1]: import numpy as np
In [2]: from scipy.spatial.distance import cdist
In [3]: a = np.random.random((340, 2))
In [4]: b = np.random.random((329, 2))
In [5]: c = cdist(a, b)
In [6]: c.shape
Out[6]:
Take a look at a nice project coming out of my department:
http://code.google.com/p/cudamat/
Best,
Jon.
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase seb.ha...@gmail.com wrote:
Wes,
I think I should have a couple of GPUs. I would be ready for anything
... if you think that I could do
I don't have the slightest idea what I'm doing, but
file name - the_lib.c
___
#include stdio.h
#include time.h
#include omp.h
#include math.h
void dists2d( double *a_ps, int na,
double *b_ps, int nb,
double *dist, int num_threads)
{
int
26 matches
Mail list logo