Re: Adding OpenMP support for some of the GSL functions

Maxime Boissonneault Thu, 13 Dec 2012 13:05:33 -0800

Hi Rhys,

I did the comparison you requested. Here is the load balancinginformation for my test run :

These were all obtained with gcc 4.7.2. The total times for the runs were :
- Without vectorization or OpenMP : 1m21s
- With vectorization : 1m14s
- With OpenMP : 1m01s


Here is load-balance information obtained with OpenSpeedshop.
Without vectorization or OpenMP :

Max Posix ThreadId Min Posix ThreadId AverageFunction (defining location)

  Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
      Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

28.085714 140384673677232 28.085714 140384673677232 28.085714rkf45_apply (libgsl.so.0.16.0)


With OpenMP :

Max Posix ThreadId Min Posix ThreadId AverageFunction (defining location)

  Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
      Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

[...]

2.800000 1146448192 1.228571 1088166208 1.778571rkf45_apply._omp_fn.4 (libgsl.so.0.16.0)2.171429 1146448192 0.971429 1129662784 1.460714rkf45_apply._omp_fn.6 (libgsl.so.0.16.0)2.142857 1146448192 1.400000 140073673240496 1.671429rkf45_apply._omp_fn.3 (libgsl.so.0.16.0)2.085714 1146448192 1.257143 1112877376 1.725000rkf45_apply._omp_fn.5 (libgsl.so.0.16.0)2.085714 1146448192 1.171429 1088166208 1.578571rkf45_apply._omp_fn.2 (libgsl.so.0.16.0)1.600000 1146448192 0.885714 140073673240496 1.139286rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233)0.714286 1146448192 0.457143 1129662784 0.557143rkf45_apply._omp_fn.0 (libgsl.so.0.16.0)

[...]
The 7 loop make a total of about ~12s.


With vectorization :

Max Posix ThreadId Min Posix ThreadId AverageFunction (defining location)

  Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
      Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

24.542857 140440801677232 24.542857 140440801677232 24.542857rkf45_apply (libgsl.so.0.16.0)

This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total of8 threads. To summarize :- Vectorization makes a difference, but it is minor (gained 4s out ofthe 28s used by the normal rkf45_apply ).

- OpenMP is the clear winner (gained 20s out of 28s).


Maxime





Le 2012-12-13 10:51, Rhys Ulerich a écrit :

I am doing that too, but any gain we can get is an important one, and it
turns out that by parallelizing rkf45_apply, my simulation runs 30% faster
on 8 cores.

That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
0.70 time units)).  This feels like you're getting a small
memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
operations by using multiple cores but the cores are otherwise not
being used effectively.  I say this because a state vector 1e6 doubles
long will not generally fit in cache.  Adding more cores increases the
amount of cache available.

I will have a deeper look at vectorization of GSL, but in my understanding,
vectorizing can only be done with simple operations, while algorithms like
RKF45 involve about 10 operations per loop iterations.

The compilers are generally very good.  Intel's icc 11.1 has to be
told that the last four loops you annotated are vectorizable.  GCC
nails it out of the box.

On GCC 4.4.3 with something like
     CFLAGS="-g -O2 -march=native -mtune=native
-ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
shows every one of those 6 loops vectorizing.  You can check this by
configuring with those options, running make and waiting for the build
to finish, and then cd-ing into ode-initval2 and running
     rm rkf45*o && make
and observing all those beautiful
     LOOP VECTORIZED
messages.  Better yet, with those options, 'make check' passes for me
on the 'ode-initval2' subdirectory.

Try ripping out your OpenMP pragmas in GSL, building with
vectorization against stock GSL as I suggested, and then seeing how
fast your code runs with GSL vectorized on 1 core versus GSL's
unvectorized rkf45_apply parallelized over 8 cores.  I suspect it will
be comparable.

- Rhys

Re: Adding OpenMP support for some of the GSL functions

Reply via email to