Dear all,

My apologies if this has already been asked before.  I'm new to the list 
and couldn't find it in the archives.

I experience bad performance when running the multi-locale compiled 
version on an InfiniBand equiped cluster 
(http://cs.vu.nl/das4/clusters.shtml, VU-site), even with only one node. 
  Below you find a minimal example that exhibits the same performance 
problems as all my programs:

I compiled chapel-1.14.0 with the following steps:

export CHPL_TARGET_ARCH=native
make -j
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv
make clean
make -j

I compile the following Chapel code:

vectoradd.chpl:
---------------
use Time;
use Random;
use BlockDist;

config const n = 1024**3;

// for single-locale
// const ProblemDomain : domain(1) = {0..#n};
// for multi-locale
const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) =
     {0..#n};

type float = real(32);

proc add(c : [ProblemDomain] float, a : [ProblemDomain] float,
     b : [ProblemDomain] float) {
   forall i in ProblemDomain {
     c[i] = a[i] + b[i];
   }
}

proc main() {
   var c : [ProblemDomain] float;
   var a : [ProblemDomain] float;
   var b : [ProblemDomain] float;
   var t : Timer;

   fillRandom(a, 0);
   fillRandom(b, 42);

   t.start();
   add(c, a, b);
   t.stop();

   writeln("n: ", n);
   writeln("Time: ", t.elapsed(), "s");
   writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s");
}
----

I compile this for single-locale with (using no domain maps, see the 
comment above in the source):

chpl -o vectoradd --fast vectoradd.chpl

I run it with (dual quad core with 2 hardware threads):

export CHPL_RT_NUM_THREADS_PER_LOCALE=16
./vectoradd

And get as output:

n: 1073741824
Time: 0.558806s
GFLOPS: 1.92149s

However, the performance for multi-locale is much worse:

I compile this for multi-locale with domain maps, see the comment in the 
source):

CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \
   vectoradd.chpl

I run it on the same type of node with:

SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`

export GASNET_PHYSMEM_MAX=1G
export GASNET_IBV_SPAWNER=ssh
export GASNET_SSH_SERVERS="$SSH_SERVERS"

export CHPL_RT_NUM_THREADS_PER_LOCALE=16
export CHPL_LAUNCHER=gasnetrun_ibv
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv

./vectoradd -nl 1

And get as output:

n: 1073741824
Time: 8.65082s
GFLOPS: 0.12412s

I would understand a performance difference of say 10% because of 
multi-locale execution, but not factors.  Is this to be expected from 
the current state of Chapel?  This performance difference is examplary 
for basically all my programs that also are more realistic and use 
larger inputs.  The performance is strange as there is no communication 
necessary (only one node) and the program is using the same amount of 
threads.

Is there any way for me to investigate this using profiling for example?

By the way, the program does scale well to multiple nodes (which is not 
difficult given the baseline):

  1 | 8.65s
  2 | 2.67s
  4 | 1.69s
  8 | 0.87s
16 | 0.41s

Thanks in advance for your input.

Kind regards,

Pieter Hijma

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to