Dear all, My apologies if this has already been asked before. I'm new to the list and couldn't find it in the archives.
I experience bad performance when running the multi-locale compiled version on an InfiniBand equiped cluster (http://cs.vu.nl/das4/clusters.shtml, VU-site), even with only one node. Below you find a minimal example that exhibits the same performance problems as all my programs: I compiled chapel-1.14.0 with the following steps: export CHPL_TARGET_ARCH=native make -j export CHPL_COMM=gasnet export CHPL_COMM_SUBSTRATE=ibv make clean make -j I compile the following Chapel code: vectoradd.chpl: --------------- use Time; use Random; use BlockDist; config const n = 1024**3; // for single-locale // const ProblemDomain : domain(1) = {0..#n}; // for multi-locale const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) = {0..#n}; type float = real(32); proc add(c : [ProblemDomain] float, a : [ProblemDomain] float, b : [ProblemDomain] float) { forall i in ProblemDomain { c[i] = a[i] + b[i]; } } proc main() { var c : [ProblemDomain] float; var a : [ProblemDomain] float; var b : [ProblemDomain] float; var t : Timer; fillRandom(a, 0); fillRandom(b, 42); t.start(); add(c, a, b); t.stop(); writeln("n: ", n); writeln("Time: ", t.elapsed(), "s"); writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s"); } ---- I compile this for single-locale with (using no domain maps, see the comment above in the source): chpl -o vectoradd --fast vectoradd.chpl I run it with (dual quad core with 2 hardware threads): export CHPL_RT_NUM_THREADS_PER_LOCALE=16 ./vectoradd And get as output: n: 1073741824 Time: 0.558806s GFLOPS: 1.92149s However, the performance for multi-locale is much worse: I compile this for multi-locale with domain maps, see the comment in the source): CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \ vectoradd.chpl I run it on the same type of node with: SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '` export GASNET_PHYSMEM_MAX=1G export GASNET_IBV_SPAWNER=ssh export GASNET_SSH_SERVERS="$SSH_SERVERS" export CHPL_RT_NUM_THREADS_PER_LOCALE=16 export CHPL_LAUNCHER=gasnetrun_ibv export CHPL_COMM=gasnet export CHPL_COMM_SUBSTRATE=ibv ./vectoradd -nl 1 And get as output: n: 1073741824 Time: 8.65082s GFLOPS: 0.12412s I would understand a performance difference of say 10% because of multi-locale execution, but not factors. Is this to be expected from the current state of Chapel? This performance difference is examplary for basically all my programs that also are more realistic and use larger inputs. The performance is strange as there is no communication necessary (only one node) and the program is using the same amount of threads. Is there any way for me to investigate this using profiling for example? By the way, the program does scale well to multiple nodes (which is not difficult given the baseline): 1 | 8.65s 2 | 2.67s 4 | 1.69s 8 | 0.87s 16 | 0.41s Thanks in advance for your input. Kind regards, Pieter Hijma ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
