Hi Ben,
Good suggestion, I'm going to test with the 1D-convolution program and I
use the Chapel version compiled with GCC 6.2.0. The single-locale
version is in directory 'datapar' and the multi-locale version is in
directory 'datapar-dist'.
The Makefiles are now the same:
$ diff datapar/Makefile datapar/Makefile
This means that I compile both programs with:
$ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \
--fast 1D-convolution.chpl
The job files are also the same:
$ diff datapar/das4.job datapar/das4.job
The contents of das4.job (an SGE script):
-----
#!/bin/bash
#$ -l h_rt=0:15:00
#$ -N CONVOLUTION_1D
#$ -cwd
. ~/.bashrc
SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '`
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv
export CHPL_LAUNCHER=gasnetrun_ibv
export CHPL_RT_NUM_THREADS_PER_LOCALE=16
export GASNET_IBV_SPAWNER=ssh
export GASNET_PHYSMEM_MAX=1G
export GASNET_SSH_SERVERS="$SSH_SERVERS"
APP=./1D-convolution
ARGS=$*
$APP $ARGS
------
The difference between the two source files is now basically that the
single-locale version has only the default domain map, whereas the
multi-locale has the BlockDist domain map:
$ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl
2a3
> use BlockDist;
6c7
< const ProblemDomain : domain(1) = {0..#n};
---
> const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n})
= {0..#n};
ERROR: 1
The output of 'datapar', the original single-locale version without
BlockDist:
convolveIndices, n: 536870912
Time: 0.319077s
GFLOPS: 5.04772
convolveZip, n: 536870912
Time: 0.320788s
GFLOPS: 5.0208
The output of 'datapar-dist', the original multi-locale version with
BlockDist:
convolveIndices, n: 536870912
Time: 3.1422s
GFLOPS: 0.512575
convolveZip, n: 536870912
Time: 3.54989s
GFLOPS: 0.453708
I guess we can conclude that only the addition of the BlockDist domain
map to the ProblemDomain results in a factor of 10 slowdown.
Kind regards,
Pieter Hijma
On 14/11/16 20:21, Ben Harshbarger wrote:
> Hi Pieter,
>
> My next suggestion would be to try compiling and running the "single locale"
> variation with the same environment variables that you use for multilocale.
> I'm wondering if the use of IBV is impacting performance in some way. I don't
> see the performance issue on our internal ibv cluster, but it's worth
> checking.
>
> -Ben Harshbarger
>
> On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote:
>
> Hi Ben,
>
> Thanks for your help.
>
> On 07/11/16 18:59, Ben Harshbarger wrote:
> > When CHPL_COMM is set to 'none', our compiler can avoid introducing
> some overhead that is necessary for multi-locale programs. You can force this
> overhead when CHPL_COMM == none by compiling with the flag "--no-local". If
> you compile your single-locale program with that flag, does the performance
> get worse?
>
> It makes some difference, but not much:
>
> chpl -o vectoradd --fast vectoradd.chpl
>
> addNoDomain n: 1073741824
> Time: 0.57211s
> GFLOPS: 1.87681
>
> addZip n: 1073741824
> Time: 0.571799s
> GFLOPS: 1.87783
>
> addForall n: 1073741824
> Time: 0.571623s
> GFLOPS: 1.87841
>
> addCollective n: 1073741824
> Time: 0.571395s
> GFLOPS: 1.87916
>
>
> chpl -o vectoradd --fast --no-local vectoradd.chpl
>
> addNoDomain n: 1073741824
> Time: 0.62087s
> GFLOPS: 1.72941
>
> addZip n: 1073741824
> Time: 0.619997s
> GFLOPS: 1.73185
>
> addForall n: 1073741824
> Time: 0.620645s
> GFLOPS: 1.73004
>
> addCollective n: 1073741824
> Time: 0.620254s
> GFLOPS: 1.73113
>
>
> > If that's the case, I'm not entirely sure what the next step would be.
> Do you have access to a newer version of GCC? The backend C compiler can
> matter when it comes to optimizing the multi-locale overhead.
>
> It is indeed an old one. We also have GCC 4.9.0, Intel 13.3, and I
> compiled GCC 6.2.0 to check:
>
> * intel/compiler/64/13.3/2013.3.163
>
> I basically see the same behavior:
>
> single locale:
>
> addNoDomain n: 536870912
> Time: 0.285186s
> GFLOPS: 1.88253
>
> addZip n: 536870912
> Time: 0.284819s
> GFLOPS: 1.88495
>
> addForall n: 536870912
> Time: 0.287904s
> GFLOPS: 1.86476
>
> addCollective n: 536870912
> Time: 0.284912s
> GFLOPS: 1.88434
>
> multi-locale, one node:
>
> addNoDomain n: 536870912
> Time: 3.24471s
> GFLOPS: 0.16546
>
> addZip n: 536870912
> Time: 3.01287s
> GFLOPS: 0.178192
>
> addForall n: 536870912
> Time: 7.23895s
> GFLOPS: 0.0741642
>
> addCollective n: 536870912
> Time: 2.59501s
> GFLOPS: 0.206886
>
>
> * GCC 4.9.0
>
> This is encouraging, the performance improves, a factor two of the
> single-locale, except for the explicit indices in the forall:
>
> single locale:
>
> addNoDomain n: 536870912
> Time: 0.277222s
> GFLOPS: 1.93661
>
> addZip n: 536870912
> Time: 0.27566s
> GFLOPS: 1.94758
>
> addForall n: 536870912
> Time: 0.27609s
> GFLOPS: 1.94455
>
> addCollective n: 536870912
> Time: 0.275303s
> GFLOPS: 1.95011
>
> multi-locale, single node:
>
> addNoDomain n: 536870912
> Time: 0.492954s
> GFLOPS: 1.08909
>
> addZip n: 536870912
> Time: 0.493039s
> GFLOPS: 1.0889
>
> addForall n: 536870912
> Time: 2.85323s
> GFLOPS: 0.188162
>
> addCollective n: 536870912
> Time: 0.492135s
> GFLOPS: 1.0909
>
>
> * GCC 6.2.0
>
> The performance on multi-locale is now even better. Still very low for
> explicit indices in the forall.
>
> single locale:
>
> addNoDomain n: 536870912
> Time: 0.283272s
> GFLOPS: 1.89525
>
> addZip n: 536870912
> Time: 0.281942s
> GFLOPS: 1.90419
>
> addForall n: 536870912
> Time: 0.282291s
> GFLOPS: 1.90184
>
> addCollective n: 536870912
> Time: 0.281629s
> GFLOPS: 1.90631
>
> Multi-locale, single node:
>
> addNoDomain n: 536870912
> Time: 0.358012s
> GFLOPS: 1.49959
>
> addZip n: 536870912
> Time: 0.356696s
> GFLOPS: 1.50512
>
> addForall n: 536870912
> Time: 2.92173s
> GFLOPS: 0.183751
>
> addCollective n: 536870912
> Time: 0.343808s
> GFLOPS: 1.56154
>
>
>
> Since this is encouraging, I also verified the performance of the
> 1D-stencils:
>
> * GCC 4.4.7
>
> For reference, the old compiler that I used initially:
>
> single locale:
>
> convolveIndices, n: 536870912
> Time: 0.82361s
> GFLOPS: 1.95555
>
> convolveZip, n: 536870912
> Time: 0.810028s
> GFLOPS: 1.98834
>
> mutli-locale, one node:
>
> convolveIndices, n: 536870912
> Time: 4.25951s
> GFLOPS: 0.378122
>
> convolveZip, n: 536870912
> Time: 4.88046s
> GFLOPS: 0.330012
>
> * intel/compiler/64/13.3/2013.3.163
>
> On this compiler the single-node performance is better than the previous
> compiler. However, the multi-locale one node performance is about a
> factor 3 slower than the previous compiler.
>
> single locale:
>
> convolveIndices, n: 536870912
> Time: 0.554139s
> GFLOPS: 2.90651
>
> convolveZip, n: 536870912
> Time: 0.556653s
> GFLOPS: 2.89339
>
>
> multi-locale, one node:
>
> convolveIndices, n: 536870912
> Time: 10.5368s
> GFLOPS: 0.152856
>
> convolveZip, n: 536870912
> Time: 12.7625s
> GFLOPS: 0.126198
>
>
> * GCC 4.9.0
>
> The performance of single locale is much better than GCC 4.4.7, however
> still poor for the multi-locale, one node configuration, although a bit
> better.
>
> single locale:
>
> convolveIndices, n: 536870912
> Time: 0.207055s
> GFLOPS: 7.77867
>
> convolveZip, n: 536870912
> Time: 0.206783s
> GFLOPS: 7.7889
>
> multi-locale, one node:
>
> convolveIndices, n: 536870912
> Time: 3.20851s
> GFLOPS: 0.501981
>
> convolveZip, n: 536870912
> Time: 3.652s
> GFLOPS: 0.441023
>
>
> * GCC 6.2.0
>
> Strangely enough, the performance of single-locale is a bit lower than
> the previous, and the same as with multi-locale, one node.
>
> single-locale:
>
> convolveIndices, n: 536870912
> Time: 0.263151s
> GFLOPS: 6.12049
>
> convolveZip, n: 536870912
> Time: 0.262234s
> GFLOPS: 6.14189
>
> multi-locale, one node:
>
> convolveIndices, n: 536870912
> Time: 3.12716s
> GFLOPS: 0.515039
>
> convolveZip, n: 536870912
> Time: 3.58663s
> GFLOPS: 0.44906
>
>
> The conclusion is that the compiler has indeed a large impact on the
> multi-locale performance, but probably only in the simple cases such as
> vector addition. With the stencil code, although it is not very
> complicated, the performance falls back into the pattern that I came
> across originally.
>
> However, perhaps this gives you an idea of the optimizations that impact
> the performance? If we can't find a solution, I would at least like to
> understand the lack of performance.
>
> I also checked the performance of the stencils by not using the
> StencilDist but just the BlockDist and it makes no difference.
>
> > You may also want to consider setting CHPL_TARGET_ARCH to something
> else if you're compiling on a machine architecture different from the compute
> nodes. There's more information about CHPL_TARGET_ARCH here:
> >
> >
> http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch
>
> The head-node and compute-nodes are all Intel Xeon Westmere's, so I
> don't think that makes a difference. To be absolutely sure, I also
> compiled Chapel and the applications on a compute node and indeed, the
> performance is comparable to all measurements above.
>
> Kind regards,
>
> Pieter Hijma
>
>
> > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:
> >
> > Dear Ben,
> >
> > Sorry for my late reactions. Unfortunately, for some reason, these
> > emails are marked as spam even though I marked the list and your
> address
> > as safe. I will make sure I check my spam folders meticulously
> from now on.
> >
> > On 28/10/16 23:34, Ben Harshbarger wrote:
> > > Hi Pieter,
> > >
> > > Sorry that you're still having issues. I think we'll need some
> more information before going forward:
> > >
> > > 1) Could you send us the output of "$CHPL_HOME/util/printchplenv
> --anonymize" ? It's a script that displays the various CHPL_ environment
> variables. "--anonymize" strips the output of information you may prefer to
> keep private (machine info, paths).
> >
> > This would be the setup if running single-locale programs:
> >
> > $ printchplenv --anonymize
> > CHPL_TARGET_PLATFORM: linux64
> > CHPL_TARGET_COMPILER: gnu
> > CHPL_TARGET_ARCH: native *
> > CHPL_LOCALE_MODEL: flat
> > CHPL_COMM: none
> > CHPL_TASKS: qthreads
> > CHPL_LAUNCHER: none
> > CHPL_TIMERS: generic
> > CHPL_UNWIND: none
> > CHPL_MEM: jemalloc
> > CHPL_MAKE: gmake
> > CHPL_ATOMICS: intrinsics
> > CHPL_GMP: gmp
> > CHPL_HWLOC: hwloc
> > CHPL_REGEXP: re2
> > CHPL_WIDE_POINTERS: struct
> > CHPL_AUX_FILESYS: none
> >
> > When I run multi-locale programs, I set the following environment
> variables:
> >
> > export CHPL_COMM=gasnet
> > export CHPL_COMM_SUBSTRATE=ibv
> >
> > Then the Chapel environment would be:
> >
> > $ printchplenv --anonymize
> > CHPL_TARGET_PLATFORM: linux64
> > CHPL_TARGET_COMPILER: gnu
> > CHPL_TARGET_ARCH: native *
> > CHPL_LOCALE_MODEL: flat
> > CHPL_COMM: gasnet *
> > CHPL_COMM_SUBSTRATE: ibv *
> > CHPL_GASNET_SEGMENT: large
> > CHPL_TASKS: qthreads
> > CHPL_LAUNCHER: gasnetrun_ibv
> > CHPL_TIMERS: generic
> > CHPL_UNWIND: none
> > CHPL_MEM: jemalloc
> > CHPL_MAKE: gmake
> > CHPL_ATOMICS: intrinsics
> > CHPL_NETWORK_ATOMICS: none
> > CHPL_GMP: gmp
> > CHPL_HWLOC: hwloc
> > CHPL_REGEXP: re2
> > CHPL_WIDE_POINTERS: struct
> > CHPL_AUX_FILESYS: none
> >
> >
> > > 2) What C compiler are you using?
> >
> > $ gcc --version
> > gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
> > Copyright (C) 2010 Free Software Foundation, Inc.
> > This is free software; see the source for copying conditions.
> There is NO
> > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
> PURPOSE.
> >
> > > 3) Are you sure that the programs are being launched correctly?
> This might seem silly, but it's worth double-checking that the programs are
> actually running on the same hardware (not necessarily the same node though).
> >
> > I am completely certain that the single-locale program, the
> multi-locale
> > program for one node, and the multi-locale for multiple nodes are
> > running on the compute nodes. I'm not completely sure what you
> mean by
> > "the same hardware". All compute nodes have the same hardware if
> that
> > is what you mean.
> >
> > > I'd also like to clarify what you mean by "multi-locale
> compiled". Is the difference between the programs just the use of the Block
> domain map, or do you compile with different environment variables set?
> >
> > I compile different programs and I use different environment
> variables:
> >
> > The single-locale version vectoradd is located in the datapar
> directory,
> > whereas the multi-locale version is located in the datapar-dist
> > directory. What follows is the diff for the .chpl file:
> >
> > $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
> > 8c8
> > < const ProblemDomain : domain(1) = {0..#n};
> > ---
> > > const ProblemDomain : domain(1) dmapped Block(boundingBox =
> {0..#n})
> > = {0..#n};
> >
> > The diff for the Makefile:
> >
> > $ diff datapar/Makefile datapar-dist/Makefile
> > 2a3
> > > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
> > 8c9
> > < $(CHPL) -o $@ $(FLAGS) $<
> > ---
> > > $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
> > 11c12
> > < rm -f $(APP)
> > ---
> > > rm -f $(APP) $(APP)_real
> >
> > Thanks for your help, and again my apologies for the delayed
> answers.
> >
> > Kind regards,
> >
> > Pieter Hijma
> >
> > >
> > > -Ben Harshbarger
> > >
> > > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote:
> > >
> > > Hi Ben,
> > >
> > > Thank you for your fast reply and suggestions! I did some
> more tests
> > > and also included stencil operations.
> > >
> > > First, the vector addition:
> > >
> > > vectoradd.chpl
> > > --------------
> > > use Time;
> > > use Random;
> > > use BlockDist;
> > > //use VisualDebug;
> > >
> > > config const n = 1024**3/2;
> > >
> > > // for multi-locale
> > > const ProblemDomain : domain(1) dmapped Block(boundingBox =
> {0..#n})
> > > = {0..#n};
> > > // for single-locale
> > > const ProblemDomain : domain(1) = {0..#n};
> > >
> > > type float = real(32);
> > >
> > > proc addNoDomain(c : [] float, a : [] float, b : [] float) {
> > > forall (ci, ai, bi) in zip(c, a, b) {
> > > ci = ai + bi;
> > > }
> > > }
> > >
> > > proc addZip(c : [ProblemDomain] float, a : [ProblemDomain]
> float,
> > > b : [ProblemDomain] float) {
> > > forall (ci, ai, bi) in zip(c, a, b) {
> > > ci = ai + bi;
> > > }
> > > }
> > >
> > > proc addForall(c : [ProblemDomain] float, a : [ProblemDomain]
> float,
> > > b : [ProblemDomain] float) {
> > > //startVdebug("vdata");
> > > forall i in ProblemDomain {
> > > c[i] = a[i] + b[i];
> > > }
> > > //stopVdebug();
> > > }
> > >
> > > proc addCollective(c : [ProblemDomain] float, a :
> [ProblemDomain] float,
> > > b : [ProblemDomain] float) {
> > > c = a + b;
> > > }
> > >
> > > proc output(t : Timer, n, testName) {
> > > t.stop();
> > > writeln(testName, " n: ", n);
> > > writeln("Time: ", t.elapsed(), "s");
> > > writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
> > > writeln();
> > > t.clear();
> > > }
> > >
> > > proc main() {
> > > var c : [ProblemDomain] float;
> > > var a : [ProblemDomain] float;
> > > var b : [ProblemDomain] float;
> > > var t : Timer;
> > >
> > > fillRandom(a, 0);
> > > fillRandom(b, 42);
> > >
> > > t.start();
> > > addNoDomain(c, a, b);
> > > output(t, n, "addNoDomain");
> > >
> > > t.start();
> > > addZip(c, a, b);
> > > output(t, n, "addZip");
> > >
> > > t.start();
> > > addForall(c, a, b);
> > > output(t, n, "addForall");
> > >
> > > t.start();
> > > addCollective(c, a, b);
> > > output(t, n, "addCollective");
> > > }
> > > -----
> > >
> > > On a single locale I get as output:
> > >
> > > addNoDomain n: 536870912
> > > Time: 0.27961s
> > > GFLOPS: 1.92007
> > >
> > > addZip n: 536870912
> > > Time: 0.278657s
> > > GFLOPS: 1.92664
> > >
> > > addForall n: 536870912
> > > Time: 0.278015s
> > > GFLOPS: 1.93109
> > >
> > > addCollective n: 536870912
> > > Time: 0.278379s
> > > GFLOPS: 1.92856
> > >
> > > On multi-locale (-nl 1) I get as output:
> > >
> > > addNoDomain n: 536870912
> > > Time: 2.16806s
> > > GFLOPS: 0.247627
> > >
> > > addZip n: 536870912
> > > Time: 2.17024s
> > > GFLOPS: 0.247378
> > >
> > > addForall n: 536870912
> > > Time: 4.78443s
> > > GFLOPS: 0.112212
> > >
> > > addCollective n: 536870912
> > > Time: 2.19838s
> > > GFLOPS: 0.244212
> > >
> > > So, indeed, your suggestion improves it by more than a factor
> two, but
> > > it is still close to a factor 8 slower than single-locale.
> > >
> > > I also used chplvis and verified that there are no gets and
> puts when
> > > running multi-locale with more than one node. The profiling
> information
> > > is clear, but not very helpful (to me):
> > >
> > > multi-locale (-nl 1):
> > >
> > > | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
> > > | 4.8777 | wrapon_fn_chpl35 | vectoradd.chpl:26 |
> > >
> > > single-locale:
> > >
> > > | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
> > >
> > >
> > >
> > > For stencil operations, I used the following program:
> > >
> > > 1d-convolution.chpl
> > > -------------------
> > > use Time;
> > > use Random;
> > > use StencilDist;
> > >
> > > config const n = 1024**3/2;
> > >
> > > const ProblemDomain : domain(1) dmapped Stencil(boundingBox =
> {0..#n},
> > > fluff = (1,))
> > > = {0..#n};
> > > const InnerDomain : subdomain(ProblemDomain) = {1..n-2};
> > >
> > > proc convolveIndices(output : [ProblemDomain] real(32),
> > > input : [ProblemDomain] real(32)) {
> > > forall i in InnerDomain {
> > > output[i] = ((input[i-1] + input[i] +
> input[i+1])/3:real(32));
> > > }
> > > }
> > >
> > > proc convolveZip(output : [ProblemDomain] real(32),
> > > input : [ProblemDomain] real(32)) {
> > > forall (im1, i, ip1) in zip(InnerDomain.translate(-1),
> > > InnerDomain,
> > > InnerDomain.translate(1)) {
> > > output[i] = ((input[im1] + input[i] +
> input[ip1])/3:real(32));
> > > }
> > > }
> > >
> > > proc print(t : Timer, n, s) {
> > > t.stop();
> > > writeln(s, ", n: ", n);
> > > writeln("Time: ", t.elapsed(), "s");
> > > writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
> > > writeln();
> > > t.clear();
> > > }
> > >
> > > proc main() {
> > > var input : [ProblemDomain] real(32);
> > > var output : [ProblemDomain] real(32);
> > > var t : Timer;
> > >
> > > fillRandom(input, 42);
> > >
> > > t.start();
> > > convolveIndices(output, input);
> > > print(t, n, "convolveIndices");
> > >
> > > t.start();
> > > convolveZip(output, input);
> > > print(t, n, "convolveZip");
> > > }
> > > ------
> > >
> > > Interestingly, in contrast to your earlier suggestion, the
> direct
> > > indexing works a bit better in this program than the zipped
> version:
> > >
> > > Multi-locale (-nl 1):
> > >
> > > convolveIndices, n: 536870912
> > > Time: 4.27148s
> > > GFLOPS: 0.377062
> > >
> > > convolveZip, n: 536870912
> > > Time: 4.87291s
> > > GFLOPS: 0.330524
> > >
> > > Single-locale:
> > >
> > > convolveIndices, n: 536870912
> > > Time: 0.548804s
> > > GFLOPS: 2.93477
> > >
> > > convolveZip, n: 536870912
> > > Time: 0.538754s
> > > GFLOPS: 2.98951
> > >
> > >
> > > Again, the multi-locale is about a factor 8 slower than
> single-locale.
> > > By the way, the Stencil distribution is a bit faster than the
> Block
> > > distribution.
> > >
> > > Thanks in advance for your input,
> > >
> > > Pieter
> > >
> > >
> > >
> > > On 24/10/16 19:20, Ben Harshbarger wrote:
> > > > Hi Pieter,
> > > >
> > > > Thanks for providing the example, that's very helpful.
> > > >
> > > > Multi-locale performance in Chapel is not yet where we'd
> like it to be, but we've done a lot of work over the past few releases to get
> cases like yours performing well. It's surprising that using Block results in
> that much of a difference, but I think you would see better performance by
> iterating over the arrays directly:
> > > >
> > > > ```
> > > > // replace the loop in the 'add' function with this:
> > > > forall (ci, ai, bi) in zip(c, a, b) {
> > > > ci = ai + bi;
> > > > }
> > > > ```
> > > >
> > > > Block-distributed arrays can leverage the fast-follower
> optimization to perform better when all arrays being iterated over share the
> same domain. You can also write that loop in a cleaner way by leveraging
> array promotion:
> > > >
> > > > ```
> > > > // This is equivalent to the first loop
> > > > c = a + b;
> > > > ```
> > > >
> > > > However, when I tried the promoted variation on my machine
> I observed worse performance than the explicit forall-loop. It seems to be
> related to the way the arguments of the 'add' function are declared. If you
> replaced "[ProblemDomain] float" with "[] float", performance seems to
> improve. That surprised a couple of us on the development team, and I'll be
> looking at that some more today.
> > > >
> > > > If you're still seeing significantly worse performance with
> Block compared to the default rectangular domain, and the programs are
> launched in the same way, that would be odd. You could try profiling using
> chplvis. I agree though that there shouldn't be any communication in this
> program. You can find more information on chplvis here in the online 1.14
> release documentation:
> > > >
> > > >
> http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
> > > >
> > > > I hope that rewriting the loops solves the problem, but let
> us know if it doesn't and we can continue investigating.
> > > >
> > > > -Ben Harshbarger
> > > >
> > > > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> wrote:
> > > >
> > > > Dear all,
> > > >
> > > > My apologies if this has already been asked before.
> I'm new to the list
> > > > and couldn't find it in the archives.
> > > >
> > > > I experience bad performance when running the
> multi-locale compiled
> > > > version on an InfiniBand equiped cluster
> > > > (http://cs.vu.nl/das4/clusters.shtml, VU-site), even
> with only one node.
> > > > Below you find a minimal example that exhibits the
> same performance
> > > > problems as all my programs:
> > > >
> > > > I compiled chapel-1.14.0 with the following steps:
> > > >
> > > > export CHPL_TARGET_ARCH=native
> > > > make -j
> > > > export CHPL_COMM=gasnet
> > > > export CHPL_COMM_SUBSTRATE=ibv
> > > > make clean
> > > > make -j
> > > >
> > > > I compile the following Chapel code:
> > > >
> > > > vectoradd.chpl:
> > > > ---------------
> > > > use Time;
> > > > use Random;
> > > > use BlockDist;
> > > >
> > > > config const n = 1024**3;
> > > >
> > > > // for single-locale
> > > > // const ProblemDomain : domain(1) = {0..#n};
> > > > // for multi-locale
> > > > const ProblemDomain : domain(1) dmapped
> Block(boundingBox = {0..#n}) =
> > > > {0..#n};
> > > >
> > > > type float = real(32);
> > > >
> > > > proc add(c : [ProblemDomain] float, a : [ProblemDomain]
> float,
> > > > b : [ProblemDomain] float) {
> > > > forall i in ProblemDomain {
> > > > c[i] = a[i] + b[i];
> > > > }
> > > > }
> > > >
> > > > proc main() {
> > > > var c : [ProblemDomain] float;
> > > > var a : [ProblemDomain] float;
> > > > var b : [ProblemDomain] float;
> > > > var t : Timer;
> > > >
> > > > fillRandom(a, 0);
> > > > fillRandom(b, 42);
> > > >
> > > > t.start();
> > > > add(c, a, b);
> > > > t.stop();
> > > >
> > > > writeln("n: ", n);
> > > > writeln("Time: ", t.elapsed(), "s");
> > > > writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s");
> > > > }
> > > > ----
> > > >
> > > > I compile this for single-locale with (using no domain
> maps, see the
> > > > comment above in the source):
> > > >
> > > > chpl -o vectoradd --fast vectoradd.chpl
> > > >
> > > > I run it with (dual quad core with 2 hardware threads):
> > > >
> > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16
> > > > ./vectoradd
> > > >
> > > > And get as output:
> > > >
> > > > n: 1073741824
> > > > Time: 0.558806s
> > > > GFLOPS: 1.92149s
> > > >
> > > > However, the performance for multi-locale is much worse:
> > > >
> > > > I compile this for multi-locale with domain maps, see
> the comment in the
> > > > source):
> > > >
> > > > CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o
> vectoradd --fast \
> > > > vectoradd.chpl
> > > >
> > > > I run it on the same type of node with:
> > > >
> > > > SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '`
> > > >
> > > > export GASNET_PHYSMEM_MAX=1G
> > > > export GASNET_IBV_SPAWNER=ssh
> > > > export GASNET_SSH_SERVERS="$SSH_SERVERS"
> > > >
> > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16
> > > > export CHPL_LAUNCHER=gasnetrun_ibv
> > > > export CHPL_COMM=gasnet
> > > > export CHPL_COMM_SUBSTRATE=ibv
> > > >
> > > > ./vectoradd -nl 1
> > > >
> > > > And get as output:
> > > >
> > > > n: 1073741824
> > > > Time: 8.65082s
> > > > GFLOPS: 0.12412s
> > > >
> > > > I would understand a performance difference of say 10%
> because of
> > > > multi-locale execution, but not factors. Is this to be
> expected from
> > > > the current state of Chapel? This performance
> difference is examplary
> > > > for basically all my programs that also are more
> realistic and use
> > > > larger inputs. The performance is strange as there is
> no communication
> > > > necessary (only one node) and the program is using the
> same amount of
> > > > threads.
> > > >
> > > > Is there any way for me to investigate this using
> profiling for example?
> > > >
> > > > By the way, the program does scale well to multiple
> nodes (which is not
> > > > difficult given the baseline):
> > > >
> > > > 1 | 8.65s
> > > > 2 | 2.67s
> > > > 4 | 1.69s
> > > > 8 | 0.87s
> > > > 16 | 0.41s
> > > >
> > > > Thanks in advance for your input.
> > > >
> > > > Kind regards,
> > > >
> > > > Pieter Hijma
> > > >
> > > >
> ------------------------------------------------------------------------------
> > > > Check out the vibrant tech community on one of the
> world's most
> > > > engaging tech sites, SlashDot.org!
> http://sdm.link/slashdot
> > > > _______________________________________________
> > > > Chapel-users mailing list
> > > > [email protected]
> > > >
> https://lists.sourceforge.net/lists/listinfo/chapel-users
> > > >
> > > >
> > >
> > >
> >
> >
>
>
------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users