Hi Ben,
Same setup, testing with GCC 6.2.0, single-locale in directory
'datapar', multi-locale in directory 'datapar-dist'.
Similarly to the 1D-convolution case, the Makefiles are the same:
$ diff datapar/Makefile datapar/Makefile
This means that I compile both programs with:
$ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \
vectoradd.chpl
The job files are also the same:
$ diff datapar/das4.job datapar/das4.job
The contents of das4.job (an SGE script), basically the same as for the
1D-convolution:
---
#!/bin/bash
#$ -l h_rt=0:15:00
#$ -N VECTORADD
#$ -cwd
. ~/.bashrc
SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '`
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv
export CHPL_LAUNCHER=gasnetrun_ibv
export CHPL_RT_NUM_THREADS_PER_LOCALE=16
export GASNET_IBV_SPAWNER=ssh
export GASNET_PHYSMEM_MAX=1G
export GASNET_SSH_SERVERS="$SSH_SERVERS"
APP=./vectoradd
ARGS=$*
$APP $ARGS
---
The two source files differ in the use of domain-maps:
$ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
2a3
> use BlockDist;
7c8
< const ProblemDomain : domain(1) = {0..#n};
---
> const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n})
= {0..#n};
ERROR: 1
The output of 'datapar', originally the single-locale version:
addNoDomain n: 536870912
Time: 0.329722s
GFLOPS: 1.62825
addZip n: 536870912
Time: 0.328751s
GFLOPS: 1.63306
addForall n: 536870912
Time: 0.325768s
GFLOPS: 1.64802
addCollective n: 536870912
Time: 0.330918s
GFLOPS: 1.62237
The output of 'datapar-dist', the multi-locale version:
addNoDomain n: 536870912
Time: 0.373368s
GFLOPS: 1.43791
addZip n: 536870912
Time: 0.372561s
GFLOPS: 1.44103
addForall n: 536870912
Time: 2.66822s
GFLOPS: 0.201209
addCollective n: 536870912
Time: 0.36856s
GFLOPS: 1.45667
I guess the conclusion is that also in this case, the use of the
BlockDist has an effect, minor overall, and major when indexing directly.
Kind regards,
Pieter Hijma
On 16/11/16 20:32, Ben Harshbarger wrote:
> Hi Pieter,
>
> Do you still see a problem with vectorAdd.chpl? I think 1D-convolution has a
> somewhat separate performance issue due to accessing with arbitrary indices
> (more overhead because we don't know if the index is local). If vectorAdd
> isn't performing well, then that could hurt 1D-convolution too.
>
> -Ben Harshbarger
>
> On 11/16/16, 3:33 AM, "Pieter Hijma" <[email protected]> wrote:
>
> Hi Ben,
>
> Good suggestion, I'm going to test with the 1D-convolution program and I
> use the Chapel version compiled with GCC 6.2.0. The single-locale
> version is in directory 'datapar' and the multi-locale version is in
> directory 'datapar-dist'.
>
> The Makefiles are now the same:
>
> $ diff datapar/Makefile datapar/Makefile
>
> This means that I compile both programs with:
>
> $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \
> --fast 1D-convolution.chpl
>
> The job files are also the same:
>
> $ diff datapar/das4.job datapar/das4.job
>
> The contents of das4.job (an SGE script):
>
> -----
> #!/bin/bash
> #$ -l h_rt=0:15:00
> #$ -N CONVOLUTION_1D
> #$ -cwd
>
> . ~/.bashrc
>
> SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '`
>
> export CHPL_COMM=gasnet
> export CHPL_COMM_SUBSTRATE=ibv
> export CHPL_LAUNCHER=gasnetrun_ibv
> export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>
> export GASNET_IBV_SPAWNER=ssh
> export GASNET_PHYSMEM_MAX=1G
> export GASNET_SSH_SERVERS="$SSH_SERVERS"
>
> APP=./1D-convolution
> ARGS=$*
>
> $APP $ARGS
> ------
>
> The difference between the two source files is now basically that the
> single-locale version has only the default domain map, whereas the
> multi-locale has the BlockDist domain map:
>
> $ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl
> 2a3
> > use BlockDist;
> 6c7
> < const ProblemDomain : domain(1) = {0..#n};
> ---
> > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n})
> = {0..#n};
> ERROR: 1
>
>
> The output of 'datapar', the original single-locale version without
> BlockDist:
>
> convolveIndices, n: 536870912
> Time: 0.319077s
> GFLOPS: 5.04772
>
> convolveZip, n: 536870912
> Time: 0.320788s
> GFLOPS: 5.0208
>
>
> The output of 'datapar-dist', the original multi-locale version with
> BlockDist:
>
> convolveIndices, n: 536870912
> Time: 3.1422s
> GFLOPS: 0.512575
>
> convolveZip, n: 536870912
> Time: 3.54989s
> GFLOPS: 0.453708
>
>
> I guess we can conclude that only the addition of the BlockDist domain
> map to the ProblemDomain results in a factor of 10 slowdown.
>
> Kind regards,
>
> Pieter Hijma
>
> On 14/11/16 20:21, Ben Harshbarger wrote:
> > Hi Pieter,
> >
> > My next suggestion would be to try compiling and running the "single
> locale" variation with the same environment variables that you use for
> multilocale. I'm wondering if the use of IBV is impacting performance in some
> way. I don't see the performance issue on our internal ibv cluster, but it's
> worth checking.
> >
> > -Ben Harshbarger
> >
> > On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote:
> >
> > Hi Ben,
> >
> > Thanks for your help.
> >
> > On 07/11/16 18:59, Ben Harshbarger wrote:
> > > When CHPL_COMM is set to 'none', our compiler can avoid
> introducing some overhead that is necessary for multi-locale programs. You
> can force this overhead when CHPL_COMM == none by compiling with the flag
> "--no-local". If you compile your single-locale program with that flag, does
> the performance get worse?
> >
> > It makes some difference, but not much:
> >
> > chpl -o vectoradd --fast vectoradd.chpl
> >
> > addNoDomain n: 1073741824
> > Time: 0.57211s
> > GFLOPS: 1.87681
> >
> > addZip n: 1073741824
> > Time: 0.571799s
> > GFLOPS: 1.87783
> >
> > addForall n: 1073741824
> > Time: 0.571623s
> > GFLOPS: 1.87841
> >
> > addCollective n: 1073741824
> > Time: 0.571395s
> > GFLOPS: 1.87916
> >
> >
> > chpl -o vectoradd --fast --no-local vectoradd.chpl
> >
> > addNoDomain n: 1073741824
> > Time: 0.62087s
> > GFLOPS: 1.72941
> >
> > addZip n: 1073741824
> > Time: 0.619997s
> > GFLOPS: 1.73185
> >
> > addForall n: 1073741824
> > Time: 0.620645s
> > GFLOPS: 1.73004
> >
> > addCollective n: 1073741824
> > Time: 0.620254s
> > GFLOPS: 1.73113
> >
> >
> > > If that's the case, I'm not entirely sure what the next step
> would be. Do you have access to a newer version of GCC? The backend C
> compiler can matter when it comes to optimizing the multi-locale overhead.
> >
> > It is indeed an old one. We also have GCC 4.9.0, Intel 13.3, and I
> > compiled GCC 6.2.0 to check:
> >
> > * intel/compiler/64/13.3/2013.3.163
> >
> > I basically see the same behavior:
> >
> > single locale:
> >
> > addNoDomain n: 536870912
> > Time: 0.285186s
> > GFLOPS: 1.88253
> >
> > addZip n: 536870912
> > Time: 0.284819s
> > GFLOPS: 1.88495
> >
> > addForall n: 536870912
> > Time: 0.287904s
> > GFLOPS: 1.86476
> >
> > addCollective n: 536870912
> > Time: 0.284912s
> > GFLOPS: 1.88434
> >
> > multi-locale, one node:
> >
> > addNoDomain n: 536870912
> > Time: 3.24471s
> > GFLOPS: 0.16546
> >
> > addZip n: 536870912
> > Time: 3.01287s
> > GFLOPS: 0.178192
> >
> > addForall n: 536870912
> > Time: 7.23895s
> > GFLOPS: 0.0741642
> >
> > addCollective n: 536870912
> > Time: 2.59501s
> > GFLOPS: 0.206886
> >
> >
> > * GCC 4.9.0
> >
> > This is encouraging, the performance improves, a factor two of the
> > single-locale, except for the explicit indices in the forall:
> >
> > single locale:
> >
> > addNoDomain n: 536870912
> > Time: 0.277222s
> > GFLOPS: 1.93661
> >
> > addZip n: 536870912
> > Time: 0.27566s
> > GFLOPS: 1.94758
> >
> > addForall n: 536870912
> > Time: 0.27609s
> > GFLOPS: 1.94455
> >
> > addCollective n: 536870912
> > Time: 0.275303s
> > GFLOPS: 1.95011
> >
> > multi-locale, single node:
> >
> > addNoDomain n: 536870912
> > Time: 0.492954s
> > GFLOPS: 1.08909
> >
> > addZip n: 536870912
> > Time: 0.493039s
> > GFLOPS: 1.0889
> >
> > addForall n: 536870912
> > Time: 2.85323s
> > GFLOPS: 0.188162
> >
> > addCollective n: 536870912
> > Time: 0.492135s
> > GFLOPS: 1.0909
> >
> >
> > * GCC 6.2.0
> >
> > The performance on multi-locale is now even better. Still very low
> for
> > explicit indices in the forall.
> >
> > single locale:
> >
> > addNoDomain n: 536870912
> > Time: 0.283272s
> > GFLOPS: 1.89525
> >
> > addZip n: 536870912
> > Time: 0.281942s
> > GFLOPS: 1.90419
> >
> > addForall n: 536870912
> > Time: 0.282291s
> > GFLOPS: 1.90184
> >
> > addCollective n: 536870912
> > Time: 0.281629s
> > GFLOPS: 1.90631
> >
> > Multi-locale, single node:
> >
> > addNoDomain n: 536870912
> > Time: 0.358012s
> > GFLOPS: 1.49959
> >
> > addZip n: 536870912
> > Time: 0.356696s
> > GFLOPS: 1.50512
> >
> > addForall n: 536870912
> > Time: 2.92173s
> > GFLOPS: 0.183751
> >
> > addCollective n: 536870912
> > Time: 0.343808s
> > GFLOPS: 1.56154
> >
> >
> >
> > Since this is encouraging, I also verified the performance of the
> > 1D-stencils:
> >
> > * GCC 4.4.7
> >
> > For reference, the old compiler that I used initially:
> >
> > single locale:
> >
> > convolveIndices, n: 536870912
> > Time: 0.82361s
> > GFLOPS: 1.95555
> >
> > convolveZip, n: 536870912
> > Time: 0.810028s
> > GFLOPS: 1.98834
> >
> > mutli-locale, one node:
> >
> > convolveIndices, n: 536870912
> > Time: 4.25951s
> > GFLOPS: 0.378122
> >
> > convolveZip, n: 536870912
> > Time: 4.88046s
> > GFLOPS: 0.330012
> >
> > * intel/compiler/64/13.3/2013.3.163
> >
> > On this compiler the single-node performance is better than the
> previous
> > compiler. However, the multi-locale one node performance is about a
> > factor 3 slower than the previous compiler.
> >
> > single locale:
> >
> > convolveIndices, n: 536870912
> > Time: 0.554139s
> > GFLOPS: 2.90651
> >
> > convolveZip, n: 536870912
> > Time: 0.556653s
> > GFLOPS: 2.89339
> >
> >
> > multi-locale, one node:
> >
> > convolveIndices, n: 536870912
> > Time: 10.5368s
> > GFLOPS: 0.152856
> >
> > convolveZip, n: 536870912
> > Time: 12.7625s
> > GFLOPS: 0.126198
> >
> >
> > * GCC 4.9.0
> >
> > The performance of single locale is much better than GCC 4.4.7,
> however
> > still poor for the multi-locale, one node configuration, although a
> bit
> > better.
> >
> > single locale:
> >
> > convolveIndices, n: 536870912
> > Time: 0.207055s
> > GFLOPS: 7.77867
> >
> > convolveZip, n: 536870912
> > Time: 0.206783s
> > GFLOPS: 7.7889
> >
> > multi-locale, one node:
> >
> > convolveIndices, n: 536870912
> > Time: 3.20851s
> > GFLOPS: 0.501981
> >
> > convolveZip, n: 536870912
> > Time: 3.652s
> > GFLOPS: 0.441023
> >
> >
> > * GCC 6.2.0
> >
> > Strangely enough, the performance of single-locale is a bit lower
> than
> > the previous, and the same as with multi-locale, one node.
> >
> > single-locale:
> >
> > convolveIndices, n: 536870912
> > Time: 0.263151s
> > GFLOPS: 6.12049
> >
> > convolveZip, n: 536870912
> > Time: 0.262234s
> > GFLOPS: 6.14189
> >
> > multi-locale, one node:
> >
> > convolveIndices, n: 536870912
> > Time: 3.12716s
> > GFLOPS: 0.515039
> >
> > convolveZip, n: 536870912
> > Time: 3.58663s
> > GFLOPS: 0.44906
> >
> >
> > The conclusion is that the compiler has indeed a large impact on the
> > multi-locale performance, but probably only in the simple cases
> such as
> > vector addition. With the stencil code, although it is not very
> > complicated, the performance falls back into the pattern that I came
> > across originally.
> >
> > However, perhaps this gives you an idea of the optimizations that
> impact
> > the performance? If we can't find a solution, I would at least
> like to
> > understand the lack of performance.
> >
> > I also checked the performance of the stencils by not using the
> > StencilDist but just the BlockDist and it makes no difference.
> >
> > > You may also want to consider setting CHPL_TARGET_ARCH to
> something else if you're compiling on a machine architecture different from
> the compute nodes. There's more information about CHPL_TARGET_ARCH here:
> > >
> > >
> http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch
> >
> > The head-node and compute-nodes are all Intel Xeon Westmere's, so I
> > don't think that makes a difference. To be absolutely sure, I also
> > compiled Chapel and the applications on a compute node and indeed,
> the
> > performance is comparable to all measurements above.
> >
> > Kind regards,
> >
> > Pieter Hijma
> >
> >
> > > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:
> > >
> > > Dear Ben,
> > >
> > > Sorry for my late reactions. Unfortunately, for some reason,
> these
> > > emails are marked as spam even though I marked the list and
> your address
> > > as safe. I will make sure I check my spam folders
> meticulously from now on.
> > >
> > > On 28/10/16 23:34, Ben Harshbarger wrote:
> > > > Hi Pieter,
> > > >
> > > > Sorry that you're still having issues. I think we'll need
> some more information before going forward:
> > > >
> > > > 1) Could you send us the output of
> "$CHPL_HOME/util/printchplenv --anonymize" ? It's a script that displays the
> various CHPL_ environment variables. "--anonymize" strips the output of
> information you may prefer to keep private (machine info, paths).
> > >
> > > This would be the setup if running single-locale programs:
> > >
> > > $ printchplenv --anonymize
> > > CHPL_TARGET_PLATFORM: linux64
> > > CHPL_TARGET_COMPILER: gnu
> > > CHPL_TARGET_ARCH: native *
> > > CHPL_LOCALE_MODEL: flat
> > > CHPL_COMM: none
> > > CHPL_TASKS: qthreads
> > > CHPL_LAUNCHER: none
> > > CHPL_TIMERS: generic
> > > CHPL_UNWIND: none
> > > CHPL_MEM: jemalloc
> > > CHPL_MAKE: gmake
> > > CHPL_ATOMICS: intrinsics
> > > CHPL_GMP: gmp
> > > CHPL_HWLOC: hwloc
> > > CHPL_REGEXP: re2
> > > CHPL_WIDE_POINTERS: struct
> > > CHPL_AUX_FILESYS: none
> > >
> > > When I run multi-locale programs, I set the following
> environment variables:
> > >
> > > export CHPL_COMM=gasnet
> > > export CHPL_COMM_SUBSTRATE=ibv
> > >
> > > Then the Chapel environment would be:
> > >
> > > $ printchplenv --anonymize
> > > CHPL_TARGET_PLATFORM: linux64
> > > CHPL_TARGET_COMPILER: gnu
> > > CHPL_TARGET_ARCH: native *
> > > CHPL_LOCALE_MODEL: flat
> > > CHPL_COMM: gasnet *
> > > CHPL_COMM_SUBSTRATE: ibv *
> > > CHPL_GASNET_SEGMENT: large
> > > CHPL_TASKS: qthreads
> > > CHPL_LAUNCHER: gasnetrun_ibv
> > > CHPL_TIMERS: generic
> > > CHPL_UNWIND: none
> > > CHPL_MEM: jemalloc
> > > CHPL_MAKE: gmake
> > > CHPL_ATOMICS: intrinsics
> > > CHPL_NETWORK_ATOMICS: none
> > > CHPL_GMP: gmp
> > > CHPL_HWLOC: hwloc
> > > CHPL_REGEXP: re2
> > > CHPL_WIDE_POINTERS: struct
> > > CHPL_AUX_FILESYS: none
> > >
> > >
> > > > 2) What C compiler are you using?
> > >
> > > $ gcc --version
> > > gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
> > > Copyright (C) 2010 Free Software Foundation, Inc.
> > > This is free software; see the source for copying conditions.
> There is NO
> > > warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.
> > >
> > > > 3) Are you sure that the programs are being launched
> correctly? This might seem silly, but it's worth double-checking that the
> programs are actually running on the same hardware (not necessarily the same
> node though).
> > >
> > > I am completely certain that the single-locale program, the
> multi-locale
> > > program for one node, and the multi-locale for multiple nodes
> are
> > > running on the compute nodes. I'm not completely sure what
> you mean by
> > > "the same hardware". All compute nodes have the same
> hardware if that
> > > is what you mean.
> > >
> > > > I'd also like to clarify what you mean by "multi-locale
> compiled". Is the difference between the programs just the use of the Block
> domain map, or do you compile with different environment variables set?
> > >
> > > I compile different programs and I use different environment
> variables:
> > >
> > > The single-locale version vectoradd is located in the datapar
> directory,
> > > whereas the multi-locale version is located in the
> datapar-dist
> > > directory. What follows is the diff for the .chpl file:
> > >
> > > $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
> > > 8c8
> > > < const ProblemDomain : domain(1) = {0..#n};
> > > ---
> > > > const ProblemDomain : domain(1) dmapped Block(boundingBox
> = {0..#n})
> > > = {0..#n};
> > >
> > > The diff for the Makefile:
> > >
> > > $ diff datapar/Makefile datapar-dist/Makefile
> > > 2a3
> > > > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
> > > 8c9
> > > < $(CHPL) -o $@ $(FLAGS) $<
> > > ---
> > > > $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
> > > 11c12
> > > < rm -f $(APP)
> > > ---
> > > > rm -f $(APP) $(APP)_real
> > >
> > > Thanks for your help, and again my apologies for the delayed
> answers.
> > >
> > > Kind regards,
> > >
> > > Pieter Hijma
> > >
> > > >
> > > > -Ben Harshbarger
> > > >
> > > > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote:
> > > >
> > > > Hi Ben,
> > > >
> > > > Thank you for your fast reply and suggestions! I did
> some more tests
> > > > and also included stencil operations.
> > > >
> > > > First, the vector addition:
> > > >
> > > > vectoradd.chpl
> > > > --------------
> > > > use Time;
> > > > use Random;
> > > > use BlockDist;
> > > > //use VisualDebug;
> > > >
> > > > config const n = 1024**3/2;
> > > >
> > > > // for multi-locale
> > > > const ProblemDomain : domain(1) dmapped
> Block(boundingBox = {0..#n})
> > > > = {0..#n};
> > > > // for single-locale
> > > > const ProblemDomain : domain(1) = {0..#n};
> > > >
> > > > type float = real(32);
> > > >
> > > > proc addNoDomain(c : [] float, a : [] float, b : []
> float) {
> > > > forall (ci, ai, bi) in zip(c, a, b) {
> > > > ci = ai + bi;
> > > > }
> > > > }
> > > >
> > > > proc addZip(c : [ProblemDomain] float, a :
> [ProblemDomain] float,
> > > > b : [ProblemDomain] float) {
> > > > forall (ci, ai, bi) in zip(c, a, b) {
> > > > ci = ai + bi;
> > > > }
> > > > }
> > > >
> > > > proc addForall(c : [ProblemDomain] float, a :
> [ProblemDomain] float,
> > > > b : [ProblemDomain] float) {
> > > > //startVdebug("vdata");
> > > > forall i in ProblemDomain {
> > > > c[i] = a[i] + b[i];
> > > > }
> > > > //stopVdebug();
> > > > }
> > > >
> > > > proc addCollective(c : [ProblemDomain] float, a :
> [ProblemDomain] float,
> > > > b : [ProblemDomain] float) {
> > > > c = a + b;
> > > > }
> > > >
> > > > proc output(t : Timer, n, testName) {
> > > > t.stop();
> > > > writeln(testName, " n: ", n);
> > > > writeln("Time: ", t.elapsed(), "s");
> > > > writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
> > > > writeln();
> > > > t.clear();
> > > > }
> > > >
> > > > proc main() {
> > > > var c : [ProblemDomain] float;
> > > > var a : [ProblemDomain] float;
> > > > var b : [ProblemDomain] float;
> > > > var t : Timer;
> > > >
> > > > fillRandom(a, 0);
> > > > fillRandom(b, 42);
> > > >
> > > > t.start();
> > > > addNoDomain(c, a, b);
> > > > output(t, n, "addNoDomain");
> > > >
> > > > t.start();
> > > > addZip(c, a, b);
> > > > output(t, n, "addZip");
> > > >
> > > > t.start();
> > > > addForall(c, a, b);
> > > > output(t, n, "addForall");
> > > >
> > > > t.start();
> > > > addCollective(c, a, b);
> > > > output(t, n, "addCollective");
> > > > }
> > > > -----
> > > >
> > > > On a single locale I get as output:
> > > >
> > > > addNoDomain n: 536870912
> > > > Time: 0.27961s
> > > > GFLOPS: 1.92007
> > > >
> > > > addZip n: 536870912
> > > > Time: 0.278657s
> > > > GFLOPS: 1.92664
> > > >
> > > > addForall n: 536870912
> > > > Time: 0.278015s
> > > > GFLOPS: 1.93109
> > > >
> > > > addCollective n: 536870912
> > > > Time: 0.278379s
> > > > GFLOPS: 1.92856
> > > >
> > > > On multi-locale (-nl 1) I get as output:
> > > >
> > > > addNoDomain n: 536870912
> > > > Time: 2.16806s
> > > > GFLOPS: 0.247627
> > > >
> > > > addZip n: 536870912
> > > > Time: 2.17024s
> > > > GFLOPS: 0.247378
> > > >
> > > > addForall n: 536870912
> > > > Time: 4.78443s
> > > > GFLOPS: 0.112212
> > > >
> > > > addCollective n: 536870912
> > > > Time: 2.19838s
> > > > GFLOPS: 0.244212
> > > >
> > > > So, indeed, your suggestion improves it by more than a
> factor two, but
> > > > it is still close to a factor 8 slower than
> single-locale.
> > > >
> > > > I also used chplvis and verified that there are no gets
> and puts when
> > > > running multi-locale with more than one node. The
> profiling information
> > > > is clear, but not very helpful (to me):
> > > >
> > > > multi-locale (-nl 1):
> > > >
> > > > | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
> > > > | 4.8777 | wrapon_fn_chpl35 | vectoradd.chpl:26 |
> > > >
> > > > single-locale:
> > > >
> > > > | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
> > > >
> > > >
> > > >
> > > > For stencil operations, I used the following program:
> > > >
> > > > 1d-convolution.chpl
> > > > -------------------
> > > > use Time;
> > > > use Random;
> > > > use StencilDist;
> > > >
> > > > config const n = 1024**3/2;
> > > >
> > > > const ProblemDomain : domain(1) dmapped
> Stencil(boundingBox = {0..#n},
> > > > fluff =
> (1,))
> > > > = {0..#n};
> > > > const InnerDomain : subdomain(ProblemDomain) = {1..n-2};
> > > >
> > > > proc convolveIndices(output : [ProblemDomain] real(32),
> > > > input : [ProblemDomain] real(32)) {
> > > > forall i in InnerDomain {
> > > > output[i] = ((input[i-1] + input[i] +
> input[i+1])/3:real(32));
> > > > }
> > > > }
> > > >
> > > > proc convolveZip(output : [ProblemDomain] real(32),
> > > > input : [ProblemDomain] real(32)) {
> > > > forall (im1, i, ip1) in
> zip(InnerDomain.translate(-1),
> > > > InnerDomain,
> > > > InnerDomain.translate(1))
> {
> > > > output[i] = ((input[im1] + input[i] +
> input[ip1])/3:real(32));
> > > > }
> > > > }
> > > >
> > > > proc print(t : Timer, n, s) {
> > > > t.stop();
> > > > writeln(s, ", n: ", n);
> > > > writeln("Time: ", t.elapsed(), "s");
> > > > writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
> > > > writeln();
> > > > t.clear();
> > > > }
> > > >
> > > > proc main() {
> > > > var input : [ProblemDomain] real(32);
> > > > var output : [ProblemDomain] real(32);
> > > > var t : Timer;
> > > >
> > > > fillRandom(input, 42);
> > > >
> > > > t.start();
> > > > convolveIndices(output, input);
> > > > print(t, n, "convolveIndices");
> > > >
> > > > t.start();
> > > > convolveZip(output, input);
> > > > print(t, n, "convolveZip");
> > > > }
> > > > ------
> > > >
> > > > Interestingly, in contrast to your earlier suggestion,
> the direct
> > > > indexing works a bit better in this program than the
> zipped version:
> > > >
> > > > Multi-locale (-nl 1):
> > > >
> > > > convolveIndices, n: 536870912
> > > > Time: 4.27148s
> > > > GFLOPS: 0.377062
> > > >
> > > > convolveZip, n: 536870912
> > > > Time: 4.87291s
> > > > GFLOPS: 0.330524
> > > >
> > > > Single-locale:
> > > >
> > > > convolveIndices, n: 536870912
> > > > Time: 0.548804s
> > > > GFLOPS: 2.93477
> > > >
> > > > convolveZip, n: 536870912
> > > > Time: 0.538754s
> > > > GFLOPS: 2.98951
> > > >
> > > >
> > > > Again, the multi-locale is about a factor 8 slower than
> single-locale.
> > > > By the way, the Stencil distribution is a bit faster
> than the Block
> > > > distribution.
> > > >
> > > > Thanks in advance for your input,
> > > >
> > > > Pieter
> > > >
> > > >
> > > >
> > > > On 24/10/16 19:20, Ben Harshbarger wrote:
> > > > > Hi Pieter,
> > > > >
> > > > > Thanks for providing the example, that's very helpful.
> > > > >
> > > > > Multi-locale performance in Chapel is not yet where
> we'd like it to be, but we've done a lot of work over the past few releases
> to get cases like yours performing well. It's surprising that using Block
> results in that much of a difference, but I think you would see better
> performance by iterating over the arrays directly:
> > > > >
> > > > > ```
> > > > > // replace the loop in the 'add' function with this:
> > > > > forall (ci, ai, bi) in zip(c, a, b) {
> > > > > ci = ai + bi;
> > > > > }
> > > > > ```
> > > > >
> > > > > Block-distributed arrays can leverage the
> fast-follower optimization to perform better when all arrays being iterated
> over share the same domain. You can also write that loop in a cleaner way by
> leveraging array promotion:
> > > > >
> > > > > ```
> > > > > // This is equivalent to the first loop
> > > > > c = a + b;
> > > > > ```
> > > > >
> > > > > However, when I tried the promoted variation on my
> machine I observed worse performance than the explicit forall-loop. It seems
> to be related to the way the arguments of the 'add' function are declared. If
> you replaced "[ProblemDomain] float" with "[] float", performance seems to
> improve. That surprised a couple of us on the development team, and I'll be
> looking at that some more today.
> > > > >
> > > > > If you're still seeing significantly worse
> performance with Block compared to the default rectangular domain, and the
> programs are launched in the same way, that would be odd. You could try
> profiling using chplvis. I agree though that there shouldn't be any
> communication in this program. You can find more information on chplvis here
> in the online 1.14 release documentation:
> > > > >
> > > > >
> http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
> > > > >
> > > > > I hope that rewriting the loops solves the problem,
> but let us know if it doesn't and we can continue investigating.
> > > > >
> > > > > -Ben Harshbarger
> > > > >
> > > > > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]>
> wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > > My apologies if this has already been asked
> before. I'm new to the list
> > > > > and couldn't find it in the archives.
> > > > >
> > > > > I experience bad performance when running the
> multi-locale compiled
> > > > > version on an InfiniBand equiped cluster
> > > > > (http://cs.vu.nl/das4/clusters.shtml, VU-site),
> even with only one node.
> > > > > Below you find a minimal example that exhibits
> the same performance
> > > > > problems as all my programs:
> > > > >
> > > > > I compiled chapel-1.14.0 with the following steps:
> > > > >
> > > > > export CHPL_TARGET_ARCH=native
> > > > > make -j
> > > > > export CHPL_COMM=gasnet
> > > > > export CHPL_COMM_SUBSTRATE=ibv
> > > > > make clean
> > > > > make -j
> > > > >
> > > > > I compile the following Chapel code:
> > > > >
> > > > > vectoradd.chpl:
> > > > > ---------------
> > > > > use Time;
> > > > > use Random;
> > > > > use BlockDist;
> > > > >
> > > > > config const n = 1024**3;
> > > > >
> > > > > // for single-locale
> > > > > // const ProblemDomain : domain(1) = {0..#n};
> > > > > // for multi-locale
> > > > > const ProblemDomain : domain(1) dmapped
> Block(boundingBox = {0..#n}) =
> > > > > {0..#n};
> > > > >
> > > > > type float = real(32);
> > > > >
> > > > > proc add(c : [ProblemDomain] float, a :
> [ProblemDomain] float,
> > > > > b : [ProblemDomain] float) {
> > > > > forall i in ProblemDomain {
> > > > > c[i] = a[i] + b[i];
> > > > > }
> > > > > }
> > > > >
> > > > > proc main() {
> > > > > var c : [ProblemDomain] float;
> > > > > var a : [ProblemDomain] float;
> > > > > var b : [ProblemDomain] float;
> > > > > var t : Timer;
> > > > >
> > > > > fillRandom(a, 0);
> > > > > fillRandom(b, 42);
> > > > >
> > > > > t.start();
> > > > > add(c, a, b);
> > > > > t.stop();
> > > > >
> > > > > writeln("n: ", n);
> > > > > writeln("Time: ", t.elapsed(), "s");
> > > > > writeln("GFLOPS: ", n / t.elapsed() / 1e9,
> "s");
> > > > > }
> > > > > ----
> > > > >
> > > > > I compile this for single-locale with (using no
> domain maps, see the
> > > > > comment above in the source):
> > > > >
> > > > > chpl -o vectoradd --fast vectoradd.chpl
> > > > >
> > > > > I run it with (dual quad core with 2 hardware
> threads):
> > > > >
> > > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16
> > > > > ./vectoradd
> > > > >
> > > > > And get as output:
> > > > >
> > > > > n: 1073741824
> > > > > Time: 0.558806s
> > > > > GFLOPS: 1.92149s
> > > > >
> > > > > However, the performance for multi-locale is much
> worse:
> > > > >
> > > > > I compile this for multi-locale with domain maps,
> see the comment in the
> > > > > source):
> > > > >
> > > > > CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o
> vectoradd --fast \
> > > > > vectoradd.chpl
> > > > >
> > > > > I run it on the same type of node with:
> > > > >
> > > > > SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '`
> > > > >
> > > > > export GASNET_PHYSMEM_MAX=1G
> > > > > export GASNET_IBV_SPAWNER=ssh
> > > > > export GASNET_SSH_SERVERS="$SSH_SERVERS"
> > > > >
> > > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16
> > > > > export CHPL_LAUNCHER=gasnetrun_ibv
> > > > > export CHPL_COMM=gasnet
> > > > > export CHPL_COMM_SUBSTRATE=ibv
> > > > >
> > > > > ./vectoradd -nl 1
> > > > >
> > > > > And get as output:
> > > > >
> > > > > n: 1073741824
> > > > > Time: 8.65082s
> > > > > GFLOPS: 0.12412s
> > > > >
> > > > > I would understand a performance difference of
> say 10% because of
> > > > > multi-locale execution, but not factors. Is this
> to be expected from
> > > > > the current state of Chapel? This performance
> difference is examplary
> > > > > for basically all my programs that also are more
> realistic and use
> > > > > larger inputs. The performance is strange as
> there is no communication
> > > > > necessary (only one node) and the program is
> using the same amount of
> > > > > threads.
> > > > >
> > > > > Is there any way for me to investigate this using
> profiling for example?
> > > > >
> > > > > By the way, the program does scale well to
> multiple nodes (which is not
> > > > > difficult given the baseline):
> > > > >
> > > > > 1 | 8.65s
> > > > > 2 | 2.67s
> > > > > 4 | 1.69s
> > > > > 8 | 0.87s
> > > > > 16 | 0.41s
> > > > >
> > > > > Thanks in advance for your input.
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Pieter Hijma
> > > > >
> > > > >
> ------------------------------------------------------------------------------
> > > > > Check out the vibrant tech community on one of
> the world's most
> > > > > engaging tech sites, SlashDot.org!
> http://sdm.link/slashdot
> > > > > _______________________________________________
> > > > > Chapel-users mailing list
> > > > > [email protected]
> > > > >
> https://lists.sourceforge.net/lists/listinfo/chapel-users
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users