Re: Performance multi-locale

Pieter Hijma Wed, 16 Nov 2016 03:35:57 -0800

Hi Ben,

Good suggestion, I'm going to test with the 1D-convolution program and I 
use the Chapel version compiled with GCC 6.2.0.  The single-locale 
version is in directory 'datapar' and the multi-locale version is in 
directory 'datapar-dist'.


The Makefiles are now the same:

$ diff datapar/Makefile datapar/Makefile

This means that I compile both programs with:

$ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \
   --fast 1D-convolution.chpl

The job files are also the same:

$ diff datapar/das4.job datapar/das4.job

The contents of das4.job (an SGE script):

-----
#!/bin/bash
#$ -l h_rt=0:15:00
#$ -N CONVOLUTION_1D
#$ -cwd

. ~/.bashrc

SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`

export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv
export CHPL_LAUNCHER=gasnetrun_ibv
export CHPL_RT_NUM_THREADS_PER_LOCALE=16

export GASNET_IBV_SPAWNER=ssh
export GASNET_PHYSMEM_MAX=1G
export GASNET_SSH_SERVERS="$SSH_SERVERS"

APP=./1D-convolution
ARGS=$*

$APP $ARGS
------

The difference between the two source files is now basically that the
single-locale version has only the default domain map, whereas the 
multi-locale has the BlockDist domain map:

$ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl
2a3
 > use BlockDist;
6c7
< const ProblemDomain : domain(1) = {0..#n};
---
 > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) 
= {0..#n};
ERROR: 1


The output of 'datapar', the original single-locale version without 
BlockDist:

convolveIndices, n: 536870912
Time: 0.319077s
GFLOPS: 5.04772

convolveZip, n: 536870912
Time: 0.320788s
GFLOPS: 5.0208


The output of 'datapar-dist', the original multi-locale version with 
BlockDist:

convolveIndices, n: 536870912
Time: 3.1422s
GFLOPS: 0.512575

convolveZip, n: 536870912
Time: 3.54989s
GFLOPS: 0.453708


I guess we can conclude that only the addition of the BlockDist domain 
map to the ProblemDomain results in a factor of 10 slowdown.

Kind regards,

Pieter Hijma

On 14/11/16 20:21, Ben Harshbarger wrote:
> Hi Pieter,
>
> My next suggestion would be to try compiling and running the "single locale" 
> variation with the same environment variables that you use for multilocale. 
> I'm wondering if the use of IBV is impacting performance in some way. I don't 
> see the performance issue on our internal ibv cluster, but it's worth 
> checking.
>
> -Ben Harshbarger
>
> On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote:
>
>     Hi Ben,
>
>     Thanks for your help.
>
>     On 07/11/16 18:59, Ben Harshbarger wrote:
>     > When CHPL_COMM is set to 'none', our compiler can avoid introducing 
> some overhead that is necessary for multi-locale programs. You can force this 
> overhead when CHPL_COMM == none by compiling with the flag "--no-local". If 
> you compile your single-locale program with that flag, does the performance 
> get worse?
>
>     It makes some difference, but not much:
>
>     chpl -o vectoradd --fast vectoradd.chpl
>
>     addNoDomain n: 1073741824
>     Time: 0.57211s
>     GFLOPS: 1.87681
>
>     addZip n: 1073741824
>     Time: 0.571799s
>     GFLOPS: 1.87783
>
>     addForall n: 1073741824
>     Time: 0.571623s
>     GFLOPS: 1.87841
>
>     addCollective n: 1073741824
>     Time: 0.571395s
>     GFLOPS: 1.87916
>
>
>     chpl -o vectoradd --fast --no-local vectoradd.chpl
>
>     addNoDomain n: 1073741824
>     Time: 0.62087s
>     GFLOPS: 1.72941
>
>     addZip n: 1073741824
>     Time: 0.619997s
>     GFLOPS: 1.73185
>
>     addForall n: 1073741824
>     Time: 0.620645s
>     GFLOPS: 1.73004
>
>     addCollective n: 1073741824
>     Time: 0.620254s
>     GFLOPS: 1.73113
>
>
>     > If that's the case, I'm not entirely sure what the next step would be. 
> Do you have access to a newer version of GCC? The backend C compiler can 
> matter when it comes to optimizing the multi-locale overhead.
>
>     It is indeed an old one.  We also have GCC 4.9.0, Intel 13.3, and I
>     compiled GCC 6.2.0 to check:
>
>     * intel/compiler/64/13.3/2013.3.163
>
>     I basically see the same behavior:
>
>     single locale:
>
>     addNoDomain n: 536870912
>     Time: 0.285186s
>     GFLOPS: 1.88253
>
>     addZip n: 536870912
>     Time: 0.284819s
>     GFLOPS: 1.88495
>
>     addForall n: 536870912
>     Time: 0.287904s
>     GFLOPS: 1.86476
>
>     addCollective n: 536870912
>     Time: 0.284912s
>     GFLOPS: 1.88434
>
>     multi-locale, one node:
>
>     addNoDomain n: 536870912
>     Time: 3.24471s
>     GFLOPS: 0.16546
>
>     addZip n: 536870912
>     Time: 3.01287s
>     GFLOPS: 0.178192
>
>     addForall n: 536870912
>     Time: 7.23895s
>     GFLOPS: 0.0741642
>
>     addCollective n: 536870912
>     Time: 2.59501s
>     GFLOPS: 0.206886
>
>
>     * GCC 4.9.0
>
>     This is encouraging, the performance improves, a factor two of the
>     single-locale, except for the explicit indices in the forall:
>
>     single locale:
>
>     addNoDomain n: 536870912
>     Time: 0.277222s
>     GFLOPS: 1.93661
>
>     addZip n: 536870912
>     Time: 0.27566s
>     GFLOPS: 1.94758
>
>     addForall n: 536870912
>     Time: 0.27609s
>     GFLOPS: 1.94455
>
>     addCollective n: 536870912
>     Time: 0.275303s
>     GFLOPS: 1.95011
>
>     multi-locale, single node:
>
>     addNoDomain n: 536870912
>     Time: 0.492954s
>     GFLOPS: 1.08909
>
>     addZip n: 536870912
>     Time: 0.493039s
>     GFLOPS: 1.0889
>
>     addForall n: 536870912
>     Time: 2.85323s
>     GFLOPS: 0.188162
>
>     addCollective n: 536870912
>     Time: 0.492135s
>     GFLOPS: 1.0909
>
>
>     * GCC 6.2.0
>
>     The performance on multi-locale is now even better.  Still very low for
>     explicit indices in the forall.
>
>     single locale:
>
>     addNoDomain n: 536870912
>     Time: 0.283272s
>     GFLOPS: 1.89525
>
>     addZip n: 536870912
>     Time: 0.281942s
>     GFLOPS: 1.90419
>
>     addForall n: 536870912
>     Time: 0.282291s
>     GFLOPS: 1.90184
>
>     addCollective n: 536870912
>     Time: 0.281629s
>     GFLOPS: 1.90631
>
>     Multi-locale, single node:
>
>     addNoDomain n: 536870912
>     Time: 0.358012s
>     GFLOPS: 1.49959
>
>     addZip n: 536870912
>     Time: 0.356696s
>     GFLOPS: 1.50512
>
>     addForall n: 536870912
>     Time: 2.92173s
>     GFLOPS: 0.183751
>
>     addCollective n: 536870912
>     Time: 0.343808s
>     GFLOPS: 1.56154
>
>
>
>     Since this is encouraging, I also verified the performance of the
>     1D-stencils:
>
>     * GCC 4.4.7
>
>     For reference, the old compiler that I used initially:
>
>     single locale:
>
>     convolveIndices, n: 536870912
>     Time: 0.82361s
>     GFLOPS: 1.95555
>
>     convolveZip, n: 536870912
>     Time: 0.810028s
>     GFLOPS: 1.98834
>
>     mutli-locale, one node:
>
>     convolveIndices, n: 536870912
>     Time: 4.25951s
>     GFLOPS: 0.378122
>
>     convolveZip, n: 536870912
>     Time: 4.88046s
>     GFLOPS: 0.330012
>
>     * intel/compiler/64/13.3/2013.3.163
>
>     On this compiler the single-node performance is better than the previous
>     compiler.  However, the multi-locale one node performance is about a
>     factor 3 slower than the previous compiler.
>
>     single locale:
>
>     convolveIndices, n: 536870912
>     Time: 0.554139s
>     GFLOPS: 2.90651
>
>     convolveZip, n: 536870912
>     Time: 0.556653s
>     GFLOPS: 2.89339
>
>
>     multi-locale, one node:
>
>     convolveIndices, n: 536870912
>     Time: 10.5368s
>     GFLOPS: 0.152856
>
>     convolveZip, n: 536870912
>     Time: 12.7625s
>     GFLOPS: 0.126198
>
>
>     * GCC 4.9.0
>
>     The performance of single locale is much better than GCC 4.4.7, however
>     still poor for the multi-locale, one node configuration, although a bit
>     better.
>
>     single locale:
>
>     convolveIndices, n: 536870912
>     Time: 0.207055s
>     GFLOPS: 7.77867
>
>     convolveZip, n: 536870912
>     Time: 0.206783s
>     GFLOPS: 7.7889
>
>     multi-locale, one node:
>
>     convolveIndices, n: 536870912
>     Time: 3.20851s
>     GFLOPS: 0.501981
>
>     convolveZip, n: 536870912
>     Time: 3.652s
>     GFLOPS: 0.441023
>
>
>     * GCC 6.2.0
>
>     Strangely enough, the performance of single-locale is a bit lower than
>     the previous, and the same as with multi-locale, one node.
>
>     single-locale:
>
>     convolveIndices, n: 536870912
>     Time: 0.263151s
>     GFLOPS: 6.12049
>
>     convolveZip, n: 536870912
>     Time: 0.262234s
>     GFLOPS: 6.14189
>
>     multi-locale, one node:
>
>     convolveIndices, n: 536870912
>     Time: 3.12716s
>     GFLOPS: 0.515039
>
>     convolveZip, n: 536870912
>     Time: 3.58663s
>     GFLOPS: 0.44906
>
>
>     The conclusion is that the compiler has indeed a large impact on the
>     multi-locale performance, but probably only in the simple cases such as
>     vector addition.  With the stencil code, although it is not very
>     complicated, the performance falls back into the pattern that I came
>     across originally.
>
>     However, perhaps this gives you an idea of the optimizations that impact
>     the performance?  If we can't find a solution, I would at least like to
>     understand the lack of performance.
>
>     I also checked the performance of the stencils by not using the
>     StencilDist but just the BlockDist and it makes no difference.
>
>     > You may also want to consider setting CHPL_TARGET_ARCH to something 
> else if you're compiling on a machine architecture different from the compute 
> nodes. There's more information about CHPL_TARGET_ARCH here:
>     >
>     > 
> http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch
>
>     The head-node and compute-nodes are all Intel Xeon Westmere's, so I
>     don't think that makes a difference.  To be absolutely sure, I also
>     compiled Chapel and the applications on a compute node and indeed, the
>     performance is comparable to all measurements above.
>
>     Kind regards,
>
>     Pieter Hijma
>
>
>     > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:
>     >
>     >     Dear Ben,
>     >
>     >     Sorry for my late reactions.  Unfortunately, for some reason, these
>     >     emails are marked as spam even though I marked the list and your 
> address
>     >     as safe.  I will make sure I check my spam folders meticulously 
> from now on.
>     >
>     >     On 28/10/16 23:34, Ben Harshbarger wrote:
>     >     > Hi Pieter,
>     >     >
>     >     > Sorry that you're still having issues. I think we'll need some 
> more information before going forward:
>     >     >
>     >     > 1) Could you send us the output of "$CHPL_HOME/util/printchplenv 
> --anonymize" ? It's a script that displays the various CHPL_ environment 
> variables. "--anonymize" strips the output of information you may prefer to 
> keep private (machine info, paths).
>     >
>     >     This would be the setup if running single-locale programs:
>     >
>     >     $ printchplenv --anonymize
>     >     CHPL_TARGET_PLATFORM: linux64
>     >     CHPL_TARGET_COMPILER: gnu
>     >     CHPL_TARGET_ARCH: native *
>     >     CHPL_LOCALE_MODEL: flat
>     >     CHPL_COMM: none
>     >     CHPL_TASKS: qthreads
>     >     CHPL_LAUNCHER: none
>     >     CHPL_TIMERS: generic
>     >     CHPL_UNWIND: none
>     >     CHPL_MEM: jemalloc
>     >     CHPL_MAKE: gmake
>     >     CHPL_ATOMICS: intrinsics
>     >     CHPL_GMP: gmp
>     >     CHPL_HWLOC: hwloc
>     >     CHPL_REGEXP: re2
>     >     CHPL_WIDE_POINTERS: struct
>     >     CHPL_AUX_FILESYS: none
>     >
>     >     When I run multi-locale programs, I set the following environment 
> variables:
>     >
>     >     export CHPL_COMM=gasnet
>     >     export CHPL_COMM_SUBSTRATE=ibv
>     >
>     >     Then the Chapel environment would be:
>     >
>     >     $ printchplenv --anonymize
>     >     CHPL_TARGET_PLATFORM: linux64
>     >     CHPL_TARGET_COMPILER: gnu
>     >     CHPL_TARGET_ARCH: native *
>     >     CHPL_LOCALE_MODEL: flat
>     >     CHPL_COMM: gasnet *
>     >        CHPL_COMM_SUBSTRATE: ibv *
>     >        CHPL_GASNET_SEGMENT: large
>     >     CHPL_TASKS: qthreads
>     >     CHPL_LAUNCHER: gasnetrun_ibv
>     >     CHPL_TIMERS: generic
>     >     CHPL_UNWIND: none
>     >     CHPL_MEM: jemalloc
>     >     CHPL_MAKE: gmake
>     >     CHPL_ATOMICS: intrinsics
>     >        CHPL_NETWORK_ATOMICS: none
>     >     CHPL_GMP: gmp
>     >     CHPL_HWLOC: hwloc
>     >     CHPL_REGEXP: re2
>     >     CHPL_WIDE_POINTERS: struct
>     >     CHPL_AUX_FILESYS: none
>     >
>     >
>     >     > 2) What C compiler are you using?
>     >
>     >     $ gcc --version
>     >     gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
>     >     Copyright (C) 2010 Free Software Foundation, Inc.
>     >     This is free software; see the source for copying conditions.  
> There is NO
>     >     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
> PURPOSE.
>     >
>     >     > 3) Are you sure that the programs are being launched correctly? 
> This might seem silly, but it's worth double-checking that the programs are 
> actually running on the same hardware (not necessarily the same node though).
>     >
>     >     I am completely certain that the single-locale program, the 
> multi-locale
>     >     program for one node, and the multi-locale for multiple nodes are
>     >     running on the compute nodes.  I'm not completely sure what you 
> mean by
>     >     "the same hardware".  All compute nodes have the same hardware if 
> that
>     >     is what you mean.
>     >
>     >     > I'd also like to clarify what you mean by "multi-locale 
> compiled". Is the difference between the programs just the use of the Block 
> domain map, or do you compile with different environment variables set?
>     >
>     >     I compile different programs and I use different environment 
> variables:
>     >
>     >     The single-locale version vectoradd is located in the datapar 
> directory,
>     >     whereas the multi-locale version is located in the datapar-dist
>     >     directory.  What follows is the diff for the .chpl file:
>     >
>     >     $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
>     >     8c8
>     >     < const ProblemDomain : domain(1) = {0..#n};
>     >     ---
>     >      > const ProblemDomain : domain(1) dmapped Block(boundingBox = 
> {0..#n})
>     >     = {0..#n};
>     >
>     >     The diff for the Makefile:
>     >
>     >     $ diff datapar/Makefile datapar-dist/Makefile
>     >     2a3
>     >      > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
>     >     8c9
>     >     <   $(CHPL) -o $@ $(FLAGS) $<
>     >     ---
>     >      >  $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
>     >     11c12
>     >     <   rm -f $(APP)
>     >     ---
>     >      >  rm -f $(APP) $(APP)_real
>     >
>     >     Thanks for your help, and again my apologies for the delayed 
> answers.
>     >
>     >     Kind regards,
>     >
>     >     Pieter Hijma
>     >
>     >     >
>     >     > -Ben Harshbarger
>     >     >
>     >     > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote:
>     >     >
>     >     >     Hi Ben,
>     >     >
>     >     >     Thank you for your fast reply and suggestions!  I did some 
> more tests
>     >     >     and also included stencil operations.
>     >     >
>     >     >     First, the vector addition:
>     >     >
>     >     >     vectoradd.chpl
>     >     >     --------------
>     >     >     use Time;
>     >     >     use Random;
>     >     >     use BlockDist;
>     >     >     //use VisualDebug;
>     >     >
>     >     >     config const n = 1024**3/2;
>     >     >
>     >     >     // for multi-locale
>     >     >     const ProblemDomain : domain(1) dmapped Block(boundingBox = 
> {0..#n})
>     >     >        = {0..#n};
>     >     >     // for single-locale
>     >     >     const ProblemDomain : domain(1) = {0..#n};
>     >     >
>     >     >     type float = real(32);
>     >     >
>     >     >     proc addNoDomain(c : [] float, a : [] float, b : [] float) {
>     >     >        forall (ci, ai, bi) in zip(c, a, b) {
>     >     >          ci = ai + bi;
>     >     >        }
>     >     >     }
>     >     >
>     >     >     proc addZip(c : [ProblemDomain] float, a : [ProblemDomain] 
> float,
>     >     >               b : [ProblemDomain] float) {
>     >     >        forall (ci, ai, bi) in zip(c, a, b) {
>     >     >          ci = ai + bi;
>     >     >        }
>     >     >     }
>     >     >
>     >     >     proc addForall(c : [ProblemDomain] float, a : [ProblemDomain] 
> float,
>     >     >                  b : [ProblemDomain] float) {
>     >     >        //startVdebug("vdata");
>     >     >        forall i in ProblemDomain {
>     >     >          c[i] = a[i] + b[i];
>     >     >        }
>     >     >        //stopVdebug();
>     >     >     }
>     >     >
>     >     >     proc addCollective(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >                      b : [ProblemDomain] float) {
>     >     >        c = a + b;
>     >     >     }
>     >     >
>     >     >     proc output(t : Timer, n, testName) {
>     >     >        t.stop();
>     >     >        writeln(testName, " n: ", n);
>     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
>     >     >        writeln();
>     >     >        t.clear();
>     >     >     }
>     >     >
>     >     >     proc main() {
>     >     >        var c : [ProblemDomain] float;
>     >     >        var a : [ProblemDomain] float;
>     >     >        var b : [ProblemDomain] float;
>     >     >        var t : Timer;
>     >     >
>     >     >        fillRandom(a, 0);
>     >     >        fillRandom(b, 42);
>     >     >
>     >     >        t.start();
>     >     >        addNoDomain(c, a, b);
>     >     >        output(t, n, "addNoDomain");
>     >     >
>     >     >        t.start();
>     >     >        addZip(c, a, b);
>     >     >        output(t, n, "addZip");
>     >     >
>     >     >        t.start();
>     >     >        addForall(c, a, b);
>     >     >        output(t, n, "addForall");
>     >     >
>     >     >        t.start();
>     >     >        addCollective(c, a, b);
>     >     >        output(t, n, "addCollective");
>     >     >     }
>     >     >     -----
>     >     >
>     >     >     On a single locale I get as output:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 0.27961s
>     >     >     GFLOPS: 1.92007
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 0.278657s
>     >     >     GFLOPS: 1.92664
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 0.278015s
>     >     >     GFLOPS: 1.93109
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 0.278379s
>     >     >     GFLOPS: 1.92856
>     >     >
>     >     >     On multi-locale (-nl 1) I get as output:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 2.16806s
>     >     >     GFLOPS: 0.247627
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 2.17024s
>     >     >     GFLOPS: 0.247378
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 4.78443s
>     >     >     GFLOPS: 0.112212
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 2.19838s
>     >     >     GFLOPS: 0.244212
>     >     >
>     >     >     So, indeed, your suggestion improves it by more than a factor 
> two, but
>     >     >     it is still close to a factor 8 slower than single-locale.
>     >     >
>     >     >     I also used chplvis and verified that there are no gets and 
> puts when
>     >     >     running multi-locale with more than one node.  The profiling 
> information
>     >     >     is clear, but not very helpful (to me):
>     >     >
>     >     >     multi-locale (-nl 1):
>     >     >
>     >     >     | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
>     >     >     |  4.8777 | wrapon_fn_chpl35      | vectoradd.chpl:26 |
>     >     >
>     >     >     single-locale:
>     >     >
>     >     >     | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
>     >     >
>     >     >
>     >     >
>     >     >     For stencil operations, I used the following program:
>     >     >
>     >     >     1d-convolution.chpl
>     >     >     -------------------
>     >     >     use Time;
>     >     >     use Random;
>     >     >     use StencilDist;
>     >     >
>     >     >     config const n = 1024**3/2;
>     >     >
>     >     >     const ProblemDomain : domain(1) dmapped Stencil(boundingBox = 
> {0..#n},
>     >     >                                                   fluff = (1,))
>     >     >        = {0..#n};
>     >     >     const InnerDomain : subdomain(ProblemDomain) = {1..n-2};
>     >     >
>     >     >     proc convolveIndices(output : [ProblemDomain] real(32),
>     >     >                        input : [ProblemDomain] real(32)) {
>     >     >        forall i in InnerDomain {
>     >     >          output[i] = ((input[i-1] + input[i] + 
> input[i+1])/3:real(32));
>     >     >        }
>     >     >     }
>     >     >
>     >     >     proc convolveZip(output : [ProblemDomain] real(32),
>     >     >                    input : [ProblemDomain] real(32)) {
>     >     >        forall (im1, i, ip1) in zip(InnerDomain.translate(-1),
>     >     >                                 InnerDomain,
>     >     >                                 InnerDomain.translate(1)) {
>     >     >          output[i] = ((input[im1] + input[i] + 
> input[ip1])/3:real(32));
>     >     >        }
>     >     >     }
>     >     >
>     >     >     proc print(t : Timer, n, s) {
>     >     >        t.stop();
>     >     >        writeln(s, ", n: ", n);
>     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >        writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
>     >     >        writeln();
>     >     >        t.clear();
>     >     >     }
>     >     >
>     >     >     proc main() {
>     >     >        var input : [ProblemDomain] real(32);
>     >     >        var output : [ProblemDomain] real(32);
>     >     >        var t : Timer;
>     >     >
>     >     >        fillRandom(input, 42);
>     >     >
>     >     >        t.start();
>     >     >        convolveIndices(output, input);
>     >     >        print(t, n, "convolveIndices");
>     >     >
>     >     >        t.start();
>     >     >        convolveZip(output, input);
>     >     >        print(t, n, "convolveZip");
>     >     >     }
>     >     >     ------
>     >     >
>     >     >     Interestingly, in contrast to your earlier suggestion, the 
> direct
>     >     >     indexing works a bit better in this program than the zipped 
> version:
>     >     >
>     >     >     Multi-locale (-nl 1):
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 4.27148s
>     >     >     GFLOPS: 0.377062
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 4.87291s
>     >     >     GFLOPS: 0.330524
>     >     >
>     >     >     Single-locale:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 0.548804s
>     >     >     GFLOPS: 2.93477
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 0.538754s
>     >     >     GFLOPS: 2.98951
>     >     >
>     >     >
>     >     >     Again, the multi-locale is about a factor 8 slower than 
> single-locale.
>     >     >     By the way, the Stencil distribution is a bit faster than the 
> Block
>     >     >     distribution.
>     >     >
>     >     >     Thanks in advance for your input,
>     >     >
>     >     >     Pieter
>     >     >
>     >     >
>     >     >
>     >     >     On 24/10/16 19:20, Ben Harshbarger wrote:
>     >     >     > Hi Pieter,
>     >     >     >
>     >     >     > Thanks for providing the example, that's very helpful.
>     >     >     >
>     >     >     > Multi-locale performance in Chapel is not yet where we'd 
> like it to be, but we've done a lot of work over the past few releases to get 
> cases like yours performing well. It's surprising that using Block results in 
> that much of a difference, but I think you would see better performance by 
> iterating over the arrays directly:
>     >     >     >
>     >     >     > ```
>     >     >     > // replace the loop in the 'add' function with this:
>     >     >     > forall (ci, ai, bi) in zip(c, a, b) {
>     >     >     >   ci = ai + bi;
>     >     >     > }
>     >     >     > ```
>     >     >     >
>     >     >     > Block-distributed arrays can leverage the fast-follower 
> optimization to perform better when all arrays being iterated over share the 
> same domain. You can also write that loop in a cleaner way by leveraging 
> array promotion:
>     >     >     >
>     >     >     > ```
>     >     >     > // This is equivalent to the first loop
>     >     >     > c = a + b;
>     >     >     > ```
>     >     >     >
>     >     >     > However, when I tried the promoted variation on my machine 
> I observed worse performance than the explicit forall-loop. It seems to be 
> related to the way the arguments of the 'add' function are declared. If you 
> replaced "[ProblemDomain] float" with "[] float", performance seems to 
> improve. That surprised a couple of us on the development team, and I'll be 
> looking at that some more today.
>     >     >     >
>     >     >     > If you're still seeing significantly worse performance with 
> Block compared to the default rectangular domain, and the programs are 
> launched in the same way, that would be odd. You could try profiling using 
> chplvis. I agree though that there shouldn't be any communication in this 
> program. You can find more information on chplvis here in the online 1.14 
> release documentation:
>     >     >     >
>     >     >     > 
> http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
>     >     >     >
>     >     >     > I hope that rewriting the loops solves the problem, but let 
> us know if it doesn't and we can continue investigating.
>     >     >     >
>     >     >     > -Ben Harshbarger
>     >     >     >
>     >     >     > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> wrote:
>     >     >     >
>     >     >     >     Dear all,
>     >     >     >
>     >     >     >     My apologies if this has already been asked before.  
> I'm new to the list
>     >     >     >     and couldn't find it in the archives.
>     >     >     >
>     >     >     >     I experience bad performance when running the 
> multi-locale compiled
>     >     >     >     version on an InfiniBand equiped cluster
>     >     >     >     (http://cs.vu.nl/das4/clusters.shtml, VU-site), even 
> with only one node.
>     >     >     >       Below you find a minimal example that exhibits the 
> same performance
>     >     >     >     problems as all my programs:
>     >     >     >
>     >     >     >     I compiled chapel-1.14.0 with the following steps:
>     >     >     >
>     >     >     >     export CHPL_TARGET_ARCH=native
>     >     >     >     make -j
>     >     >     >     export CHPL_COMM=gasnet
>     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >     >     make clean
>     >     >     >     make -j
>     >     >     >
>     >     >     >     I compile the following Chapel code:
>     >     >     >
>     >     >     >     vectoradd.chpl:
>     >     >     >     ---------------
>     >     >     >     use Time;
>     >     >     >     use Random;
>     >     >     >     use BlockDist;
>     >     >     >
>     >     >     >     config const n = 1024**3;
>     >     >     >
>     >     >     >     // for single-locale
>     >     >     >     // const ProblemDomain : domain(1) = {0..#n};
>     >     >     >     // for multi-locale
>     >     >     >     const ProblemDomain : domain(1) dmapped 
> Block(boundingBox = {0..#n}) =
>     >     >     >          {0..#n};
>     >     >     >
>     >     >     >     type float = real(32);
>     >     >     >
>     >     >     >     proc add(c : [ProblemDomain] float, a : [ProblemDomain] 
> float,
>     >     >     >          b : [ProblemDomain] float) {
>     >     >     >        forall i in ProblemDomain {
>     >     >     >          c[i] = a[i] + b[i];
>     >     >     >        }
>     >     >     >     }
>     >     >     >
>     >     >     >     proc main() {
>     >     >     >        var c : [ProblemDomain] float;
>     >     >     >        var a : [ProblemDomain] float;
>     >     >     >        var b : [ProblemDomain] float;
>     >     >     >        var t : Timer;
>     >     >     >
>     >     >     >        fillRandom(a, 0);
>     >     >     >        fillRandom(b, 42);
>     >     >     >
>     >     >     >        t.start();
>     >     >     >        add(c, a, b);
>     >     >     >        t.stop();
>     >     >     >
>     >     >     >        writeln("n: ", n);
>     >     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s");
>     >     >     >     }
>     >     >     >     ----
>     >     >     >
>     >     >     >     I compile this for single-locale with (using no domain 
> maps, see the
>     >     >     >     comment above in the source):
>     >     >     >
>     >     >     >     chpl -o vectoradd --fast vectoradd.chpl
>     >     >     >
>     >     >     >     I run it with (dual quad core with 2 hardware threads):
>     >     >     >
>     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>     >     >     >     ./vectoradd
>     >     >     >
>     >     >     >     And get as output:
>     >     >     >
>     >     >     >     n: 1073741824
>     >     >     >     Time: 0.558806s
>     >     >     >     GFLOPS: 1.92149s
>     >     >     >
>     >     >     >     However, the performance for multi-locale is much worse:
>     >     >     >
>     >     >     >     I compile this for multi-locale with domain maps, see 
> the comment in the
>     >     >     >     source):
>     >     >     >
>     >     >     >     CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 
> vectoradd --fast \
>     >     >     >        vectoradd.chpl
>     >     >     >
>     >     >     >     I run it on the same type of node with:
>     >     >     >
>     >     >     >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
>     >     >     >
>     >     >     >     export GASNET_PHYSMEM_MAX=1G
>     >     >     >     export GASNET_IBV_SPAWNER=ssh
>     >     >     >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
>     >     >     >
>     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>     >     >     >     export CHPL_LAUNCHER=gasnetrun_ibv
>     >     >     >     export CHPL_COMM=gasnet
>     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >     >
>     >     >     >     ./vectoradd -nl 1
>     >     >     >
>     >     >     >     And get as output:
>     >     >     >
>     >     >     >     n: 1073741824
>     >     >     >     Time: 8.65082s
>     >     >     >     GFLOPS: 0.12412s
>     >     >     >
>     >     >     >     I would understand a performance difference of say 10% 
> because of
>     >     >     >     multi-locale execution, but not factors.  Is this to be 
> expected from
>     >     >     >     the current state of Chapel?  This performance 
> difference is examplary
>     >     >     >     for basically all my programs that also are more 
> realistic and use
>     >     >     >     larger inputs.  The performance is strange as there is 
> no communication
>     >     >     >     necessary (only one node) and the program is using the 
> same amount of
>     >     >     >     threads.
>     >     >     >
>     >     >     >     Is there any way for me to investigate this using 
> profiling for example?
>     >     >     >
>     >     >     >     By the way, the program does scale well to multiple 
> nodes (which is not
>     >     >     >     difficult given the baseline):
>     >     >     >
>     >     >     >       1 | 8.65s
>     >     >     >       2 | 2.67s
>     >     >     >       4 | 1.69s
>     >     >     >       8 | 0.87s
>     >     >     >     16 | 0.41s
>     >     >     >
>     >     >     >     Thanks in advance for your input.
>     >     >     >
>     >     >     >     Kind regards,
>     >     >     >
>     >     >     >     Pieter Hijma
>     >     >     >
>     >     >     >     
> ------------------------------------------------------------------------------
>     >     >     >     Check out the vibrant tech community on one of the 
> world's most
>     >     >     >     engaging tech sites, SlashDot.org! 
> http://sdm.link/slashdot
>     >     >     >     _______________________________________________
>     >     >     >     Chapel-users mailing list
>     >     >     >     [email protected]
>     >     >     >     
> https://lists.sourceforge.net/lists/listinfo/chapel-users
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
>
>

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Performance multi-locale

Reply via email to