Re: Performance multi-locale

Pieter Hijma Thu, 17 Nov 2016 01:30:58 -0800

Hi Ben,

Same setup, testing with GCC 6.2.0, single-locale in directory 
'datapar', multi-locale in directory 'datapar-dist'.


Similarly to the 1D-convolution case, the Makefiles are the same:

$ diff datapar/Makefile datapar/Makefile

This means that I compile both programs with:

$ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \
     vectoradd.chpl

The job files are also the same:

$ diff datapar/das4.job datapar/das4.job

The contents of das4.job (an SGE script), basically the same as for the
1D-convolution:

---
#!/bin/bash
#$ -l h_rt=0:15:00
#$ -N VECTORADD
#$ -cwd

. ~/.bashrc

SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`

export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv
export CHPL_LAUNCHER=gasnetrun_ibv
export CHPL_RT_NUM_THREADS_PER_LOCALE=16

export GASNET_IBV_SPAWNER=ssh
export GASNET_PHYSMEM_MAX=1G
export GASNET_SSH_SERVERS="$SSH_SERVERS"

APP=./vectoradd
ARGS=$*

$APP $ARGS
---

The two source files differ in the use of domain-maps:

$ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
2a3
 > use BlockDist;
7c8
< const ProblemDomain : domain(1) = {0..#n};
---
 > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) 
= {0..#n};
ERROR: 1


The output of 'datapar', originally the single-locale version:

addNoDomain n: 536870912
Time: 0.329722s
GFLOPS: 1.62825

addZip n: 536870912
Time: 0.328751s
GFLOPS: 1.63306

addForall n: 536870912
Time: 0.325768s
GFLOPS: 1.64802

addCollective n: 536870912
Time: 0.330918s
GFLOPS: 1.62237



The output of 'datapar-dist', the multi-locale version:

addNoDomain n: 536870912
Time: 0.373368s
GFLOPS: 1.43791

addZip n: 536870912
Time: 0.372561s
GFLOPS: 1.44103

addForall n: 536870912
Time: 2.66822s
GFLOPS: 0.201209

addCollective n: 536870912
Time: 0.36856s
GFLOPS: 1.45667


I guess the conclusion is that also in this case, the use of the 
BlockDist has an effect, minor overall, and major when indexing directly.

Kind regards,

Pieter Hijma

On 16/11/16 20:32, Ben Harshbarger wrote:
> Hi Pieter,
>
> Do you still see a problem with vectorAdd.chpl? I think 1D-convolution has a 
> somewhat separate performance issue due to accessing with arbitrary indices 
> (more overhead because we don't know if the index is local). If vectorAdd 
> isn't performing well, then that could hurt 1D-convolution too.
>
> -Ben Harshbarger
>
> On 11/16/16, 3:33 AM, "Pieter Hijma" <[email protected]> wrote:
>
>     Hi Ben,
>
>     Good suggestion, I'm going to test with the 1D-convolution program and I
>     use the Chapel version compiled with GCC 6.2.0.  The single-locale
>     version is in directory 'datapar' and the multi-locale version is in
>     directory 'datapar-dist'.
>
>     The Makefiles are now the same:
>
>     $ diff datapar/Makefile datapar/Makefile
>
>     This means that I compile both programs with:
>
>     $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \
>        --fast 1D-convolution.chpl
>
>     The job files are also the same:
>
>     $ diff datapar/das4.job datapar/das4.job
>
>     The contents of das4.job (an SGE script):
>
>     -----
>     #!/bin/bash
>     #$ -l h_rt=0:15:00
>     #$ -N CONVOLUTION_1D
>     #$ -cwd
>
>     . ~/.bashrc
>
>     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
>
>     export CHPL_COMM=gasnet
>     export CHPL_COMM_SUBSTRATE=ibv
>     export CHPL_LAUNCHER=gasnetrun_ibv
>     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>
>     export GASNET_IBV_SPAWNER=ssh
>     export GASNET_PHYSMEM_MAX=1G
>     export GASNET_SSH_SERVERS="$SSH_SERVERS"
>
>     APP=./1D-convolution
>     ARGS=$*
>
>     $APP $ARGS
>     ------
>
>     The difference between the two source files is now basically that the
>     single-locale version has only the default domain map, whereas the
>     multi-locale has the BlockDist domain map:
>
>     $ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl
>     2a3
>      > use BlockDist;
>     6c7
>     < const ProblemDomain : domain(1) = {0..#n};
>     ---
>      > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n})
>     = {0..#n};
>     ERROR: 1
>
>
>     The output of 'datapar', the original single-locale version without
>     BlockDist:
>
>     convolveIndices, n: 536870912
>     Time: 0.319077s
>     GFLOPS: 5.04772
>
>     convolveZip, n: 536870912
>     Time: 0.320788s
>     GFLOPS: 5.0208
>
>
>     The output of 'datapar-dist', the original multi-locale version with
>     BlockDist:
>
>     convolveIndices, n: 536870912
>     Time: 3.1422s
>     GFLOPS: 0.512575
>
>     convolveZip, n: 536870912
>     Time: 3.54989s
>     GFLOPS: 0.453708
>
>
>     I guess we can conclude that only the addition of the BlockDist domain
>     map to the ProblemDomain results in a factor of 10 slowdown.
>
>     Kind regards,
>
>     Pieter Hijma
>
>     On 14/11/16 20:21, Ben Harshbarger wrote:
>     > Hi Pieter,
>     >
>     > My next suggestion would be to try compiling and running the "single 
> locale" variation with the same environment variables that you use for 
> multilocale. I'm wondering if the use of IBV is impacting performance in some 
> way. I don't see the performance issue on our internal ibv cluster, but it's 
> worth checking.
>     >
>     > -Ben Harshbarger
>     >
>     > On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote:
>     >
>     >     Hi Ben,
>     >
>     >     Thanks for your help.
>     >
>     >     On 07/11/16 18:59, Ben Harshbarger wrote:
>     >     > When CHPL_COMM is set to 'none', our compiler can avoid 
> introducing some overhead that is necessary for multi-locale programs. You 
> can force this overhead when CHPL_COMM == none by compiling with the flag 
> "--no-local". If you compile your single-locale program with that flag, does 
> the performance get worse?
>     >
>     >     It makes some difference, but not much:
>     >
>     >     chpl -o vectoradd --fast vectoradd.chpl
>     >
>     >     addNoDomain n: 1073741824
>     >     Time: 0.57211s
>     >     GFLOPS: 1.87681
>     >
>     >     addZip n: 1073741824
>     >     Time: 0.571799s
>     >     GFLOPS: 1.87783
>     >
>     >     addForall n: 1073741824
>     >     Time: 0.571623s
>     >     GFLOPS: 1.87841
>     >
>     >     addCollective n: 1073741824
>     >     Time: 0.571395s
>     >     GFLOPS: 1.87916
>     >
>     >
>     >     chpl -o vectoradd --fast --no-local vectoradd.chpl
>     >
>     >     addNoDomain n: 1073741824
>     >     Time: 0.62087s
>     >     GFLOPS: 1.72941
>     >
>     >     addZip n: 1073741824
>     >     Time: 0.619997s
>     >     GFLOPS: 1.73185
>     >
>     >     addForall n: 1073741824
>     >     Time: 0.620645s
>     >     GFLOPS: 1.73004
>     >
>     >     addCollective n: 1073741824
>     >     Time: 0.620254s
>     >     GFLOPS: 1.73113
>     >
>     >
>     >     > If that's the case, I'm not entirely sure what the next step 
> would be. Do you have access to a newer version of GCC? The backend C 
> compiler can matter when it comes to optimizing the multi-locale overhead.
>     >
>     >     It is indeed an old one.  We also have GCC 4.9.0, Intel 13.3, and I
>     >     compiled GCC 6.2.0 to check:
>     >
>     >     * intel/compiler/64/13.3/2013.3.163
>     >
>     >     I basically see the same behavior:
>     >
>     >     single locale:
>     >
>     >     addNoDomain n: 536870912
>     >     Time: 0.285186s
>     >     GFLOPS: 1.88253
>     >
>     >     addZip n: 536870912
>     >     Time: 0.284819s
>     >     GFLOPS: 1.88495
>     >
>     >     addForall n: 536870912
>     >     Time: 0.287904s
>     >     GFLOPS: 1.86476
>     >
>     >     addCollective n: 536870912
>     >     Time: 0.284912s
>     >     GFLOPS: 1.88434
>     >
>     >     multi-locale, one node:
>     >
>     >     addNoDomain n: 536870912
>     >     Time: 3.24471s
>     >     GFLOPS: 0.16546
>     >
>     >     addZip n: 536870912
>     >     Time: 3.01287s
>     >     GFLOPS: 0.178192
>     >
>     >     addForall n: 536870912
>     >     Time: 7.23895s
>     >     GFLOPS: 0.0741642
>     >
>     >     addCollective n: 536870912
>     >     Time: 2.59501s
>     >     GFLOPS: 0.206886
>     >
>     >
>     >     * GCC 4.9.0
>     >
>     >     This is encouraging, the performance improves, a factor two of the
>     >     single-locale, except for the explicit indices in the forall:
>     >
>     >     single locale:
>     >
>     >     addNoDomain n: 536870912
>     >     Time: 0.277222s
>     >     GFLOPS: 1.93661
>     >
>     >     addZip n: 536870912
>     >     Time: 0.27566s
>     >     GFLOPS: 1.94758
>     >
>     >     addForall n: 536870912
>     >     Time: 0.27609s
>     >     GFLOPS: 1.94455
>     >
>     >     addCollective n: 536870912
>     >     Time: 0.275303s
>     >     GFLOPS: 1.95011
>     >
>     >     multi-locale, single node:
>     >
>     >     addNoDomain n: 536870912
>     >     Time: 0.492954s
>     >     GFLOPS: 1.08909
>     >
>     >     addZip n: 536870912
>     >     Time: 0.493039s
>     >     GFLOPS: 1.0889
>     >
>     >     addForall n: 536870912
>     >     Time: 2.85323s
>     >     GFLOPS: 0.188162
>     >
>     >     addCollective n: 536870912
>     >     Time: 0.492135s
>     >     GFLOPS: 1.0909
>     >
>     >
>     >     * GCC 6.2.0
>     >
>     >     The performance on multi-locale is now even better.  Still very low 
> for
>     >     explicit indices in the forall.
>     >
>     >     single locale:
>     >
>     >     addNoDomain n: 536870912
>     >     Time: 0.283272s
>     >     GFLOPS: 1.89525
>     >
>     >     addZip n: 536870912
>     >     Time: 0.281942s
>     >     GFLOPS: 1.90419
>     >
>     >     addForall n: 536870912
>     >     Time: 0.282291s
>     >     GFLOPS: 1.90184
>     >
>     >     addCollective n: 536870912
>     >     Time: 0.281629s
>     >     GFLOPS: 1.90631
>     >
>     >     Multi-locale, single node:
>     >
>     >     addNoDomain n: 536870912
>     >     Time: 0.358012s
>     >     GFLOPS: 1.49959
>     >
>     >     addZip n: 536870912
>     >     Time: 0.356696s
>     >     GFLOPS: 1.50512
>     >
>     >     addForall n: 536870912
>     >     Time: 2.92173s
>     >     GFLOPS: 0.183751
>     >
>     >     addCollective n: 536870912
>     >     Time: 0.343808s
>     >     GFLOPS: 1.56154
>     >
>     >
>     >
>     >     Since this is encouraging, I also verified the performance of the
>     >     1D-stencils:
>     >
>     >     * GCC 4.4.7
>     >
>     >     For reference, the old compiler that I used initially:
>     >
>     >     single locale:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 0.82361s
>     >     GFLOPS: 1.95555
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 0.810028s
>     >     GFLOPS: 1.98834
>     >
>     >     mutli-locale, one node:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 4.25951s
>     >     GFLOPS: 0.378122
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 4.88046s
>     >     GFLOPS: 0.330012
>     >
>     >     * intel/compiler/64/13.3/2013.3.163
>     >
>     >     On this compiler the single-node performance is better than the 
> previous
>     >     compiler.  However, the multi-locale one node performance is about a
>     >     factor 3 slower than the previous compiler.
>     >
>     >     single locale:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 0.554139s
>     >     GFLOPS: 2.90651
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 0.556653s
>     >     GFLOPS: 2.89339
>     >
>     >
>     >     multi-locale, one node:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 10.5368s
>     >     GFLOPS: 0.152856
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 12.7625s
>     >     GFLOPS: 0.126198
>     >
>     >
>     >     * GCC 4.9.0
>     >
>     >     The performance of single locale is much better than GCC 4.4.7, 
> however
>     >     still poor for the multi-locale, one node configuration, although a 
> bit
>     >     better.
>     >
>     >     single locale:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 0.207055s
>     >     GFLOPS: 7.77867
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 0.206783s
>     >     GFLOPS: 7.7889
>     >
>     >     multi-locale, one node:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 3.20851s
>     >     GFLOPS: 0.501981
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 3.652s
>     >     GFLOPS: 0.441023
>     >
>     >
>     >     * GCC 6.2.0
>     >
>     >     Strangely enough, the performance of single-locale is a bit lower 
> than
>     >     the previous, and the same as with multi-locale, one node.
>     >
>     >     single-locale:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 0.263151s
>     >     GFLOPS: 6.12049
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 0.262234s
>     >     GFLOPS: 6.14189
>     >
>     >     multi-locale, one node:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 3.12716s
>     >     GFLOPS: 0.515039
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 3.58663s
>     >     GFLOPS: 0.44906
>     >
>     >
>     >     The conclusion is that the compiler has indeed a large impact on the
>     >     multi-locale performance, but probably only in the simple cases 
> such as
>     >     vector addition.  With the stencil code, although it is not very
>     >     complicated, the performance falls back into the pattern that I came
>     >     across originally.
>     >
>     >     However, perhaps this gives you an idea of the optimizations that 
> impact
>     >     the performance?  If we can't find a solution, I would at least 
> like to
>     >     understand the lack of performance.
>     >
>     >     I also checked the performance of the stencils by not using the
>     >     StencilDist but just the BlockDist and it makes no difference.
>     >
>     >     > You may also want to consider setting CHPL_TARGET_ARCH to 
> something else if you're compiling on a machine architecture different from 
> the compute nodes. There's more information about CHPL_TARGET_ARCH here:
>     >     >
>     >     > 
> http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch
>     >
>     >     The head-node and compute-nodes are all Intel Xeon Westmere's, so I
>     >     don't think that makes a difference.  To be absolutely sure, I also
>     >     compiled Chapel and the applications on a compute node and indeed, 
> the
>     >     performance is comparable to all measurements above.
>     >
>     >     Kind regards,
>     >
>     >     Pieter Hijma
>     >
>     >
>     >     > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:
>     >     >
>     >     >     Dear Ben,
>     >     >
>     >     >     Sorry for my late reactions.  Unfortunately, for some reason, 
> these
>     >     >     emails are marked as spam even though I marked the list and 
> your address
>     >     >     as safe.  I will make sure I check my spam folders 
> meticulously from now on.
>     >     >
>     >     >     On 28/10/16 23:34, Ben Harshbarger wrote:
>     >     >     > Hi Pieter,
>     >     >     >
>     >     >     > Sorry that you're still having issues. I think we'll need 
> some more information before going forward:
>     >     >     >
>     >     >     > 1) Could you send us the output of 
> "$CHPL_HOME/util/printchplenv --anonymize" ? It's a script that displays the 
> various CHPL_ environment variables. "--anonymize" strips the output of 
> information you may prefer to keep private (machine info, paths).
>     >     >
>     >     >     This would be the setup if running single-locale programs:
>     >     >
>     >     >     $ printchplenv --anonymize
>     >     >     CHPL_TARGET_PLATFORM: linux64
>     >     >     CHPL_TARGET_COMPILER: gnu
>     >     >     CHPL_TARGET_ARCH: native *
>     >     >     CHPL_LOCALE_MODEL: flat
>     >     >     CHPL_COMM: none
>     >     >     CHPL_TASKS: qthreads
>     >     >     CHPL_LAUNCHER: none
>     >     >     CHPL_TIMERS: generic
>     >     >     CHPL_UNWIND: none
>     >     >     CHPL_MEM: jemalloc
>     >     >     CHPL_MAKE: gmake
>     >     >     CHPL_ATOMICS: intrinsics
>     >     >     CHPL_GMP: gmp
>     >     >     CHPL_HWLOC: hwloc
>     >     >     CHPL_REGEXP: re2
>     >     >     CHPL_WIDE_POINTERS: struct
>     >     >     CHPL_AUX_FILESYS: none
>     >     >
>     >     >     When I run multi-locale programs, I set the following 
> environment variables:
>     >     >
>     >     >     export CHPL_COMM=gasnet
>     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >
>     >     >     Then the Chapel environment would be:
>     >     >
>     >     >     $ printchplenv --anonymize
>     >     >     CHPL_TARGET_PLATFORM: linux64
>     >     >     CHPL_TARGET_COMPILER: gnu
>     >     >     CHPL_TARGET_ARCH: native *
>     >     >     CHPL_LOCALE_MODEL: flat
>     >     >     CHPL_COMM: gasnet *
>     >     >        CHPL_COMM_SUBSTRATE: ibv *
>     >     >        CHPL_GASNET_SEGMENT: large
>     >     >     CHPL_TASKS: qthreads
>     >     >     CHPL_LAUNCHER: gasnetrun_ibv
>     >     >     CHPL_TIMERS: generic
>     >     >     CHPL_UNWIND: none
>     >     >     CHPL_MEM: jemalloc
>     >     >     CHPL_MAKE: gmake
>     >     >     CHPL_ATOMICS: intrinsics
>     >     >        CHPL_NETWORK_ATOMICS: none
>     >     >     CHPL_GMP: gmp
>     >     >     CHPL_HWLOC: hwloc
>     >     >     CHPL_REGEXP: re2
>     >     >     CHPL_WIDE_POINTERS: struct
>     >     >     CHPL_AUX_FILESYS: none
>     >     >
>     >     >
>     >     >     > 2) What C compiler are you using?
>     >     >
>     >     >     $ gcc --version
>     >     >     gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
>     >     >     Copyright (C) 2010 Free Software Foundation, Inc.
>     >     >     This is free software; see the source for copying conditions. 
>  There is NO
>     >     >     warranty; not even for MERCHANTABILITY or FITNESS FOR A 
> PARTICULAR PURPOSE.
>     >     >
>     >     >     > 3) Are you sure that the programs are being launched 
> correctly? This might seem silly, but it's worth double-checking that the 
> programs are actually running on the same hardware (not necessarily the same 
> node though).
>     >     >
>     >     >     I am completely certain that the single-locale program, the 
> multi-locale
>     >     >     program for one node, and the multi-locale for multiple nodes 
> are
>     >     >     running on the compute nodes.  I'm not completely sure what 
> you mean by
>     >     >     "the same hardware".  All compute nodes have the same 
> hardware if that
>     >     >     is what you mean.
>     >     >
>     >     >     > I'd also like to clarify what you mean by "multi-locale 
> compiled". Is the difference between the programs just the use of the Block 
> domain map, or do you compile with different environment variables set?
>     >     >
>     >     >     I compile different programs and I use different environment 
> variables:
>     >     >
>     >     >     The single-locale version vectoradd is located in the datapar 
> directory,
>     >     >     whereas the multi-locale version is located in the 
> datapar-dist
>     >     >     directory.  What follows is the diff for the .chpl file:
>     >     >
>     >     >     $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
>     >     >     8c8
>     >     >     < const ProblemDomain : domain(1) = {0..#n};
>     >     >     ---
>     >     >      > const ProblemDomain : domain(1) dmapped Block(boundingBox 
> = {0..#n})
>     >     >     = {0..#n};
>     >     >
>     >     >     The diff for the Makefile:
>     >     >
>     >     >     $ diff datapar/Makefile datapar-dist/Makefile
>     >     >     2a3
>     >     >      > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
>     >     >     8c9
>     >     >     <     $(CHPL) -o $@ $(FLAGS) $<
>     >     >     ---
>     >     >      >    $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
>     >     >     11c12
>     >     >     <     rm -f $(APP)
>     >     >     ---
>     >     >      >    rm -f $(APP) $(APP)_real
>     >     >
>     >     >     Thanks for your help, and again my apologies for the delayed 
> answers.
>     >     >
>     >     >     Kind regards,
>     >     >
>     >     >     Pieter Hijma
>     >     >
>     >     >     >
>     >     >     > -Ben Harshbarger
>     >     >     >
>     >     >     > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote:
>     >     >     >
>     >     >     >     Hi Ben,
>     >     >     >
>     >     >     >     Thank you for your fast reply and suggestions!  I did 
> some more tests
>     >     >     >     and also included stencil operations.
>     >     >     >
>     >     >     >     First, the vector addition:
>     >     >     >
>     >     >     >     vectoradd.chpl
>     >     >     >     --------------
>     >     >     >     use Time;
>     >     >     >     use Random;
>     >     >     >     use BlockDist;
>     >     >     >     //use VisualDebug;
>     >     >     >
>     >     >     >     config const n = 1024**3/2;
>     >     >     >
>     >     >     >     // for multi-locale
>     >     >     >     const ProblemDomain : domain(1) dmapped 
> Block(boundingBox = {0..#n})
>     >     >     >        = {0..#n};
>     >     >     >     // for single-locale
>     >     >     >     const ProblemDomain : domain(1) = {0..#n};
>     >     >     >
>     >     >     >     type float = real(32);
>     >     >     >
>     >     >     >     proc addNoDomain(c : [] float, a : [] float, b : [] 
> float) {
>     >     >     >        forall (ci, ai, bi) in zip(c, a, b) {
>     >     >     >          ci = ai + bi;
>     >     >     >        }
>     >     >     >     }
>     >     >     >
>     >     >     >     proc addZip(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >                 b : [ProblemDomain] float) {
>     >     >     >        forall (ci, ai, bi) in zip(c, a, b) {
>     >     >     >          ci = ai + bi;
>     >     >     >        }
>     >     >     >     }
>     >     >     >
>     >     >     >     proc addForall(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >                    b : [ProblemDomain] float) {
>     >     >     >        //startVdebug("vdata");
>     >     >     >        forall i in ProblemDomain {
>     >     >     >          c[i] = a[i] + b[i];
>     >     >     >        }
>     >     >     >        //stopVdebug();
>     >     >     >     }
>     >     >     >
>     >     >     >     proc addCollective(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >                        b : [ProblemDomain] float) {
>     >     >     >        c = a + b;
>     >     >     >     }
>     >     >     >
>     >     >     >     proc output(t : Timer, n, testName) {
>     >     >     >        t.stop();
>     >     >     >        writeln(testName, " n: ", n);
>     >     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
>     >     >     >        writeln();
>     >     >     >        t.clear();
>     >     >     >     }
>     >     >     >
>     >     >     >     proc main() {
>     >     >     >        var c : [ProblemDomain] float;
>     >     >     >        var a : [ProblemDomain] float;
>     >     >     >        var b : [ProblemDomain] float;
>     >     >     >        var t : Timer;
>     >     >     >
>     >     >     >        fillRandom(a, 0);
>     >     >     >        fillRandom(b, 42);
>     >     >     >
>     >     >     >        t.start();
>     >     >     >        addNoDomain(c, a, b);
>     >     >     >        output(t, n, "addNoDomain");
>     >     >     >
>     >     >     >        t.start();
>     >     >     >        addZip(c, a, b);
>     >     >     >        output(t, n, "addZip");
>     >     >     >
>     >     >     >        t.start();
>     >     >     >        addForall(c, a, b);
>     >     >     >        output(t, n, "addForall");
>     >     >     >
>     >     >     >        t.start();
>     >     >     >        addCollective(c, a, b);
>     >     >     >        output(t, n, "addCollective");
>     >     >     >     }
>     >     >     >     -----
>     >     >     >
>     >     >     >     On a single locale I get as output:
>     >     >     >
>     >     >     >     addNoDomain n: 536870912
>     >     >     >     Time: 0.27961s
>     >     >     >     GFLOPS: 1.92007
>     >     >     >
>     >     >     >     addZip n: 536870912
>     >     >     >     Time: 0.278657s
>     >     >     >     GFLOPS: 1.92664
>     >     >     >
>     >     >     >     addForall n: 536870912
>     >     >     >     Time: 0.278015s
>     >     >     >     GFLOPS: 1.93109
>     >     >     >
>     >     >     >     addCollective n: 536870912
>     >     >     >     Time: 0.278379s
>     >     >     >     GFLOPS: 1.92856
>     >     >     >
>     >     >     >     On multi-locale (-nl 1) I get as output:
>     >     >     >
>     >     >     >     addNoDomain n: 536870912
>     >     >     >     Time: 2.16806s
>     >     >     >     GFLOPS: 0.247627
>     >     >     >
>     >     >     >     addZip n: 536870912
>     >     >     >     Time: 2.17024s
>     >     >     >     GFLOPS: 0.247378
>     >     >     >
>     >     >     >     addForall n: 536870912
>     >     >     >     Time: 4.78443s
>     >     >     >     GFLOPS: 0.112212
>     >     >     >
>     >     >     >     addCollective n: 536870912
>     >     >     >     Time: 2.19838s
>     >     >     >     GFLOPS: 0.244212
>     >     >     >
>     >     >     >     So, indeed, your suggestion improves it by more than a 
> factor two, but
>     >     >     >     it is still close to a factor 8 slower than 
> single-locale.
>     >     >     >
>     >     >     >     I also used chplvis and verified that there are no gets 
> and puts when
>     >     >     >     running multi-locale with more than one node.  The 
> profiling information
>     >     >     >     is clear, but not very helpful (to me):
>     >     >     >
>     >     >     >     multi-locale (-nl 1):
>     >     >     >
>     >     >     >     | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
>     >     >     >     |  4.8777 | wrapon_fn_chpl35      | vectoradd.chpl:26 |
>     >     >     >
>     >     >     >     single-locale:
>     >     >     >
>     >     >     >     | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >     For stencil operations, I used the following program:
>     >     >     >
>     >     >     >     1d-convolution.chpl
>     >     >     >     -------------------
>     >     >     >     use Time;
>     >     >     >     use Random;
>     >     >     >     use StencilDist;
>     >     >     >
>     >     >     >     config const n = 1024**3/2;
>     >     >     >
>     >     >     >     const ProblemDomain : domain(1) dmapped 
> Stencil(boundingBox = {0..#n},
>     >     >     >                                                     fluff = 
> (1,))
>     >     >     >        = {0..#n};
>     >     >     >     const InnerDomain : subdomain(ProblemDomain) = {1..n-2};
>     >     >     >
>     >     >     >     proc convolveIndices(output : [ProblemDomain] real(32),
>     >     >     >                          input : [ProblemDomain] real(32)) {
>     >     >     >        forall i in InnerDomain {
>     >     >     >          output[i] = ((input[i-1] + input[i] + 
> input[i+1])/3:real(32));
>     >     >     >        }
>     >     >     >     }
>     >     >     >
>     >     >     >     proc convolveZip(output : [ProblemDomain] real(32),
>     >     >     >                      input : [ProblemDomain] real(32)) {
>     >     >     >        forall (im1, i, ip1) in 
> zip(InnerDomain.translate(-1),
>     >     >     >                                   InnerDomain,
>     >     >     >                                   InnerDomain.translate(1)) 
> {
>     >     >     >          output[i] = ((input[im1] + input[i] + 
> input[ip1])/3:real(32));
>     >     >     >        }
>     >     >     >     }
>     >     >     >
>     >     >     >     proc print(t : Timer, n, s) {
>     >     >     >        t.stop();
>     >     >     >        writeln(s, ", n: ", n);
>     >     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >     >        writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
>     >     >     >        writeln();
>     >     >     >        t.clear();
>     >     >     >     }
>     >     >     >
>     >     >     >     proc main() {
>     >     >     >        var input : [ProblemDomain] real(32);
>     >     >     >        var output : [ProblemDomain] real(32);
>     >     >     >        var t : Timer;
>     >     >     >
>     >     >     >        fillRandom(input, 42);
>     >     >     >
>     >     >     >        t.start();
>     >     >     >        convolveIndices(output, input);
>     >     >     >        print(t, n, "convolveIndices");
>     >     >     >
>     >     >     >        t.start();
>     >     >     >        convolveZip(output, input);
>     >     >     >        print(t, n, "convolveZip");
>     >     >     >     }
>     >     >     >     ------
>     >     >     >
>     >     >     >     Interestingly, in contrast to your earlier suggestion, 
> the direct
>     >     >     >     indexing works a bit better in this program than the 
> zipped version:
>     >     >     >
>     >     >     >     Multi-locale (-nl 1):
>     >     >     >
>     >     >     >     convolveIndices, n: 536870912
>     >     >     >     Time: 4.27148s
>     >     >     >     GFLOPS: 0.377062
>     >     >     >
>     >     >     >     convolveZip, n: 536870912
>     >     >     >     Time: 4.87291s
>     >     >     >     GFLOPS: 0.330524
>     >     >     >
>     >     >     >     Single-locale:
>     >     >     >
>     >     >     >     convolveIndices, n: 536870912
>     >     >     >     Time: 0.548804s
>     >     >     >     GFLOPS: 2.93477
>     >     >     >
>     >     >     >     convolveZip, n: 536870912
>     >     >     >     Time: 0.538754s
>     >     >     >     GFLOPS: 2.98951
>     >     >     >
>     >     >     >
>     >     >     >     Again, the multi-locale is about a factor 8 slower than 
> single-locale.
>     >     >     >     By the way, the Stencil distribution is a bit faster 
> than the Block
>     >     >     >     distribution.
>     >     >     >
>     >     >     >     Thanks in advance for your input,
>     >     >     >
>     >     >     >     Pieter
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >     On 24/10/16 19:20, Ben Harshbarger wrote:
>     >     >     >     > Hi Pieter,
>     >     >     >     >
>     >     >     >     > Thanks for providing the example, that's very helpful.
>     >     >     >     >
>     >     >     >     > Multi-locale performance in Chapel is not yet where 
> we'd like it to be, but we've done a lot of work over the past few releases 
> to get cases like yours performing well. It's surprising that using Block 
> results in that much of a difference, but I think you would see better 
> performance by iterating over the arrays directly:
>     >     >     >     >
>     >     >     >     > ```
>     >     >     >     > // replace the loop in the 'add' function with this:
>     >     >     >     > forall (ci, ai, bi) in zip(c, a, b) {
>     >     >     >     >   ci = ai + bi;
>     >     >     >     > }
>     >     >     >     > ```
>     >     >     >     >
>     >     >     >     > Block-distributed arrays can leverage the 
> fast-follower optimization to perform better when all arrays being iterated 
> over share the same domain. You can also write that loop in a cleaner way by 
> leveraging array promotion:
>     >     >     >     >
>     >     >     >     > ```
>     >     >     >     > // This is equivalent to the first loop
>     >     >     >     > c = a + b;
>     >     >     >     > ```
>     >     >     >     >
>     >     >     >     > However, when I tried the promoted variation on my 
> machine I observed worse performance than the explicit forall-loop. It seems 
> to be related to the way the arguments of the 'add' function are declared. If 
> you replaced "[ProblemDomain] float" with "[] float", performance seems to 
> improve. That surprised a couple of us on the development team, and I'll be 
> looking at that some more today.
>     >     >     >     >
>     >     >     >     > If you're still seeing significantly worse 
> performance with Block compared to the default rectangular domain, and the 
> programs are launched in the same way, that would be odd. You could try 
> profiling using chplvis. I agree though that there shouldn't be any 
> communication in this program. You can find more information on chplvis here 
> in the online 1.14 release documentation:
>     >     >     >     >
>     >     >     >     > 
> http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
>     >     >     >     >
>     >     >     >     > I hope that rewriting the loops solves the problem, 
> but let us know if it doesn't and we can continue investigating.
>     >     >     >     >
>     >     >     >     > -Ben Harshbarger
>     >     >     >     >
>     >     >     >     > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> 
> wrote:
>     >     >     >     >
>     >     >     >     >     Dear all,
>     >     >     >     >
>     >     >     >     >     My apologies if this has already been asked 
> before.  I'm new to the list
>     >     >     >     >     and couldn't find it in the archives.
>     >     >     >     >
>     >     >     >     >     I experience bad performance when running the 
> multi-locale compiled
>     >     >     >     >     version on an InfiniBand equiped cluster
>     >     >     >     >     (http://cs.vu.nl/das4/clusters.shtml, VU-site), 
> even with only one node.
>     >     >     >     >       Below you find a minimal example that exhibits 
> the same performance
>     >     >     >     >     problems as all my programs:
>     >     >     >     >
>     >     >     >     >     I compiled chapel-1.14.0 with the following steps:
>     >     >     >     >
>     >     >     >     >     export CHPL_TARGET_ARCH=native
>     >     >     >     >     make -j
>     >     >     >     >     export CHPL_COMM=gasnet
>     >     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >     >     >     make clean
>     >     >     >     >     make -j
>     >     >     >     >
>     >     >     >     >     I compile the following Chapel code:
>     >     >     >     >
>     >     >     >     >     vectoradd.chpl:
>     >     >     >     >     ---------------
>     >     >     >     >     use Time;
>     >     >     >     >     use Random;
>     >     >     >     >     use BlockDist;
>     >     >     >     >
>     >     >     >     >     config const n = 1024**3;
>     >     >     >     >
>     >     >     >     >     // for single-locale
>     >     >     >     >     // const ProblemDomain : domain(1) = {0..#n};
>     >     >     >     >     // for multi-locale
>     >     >     >     >     const ProblemDomain : domain(1) dmapped 
> Block(boundingBox = {0..#n}) =
>     >     >     >     >          {0..#n};
>     >     >     >     >
>     >     >     >     >     type float = real(32);
>     >     >     >     >
>     >     >     >     >     proc add(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >     >          b : [ProblemDomain] float) {
>     >     >     >     >        forall i in ProblemDomain {
>     >     >     >     >          c[i] = a[i] + b[i];
>     >     >     >     >        }
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc main() {
>     >     >     >     >        var c : [ProblemDomain] float;
>     >     >     >     >        var a : [ProblemDomain] float;
>     >     >     >     >        var b : [ProblemDomain] float;
>     >     >     >     >        var t : Timer;
>     >     >     >     >
>     >     >     >     >        fillRandom(a, 0);
>     >     >     >     >        fillRandom(b, 42);
>     >     >     >     >
>     >     >     >     >        t.start();
>     >     >     >     >        add(c, a, b);
>     >     >     >     >        t.stop();
>     >     >     >     >
>     >     >     >     >        writeln("n: ", n);
>     >     >     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, 
> "s");
>     >     >     >     >     }
>     >     >     >     >     ----
>     >     >     >     >
>     >     >     >     >     I compile this for single-locale with (using no 
> domain maps, see the
>     >     >     >     >     comment above in the source):
>     >     >     >     >
>     >     >     >     >     chpl -o vectoradd --fast vectoradd.chpl
>     >     >     >     >
>     >     >     >     >     I run it with (dual quad core with 2 hardware 
> threads):
>     >     >     >     >
>     >     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>     >     >     >     >     ./vectoradd
>     >     >     >     >
>     >     >     >     >     And get as output:
>     >     >     >     >
>     >     >     >     >     n: 1073741824
>     >     >     >     >     Time: 0.558806s
>     >     >     >     >     GFLOPS: 1.92149s
>     >     >     >     >
>     >     >     >     >     However, the performance for multi-locale is much 
> worse:
>     >     >     >     >
>     >     >     >     >     I compile this for multi-locale with domain maps, 
> see the comment in the
>     >     >     >     >     source):
>     >     >     >     >
>     >     >     >     >     CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 
> vectoradd --fast \
>     >     >     >     >        vectoradd.chpl
>     >     >     >     >
>     >     >     >     >     I run it on the same type of node with:
>     >     >     >     >
>     >     >     >     >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
>     >     >     >     >
>     >     >     >     >     export GASNET_PHYSMEM_MAX=1G
>     >     >     >     >     export GASNET_IBV_SPAWNER=ssh
>     >     >     >     >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
>     >     >     >     >
>     >     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>     >     >     >     >     export CHPL_LAUNCHER=gasnetrun_ibv
>     >     >     >     >     export CHPL_COMM=gasnet
>     >     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >     >     >
>     >     >     >     >     ./vectoradd -nl 1
>     >     >     >     >
>     >     >     >     >     And get as output:
>     >     >     >     >
>     >     >     >     >     n: 1073741824
>     >     >     >     >     Time: 8.65082s
>     >     >     >     >     GFLOPS: 0.12412s
>     >     >     >     >
>     >     >     >     >     I would understand a performance difference of 
> say 10% because of
>     >     >     >     >     multi-locale execution, but not factors.  Is this 
> to be expected from
>     >     >     >     >     the current state of Chapel?  This performance 
> difference is examplary
>     >     >     >     >     for basically all my programs that also are more 
> realistic and use
>     >     >     >     >     larger inputs.  The performance is strange as 
> there is no communication
>     >     >     >     >     necessary (only one node) and the program is 
> using the same amount of
>     >     >     >     >     threads.
>     >     >     >     >
>     >     >     >     >     Is there any way for me to investigate this using 
> profiling for example?
>     >     >     >     >
>     >     >     >     >     By the way, the program does scale well to 
> multiple nodes (which is not
>     >     >     >     >     difficult given the baseline):
>     >     >     >     >
>     >     >     >     >       1 | 8.65s
>     >     >     >     >       2 | 2.67s
>     >     >     >     >       4 | 1.69s
>     >     >     >     >       8 | 0.87s
>     >     >     >     >     16 | 0.41s
>     >     >     >     >
>     >     >     >     >     Thanks in advance for your input.
>     >     >     >     >
>     >     >     >     >     Kind regards,
>     >     >     >     >
>     >     >     >     >     Pieter Hijma
>     >     >     >     >
>     >     >     >     >     
> ------------------------------------------------------------------------------
>     >     >     >     >     Check out the vibrant tech community on one of 
> the world's most
>     >     >     >     >     engaging tech sites, SlashDot.org! 
> http://sdm.link/slashdot
>     >     >     >     >     _______________________________________________
>     >     >     >     >     Chapel-users mailing list
>     >     >     >     >     [email protected]
>     >     >     >     >     
> https://lists.sourceforge.net/lists/listinfo/chapel-users
>     >     >     >     >
>     >     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
>
>

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Performance multi-locale

Reply via email to