Re: Performance multi-locale

Ben Harshbarger Thu, 17 Nov 2016 09:27:05 -0800

Hi Pieter,

I think that's a fair conclusion. Thanks for being patient and providing the 
code examples and timings. On my end, I'll try and work on creating some 
mini-benchmarks to track this difference more clearly (for both the vectorAdd 
and 1D-convolution use-cases).


As far as understanding overhead for direct-indexing in BlockDist, here are 
some issues that I'm aware of:
- Overhead when the compiler is unable to infer that data is local. In such 
cases we introduce "wide pointers", which the back-end C compiler may not be 
able to optimize as effectively as a normal pointer. This is what may have been 
impacted by switching from gcc 4.4 to 6.2.
- Overhead to check which locale "owns" an index
- Possible overhead for privatized objects

Though this isn't a satisfying conclusion, hopefully it helps to get some sense 
of how different patterns perform.

-Ben Harshbarger

On 11/17/16, 1:28 AM, "Pieter Hijma" <[email protected]> wrote:

    Hi Ben,
    
    Same setup, testing with GCC 6.2.0, single-locale in directory 
    'datapar', multi-locale in directory 'datapar-dist'.
    
    Similarly to the 1D-convolution case, the Makefiles are the same:
    
    $ diff datapar/Makefile datapar/Makefile
    
    This means that I compile both programs with:
    
    $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \
         vectoradd.chpl
    
    The job files are also the same:
    
    $ diff datapar/das4.job datapar/das4.job
    
    The contents of das4.job (an SGE script), basically the same as for the
    1D-convolution:
    
    ---
    #!/bin/bash
    #$ -l h_rt=0:15:00
    #$ -N VECTORADD
    #$ -cwd
    
    . ~/.bashrc
    
    SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
    
    export CHPL_COMM=gasnet
    export CHPL_COMM_SUBSTRATE=ibv
    export CHPL_LAUNCHER=gasnetrun_ibv
    export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    
    export GASNET_IBV_SPAWNER=ssh
    export GASNET_PHYSMEM_MAX=1G
    export GASNET_SSH_SERVERS="$SSH_SERVERS"
    
    APP=./vectoradd
    ARGS=$*
    
    $APP $ARGS
    ---
    
    The two source files differ in the use of domain-maps:
    
    $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
    2a3
     > use BlockDist;
    7c8
    < const ProblemDomain : domain(1) = {0..#n};
    ---
     > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) 
    = {0..#n};
    ERROR: 1
    
    
    The output of 'datapar', originally the single-locale version:
    
    addNoDomain n: 536870912
    Time: 0.329722s
    GFLOPS: 1.62825
    
    addZip n: 536870912
    Time: 0.328751s
    GFLOPS: 1.63306
    
    addForall n: 536870912
    Time: 0.325768s
    GFLOPS: 1.64802
    
    addCollective n: 536870912
    Time: 0.330918s
    GFLOPS: 1.62237
    
    
    
    The output of 'datapar-dist', the multi-locale version:
    
    addNoDomain n: 536870912
    Time: 0.373368s
    GFLOPS: 1.43791
    
    addZip n: 536870912
    Time: 0.372561s
    GFLOPS: 1.44103
    
    addForall n: 536870912
    Time: 2.66822s
    GFLOPS: 0.201209
    
    addCollective n: 536870912
    Time: 0.36856s
    GFLOPS: 1.45667
    
    
    I guess the conclusion is that also in this case, the use of the 
    BlockDist has an effect, minor overall, and major when indexing directly.
    
    Kind regards,
    
    Pieter Hijma
    
    On 16/11/16 20:32, Ben Harshbarger wrote:
    > Hi Pieter,
    >
    > Do you still see a problem with vectorAdd.chpl? I think 1D-convolution 
has a somewhat separate performance issue due to accessing with arbitrary 
indices (more overhead because we don't know if the index is local). If 
vectorAdd isn't performing well, then that could hurt 1D-convolution too.
    >
    > -Ben Harshbarger
    >
    > On 11/16/16, 3:33 AM, "Pieter Hijma" <[email protected]> wrote:
    >
    >     Hi Ben,
    >
    >     Good suggestion, I'm going to test with the 1D-convolution program 
and I
    >     use the Chapel version compiled with GCC 6.2.0.  The single-locale
    >     version is in directory 'datapar' and the multi-locale version is in
    >     directory 'datapar-dist'.
    >
    >     The Makefiles are now the same:
    >
    >     $ diff datapar/Makefile datapar/Makefile
    >
    >     This means that I compile both programs with:
    >
    >     $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \
    >        --fast 1D-convolution.chpl
    >
    >     The job files are also the same:
    >
    >     $ diff datapar/das4.job datapar/das4.job
    >
    >     The contents of das4.job (an SGE script):
    >
    >     -----
    >     #!/bin/bash
    >     #$ -l h_rt=0:15:00
    >     #$ -N CONVOLUTION_1D
    >     #$ -cwd
    >
    >     . ~/.bashrc
    >
    >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
    >
    >     export CHPL_COMM=gasnet
    >     export CHPL_COMM_SUBSTRATE=ibv
    >     export CHPL_LAUNCHER=gasnetrun_ibv
    >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >
    >     export GASNET_IBV_SPAWNER=ssh
    >     export GASNET_PHYSMEM_MAX=1G
    >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
    >
    >     APP=./1D-convolution
    >     ARGS=$*
    >
    >     $APP $ARGS
    >     ------
    >
    >     The difference between the two source files is now basically that the
    >     single-locale version has only the default domain map, whereas the
    >     multi-locale has the BlockDist domain map:
    >
    >     $ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl
    >     2a3
    >      > use BlockDist;
    >     6c7
    >     < const ProblemDomain : domain(1) = {0..#n};
    >     ---
    >      > const ProblemDomain : domain(1) dmapped Block(boundingBox = 
{0..#n})
    >     = {0..#n};
    >     ERROR: 1
    >
    >
    >     The output of 'datapar', the original single-locale version without
    >     BlockDist:
    >
    >     convolveIndices, n: 536870912
    >     Time: 0.319077s
    >     GFLOPS: 5.04772
    >
    >     convolveZip, n: 536870912
    >     Time: 0.320788s
    >     GFLOPS: 5.0208
    >
    >
    >     The output of 'datapar-dist', the original multi-locale version with
    >     BlockDist:
    >
    >     convolveIndices, n: 536870912
    >     Time: 3.1422s
    >     GFLOPS: 0.512575
    >
    >     convolveZip, n: 536870912
    >     Time: 3.54989s
    >     GFLOPS: 0.453708
    >
    >
    >     I guess we can conclude that only the addition of the BlockDist domain
    >     map to the ProblemDomain results in a factor of 10 slowdown.
    >
    >     Kind regards,
    >
    >     Pieter Hijma
    >
    >     On 14/11/16 20:21, Ben Harshbarger wrote:
    >     > Hi Pieter,
    >     >
    >     > My next suggestion would be to try compiling and running the 
"single locale" variation with the same environment variables that you use for 
multilocale. I'm wondering if the use of IBV is impacting performance in some 
way. I don't see the performance issue on our internal ibv cluster, but it's 
worth checking.
    >     >
    >     > -Ben Harshbarger
    >     >
    >     > On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote:
    >     >
    >     >     Hi Ben,
    >     >
    >     >     Thanks for your help.
    >     >
    >     >     On 07/11/16 18:59, Ben Harshbarger wrote:
    >     >     > When CHPL_COMM is set to 'none', our compiler can avoid 
introducing some overhead that is necessary for multi-locale programs. You can 
force this overhead when CHPL_COMM == none by compiling with the flag 
"--no-local". If you compile your single-locale program with that flag, does 
the performance get worse?
    >     >
    >     >     It makes some difference, but not much:
    >     >
    >     >     chpl -o vectoradd --fast vectoradd.chpl
    >     >
    >     >     addNoDomain n: 1073741824
    >     >     Time: 0.57211s
    >     >     GFLOPS: 1.87681
    >     >
    >     >     addZip n: 1073741824
    >     >     Time: 0.571799s
    >     >     GFLOPS: 1.87783
    >     >
    >     >     addForall n: 1073741824
    >     >     Time: 0.571623s
    >     >     GFLOPS: 1.87841
    >     >
    >     >     addCollective n: 1073741824
    >     >     Time: 0.571395s
    >     >     GFLOPS: 1.87916
    >     >
    >     >
    >     >     chpl -o vectoradd --fast --no-local vectoradd.chpl
    >     >
    >     >     addNoDomain n: 1073741824
    >     >     Time: 0.62087s
    >     >     GFLOPS: 1.72941
    >     >
    >     >     addZip n: 1073741824
    >     >     Time: 0.619997s
    >     >     GFLOPS: 1.73185
    >     >
    >     >     addForall n: 1073741824
    >     >     Time: 0.620645s
    >     >     GFLOPS: 1.73004
    >     >
    >     >     addCollective n: 1073741824
    >     >     Time: 0.620254s
    >     >     GFLOPS: 1.73113
    >     >
    >     >
    >     >     > If that's the case, I'm not entirely sure what the next step 
would be. Do you have access to a newer version of GCC? The backend C compiler 
can matter when it comes to optimizing the multi-locale overhead.
    >     >
    >     >     It is indeed an old one.  We also have GCC 4.9.0, Intel 13.3, 
and I
    >     >     compiled GCC 6.2.0 to check:
    >     >
    >     >     * intel/compiler/64/13.3/2013.3.163
    >     >
    >     >     I basically see the same behavior:
    >     >
    >     >     single locale:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 0.285186s
    >     >     GFLOPS: 1.88253
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 0.284819s
    >     >     GFLOPS: 1.88495
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 0.287904s
    >     >     GFLOPS: 1.86476
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 0.284912s
    >     >     GFLOPS: 1.88434
    >     >
    >     >     multi-locale, one node:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 3.24471s
    >     >     GFLOPS: 0.16546
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 3.01287s
    >     >     GFLOPS: 0.178192
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 7.23895s
    >     >     GFLOPS: 0.0741642
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 2.59501s
    >     >     GFLOPS: 0.206886
    >     >
    >     >
    >     >     * GCC 4.9.0
    >     >
    >     >     This is encouraging, the performance improves, a factor two of 
the
    >     >     single-locale, except for the explicit indices in the forall:
    >     >
    >     >     single locale:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 0.277222s
    >     >     GFLOPS: 1.93661
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 0.27566s
    >     >     GFLOPS: 1.94758
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 0.27609s
    >     >     GFLOPS: 1.94455
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 0.275303s
    >     >     GFLOPS: 1.95011
    >     >
    >     >     multi-locale, single node:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 0.492954s
    >     >     GFLOPS: 1.08909
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 0.493039s
    >     >     GFLOPS: 1.0889
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 2.85323s
    >     >     GFLOPS: 0.188162
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 0.492135s
    >     >     GFLOPS: 1.0909
    >     >
    >     >
    >     >     * GCC 6.2.0
    >     >
    >     >     The performance on multi-locale is now even better.  Still very 
low for
    >     >     explicit indices in the forall.
    >     >
    >     >     single locale:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 0.283272s
    >     >     GFLOPS: 1.89525
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 0.281942s
    >     >     GFLOPS: 1.90419
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 0.282291s
    >     >     GFLOPS: 1.90184
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 0.281629s
    >     >     GFLOPS: 1.90631
    >     >
    >     >     Multi-locale, single node:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 0.358012s
    >     >     GFLOPS: 1.49959
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 0.356696s
    >     >     GFLOPS: 1.50512
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 2.92173s
    >     >     GFLOPS: 0.183751
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 0.343808s
    >     >     GFLOPS: 1.56154
    >     >
    >     >
    >     >
    >     >     Since this is encouraging, I also verified the performance of 
the
    >     >     1D-stencils:
    >     >
    >     >     * GCC 4.4.7
    >     >
    >     >     For reference, the old compiler that I used initially:
    >     >
    >     >     single locale:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 0.82361s
    >     >     GFLOPS: 1.95555
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 0.810028s
    >     >     GFLOPS: 1.98834
    >     >
    >     >     mutli-locale, one node:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 4.25951s
    >     >     GFLOPS: 0.378122
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 4.88046s
    >     >     GFLOPS: 0.330012
    >     >
    >     >     * intel/compiler/64/13.3/2013.3.163
    >     >
    >     >     On this compiler the single-node performance is better than the 
previous
    >     >     compiler.  However, the multi-locale one node performance is 
about a
    >     >     factor 3 slower than the previous compiler.
    >     >
    >     >     single locale:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 0.554139s
    >     >     GFLOPS: 2.90651
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 0.556653s
    >     >     GFLOPS: 2.89339
    >     >
    >     >
    >     >     multi-locale, one node:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 10.5368s
    >     >     GFLOPS: 0.152856
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 12.7625s
    >     >     GFLOPS: 0.126198
    >     >
    >     >
    >     >     * GCC 4.9.0
    >     >
    >     >     The performance of single locale is much better than GCC 4.4.7, 
however
    >     >     still poor for the multi-locale, one node configuration, 
although a bit
    >     >     better.
    >     >
    >     >     single locale:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 0.207055s
    >     >     GFLOPS: 7.77867
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 0.206783s
    >     >     GFLOPS: 7.7889
    >     >
    >     >     multi-locale, one node:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 3.20851s
    >     >     GFLOPS: 0.501981
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 3.652s
    >     >     GFLOPS: 0.441023
    >     >
    >     >
    >     >     * GCC 6.2.0
    >     >
    >     >     Strangely enough, the performance of single-locale is a bit 
lower than
    >     >     the previous, and the same as with multi-locale, one node.
    >     >
    >     >     single-locale:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 0.263151s
    >     >     GFLOPS: 6.12049
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 0.262234s
    >     >     GFLOPS: 6.14189
    >     >
    >     >     multi-locale, one node:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 3.12716s
    >     >     GFLOPS: 0.515039
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 3.58663s
    >     >     GFLOPS: 0.44906
    >     >
    >     >
    >     >     The conclusion is that the compiler has indeed a large impact 
on the
    >     >     multi-locale performance, but probably only in the simple cases 
such as
    >     >     vector addition.  With the stencil code, although it is not very
    >     >     complicated, the performance falls back into the pattern that I 
came
    >     >     across originally.
    >     >
    >     >     However, perhaps this gives you an idea of the optimizations 
that impact
    >     >     the performance?  If we can't find a solution, I would at least 
like to
    >     >     understand the lack of performance.
    >     >
    >     >     I also checked the performance of the stencils by not using the
    >     >     StencilDist but just the BlockDist and it makes no difference.
    >     >
    >     >     > You may also want to consider setting CHPL_TARGET_ARCH to 
something else if you're compiling on a machine architecture different from the 
compute nodes. There's more information about CHPL_TARGET_ARCH here:
    >     >     >
    >     >     > 
http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch
    >     >
    >     >     The head-node and compute-nodes are all Intel Xeon Westmere's, 
so I
    >     >     don't think that makes a difference.  To be absolutely sure, I 
also
    >     >     compiled Chapel and the applications on a compute node and 
indeed, the
    >     >     performance is comparable to all measurements above.
    >     >
    >     >     Kind regards,
    >     >
    >     >     Pieter Hijma
    >     >
    >     >
    >     >     > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:
    >     >     >
    >     >     >     Dear Ben,
    >     >     >
    >     >     >     Sorry for my late reactions.  Unfortunately, for some 
reason, these
    >     >     >     emails are marked as spam even though I marked the list 
and your address
    >     >     >     as safe.  I will make sure I check my spam folders 
meticulously from now on.
    >     >     >
    >     >     >     On 28/10/16 23:34, Ben Harshbarger wrote:
    >     >     >     > Hi Pieter,
    >     >     >     >
    >     >     >     > Sorry that you're still having issues. I think we'll 
need some more information before going forward:
    >     >     >     >
    >     >     >     > 1) Could you send us the output of 
"$CHPL_HOME/util/printchplenv --anonymize" ? It's a script that displays the 
various CHPL_ environment variables. "--anonymize" strips the output of 
information you may prefer to keep private (machine info, paths).
    >     >     >
    >     >     >     This would be the setup if running single-locale programs:
    >     >     >
    >     >     >     $ printchplenv --anonymize
    >     >     >     CHPL_TARGET_PLATFORM: linux64
    >     >     >     CHPL_TARGET_COMPILER: gnu
    >     >     >     CHPL_TARGET_ARCH: native *
    >     >     >     CHPL_LOCALE_MODEL: flat
    >     >     >     CHPL_COMM: none
    >     >     >     CHPL_TASKS: qthreads
    >     >     >     CHPL_LAUNCHER: none
    >     >     >     CHPL_TIMERS: generic
    >     >     >     CHPL_UNWIND: none
    >     >     >     CHPL_MEM: jemalloc
    >     >     >     CHPL_MAKE: gmake
    >     >     >     CHPL_ATOMICS: intrinsics
    >     >     >     CHPL_GMP: gmp
    >     >     >     CHPL_HWLOC: hwloc
    >     >     >     CHPL_REGEXP: re2
    >     >     >     CHPL_WIDE_POINTERS: struct
    >     >     >     CHPL_AUX_FILESYS: none
    >     >     >
    >     >     >     When I run multi-locale programs, I set the following 
environment variables:
    >     >     >
    >     >     >     export CHPL_COMM=gasnet
    >     >     >     export CHPL_COMM_SUBSTRATE=ibv
    >     >     >
    >     >     >     Then the Chapel environment would be:
    >     >     >
    >     >     >     $ printchplenv --anonymize
    >     >     >     CHPL_TARGET_PLATFORM: linux64
    >     >     >     CHPL_TARGET_COMPILER: gnu
    >     >     >     CHPL_TARGET_ARCH: native *
    >     >     >     CHPL_LOCALE_MODEL: flat
    >     >     >     CHPL_COMM: gasnet *
    >     >     >        CHPL_COMM_SUBSTRATE: ibv *
    >     >     >        CHPL_GASNET_SEGMENT: large
    >     >     >     CHPL_TASKS: qthreads
    >     >     >     CHPL_LAUNCHER: gasnetrun_ibv
    >     >     >     CHPL_TIMERS: generic
    >     >     >     CHPL_UNWIND: none
    >     >     >     CHPL_MEM: jemalloc
    >     >     >     CHPL_MAKE: gmake
    >     >     >     CHPL_ATOMICS: intrinsics
    >     >     >        CHPL_NETWORK_ATOMICS: none
    >     >     >     CHPL_GMP: gmp
    >     >     >     CHPL_HWLOC: hwloc
    >     >     >     CHPL_REGEXP: re2
    >     >     >     CHPL_WIDE_POINTERS: struct
    >     >     >     CHPL_AUX_FILESYS: none
    >     >     >
    >     >     >
    >     >     >     > 2) What C compiler are you using?
    >     >     >
    >     >     >     $ gcc --version
    >     >     >     gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
    >     >     >     Copyright (C) 2010 Free Software Foundation, Inc.
    >     >     >     This is free software; see the source for copying 
conditions.  There is NO
    >     >     >     warranty; not even for MERCHANTABILITY or FITNESS FOR A 
PARTICULAR PURPOSE.
    >     >     >
    >     >     >     > 3) Are you sure that the programs are being launched 
correctly? This might seem silly, but it's worth double-checking that the 
programs are actually running on the same hardware (not necessarily the same 
node though).
    >     >     >
    >     >     >     I am completely certain that the single-locale program, 
the multi-locale
    >     >     >     program for one node, and the multi-locale for multiple 
nodes are
    >     >     >     running on the compute nodes.  I'm not completely sure 
what you mean by
    >     >     >     "the same hardware".  All compute nodes have the same 
hardware if that
    >     >     >     is what you mean.
    >     >     >
    >     >     >     > I'd also like to clarify what you mean by "multi-locale 
compiled". Is the difference between the programs just the use of the Block 
domain map, or do you compile with different environment variables set?
    >     >     >
    >     >     >     I compile different programs and I use different 
environment variables:
    >     >     >
    >     >     >     The single-locale version vectoradd is located in the 
datapar directory,
    >     >     >     whereas the multi-locale version is located in the 
datapar-dist
    >     >     >     directory.  What follows is the diff for the .chpl file:
    >     >     >
    >     >     >     $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
    >     >     >     8c8
    >     >     >     < const ProblemDomain : domain(1) = {0..#n};
    >     >     >     ---
    >     >     >      > const ProblemDomain : domain(1) dmapped 
Block(boundingBox = {0..#n})
    >     >     >     = {0..#n};
    >     >     >
    >     >     >     The diff for the Makefile:
    >     >     >
    >     >     >     $ diff datapar/Makefile datapar-dist/Makefile
    >     >     >     2a3
    >     >     >      > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
    >     >     >     8c9
    >     >     >     <         $(CHPL) -o $@ $(FLAGS) $<
    >     >     >     ---
    >     >     >      >        $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
    >     >     >     11c12
    >     >     >     <         rm -f $(APP)
    >     >     >     ---
    >     >     >      >        rm -f $(APP) $(APP)_real
    >     >     >
    >     >     >     Thanks for your help, and again my apologies for the 
delayed answers.
    >     >     >
    >     >     >     Kind regards,
    >     >     >
    >     >     >     Pieter Hijma
    >     >     >
    >     >     >     >
    >     >     >     > -Ben Harshbarger
    >     >     >     >
    >     >     >     > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> 
wrote:
    >     >     >     >
    >     >     >     >     Hi Ben,
    >     >     >     >
    >     >     >     >     Thank you for your fast reply and suggestions!  I 
did some more tests
    >     >     >     >     and also included stencil operations.
    >     >     >     >
    >     >     >     >     First, the vector addition:
    >     >     >     >
    >     >     >     >     vectoradd.chpl
    >     >     >     >     --------------
    >     >     >     >     use Time;
    >     >     >     >     use Random;
    >     >     >     >     use BlockDist;
    >     >     >     >     //use VisualDebug;
    >     >     >     >
    >     >     >     >     config const n = 1024**3/2;
    >     >     >     >
    >     >     >     >     // for multi-locale
    >     >     >     >     const ProblemDomain : domain(1) dmapped 
Block(boundingBox = {0..#n})
    >     >     >     >        = {0..#n};
    >     >     >     >     // for single-locale
    >     >     >     >     const ProblemDomain : domain(1) = {0..#n};
    >     >     >     >
    >     >     >     >     type float = real(32);
    >     >     >     >
    >     >     >     >     proc addNoDomain(c : [] float, a : [] float, b : [] 
float) {
    >     >     >     >        forall (ci, ai, bi) in zip(c, a, b) {
    >     >     >     >          ci = ai + bi;
    >     >     >     >        }
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc addZip(c : [ProblemDomain] float, a : 
[ProblemDomain] float,
    >     >     >     >             b : [ProblemDomain] float) {
    >     >     >     >        forall (ci, ai, bi) in zip(c, a, b) {
    >     >     >     >          ci = ai + bi;
    >     >     >     >        }
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc addForall(c : [ProblemDomain] float, a : 
[ProblemDomain] float,
    >     >     >     >                b : [ProblemDomain] float) {
    >     >     >     >        //startVdebug("vdata");
    >     >     >     >        forall i in ProblemDomain {
    >     >     >     >          c[i] = a[i] + b[i];
    >     >     >     >        }
    >     >     >     >        //stopVdebug();
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc addCollective(c : [ProblemDomain] float, a : 
[ProblemDomain] float,
    >     >     >     >                    b : [ProblemDomain] float) {
    >     >     >     >        c = a + b;
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc output(t : Timer, n, testName) {
    >     >     >     >        t.stop();
    >     >     >     >        writeln(testName, " n: ", n);
    >     >     >     >        writeln("Time: ", t.elapsed(), "s");
    >     >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
    >     >     >     >        writeln();
    >     >     >     >        t.clear();
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc main() {
    >     >     >     >        var c : [ProblemDomain] float;
    >     >     >     >        var a : [ProblemDomain] float;
    >     >     >     >        var b : [ProblemDomain] float;
    >     >     >     >        var t : Timer;
    >     >     >     >
    >     >     >     >        fillRandom(a, 0);
    >     >     >     >        fillRandom(b, 42);
    >     >     >     >
    >     >     >     >        t.start();
    >     >     >     >        addNoDomain(c, a, b);
    >     >     >     >        output(t, n, "addNoDomain");
    >     >     >     >
    >     >     >     >        t.start();
    >     >     >     >        addZip(c, a, b);
    >     >     >     >        output(t, n, "addZip");
    >     >     >     >
    >     >     >     >        t.start();
    >     >     >     >        addForall(c, a, b);
    >     >     >     >        output(t, n, "addForall");
    >     >     >     >
    >     >     >     >        t.start();
    >     >     >     >        addCollective(c, a, b);
    >     >     >     >        output(t, n, "addCollective");
    >     >     >     >     }
    >     >     >     >     -----
    >     >     >     >
    >     >     >     >     On a single locale I get as output:
    >     >     >     >
    >     >     >     >     addNoDomain n: 536870912
    >     >     >     >     Time: 0.27961s
    >     >     >     >     GFLOPS: 1.92007
    >     >     >     >
    >     >     >     >     addZip n: 536870912
    >     >     >     >     Time: 0.278657s
    >     >     >     >     GFLOPS: 1.92664
    >     >     >     >
    >     >     >     >     addForall n: 536870912
    >     >     >     >     Time: 0.278015s
    >     >     >     >     GFLOPS: 1.93109
    >     >     >     >
    >     >     >     >     addCollective n: 536870912
    >     >     >     >     Time: 0.278379s
    >     >     >     >     GFLOPS: 1.92856
    >     >     >     >
    >     >     >     >     On multi-locale (-nl 1) I get as output:
    >     >     >     >
    >     >     >     >     addNoDomain n: 536870912
    >     >     >     >     Time: 2.16806s
    >     >     >     >     GFLOPS: 0.247627
    >     >     >     >
    >     >     >     >     addZip n: 536870912
    >     >     >     >     Time: 2.17024s
    >     >     >     >     GFLOPS: 0.247378
    >     >     >     >
    >     >     >     >     addForall n: 536870912
    >     >     >     >     Time: 4.78443s
    >     >     >     >     GFLOPS: 0.112212
    >     >     >     >
    >     >     >     >     addCollective n: 536870912
    >     >     >     >     Time: 2.19838s
    >     >     >     >     GFLOPS: 0.244212
    >     >     >     >
    >     >     >     >     So, indeed, your suggestion improves it by more 
than a factor two, but
    >     >     >     >     it is still close to a factor 8 slower than 
single-locale.
    >     >     >     >
    >     >     >     >     I also used chplvis and verified that there are no 
gets and puts when
    >     >     >     >     running multi-locale with more than one node.  The 
profiling information
    >     >     >     >     is clear, but not very helpful (to me):
    >     >     >     >
    >     >     >     >     multi-locale (-nl 1):
    >     >     >     >
    >     >     >     >     | 65.3451 | wrapcoforall_fn_chpl5 | 
vectoradd.chpl:26 |
    >     >     >     >     |  4.8777 | wrapon_fn_chpl35      | 
vectoradd.chpl:26 |
    >     >     >     >
    >     >     >     >     single-locale:
    >     >     >     >
    >     >     >     >     | 5.0019 | wrapcoforall_fn_chpl5 | 
vectoradd.chpl:26 |
    >     >     >     >
    >     >     >     >
    >     >     >     >
    >     >     >     >     For stencil operations, I used the following 
program:
    >     >     >     >
    >     >     >     >     1d-convolution.chpl
    >     >     >     >     -------------------
    >     >     >     >     use Time;
    >     >     >     >     use Random;
    >     >     >     >     use StencilDist;
    >     >     >     >
    >     >     >     >     config const n = 1024**3/2;
    >     >     >     >
    >     >     >     >     const ProblemDomain : domain(1) dmapped 
Stencil(boundingBox = {0..#n},
    >     >     >     >                                                 fluff = 
(1,))
    >     >     >     >        = {0..#n};
    >     >     >     >     const InnerDomain : subdomain(ProblemDomain) = 
{1..n-2};
    >     >     >     >
    >     >     >     >     proc convolveIndices(output : [ProblemDomain] 
real(32),
    >     >     >     >                      input : [ProblemDomain] real(32)) {
    >     >     >     >        forall i in InnerDomain {
    >     >     >     >          output[i] = ((input[i-1] + input[i] + 
input[i+1])/3:real(32));
    >     >     >     >        }
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc convolveZip(output : [ProblemDomain] real(32),
    >     >     >     >                  input : [ProblemDomain] real(32)) {
    >     >     >     >        forall (im1, i, ip1) in 
zip(InnerDomain.translate(-1),
    >     >     >     >                               InnerDomain,
    >     >     >     >                               InnerDomain.translate(1)) 
{
    >     >     >     >          output[i] = ((input[im1] + input[i] + 
input[ip1])/3:real(32));
    >     >     >     >        }
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc print(t : Timer, n, s) {
    >     >     >     >        t.stop();
    >     >     >     >        writeln(s, ", n: ", n);
    >     >     >     >        writeln("Time: ", t.elapsed(), "s");
    >     >     >     >        writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
    >     >     >     >        writeln();
    >     >     >     >        t.clear();
    >     >     >     >     }
    >     >     >     >
    >     >     >     >     proc main() {
    >     >     >     >        var input : [ProblemDomain] real(32);
    >     >     >     >        var output : [ProblemDomain] real(32);
    >     >     >     >        var t : Timer;
    >     >     >     >
    >     >     >     >        fillRandom(input, 42);
    >     >     >     >
    >     >     >     >        t.start();
    >     >     >     >        convolveIndices(output, input);
    >     >     >     >        print(t, n, "convolveIndices");
    >     >     >     >
    >     >     >     >        t.start();
    >     >     >     >        convolveZip(output, input);
    >     >     >     >        print(t, n, "convolveZip");
    >     >     >     >     }
    >     >     >     >     ------
    >     >     >     >
    >     >     >     >     Interestingly, in contrast to your earlier 
suggestion, the direct
    >     >     >     >     indexing works a bit better in this program than 
the zipped version:
    >     >     >     >
    >     >     >     >     Multi-locale (-nl 1):
    >     >     >     >
    >     >     >     >     convolveIndices, n: 536870912
    >     >     >     >     Time: 4.27148s
    >     >     >     >     GFLOPS: 0.377062
    >     >     >     >
    >     >     >     >     convolveZip, n: 536870912
    >     >     >     >     Time: 4.87291s
    >     >     >     >     GFLOPS: 0.330524
    >     >     >     >
    >     >     >     >     Single-locale:
    >     >     >     >
    >     >     >     >     convolveIndices, n: 536870912
    >     >     >     >     Time: 0.548804s
    >     >     >     >     GFLOPS: 2.93477
    >     >     >     >
    >     >     >     >     convolveZip, n: 536870912
    >     >     >     >     Time: 0.538754s
    >     >     >     >     GFLOPS: 2.98951
    >     >     >     >
    >     >     >     >
    >     >     >     >     Again, the multi-locale is about a factor 8 slower 
than single-locale.
    >     >     >     >     By the way, the Stencil distribution is a bit 
faster than the Block
    >     >     >     >     distribution.
    >     >     >     >
    >     >     >     >     Thanks in advance for your input,
    >     >     >     >
    >     >     >     >     Pieter
    >     >     >     >
    >     >     >     >
    >     >     >     >
    >     >     >     >     On 24/10/16 19:20, Ben Harshbarger wrote:
    >     >     >     >     > Hi Pieter,
    >     >     >     >     >
    >     >     >     >     > Thanks for providing the example, that's very 
helpful.
    >     >     >     >     >
    >     >     >     >     > Multi-locale performance in Chapel is not yet 
where we'd like it to be, but we've done a lot of work over the past few 
releases to get cases like yours performing well. It's surprising that using 
Block results in that much of a difference, but I think you would see better 
performance by iterating over the arrays directly:
    >     >     >     >     >
    >     >     >     >     > ```
    >     >     >     >     > // replace the loop in the 'add' function with 
this:
    >     >     >     >     > forall (ci, ai, bi) in zip(c, a, b) {
    >     >     >     >     >   ci = ai + bi;
    >     >     >     >     > }
    >     >     >     >     > ```
    >     >     >     >     >
    >     >     >     >     > Block-distributed arrays can leverage the 
fast-follower optimization to perform better when all arrays being iterated 
over share the same domain. You can also write that loop in a cleaner way by 
leveraging array promotion:
    >     >     >     >     >
    >     >     >     >     > ```
    >     >     >     >     > // This is equivalent to the first loop
    >     >     >     >     > c = a + b;
    >     >     >     >     > ```
    >     >     >     >     >
    >     >     >     >     > However, when I tried the promoted variation on 
my machine I observed worse performance than the explicit forall-loop. It seems 
to be related to the way the arguments of the 'add' function are declared. If 
you replaced "[ProblemDomain] float" with "[] float", performance seems to 
improve. That surprised a couple of us on the development team, and I'll be 
looking at that some more today.
    >     >     >     >     >
    >     >     >     >     > If you're still seeing significantly worse 
performance with Block compared to the default rectangular domain, and the 
programs are launched in the same way, that would be odd. You could try 
profiling using chplvis. I agree though that there shouldn't be any 
communication in this program. You can find more information on chplvis here in 
the online 1.14 release documentation:
    >     >     >     >     >
    >     >     >     >     > 
http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
    >     >     >     >     >
    >     >     >     >     > I hope that rewriting the loops solves the 
problem, but let us know if it doesn't and we can continue investigating.
    >     >     >     >     >
    >     >     >     >     > -Ben Harshbarger
    >     >     >     >     >
    >     >     >     >     > On 10/24/16, 6:19 AM, "Pieter Hijma" 
<[email protected]> wrote:
    >     >     >     >     >
    >     >     >     >     >     Dear all,
    >     >     >     >     >
    >     >     >     >     >     My apologies if this has already been asked 
before.  I'm new to the list
    >     >     >     >     >     and couldn't find it in the archives.
    >     >     >     >     >
    >     >     >     >     >     I experience bad performance when running the 
multi-locale compiled
    >     >     >     >     >     version on an InfiniBand equiped cluster
    >     >     >     >     >     (http://cs.vu.nl/das4/clusters.shtml, 
VU-site), even with only one node.
    >     >     >     >     >       Below you find a minimal example that 
exhibits the same performance
    >     >     >     >     >     problems as all my programs:
    >     >     >     >     >
    >     >     >     >     >     I compiled chapel-1.14.0 with the following 
steps:
    >     >     >     >     >
    >     >     >     >     >     export CHPL_TARGET_ARCH=native
    >     >     >     >     >     make -j
    >     >     >     >     >     export CHPL_COMM=gasnet
    >     >     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
    >     >     >     >     >     make clean
    >     >     >     >     >     make -j
    >     >     >     >     >
    >     >     >     >     >     I compile the following Chapel code:
    >     >     >     >     >
    >     >     >     >     >     vectoradd.chpl:
    >     >     >     >     >     ---------------
    >     >     >     >     >     use Time;
    >     >     >     >     >     use Random;
    >     >     >     >     >     use BlockDist;
    >     >     >     >     >
    >     >     >     >     >     config const n = 1024**3;
    >     >     >     >     >
    >     >     >     >     >     // for single-locale
    >     >     >     >     >     // const ProblemDomain : domain(1) = {0..#n};
    >     >     >     >     >     // for multi-locale
    >     >     >     >     >     const ProblemDomain : domain(1) dmapped 
Block(boundingBox = {0..#n}) =
    >     >     >     >     >          {0..#n};
    >     >     >     >     >
    >     >     >     >     >     type float = real(32);
    >     >     >     >     >
    >     >     >     >     >     proc add(c : [ProblemDomain] float, a : 
[ProblemDomain] float,
    >     >     >     >     >          b : [ProblemDomain] float) {
    >     >     >     >     >        forall i in ProblemDomain {
    >     >     >     >     >          c[i] = a[i] + b[i];
    >     >     >     >     >        }
    >     >     >     >     >     }
    >     >     >     >     >
    >     >     >     >     >     proc main() {
    >     >     >     >     >        var c : [ProblemDomain] float;
    >     >     >     >     >        var a : [ProblemDomain] float;
    >     >     >     >     >        var b : [ProblemDomain] float;
    >     >     >     >     >        var t : Timer;
    >     >     >     >     >
    >     >     >     >     >        fillRandom(a, 0);
    >     >     >     >     >        fillRandom(b, 42);
    >     >     >     >     >
    >     >     >     >     >        t.start();
    >     >     >     >     >        add(c, a, b);
    >     >     >     >     >        t.stop();
    >     >     >     >     >
    >     >     >     >     >        writeln("n: ", n);
    >     >     >     >     >        writeln("Time: ", t.elapsed(), "s");
    >     >     >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, 
"s");
    >     >     >     >     >     }
    >     >     >     >     >     ----
    >     >     >     >     >
    >     >     >     >     >     I compile this for single-locale with (using 
no domain maps, see the
    >     >     >     >     >     comment above in the source):
    >     >     >     >     >
    >     >     >     >     >     chpl -o vectoradd --fast vectoradd.chpl
    >     >     >     >     >
    >     >     >     >     >     I run it with (dual quad core with 2 hardware 
threads):
    >     >     >     >     >
    >     >     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     >     >     >     >     ./vectoradd
    >     >     >     >     >
    >     >     >     >     >     And get as output:
    >     >     >     >     >
    >     >     >     >     >     n: 1073741824
    >     >     >     >     >     Time: 0.558806s
    >     >     >     >     >     GFLOPS: 1.92149s
    >     >     >     >     >
    >     >     >     >     >     However, the performance for multi-locale is 
much worse:
    >     >     >     >     >
    >     >     >     >     >     I compile this for multi-locale with domain 
maps, see the comment in the
    >     >     >     >     >     source):
    >     >     >     >     >
    >     >     >     >     >     CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl 
-o vectoradd --fast \
    >     >     >     >     >        vectoradd.chpl
    >     >     >     >     >
    >     >     >     >     >     I run it on the same type of node with:
    >     >     >     >     >
    >     >     >     >     >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' 
' '`
    >     >     >     >     >
    >     >     >     >     >     export GASNET_PHYSMEM_MAX=1G
    >     >     >     >     >     export GASNET_IBV_SPAWNER=ssh
    >     >     >     >     >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
    >     >     >     >     >
    >     >     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     >     >     >     >     export CHPL_LAUNCHER=gasnetrun_ibv
    >     >     >     >     >     export CHPL_COMM=gasnet
    >     >     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
    >     >     >     >     >
    >     >     >     >     >     ./vectoradd -nl 1
    >     >     >     >     >
    >     >     >     >     >     And get as output:
    >     >     >     >     >
    >     >     >     >     >     n: 1073741824
    >     >     >     >     >     Time: 8.65082s
    >     >     >     >     >     GFLOPS: 0.12412s
    >     >     >     >     >
    >     >     >     >     >     I would understand a performance difference 
of say 10% because of
    >     >     >     >     >     multi-locale execution, but not factors.  Is 
this to be expected from
    >     >     >     >     >     the current state of Chapel?  This 
performance difference is examplary
    >     >     >     >     >     for basically all my programs that also are 
more realistic and use
    >     >     >     >     >     larger inputs.  The performance is strange as 
there is no communication
    >     >     >     >     >     necessary (only one node) and the program is 
using the same amount of
    >     >     >     >     >     threads.
    >     >     >     >     >
    >     >     >     >     >     Is there any way for me to investigate this 
using profiling for example?
    >     >     >     >     >
    >     >     >     >     >     By the way, the program does scale well to 
multiple nodes (which is not
    >     >     >     >     >     difficult given the baseline):
    >     >     >     >     >
    >     >     >     >     >       1 | 8.65s
    >     >     >     >     >       2 | 2.67s
    >     >     >     >     >       4 | 1.69s
    >     >     >     >     >       8 | 0.87s
    >     >     >     >     >     16 | 0.41s
    >     >     >     >     >
    >     >     >     >     >     Thanks in advance for your input.
    >     >     >     >     >
    >     >     >     >     >     Kind regards,
    >     >     >     >     >
    >     >     >     >     >     Pieter Hijma
    >     >     >     >     >
    >     >     >     >     >     
------------------------------------------------------------------------------
    >     >     >     >     >     Check out the vibrant tech community on one 
of the world's most
    >     >     >     >     >     engaging tech sites, SlashDot.org! 
http://sdm.link/slashdot
    >     >     >     >     >     
_______________________________________________
    >     >     >     >     >     Chapel-users mailing list
    >     >     >     >     >     [email protected]
    >     >     >     >     >     
https://lists.sourceforge.net/lists/listinfo/chapel-users
    >     >     >     >     >
    >     >     >     >     >
    >     >     >     >
    >     >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    >
    >
    

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Performance multi-locale

Reply via email to