Re: Performance multi-locale

Ben Harshbarger Mon, 14 Nov 2016 11:23:01 -0800

Hi Pieter,

My next suggestion would be to try compiling and running the "single locale" 
variation with the same environment variables that you use for multilocale. I'm 
wondering if the use of IBV is impacting performance in some way. I don't see 
the performance issue on our internal ibv cluster, but it's worth checking.


-Ben Harshbarger

On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote:

    Hi Ben,
    
    Thanks for your help.
    
    On 07/11/16 18:59, Ben Harshbarger wrote:
    > When CHPL_COMM is set to 'none', our compiler can avoid introducing some 
overhead that is necessary for multi-locale programs. You can force this 
overhead when CHPL_COMM == none by compiling with the flag "--no-local". If you 
compile your single-locale program with that flag, does the performance get 
worse?
    
    It makes some difference, but not much:
    
    chpl -o vectoradd --fast vectoradd.chpl
    
    addNoDomain n: 1073741824
    Time: 0.57211s
    GFLOPS: 1.87681
    
    addZip n: 1073741824
    Time: 0.571799s
    GFLOPS: 1.87783
    
    addForall n: 1073741824
    Time: 0.571623s
    GFLOPS: 1.87841
    
    addCollective n: 1073741824
    Time: 0.571395s
    GFLOPS: 1.87916
    
    
    chpl -o vectoradd --fast --no-local vectoradd.chpl
    
    addNoDomain n: 1073741824
    Time: 0.62087s
    GFLOPS: 1.72941
    
    addZip n: 1073741824
    Time: 0.619997s
    GFLOPS: 1.73185
    
    addForall n: 1073741824
    Time: 0.620645s
    GFLOPS: 1.73004
    
    addCollective n: 1073741824
    Time: 0.620254s
    GFLOPS: 1.73113
    
    
    > If that's the case, I'm not entirely sure what the next step would be. Do 
you have access to a newer version of GCC? The backend C compiler can matter 
when it comes to optimizing the multi-locale overhead.
    
    It is indeed an old one.  We also have GCC 4.9.0, Intel 13.3, and I 
    compiled GCC 6.2.0 to check:
    
    * intel/compiler/64/13.3/2013.3.163
    
    I basically see the same behavior:
    
    single locale:
    
    addNoDomain n: 536870912
    Time: 0.285186s
    GFLOPS: 1.88253
    
    addZip n: 536870912
    Time: 0.284819s
    GFLOPS: 1.88495
    
    addForall n: 536870912
    Time: 0.287904s
    GFLOPS: 1.86476
    
    addCollective n: 536870912
    Time: 0.284912s
    GFLOPS: 1.88434
    
    multi-locale, one node:
    
    addNoDomain n: 536870912
    Time: 3.24471s
    GFLOPS: 0.16546
    
    addZip n: 536870912
    Time: 3.01287s
    GFLOPS: 0.178192
    
    addForall n: 536870912
    Time: 7.23895s
    GFLOPS: 0.0741642
    
    addCollective n: 536870912
    Time: 2.59501s
    GFLOPS: 0.206886
    
    
    * GCC 4.9.0
    
    This is encouraging, the performance improves, a factor two of the 
    single-locale, except for the explicit indices in the forall:
    
    single locale:
    
    addNoDomain n: 536870912
    Time: 0.277222s
    GFLOPS: 1.93661
    
    addZip n: 536870912
    Time: 0.27566s
    GFLOPS: 1.94758
    
    addForall n: 536870912
    Time: 0.27609s
    GFLOPS: 1.94455
    
    addCollective n: 536870912
    Time: 0.275303s
    GFLOPS: 1.95011
    
    multi-locale, single node:
    
    addNoDomain n: 536870912
    Time: 0.492954s
    GFLOPS: 1.08909
    
    addZip n: 536870912
    Time: 0.493039s
    GFLOPS: 1.0889
    
    addForall n: 536870912
    Time: 2.85323s
    GFLOPS: 0.188162
    
    addCollective n: 536870912
    Time: 0.492135s
    GFLOPS: 1.0909
    
    
    * GCC 6.2.0
    
    The performance on multi-locale is now even better.  Still very low for 
    explicit indices in the forall.
    
    single locale:
    
    addNoDomain n: 536870912
    Time: 0.283272s
    GFLOPS: 1.89525
    
    addZip n: 536870912
    Time: 0.281942s
    GFLOPS: 1.90419
    
    addForall n: 536870912
    Time: 0.282291s
    GFLOPS: 1.90184
    
    addCollective n: 536870912
    Time: 0.281629s
    GFLOPS: 1.90631
    
    Multi-locale, single node:
    
    addNoDomain n: 536870912
    Time: 0.358012s
    GFLOPS: 1.49959
    
    addZip n: 536870912
    Time: 0.356696s
    GFLOPS: 1.50512
    
    addForall n: 536870912
    Time: 2.92173s
    GFLOPS: 0.183751
    
    addCollective n: 536870912
    Time: 0.343808s
    GFLOPS: 1.56154
    
    
    
    Since this is encouraging, I also verified the performance of the 
    1D-stencils:
    
    * GCC 4.4.7
    
    For reference, the old compiler that I used initially:
    
    single locale:
    
    convolveIndices, n: 536870912
    Time: 0.82361s
    GFLOPS: 1.95555
    
    convolveZip, n: 536870912
    Time: 0.810028s
    GFLOPS: 1.98834
    
    mutli-locale, one node:
    
    convolveIndices, n: 536870912
    Time: 4.25951s
    GFLOPS: 0.378122
    
    convolveZip, n: 536870912
    Time: 4.88046s
    GFLOPS: 0.330012
    
    * intel/compiler/64/13.3/2013.3.163
    
    On this compiler the single-node performance is better than the previous 
    compiler.  However, the multi-locale one node performance is about a 
    factor 3 slower than the previous compiler.
    
    single locale:
    
    convolveIndices, n: 536870912
    Time: 0.554139s
    GFLOPS: 2.90651
    
    convolveZip, n: 536870912
    Time: 0.556653s
    GFLOPS: 2.89339
    
    
    multi-locale, one node:
    
    convolveIndices, n: 536870912
    Time: 10.5368s
    GFLOPS: 0.152856
    
    convolveZip, n: 536870912
    Time: 12.7625s
    GFLOPS: 0.126198
    
    
    * GCC 4.9.0
    
    The performance of single locale is much better than GCC 4.4.7, however 
    still poor for the multi-locale, one node configuration, although a bit 
    better.
    
    single locale:
    
    convolveIndices, n: 536870912
    Time: 0.207055s
    GFLOPS: 7.77867
    
    convolveZip, n: 536870912
    Time: 0.206783s
    GFLOPS: 7.7889
    
    multi-locale, one node:
    
    convolveIndices, n: 536870912
    Time: 3.20851s
    GFLOPS: 0.501981
    
    convolveZip, n: 536870912
    Time: 3.652s
    GFLOPS: 0.441023
    
    
    * GCC 6.2.0
    
    Strangely enough, the performance of single-locale is a bit lower than 
    the previous, and the same as with multi-locale, one node.
    
    single-locale:
    
    convolveIndices, n: 536870912
    Time: 0.263151s
    GFLOPS: 6.12049
    
    convolveZip, n: 536870912
    Time: 0.262234s
    GFLOPS: 6.14189
    
    multi-locale, one node:
    
    convolveIndices, n: 536870912
    Time: 3.12716s
    GFLOPS: 0.515039
    
    convolveZip, n: 536870912
    Time: 3.58663s
    GFLOPS: 0.44906
    
    
    The conclusion is that the compiler has indeed a large impact on the 
    multi-locale performance, but probably only in the simple cases such as 
    vector addition.  With the stencil code, although it is not very 
    complicated, the performance falls back into the pattern that I came 
    across originally.
    
    However, perhaps this gives you an idea of the optimizations that impact 
    the performance?  If we can't find a solution, I would at least like to 
    understand the lack of performance.
    
    I also checked the performance of the stencils by not using the 
    StencilDist but just the BlockDist and it makes no difference.
    
    > You may also want to consider setting CHPL_TARGET_ARCH to something else 
if you're compiling on a machine architecture different from the compute nodes. 
There's more information about CHPL_TARGET_ARCH here:
    >
    > 
http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch
    
    The head-node and compute-nodes are all Intel Xeon Westmere's, so I 
    don't think that makes a difference.  To be absolutely sure, I also 
    compiled Chapel and the applications on a compute node and indeed, the 
    performance is comparable to all measurements above.
    
    Kind regards,
    
    Pieter Hijma
    
    
    > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:
    >
    >     Dear Ben,
    >
    >     Sorry for my late reactions.  Unfortunately, for some reason, these
    >     emails are marked as spam even though I marked the list and your 
address
    >     as safe.  I will make sure I check my spam folders meticulously from 
now on.
    >
    >     On 28/10/16 23:34, Ben Harshbarger wrote:
    >     > Hi Pieter,
    >     >
    >     > Sorry that you're still having issues. I think we'll need some more 
information before going forward:
    >     >
    >     > 1) Could you send us the output of "$CHPL_HOME/util/printchplenv 
--anonymize" ? It's a script that displays the various CHPL_ environment 
variables. "--anonymize" strips the output of information you may prefer to 
keep private (machine info, paths).
    >
    >     This would be the setup if running single-locale programs:
    >
    >     $ printchplenv --anonymize
    >     CHPL_TARGET_PLATFORM: linux64
    >     CHPL_TARGET_COMPILER: gnu
    >     CHPL_TARGET_ARCH: native *
    >     CHPL_LOCALE_MODEL: flat
    >     CHPL_COMM: none
    >     CHPL_TASKS: qthreads
    >     CHPL_LAUNCHER: none
    >     CHPL_TIMERS: generic
    >     CHPL_UNWIND: none
    >     CHPL_MEM: jemalloc
    >     CHPL_MAKE: gmake
    >     CHPL_ATOMICS: intrinsics
    >     CHPL_GMP: gmp
    >     CHPL_HWLOC: hwloc
    >     CHPL_REGEXP: re2
    >     CHPL_WIDE_POINTERS: struct
    >     CHPL_AUX_FILESYS: none
    >
    >     When I run multi-locale programs, I set the following environment 
variables:
    >
    >     export CHPL_COMM=gasnet
    >     export CHPL_COMM_SUBSTRATE=ibv
    >
    >     Then the Chapel environment would be:
    >
    >     $ printchplenv --anonymize
    >     CHPL_TARGET_PLATFORM: linux64
    >     CHPL_TARGET_COMPILER: gnu
    >     CHPL_TARGET_ARCH: native *
    >     CHPL_LOCALE_MODEL: flat
    >     CHPL_COMM: gasnet *
    >        CHPL_COMM_SUBSTRATE: ibv *
    >        CHPL_GASNET_SEGMENT: large
    >     CHPL_TASKS: qthreads
    >     CHPL_LAUNCHER: gasnetrun_ibv
    >     CHPL_TIMERS: generic
    >     CHPL_UNWIND: none
    >     CHPL_MEM: jemalloc
    >     CHPL_MAKE: gmake
    >     CHPL_ATOMICS: intrinsics
    >        CHPL_NETWORK_ATOMICS: none
    >     CHPL_GMP: gmp
    >     CHPL_HWLOC: hwloc
    >     CHPL_REGEXP: re2
    >     CHPL_WIDE_POINTERS: struct
    >     CHPL_AUX_FILESYS: none
    >
    >
    >     > 2) What C compiler are you using?
    >
    >     $ gcc --version
    >     gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
    >     Copyright (C) 2010 Free Software Foundation, Inc.
    >     This is free software; see the source for copying conditions.  There 
is NO
    >     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE.
    >
    >     > 3) Are you sure that the programs are being launched correctly? 
This might seem silly, but it's worth double-checking that the programs are 
actually running on the same hardware (not necessarily the same node though).
    >
    >     I am completely certain that the single-locale program, the 
multi-locale
    >     program for one node, and the multi-locale for multiple nodes are
    >     running on the compute nodes.  I'm not completely sure what you mean 
by
    >     "the same hardware".  All compute nodes have the same hardware if that
    >     is what you mean.
    >
    >     > I'd also like to clarify what you mean by "multi-locale compiled". 
Is the difference between the programs just the use of the Block domain map, or 
do you compile with different environment variables set?
    >
    >     I compile different programs and I use different environment 
variables:
    >
    >     The single-locale version vectoradd is located in the datapar 
directory,
    >     whereas the multi-locale version is located in the datapar-dist
    >     directory.  What follows is the diff for the .chpl file:
    >
    >     $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
    >     8c8
    >     < const ProblemDomain : domain(1) = {0..#n};
    >     ---
    >      > const ProblemDomain : domain(1) dmapped Block(boundingBox = 
{0..#n})
    >     = {0..#n};
    >
    >     The diff for the Makefile:
    >
    >     $ diff datapar/Makefile datapar-dist/Makefile
    >     2a3
    >      > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
    >     8c9
    >     <     $(CHPL) -o $@ $(FLAGS) $<
    >     ---
    >      >    $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
    >     11c12
    >     <     rm -f $(APP)
    >     ---
    >      >    rm -f $(APP) $(APP)_real
    >
    >     Thanks for your help, and again my apologies for the delayed answers.
    >
    >     Kind regards,
    >
    >     Pieter Hijma
    >
    >     >
    >     > -Ben Harshbarger
    >     >
    >     > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote:
    >     >
    >     >     Hi Ben,
    >     >
    >     >     Thank you for your fast reply and suggestions!  I did some more 
tests
    >     >     and also included stencil operations.
    >     >
    >     >     First, the vector addition:
    >     >
    >     >     vectoradd.chpl
    >     >     --------------
    >     >     use Time;
    >     >     use Random;
    >     >     use BlockDist;
    >     >     //use VisualDebug;
    >     >
    >     >     config const n = 1024**3/2;
    >     >
    >     >     // for multi-locale
    >     >     const ProblemDomain : domain(1) dmapped Block(boundingBox = 
{0..#n})
    >     >        = {0..#n};
    >     >     // for single-locale
    >     >     const ProblemDomain : domain(1) = {0..#n};
    >     >
    >     >     type float = real(32);
    >     >
    >     >     proc addNoDomain(c : [] float, a : [] float, b : [] float) {
    >     >        forall (ci, ai, bi) in zip(c, a, b) {
    >     >          ci = ai + bi;
    >     >        }
    >     >     }
    >     >
    >     >     proc addZip(c : [ProblemDomain] float, a : [ProblemDomain] 
float,
    >     >                 b : [ProblemDomain] float) {
    >     >        forall (ci, ai, bi) in zip(c, a, b) {
    >     >          ci = ai + bi;
    >     >        }
    >     >     }
    >     >
    >     >     proc addForall(c : [ProblemDomain] float, a : [ProblemDomain] 
float,
    >     >                    b : [ProblemDomain] float) {
    >     >        //startVdebug("vdata");
    >     >        forall i in ProblemDomain {
    >     >          c[i] = a[i] + b[i];
    >     >        }
    >     >        //stopVdebug();
    >     >     }
    >     >
    >     >     proc addCollective(c : [ProblemDomain] float, a : 
[ProblemDomain] float,
    >     >                        b : [ProblemDomain] float) {
    >     >        c = a + b;
    >     >     }
    >     >
    >     >     proc output(t : Timer, n, testName) {
    >     >        t.stop();
    >     >        writeln(testName, " n: ", n);
    >     >        writeln("Time: ", t.elapsed(), "s");
    >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
    >     >        writeln();
    >     >        t.clear();
    >     >     }
    >     >
    >     >     proc main() {
    >     >        var c : [ProblemDomain] float;
    >     >        var a : [ProblemDomain] float;
    >     >        var b : [ProblemDomain] float;
    >     >        var t : Timer;
    >     >
    >     >        fillRandom(a, 0);
    >     >        fillRandom(b, 42);
    >     >
    >     >        t.start();
    >     >        addNoDomain(c, a, b);
    >     >        output(t, n, "addNoDomain");
    >     >
    >     >        t.start();
    >     >        addZip(c, a, b);
    >     >        output(t, n, "addZip");
    >     >
    >     >        t.start();
    >     >        addForall(c, a, b);
    >     >        output(t, n, "addForall");
    >     >
    >     >        t.start();
    >     >        addCollective(c, a, b);
    >     >        output(t, n, "addCollective");
    >     >     }
    >     >     -----
    >     >
    >     >     On a single locale I get as output:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 0.27961s
    >     >     GFLOPS: 1.92007
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 0.278657s
    >     >     GFLOPS: 1.92664
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 0.278015s
    >     >     GFLOPS: 1.93109
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 0.278379s
    >     >     GFLOPS: 1.92856
    >     >
    >     >     On multi-locale (-nl 1) I get as output:
    >     >
    >     >     addNoDomain n: 536870912
    >     >     Time: 2.16806s
    >     >     GFLOPS: 0.247627
    >     >
    >     >     addZip n: 536870912
    >     >     Time: 2.17024s
    >     >     GFLOPS: 0.247378
    >     >
    >     >     addForall n: 536870912
    >     >     Time: 4.78443s
    >     >     GFLOPS: 0.112212
    >     >
    >     >     addCollective n: 536870912
    >     >     Time: 2.19838s
    >     >     GFLOPS: 0.244212
    >     >
    >     >     So, indeed, your suggestion improves it by more than a factor 
two, but
    >     >     it is still close to a factor 8 slower than single-locale.
    >     >
    >     >     I also used chplvis and verified that there are no gets and 
puts when
    >     >     running multi-locale with more than one node.  The profiling 
information
    >     >     is clear, but not very helpful (to me):
    >     >
    >     >     multi-locale (-nl 1):
    >     >
    >     >     | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
    >     >     |  4.8777 | wrapon_fn_chpl35      | vectoradd.chpl:26 |
    >     >
    >     >     single-locale:
    >     >
    >     >     | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
    >     >
    >     >
    >     >
    >     >     For stencil operations, I used the following program:
    >     >
    >     >     1d-convolution.chpl
    >     >     -------------------
    >     >     use Time;
    >     >     use Random;
    >     >     use StencilDist;
    >     >
    >     >     config const n = 1024**3/2;
    >     >
    >     >     const ProblemDomain : domain(1) dmapped Stencil(boundingBox = 
{0..#n},
    >     >                                                     fluff = (1,))
    >     >        = {0..#n};
    >     >     const InnerDomain : subdomain(ProblemDomain) = {1..n-2};
    >     >
    >     >     proc convolveIndices(output : [ProblemDomain] real(32),
    >     >                          input : [ProblemDomain] real(32)) {
    >     >        forall i in InnerDomain {
    >     >          output[i] = ((input[i-1] + input[i] + 
input[i+1])/3:real(32));
    >     >        }
    >     >     }
    >     >
    >     >     proc convolveZip(output : [ProblemDomain] real(32),
    >     >                      input : [ProblemDomain] real(32)) {
    >     >        forall (im1, i, ip1) in zip(InnerDomain.translate(-1),
    >     >                                   InnerDomain,
    >     >                                   InnerDomain.translate(1)) {
    >     >          output[i] = ((input[im1] + input[i] + 
input[ip1])/3:real(32));
    >     >        }
    >     >     }
    >     >
    >     >     proc print(t : Timer, n, s) {
    >     >        t.stop();
    >     >        writeln(s, ", n: ", n);
    >     >        writeln("Time: ", t.elapsed(), "s");
    >     >        writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
    >     >        writeln();
    >     >        t.clear();
    >     >     }
    >     >
    >     >     proc main() {
    >     >        var input : [ProblemDomain] real(32);
    >     >        var output : [ProblemDomain] real(32);
    >     >        var t : Timer;
    >     >
    >     >        fillRandom(input, 42);
    >     >
    >     >        t.start();
    >     >        convolveIndices(output, input);
    >     >        print(t, n, "convolveIndices");
    >     >
    >     >        t.start();
    >     >        convolveZip(output, input);
    >     >        print(t, n, "convolveZip");
    >     >     }
    >     >     ------
    >     >
    >     >     Interestingly, in contrast to your earlier suggestion, the 
direct
    >     >     indexing works a bit better in this program than the zipped 
version:
    >     >
    >     >     Multi-locale (-nl 1):
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 4.27148s
    >     >     GFLOPS: 0.377062
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 4.87291s
    >     >     GFLOPS: 0.330524
    >     >
    >     >     Single-locale:
    >     >
    >     >     convolveIndices, n: 536870912
    >     >     Time: 0.548804s
    >     >     GFLOPS: 2.93477
    >     >
    >     >     convolveZip, n: 536870912
    >     >     Time: 0.538754s
    >     >     GFLOPS: 2.98951
    >     >
    >     >
    >     >     Again, the multi-locale is about a factor 8 slower than 
single-locale.
    >     >     By the way, the Stencil distribution is a bit faster than the 
Block
    >     >     distribution.
    >     >
    >     >     Thanks in advance for your input,
    >     >
    >     >     Pieter
    >     >
    >     >
    >     >
    >     >     On 24/10/16 19:20, Ben Harshbarger wrote:
    >     >     > Hi Pieter,
    >     >     >
    >     >     > Thanks for providing the example, that's very helpful.
    >     >     >
    >     >     > Multi-locale performance in Chapel is not yet where we'd like 
it to be, but we've done a lot of work over the past few releases to get cases 
like yours performing well. It's surprising that using Block results in that 
much of a difference, but I think you would see better performance by iterating 
over the arrays directly:
    >     >     >
    >     >     > ```
    >     >     > // replace the loop in the 'add' function with this:
    >     >     > forall (ci, ai, bi) in zip(c, a, b) {
    >     >     >   ci = ai + bi;
    >     >     > }
    >     >     > ```
    >     >     >
    >     >     > Block-distributed arrays can leverage the fast-follower 
optimization to perform better when all arrays being iterated over share the 
same domain. You can also write that loop in a cleaner way by leveraging array 
promotion:
    >     >     >
    >     >     > ```
    >     >     > // This is equivalent to the first loop
    >     >     > c = a + b;
    >     >     > ```
    >     >     >
    >     >     > However, when I tried the promoted variation on my machine I 
observed worse performance than the explicit forall-loop. It seems to be 
related to the way the arguments of the 'add' function are declared. If you 
replaced "[ProblemDomain] float" with "[] float", performance seems to improve. 
That surprised a couple of us on the development team, and I'll be looking at 
that some more today.
    >     >     >
    >     >     > If you're still seeing significantly worse performance with 
Block compared to the default rectangular domain, and the programs are launched 
in the same way, that would be odd. You could try profiling using chplvis. I 
agree though that there shouldn't be any communication in this program. You can 
find more information on chplvis here in the online 1.14 release documentation:
    >     >     >
    >     >     > http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
    >     >     >
    >     >     > I hope that rewriting the loops solves the problem, but let 
us know if it doesn't and we can continue investigating.
    >     >     >
    >     >     > -Ben Harshbarger
    >     >     >
    >     >     > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> wrote:
    >     >     >
    >     >     >     Dear all,
    >     >     >
    >     >     >     My apologies if this has already been asked before.  I'm 
new to the list
    >     >     >     and couldn't find it in the archives.
    >     >     >
    >     >     >     I experience bad performance when running the 
multi-locale compiled
    >     >     >     version on an InfiniBand equiped cluster
    >     >     >     (http://cs.vu.nl/das4/clusters.shtml, VU-site), even with 
only one node.
    >     >     >       Below you find a minimal example that exhibits the same 
performance
    >     >     >     problems as all my programs:
    >     >     >
    >     >     >     I compiled chapel-1.14.0 with the following steps:
    >     >     >
    >     >     >     export CHPL_TARGET_ARCH=native
    >     >     >     make -j
    >     >     >     export CHPL_COMM=gasnet
    >     >     >     export CHPL_COMM_SUBSTRATE=ibv
    >     >     >     make clean
    >     >     >     make -j
    >     >     >
    >     >     >     I compile the following Chapel code:
    >     >     >
    >     >     >     vectoradd.chpl:
    >     >     >     ---------------
    >     >     >     use Time;
    >     >     >     use Random;
    >     >     >     use BlockDist;
    >     >     >
    >     >     >     config const n = 1024**3;
    >     >     >
    >     >     >     // for single-locale
    >     >     >     // const ProblemDomain : domain(1) = {0..#n};
    >     >     >     // for multi-locale
    >     >     >     const ProblemDomain : domain(1) dmapped Block(boundingBox 
= {0..#n}) =
    >     >     >          {0..#n};
    >     >     >
    >     >     >     type float = real(32);
    >     >     >
    >     >     >     proc add(c : [ProblemDomain] float, a : [ProblemDomain] 
float,
    >     >     >          b : [ProblemDomain] float) {
    >     >     >        forall i in ProblemDomain {
    >     >     >          c[i] = a[i] + b[i];
    >     >     >        }
    >     >     >     }
    >     >     >
    >     >     >     proc main() {
    >     >     >        var c : [ProblemDomain] float;
    >     >     >        var a : [ProblemDomain] float;
    >     >     >        var b : [ProblemDomain] float;
    >     >     >        var t : Timer;
    >     >     >
    >     >     >        fillRandom(a, 0);
    >     >     >        fillRandom(b, 42);
    >     >     >
    >     >     >        t.start();
    >     >     >        add(c, a, b);
    >     >     >        t.stop();
    >     >     >
    >     >     >        writeln("n: ", n);
    >     >     >        writeln("Time: ", t.elapsed(), "s");
    >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s");
    >     >     >     }
    >     >     >     ----
    >     >     >
    >     >     >     I compile this for single-locale with (using no domain 
maps, see the
    >     >     >     comment above in the source):
    >     >     >
    >     >     >     chpl -o vectoradd --fast vectoradd.chpl
    >     >     >
    >     >     >     I run it with (dual quad core with 2 hardware threads):
    >     >     >
    >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     >     >     ./vectoradd
    >     >     >
    >     >     >     And get as output:
    >     >     >
    >     >     >     n: 1073741824
    >     >     >     Time: 0.558806s
    >     >     >     GFLOPS: 1.92149s
    >     >     >
    >     >     >     However, the performance for multi-locale is much worse:
    >     >     >
    >     >     >     I compile this for multi-locale with domain maps, see the 
comment in the
    >     >     >     source):
    >     >     >
    >     >     >     CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 
vectoradd --fast \
    >     >     >        vectoradd.chpl
    >     >     >
    >     >     >     I run it on the same type of node with:
    >     >     >
    >     >     >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
    >     >     >
    >     >     >     export GASNET_PHYSMEM_MAX=1G
    >     >     >     export GASNET_IBV_SPAWNER=ssh
    >     >     >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
    >     >     >
    >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     >     >     export CHPL_LAUNCHER=gasnetrun_ibv
    >     >     >     export CHPL_COMM=gasnet
    >     >     >     export CHPL_COMM_SUBSTRATE=ibv
    >     >     >
    >     >     >     ./vectoradd -nl 1
    >     >     >
    >     >     >     And get as output:
    >     >     >
    >     >     >     n: 1073741824
    >     >     >     Time: 8.65082s
    >     >     >     GFLOPS: 0.12412s
    >     >     >
    >     >     >     I would understand a performance difference of say 10% 
because of
    >     >     >     multi-locale execution, but not factors.  Is this to be 
expected from
    >     >     >     the current state of Chapel?  This performance difference 
is examplary
    >     >     >     for basically all my programs that also are more 
realistic and use
    >     >     >     larger inputs.  The performance is strange as there is no 
communication
    >     >     >     necessary (only one node) and the program is using the 
same amount of
    >     >     >     threads.
    >     >     >
    >     >     >     Is there any way for me to investigate this using 
profiling for example?
    >     >     >
    >     >     >     By the way, the program does scale well to multiple nodes 
(which is not
    >     >     >     difficult given the baseline):
    >     >     >
    >     >     >       1 | 8.65s
    >     >     >       2 | 2.67s
    >     >     >       4 | 1.69s
    >     >     >       8 | 0.87s
    >     >     >     16 | 0.41s
    >     >     >
    >     >     >     Thanks in advance for your input.
    >     >     >
    >     >     >     Kind regards,
    >     >     >
    >     >     >     Pieter Hijma
    >     >     >
    >     >     >     
------------------------------------------------------------------------------
    >     >     >     Check out the vibrant tech community on one of the 
world's most
    >     >     >     engaging tech sites, SlashDot.org! 
http://sdm.link/slashdot
    >     >     >     _______________________________________________
    >     >     >     Chapel-users mailing list
    >     >     >     [email protected]
    >     >     >     https://lists.sourceforge.net/lists/listinfo/chapel-users
    >     >     >
    >     >     >
    >     >
    >     >
    >
    >
    

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Performance multi-locale

Reply via email to