Hi Pieter,

When CHPL_COMM is set to 'none', our compiler can avoid introducing some 
overhead that is necessary for multi-locale programs. You can force this 
overhead when CHPL_COMM == none by compiling with the flag "--no-local". If you 
compile your single-locale program with that flag, does the performance get 
worse?

If that's the case, I'm not entirely sure what the next step would be. Do you 
have access to a newer version of GCC? The backend C compiler can matter when 
it comes to optimizing the multi-locale overhead.

You may also want to consider setting CHPL_TARGET_ARCH to something else if 
you're compiling on a machine architecture different from the compute nodes. 
There's more information about CHPL_TARGET_ARCH here:

http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch

-Ben Harshbarger

On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:

    Dear Ben,
    
    Sorry for my late reactions.  Unfortunately, for some reason, these 
    emails are marked as spam even though I marked the list and your address 
    as safe.  I will make sure I check my spam folders meticulously from now on.
    
    On 28/10/16 23:34, Ben Harshbarger wrote:
    > Hi Pieter,
    >
    > Sorry that you're still having issues. I think we'll need some more 
information before going forward:
    >
    > 1) Could you send us the output of "$CHPL_HOME/util/printchplenv 
--anonymize" ? It's a script that displays the various CHPL_ environment 
variables. "--anonymize" strips the output of information you may prefer to 
keep private (machine info, paths).
    
    This would be the setup if running single-locale programs:
    
    $ printchplenv --anonymize
    CHPL_TARGET_PLATFORM: linux64
    CHPL_TARGET_COMPILER: gnu
    CHPL_TARGET_ARCH: native *
    CHPL_LOCALE_MODEL: flat
    CHPL_COMM: none
    CHPL_TASKS: qthreads
    CHPL_LAUNCHER: none
    CHPL_TIMERS: generic
    CHPL_UNWIND: none
    CHPL_MEM: jemalloc
    CHPL_MAKE: gmake
    CHPL_ATOMICS: intrinsics
    CHPL_GMP: gmp
    CHPL_HWLOC: hwloc
    CHPL_REGEXP: re2
    CHPL_WIDE_POINTERS: struct
    CHPL_AUX_FILESYS: none
    
    When I run multi-locale programs, I set the following environment variables:
    
    export CHPL_COMM=gasnet
    export CHPL_COMM_SUBSTRATE=ibv
    
    Then the Chapel environment would be:
    
    $ printchplenv --anonymize
    CHPL_TARGET_PLATFORM: linux64
    CHPL_TARGET_COMPILER: gnu
    CHPL_TARGET_ARCH: native *
    CHPL_LOCALE_MODEL: flat
    CHPL_COMM: gasnet *
       CHPL_COMM_SUBSTRATE: ibv *
       CHPL_GASNET_SEGMENT: large
    CHPL_TASKS: qthreads
    CHPL_LAUNCHER: gasnetrun_ibv
    CHPL_TIMERS: generic
    CHPL_UNWIND: none
    CHPL_MEM: jemalloc
    CHPL_MAKE: gmake
    CHPL_ATOMICS: intrinsics
       CHPL_NETWORK_ATOMICS: none
    CHPL_GMP: gmp
    CHPL_HWLOC: hwloc
    CHPL_REGEXP: re2
    CHPL_WIDE_POINTERS: struct
    CHPL_AUX_FILESYS: none
    
    
    > 2) What C compiler are you using?
    
    $ gcc --version
    gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
    Copyright (C) 2010 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    > 3) Are you sure that the programs are being launched correctly? This 
might seem silly, but it's worth double-checking that the programs are actually 
running on the same hardware (not necessarily the same node though).
    
    I am completely certain that the single-locale program, the multi-locale 
    program for one node, and the multi-locale for multiple nodes are 
    running on the compute nodes.  I'm not completely sure what you mean by 
    "the same hardware".  All compute nodes have the same hardware if that 
    is what you mean.
    
    > I'd also like to clarify what you mean by "multi-locale compiled". Is the 
difference between the programs just the use of the Block domain map, or do you 
compile with different environment variables set?
    
    I compile different programs and I use different environment variables:
    
    The single-locale version vectoradd is located in the datapar directory, 
    whereas the multi-locale version is located in the datapar-dist 
    directory.  What follows is the diff for the .chpl file:
    
    $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
    8c8
    < const ProblemDomain : domain(1) = {0..#n};
    ---
     > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) 
    = {0..#n};
    
    The diff for the Makefile:
    
    $ diff datapar/Makefile datapar-dist/Makefile
    2a3
     > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
    8c9
    <   $(CHPL) -o $@ $(FLAGS) $<
    ---
     >  $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
    11c12
    <   rm -f $(APP)
    ---
     >  rm -f $(APP) $(APP)_real
    
    Thanks for your help, and again my apologies for the delayed answers.
    
    Kind regards,
    
    Pieter Hijma
    
    >
    > -Ben Harshbarger
    >
    > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote:
    >
    >     Hi Ben,
    >
    >     Thank you for your fast reply and suggestions!  I did some more tests
    >     and also included stencil operations.
    >
    >     First, the vector addition:
    >
    >     vectoradd.chpl
    >     --------------
    >     use Time;
    >     use Random;
    >     use BlockDist;
    >     //use VisualDebug;
    >
    >     config const n = 1024**3/2;
    >
    >     // for multi-locale
    >     const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n})
    >        = {0..#n};
    >     // for single-locale
    >     const ProblemDomain : domain(1) = {0..#n};
    >
    >     type float = real(32);
    >
    >     proc addNoDomain(c : [] float, a : [] float, b : [] float) {
    >        forall (ci, ai, bi) in zip(c, a, b) {
    >          ci = ai + bi;
    >        }
    >     }
    >
    >     proc addZip(c : [ProblemDomain] float, a : [ProblemDomain] float,
    >               b : [ProblemDomain] float) {
    >        forall (ci, ai, bi) in zip(c, a, b) {
    >          ci = ai + bi;
    >        }
    >     }
    >
    >     proc addForall(c : [ProblemDomain] float, a : [ProblemDomain] float,
    >                  b : [ProblemDomain] float) {
    >        //startVdebug("vdata");
    >        forall i in ProblemDomain {
    >          c[i] = a[i] + b[i];
    >        }
    >        //stopVdebug();
    >     }
    >
    >     proc addCollective(c : [ProblemDomain] float, a : [ProblemDomain] 
float,
    >                      b : [ProblemDomain] float) {
    >        c = a + b;
    >     }
    >
    >     proc output(t : Timer, n, testName) {
    >        t.stop();
    >        writeln(testName, " n: ", n);
    >        writeln("Time: ", t.elapsed(), "s");
    >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
    >        writeln();
    >        t.clear();
    >     }
    >
    >     proc main() {
    >        var c : [ProblemDomain] float;
    >        var a : [ProblemDomain] float;
    >        var b : [ProblemDomain] float;
    >        var t : Timer;
    >
    >        fillRandom(a, 0);
    >        fillRandom(b, 42);
    >
    >        t.start();
    >        addNoDomain(c, a, b);
    >        output(t, n, "addNoDomain");
    >
    >        t.start();
    >        addZip(c, a, b);
    >        output(t, n, "addZip");
    >
    >        t.start();
    >        addForall(c, a, b);
    >        output(t, n, "addForall");
    >
    >        t.start();
    >        addCollective(c, a, b);
    >        output(t, n, "addCollective");
    >     }
    >     -----
    >
    >     On a single locale I get as output:
    >
    >     addNoDomain n: 536870912
    >     Time: 0.27961s
    >     GFLOPS: 1.92007
    >
    >     addZip n: 536870912
    >     Time: 0.278657s
    >     GFLOPS: 1.92664
    >
    >     addForall n: 536870912
    >     Time: 0.278015s
    >     GFLOPS: 1.93109
    >
    >     addCollective n: 536870912
    >     Time: 0.278379s
    >     GFLOPS: 1.92856
    >
    >     On multi-locale (-nl 1) I get as output:
    >
    >     addNoDomain n: 536870912
    >     Time: 2.16806s
    >     GFLOPS: 0.247627
    >
    >     addZip n: 536870912
    >     Time: 2.17024s
    >     GFLOPS: 0.247378
    >
    >     addForall n: 536870912
    >     Time: 4.78443s
    >     GFLOPS: 0.112212
    >
    >     addCollective n: 536870912
    >     Time: 2.19838s
    >     GFLOPS: 0.244212
    >
    >     So, indeed, your suggestion improves it by more than a factor two, but
    >     it is still close to a factor 8 slower than single-locale.
    >
    >     I also used chplvis and verified that there are no gets and puts when
    >     running multi-locale with more than one node.  The profiling 
information
    >     is clear, but not very helpful (to me):
    >
    >     multi-locale (-nl 1):
    >
    >     | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
    >     |  4.8777 | wrapon_fn_chpl35      | vectoradd.chpl:26 |
    >
    >     single-locale:
    >
    >     | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
    >
    >
    >
    >     For stencil operations, I used the following program:
    >
    >     1d-convolution.chpl
    >     -------------------
    >     use Time;
    >     use Random;
    >     use StencilDist;
    >
    >     config const n = 1024**3/2;
    >
    >     const ProblemDomain : domain(1) dmapped Stencil(boundingBox = {0..#n},
    >                                                   fluff = (1,))
    >        = {0..#n};
    >     const InnerDomain : subdomain(ProblemDomain) = {1..n-2};
    >
    >     proc convolveIndices(output : [ProblemDomain] real(32),
    >                        input : [ProblemDomain] real(32)) {
    >        forall i in InnerDomain {
    >          output[i] = ((input[i-1] + input[i] + input[i+1])/3:real(32));
    >        }
    >     }
    >
    >     proc convolveZip(output : [ProblemDomain] real(32),
    >                    input : [ProblemDomain] real(32)) {
    >        forall (im1, i, ip1) in zip(InnerDomain.translate(-1),
    >                                 InnerDomain,
    >                                 InnerDomain.translate(1)) {
    >          output[i] = ((input[im1] + input[i] + input[ip1])/3:real(32));
    >        }
    >     }
    >
    >     proc print(t : Timer, n, s) {
    >        t.stop();
    >        writeln(s, ", n: ", n);
    >        writeln("Time: ", t.elapsed(), "s");
    >        writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
    >        writeln();
    >        t.clear();
    >     }
    >
    >     proc main() {
    >        var input : [ProblemDomain] real(32);
    >        var output : [ProblemDomain] real(32);
    >        var t : Timer;
    >
    >        fillRandom(input, 42);
    >
    >        t.start();
    >        convolveIndices(output, input);
    >        print(t, n, "convolveIndices");
    >
    >        t.start();
    >        convolveZip(output, input);
    >        print(t, n, "convolveZip");
    >     }
    >     ------
    >
    >     Interestingly, in contrast to your earlier suggestion, the direct
    >     indexing works a bit better in this program than the zipped version:
    >
    >     Multi-locale (-nl 1):
    >
    >     convolveIndices, n: 536870912
    >     Time: 4.27148s
    >     GFLOPS: 0.377062
    >
    >     convolveZip, n: 536870912
    >     Time: 4.87291s
    >     GFLOPS: 0.330524
    >
    >     Single-locale:
    >
    >     convolveIndices, n: 536870912
    >     Time: 0.548804s
    >     GFLOPS: 2.93477
    >
    >     convolveZip, n: 536870912
    >     Time: 0.538754s
    >     GFLOPS: 2.98951
    >
    >
    >     Again, the multi-locale is about a factor 8 slower than single-locale.
    >     By the way, the Stencil distribution is a bit faster than the Block
    >     distribution.
    >
    >     Thanks in advance for your input,
    >
    >     Pieter
    >
    >
    >
    >     On 24/10/16 19:20, Ben Harshbarger wrote:
    >     > Hi Pieter,
    >     >
    >     > Thanks for providing the example, that's very helpful.
    >     >
    >     > Multi-locale performance in Chapel is not yet where we'd like it to 
be, but we've done a lot of work over the past few releases to get cases like 
yours performing well. It's surprising that using Block results in that much of 
a difference, but I think you would see better performance by iterating over 
the arrays directly:
    >     >
    >     > ```
    >     > // replace the loop in the 'add' function with this:
    >     > forall (ci, ai, bi) in zip(c, a, b) {
    >     >   ci = ai + bi;
    >     > }
    >     > ```
    >     >
    >     > Block-distributed arrays can leverage the fast-follower 
optimization to perform better when all arrays being iterated over share the 
same domain. You can also write that loop in a cleaner way by leveraging array 
promotion:
    >     >
    >     > ```
    >     > // This is equivalent to the first loop
    >     > c = a + b;
    >     > ```
    >     >
    >     > However, when I tried the promoted variation on my machine I 
observed worse performance than the explicit forall-loop. It seems to be 
related to the way the arguments of the 'add' function are declared. If you 
replaced "[ProblemDomain] float" with "[] float", performance seems to improve. 
That surprised a couple of us on the development team, and I'll be looking at 
that some more today.
    >     >
    >     > If you're still seeing significantly worse performance with Block 
compared to the default rectangular domain, and the programs are launched in 
the same way, that would be odd. You could try profiling using chplvis. I agree 
though that there shouldn't be any communication in this program. You can find 
more information on chplvis here in the online 1.14 release documentation:
    >     >
    >     > http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
    >     >
    >     > I hope that rewriting the loops solves the problem, but let us know 
if it doesn't and we can continue investigating.
    >     >
    >     > -Ben Harshbarger
    >     >
    >     > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> wrote:
    >     >
    >     >     Dear all,
    >     >
    >     >     My apologies if this has already been asked before.  I'm new to 
the list
    >     >     and couldn't find it in the archives.
    >     >
    >     >     I experience bad performance when running the multi-locale 
compiled
    >     >     version on an InfiniBand equiped cluster
    >     >     (http://cs.vu.nl/das4/clusters.shtml, VU-site), even with only 
one node.
    >     >       Below you find a minimal example that exhibits the same 
performance
    >     >     problems as all my programs:
    >     >
    >     >     I compiled chapel-1.14.0 with the following steps:
    >     >
    >     >     export CHPL_TARGET_ARCH=native
    >     >     make -j
    >     >     export CHPL_COMM=gasnet
    >     >     export CHPL_COMM_SUBSTRATE=ibv
    >     >     make clean
    >     >     make -j
    >     >
    >     >     I compile the following Chapel code:
    >     >
    >     >     vectoradd.chpl:
    >     >     ---------------
    >     >     use Time;
    >     >     use Random;
    >     >     use BlockDist;
    >     >
    >     >     config const n = 1024**3;
    >     >
    >     >     // for single-locale
    >     >     // const ProblemDomain : domain(1) = {0..#n};
    >     >     // for multi-locale
    >     >     const ProblemDomain : domain(1) dmapped Block(boundingBox = 
{0..#n}) =
    >     >          {0..#n};
    >     >
    >     >     type float = real(32);
    >     >
    >     >     proc add(c : [ProblemDomain] float, a : [ProblemDomain] float,
    >     >          b : [ProblemDomain] float) {
    >     >        forall i in ProblemDomain {
    >     >          c[i] = a[i] + b[i];
    >     >        }
    >     >     }
    >     >
    >     >     proc main() {
    >     >        var c : [ProblemDomain] float;
    >     >        var a : [ProblemDomain] float;
    >     >        var b : [ProblemDomain] float;
    >     >        var t : Timer;
    >     >
    >     >        fillRandom(a, 0);
    >     >        fillRandom(b, 42);
    >     >
    >     >        t.start();
    >     >        add(c, a, b);
    >     >        t.stop();
    >     >
    >     >        writeln("n: ", n);
    >     >        writeln("Time: ", t.elapsed(), "s");
    >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s");
    >     >     }
    >     >     ----
    >     >
    >     >     I compile this for single-locale with (using no domain maps, 
see the
    >     >     comment above in the source):
    >     >
    >     >     chpl -o vectoradd --fast vectoradd.chpl
    >     >
    >     >     I run it with (dual quad core with 2 hardware threads):
    >     >
    >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     >     ./vectoradd
    >     >
    >     >     And get as output:
    >     >
    >     >     n: 1073741824
    >     >     Time: 0.558806s
    >     >     GFLOPS: 1.92149s
    >     >
    >     >     However, the performance for multi-locale is much worse:
    >     >
    >     >     I compile this for multi-locale with domain maps, see the 
comment in the
    >     >     source):
    >     >
    >     >     CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd 
--fast \
    >     >        vectoradd.chpl
    >     >
    >     >     I run it on the same type of node with:
    >     >
    >     >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
    >     >
    >     >     export GASNET_PHYSMEM_MAX=1G
    >     >     export GASNET_IBV_SPAWNER=ssh
    >     >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
    >     >
    >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     >     export CHPL_LAUNCHER=gasnetrun_ibv
    >     >     export CHPL_COMM=gasnet
    >     >     export CHPL_COMM_SUBSTRATE=ibv
    >     >
    >     >     ./vectoradd -nl 1
    >     >
    >     >     And get as output:
    >     >
    >     >     n: 1073741824
    >     >     Time: 8.65082s
    >     >     GFLOPS: 0.12412s
    >     >
    >     >     I would understand a performance difference of say 10% because 
of
    >     >     multi-locale execution, but not factors.  Is this to be 
expected from
    >     >     the current state of Chapel?  This performance difference is 
examplary
    >     >     for basically all my programs that also are more realistic and 
use
    >     >     larger inputs.  The performance is strange as there is no 
communication
    >     >     necessary (only one node) and the program is using the same 
amount of
    >     >     threads.
    >     >
    >     >     Is there any way for me to investigate this using profiling for 
example?
    >     >
    >     >     By the way, the program does scale well to multiple nodes 
(which is not
    >     >     difficult given the baseline):
    >     >
    >     >       1 | 8.65s
    >     >       2 | 2.67s
    >     >       4 | 1.69s
    >     >       8 | 0.87s
    >     >     16 | 0.41s
    >     >
    >     >     Thanks in advance for your input.
    >     >
    >     >     Kind regards,
    >     >
    >     >     Pieter Hijma
    >     >
    >     >     
------------------------------------------------------------------------------
    >     >     Check out the vibrant tech community on one of the world's most
    >     >     engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    >     >     _______________________________________________
    >     >     Chapel-users mailing list
    >     >     [email protected]
    >     >     https://lists.sourceforge.net/lists/listinfo/chapel-users
    >     >
    >     >
    >
    >
    

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to