Hi Pieter,

Sorry that you're still having issues. I think we'll need some more information 
before going forward:

1) Could you send us the output of "$CHPL_HOME/util/printchplenv --anonymize" ? 
It's a script that displays the various CHPL_ environment variables. 
"--anonymize" strips the output of information you may prefer to keep private 
(machine info, paths).

2) What C compiler are you using?

3) Are you sure that the programs are being launched correctly? This might seem 
silly, but it's worth double-checking that the programs are actually running on 
the same hardware (not necessarily the same node though).

I'd also like to clarify what you mean by "multi-locale compiled". Is the 
difference between the programs just the use of the Block domain map, or do you 
compile with different environment variables set?

-Ben Harshbarger

On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote:

    Hi Ben,
    
    Thank you for your fast reply and suggestions!  I did some more tests 
    and also included stencil operations.
    
    First, the vector addition:
    
    vectoradd.chpl
    --------------
    use Time;
    use Random;
    use BlockDist;
    //use VisualDebug;
    
    config const n = 1024**3/2;
    
    // for multi-locale
    const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n})
       = {0..#n};
    // for single-locale
    const ProblemDomain : domain(1) = {0..#n};
    
    type float = real(32);
    
    proc addNoDomain(c : [] float, a : [] float, b : [] float) {
       forall (ci, ai, bi) in zip(c, a, b) {
         ci = ai + bi;
       }
    }
    
    proc addZip(c : [ProblemDomain] float, a : [ProblemDomain] float,
            b : [ProblemDomain] float) {
       forall (ci, ai, bi) in zip(c, a, b) {
         ci = ai + bi;
       }
    }
    
    proc addForall(c : [ProblemDomain] float, a : [ProblemDomain] float,
               b : [ProblemDomain] float) {
       //startVdebug("vdata");
       forall i in ProblemDomain {
         c[i] = a[i] + b[i];
       }
       //stopVdebug();
    }
    
    proc addCollective(c : [ProblemDomain] float, a : [ProblemDomain] float,
                   b : [ProblemDomain] float) {
       c = a + b;
    }
    
    proc output(t : Timer, n, testName) {
       t.stop();
       writeln(testName, " n: ", n);
       writeln("Time: ", t.elapsed(), "s");
       writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
       writeln();
       t.clear();
    }
    
    proc main() {
       var c : [ProblemDomain] float;
       var a : [ProblemDomain] float;
       var b : [ProblemDomain] float;
       var t : Timer;
    
       fillRandom(a, 0);
       fillRandom(b, 42);
    
       t.start();
       addNoDomain(c, a, b);
       output(t, n, "addNoDomain");
    
       t.start();
       addZip(c, a, b);
       output(t, n, "addZip");
    
       t.start();
       addForall(c, a, b);
       output(t, n, "addForall");
    
       t.start();
       addCollective(c, a, b);
       output(t, n, "addCollective");
    }
    -----
    
    On a single locale I get as output:
    
    addNoDomain n: 536870912
    Time: 0.27961s
    GFLOPS: 1.92007
    
    addZip n: 536870912
    Time: 0.278657s
    GFLOPS: 1.92664
    
    addForall n: 536870912
    Time: 0.278015s
    GFLOPS: 1.93109
    
    addCollective n: 536870912
    Time: 0.278379s
    GFLOPS: 1.92856
    
    On multi-locale (-nl 1) I get as output:
    
    addNoDomain n: 536870912
    Time: 2.16806s
    GFLOPS: 0.247627
    
    addZip n: 536870912
    Time: 2.17024s
    GFLOPS: 0.247378
    
    addForall n: 536870912
    Time: 4.78443s
    GFLOPS: 0.112212
    
    addCollective n: 536870912
    Time: 2.19838s
    GFLOPS: 0.244212
    
    So, indeed, your suggestion improves it by more than a factor two, but 
    it is still close to a factor 8 slower than single-locale.
    
    I also used chplvis and verified that there are no gets and puts when 
    running multi-locale with more than one node.  The profiling information 
    is clear, but not very helpful (to me):
    
    multi-locale (-nl 1):
    
    | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
    |  4.8777 | wrapon_fn_chpl35      | vectoradd.chpl:26 |
    
    single-locale:
    
    | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 |
    
    
    
    For stencil operations, I used the following program:
    
    1d-convolution.chpl
    -------------------
    use Time;
    use Random;
    use StencilDist;
    
    config const n = 1024**3/2;
    
    const ProblemDomain : domain(1) dmapped Stencil(boundingBox = {0..#n},
                                                fluff = (1,))
       = {0..#n};
    const InnerDomain : subdomain(ProblemDomain) = {1..n-2};
    
    proc convolveIndices(output : [ProblemDomain] real(32),
                     input : [ProblemDomain] real(32)) {
       forall i in InnerDomain {
         output[i] = ((input[i-1] + input[i] + input[i+1])/3:real(32));
       }
    }
    
    proc convolveZip(output : [ProblemDomain] real(32),
                 input : [ProblemDomain] real(32)) {
       forall (im1, i, ip1) in zip(InnerDomain.translate(-1),
                              InnerDomain,
                              InnerDomain.translate(1)) {
         output[i] = ((input[im1] + input[i] + input[ip1])/3:real(32));
       }
    }
    
    proc print(t : Timer, n, s) {
       t.stop();
       writeln(s, ", n: ", n);
       writeln("Time: ", t.elapsed(), "s");
       writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
       writeln();
       t.clear();
    }
    
    proc main() {
       var input : [ProblemDomain] real(32);
       var output : [ProblemDomain] real(32);
       var t : Timer;
    
       fillRandom(input, 42);
    
       t.start();
       convolveIndices(output, input);
       print(t, n, "convolveIndices");
    
       t.start();
       convolveZip(output, input);
       print(t, n, "convolveZip");
    }
    ------
    
    Interestingly, in contrast to your earlier suggestion, the direct 
    indexing works a bit better in this program than the zipped version:
    
    Multi-locale (-nl 1):
    
    convolveIndices, n: 536870912
    Time: 4.27148s
    GFLOPS: 0.377062
    
    convolveZip, n: 536870912
    Time: 4.87291s
    GFLOPS: 0.330524
    
    Single-locale:
    
    convolveIndices, n: 536870912
    Time: 0.548804s
    GFLOPS: 2.93477
    
    convolveZip, n: 536870912
    Time: 0.538754s
    GFLOPS: 2.98951
    
    
    Again, the multi-locale is about a factor 8 slower than single-locale. 
    By the way, the Stencil distribution is a bit faster than the Block 
    distribution.
    
    Thanks in advance for your input,
    
    Pieter
    
    
    
    On 24/10/16 19:20, Ben Harshbarger wrote:
    > Hi Pieter,
    >
    > Thanks for providing the example, that's very helpful.
    >
    > Multi-locale performance in Chapel is not yet where we'd like it to be, 
but we've done a lot of work over the past few releases to get cases like yours 
performing well. It's surprising that using Block results in that much of a 
difference, but I think you would see better performance by iterating over the 
arrays directly:
    >
    > ```
    > // replace the loop in the 'add' function with this:
    > forall (ci, ai, bi) in zip(c, a, b) {
    >   ci = ai + bi;
    > }
    > ```
    >
    > Block-distributed arrays can leverage the fast-follower optimization to 
perform better when all arrays being iterated over share the same domain. You 
can also write that loop in a cleaner way by leveraging array promotion:
    >
    > ```
    > // This is equivalent to the first loop
    > c = a + b;
    > ```
    >
    > However, when I tried the promoted variation on my machine I observed 
worse performance than the explicit forall-loop. It seems to be related to the 
way the arguments of the 'add' function are declared. If you replaced 
"[ProblemDomain] float" with "[] float", performance seems to improve. That 
surprised a couple of us on the development team, and I'll be looking at that 
some more today.
    >
    > If you're still seeing significantly worse performance with Block 
compared to the default rectangular domain, and the programs are launched in 
the same way, that would be odd. You could try profiling using chplvis. I agree 
though that there shouldn't be any communication in this program. You can find 
more information on chplvis here in the online 1.14 release documentation:
    >
    > http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
    >
    > I hope that rewriting the loops solves the problem, but let us know if it 
doesn't and we can continue investigating.
    >
    > -Ben Harshbarger
    >
    > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> wrote:
    >
    >     Dear all,
    >
    >     My apologies if this has already been asked before.  I'm new to the 
list
    >     and couldn't find it in the archives.
    >
    >     I experience bad performance when running the multi-locale compiled
    >     version on an InfiniBand equiped cluster
    >     (http://cs.vu.nl/das4/clusters.shtml, VU-site), even with only one 
node.
    >       Below you find a minimal example that exhibits the same performance
    >     problems as all my programs:
    >
    >     I compiled chapel-1.14.0 with the following steps:
    >
    >     export CHPL_TARGET_ARCH=native
    >     make -j
    >     export CHPL_COMM=gasnet
    >     export CHPL_COMM_SUBSTRATE=ibv
    >     make clean
    >     make -j
    >
    >     I compile the following Chapel code:
    >
    >     vectoradd.chpl:
    >     ---------------
    >     use Time;
    >     use Random;
    >     use BlockDist;
    >
    >     config const n = 1024**3;
    >
    >     // for single-locale
    >     // const ProblemDomain : domain(1) = {0..#n};
    >     // for multi-locale
    >     const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) =
    >          {0..#n};
    >
    >     type float = real(32);
    >
    >     proc add(c : [ProblemDomain] float, a : [ProblemDomain] float,
    >          b : [ProblemDomain] float) {
    >        forall i in ProblemDomain {
    >          c[i] = a[i] + b[i];
    >        }
    >     }
    >
    >     proc main() {
    >        var c : [ProblemDomain] float;
    >        var a : [ProblemDomain] float;
    >        var b : [ProblemDomain] float;
    >        var t : Timer;
    >
    >        fillRandom(a, 0);
    >        fillRandom(b, 42);
    >
    >        t.start();
    >        add(c, a, b);
    >        t.stop();
    >
    >        writeln("n: ", n);
    >        writeln("Time: ", t.elapsed(), "s");
    >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s");
    >     }
    >     ----
    >
    >     I compile this for single-locale with (using no domain maps, see the
    >     comment above in the source):
    >
    >     chpl -o vectoradd --fast vectoradd.chpl
    >
    >     I run it with (dual quad core with 2 hardware threads):
    >
    >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     ./vectoradd
    >
    >     And get as output:
    >
    >     n: 1073741824
    >     Time: 0.558806s
    >     GFLOPS: 1.92149s
    >
    >     However, the performance for multi-locale is much worse:
    >
    >     I compile this for multi-locale with domain maps, see the comment in 
the
    >     source):
    >
    >     CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \
    >        vectoradd.chpl
    >
    >     I run it on the same type of node with:
    >
    >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
    >
    >     export GASNET_PHYSMEM_MAX=1G
    >     export GASNET_IBV_SPAWNER=ssh
    >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
    >
    >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    >     export CHPL_LAUNCHER=gasnetrun_ibv
    >     export CHPL_COMM=gasnet
    >     export CHPL_COMM_SUBSTRATE=ibv
    >
    >     ./vectoradd -nl 1
    >
    >     And get as output:
    >
    >     n: 1073741824
    >     Time: 8.65082s
    >     GFLOPS: 0.12412s
    >
    >     I would understand a performance difference of say 10% because of
    >     multi-locale execution, but not factors.  Is this to be expected from
    >     the current state of Chapel?  This performance difference is examplary
    >     for basically all my programs that also are more realistic and use
    >     larger inputs.  The performance is strange as there is no 
communication
    >     necessary (only one node) and the program is using the same amount of
    >     threads.
    >
    >     Is there any way for me to investigate this using profiling for 
example?
    >
    >     By the way, the program does scale well to multiple nodes (which is 
not
    >     difficult given the baseline):
    >
    >       1 | 8.65s
    >       2 | 2.67s
    >       4 | 1.69s
    >       8 | 0.87s
    >     16 | 0.41s
    >
    >     Thanks in advance for your input.
    >
    >     Kind regards,
    >
    >     Pieter Hijma
    >
    >     
------------------------------------------------------------------------------
    >     Check out the vibrant tech community on one of the world's most
    >     engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    >     _______________________________________________
    >     Chapel-users mailing list
    >     [email protected]
    >     https://lists.sourceforge.net/lists/listinfo/chapel-users
    >
    >
    

------------------------------------------------------------------------------
The Command Line: Reinvented for Modern Developers
Did the resurgence of CLI tooling catch you by surprise?
Reconnect with the command line and become more productive. 
Learn the new .NET and ASP.NET CLI. Get your free copy!
http://sdm.link/telerik
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to