Re: Performance multi-locale

Ben Harshbarger Mon, 24 Oct 2016 10:22:03 -0700

Hi Pieter,

Thanks for providing the example, that's very helpful.


Multi-locale performance in Chapel is not yet where we'd like it to be, but 
we've done a lot of work over the past few releases to get cases like yours 
performing well. It's surprising that using Block results in that much of a 
difference, but I think you would see better performance by iterating over the 
arrays directly:

```
// replace the loop in the 'add' function with this:
forall (ci, ai, bi) in zip(c, a, b) {
  ci = ai + bi;
}
```

Block-distributed arrays can leverage the fast-follower optimization to perform 
better when all arrays being iterated over share the same domain. You can also 
write that loop in a cleaner way by leveraging array promotion:

```
// This is equivalent to the first loop
c = a + b;
```

However, when I tried the promoted variation on my machine I observed worse 
performance than the explicit forall-loop. It seems to be related to the way 
the arguments of the 'add' function are declared. If you replaced 
"[ProblemDomain] float" with "[] float", performance seems to improve. That 
surprised a couple of us on the development team, and I'll be looking at that 
some more today.

If you're still seeing significantly worse performance with Block compared to 
the default rectangular domain, and the programs are launched in the same way, 
that would be odd. You could try profiling using chplvis. I agree though that 
there shouldn't be any communication in this program. You can find more 
information on chplvis here in the online 1.14 release documentation:

http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html

I hope that rewriting the loops solves the problem, but let us know if it 
doesn't and we can continue investigating.

-Ben Harshbarger

On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> wrote:

    Dear all,
    
    My apologies if this has already been asked before.  I'm new to the list 
    and couldn't find it in the archives.
    
    I experience bad performance when running the multi-locale compiled 
    version on an InfiniBand equiped cluster 
    (http://cs.vu.nl/das4/clusters.shtml, VU-site), even with only one node. 
      Below you find a minimal example that exhibits the same performance 
    problems as all my programs:
    
    I compiled chapel-1.14.0 with the following steps:
    
    export CHPL_TARGET_ARCH=native
    make -j
    export CHPL_COMM=gasnet
    export CHPL_COMM_SUBSTRATE=ibv
    make clean
    make -j
    
    I compile the following Chapel code:
    
    vectoradd.chpl:
    ---------------
    use Time;
    use Random;
    use BlockDist;
    
    config const n = 1024**3;
    
    // for single-locale
    // const ProblemDomain : domain(1) = {0..#n};
    // for multi-locale
    const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) =
         {0..#n};
    
    type float = real(32);
    
    proc add(c : [ProblemDomain] float, a : [ProblemDomain] float,
         b : [ProblemDomain] float) {
       forall i in ProblemDomain {
         c[i] = a[i] + b[i];
       }
    }
    
    proc main() {
       var c : [ProblemDomain] float;
       var a : [ProblemDomain] float;
       var b : [ProblemDomain] float;
       var t : Timer;
    
       fillRandom(a, 0);
       fillRandom(b, 42);
    
       t.start();
       add(c, a, b);
       t.stop();
    
       writeln("n: ", n);
       writeln("Time: ", t.elapsed(), "s");
       writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s");
    }
    ----
    
    I compile this for single-locale with (using no domain maps, see the 
    comment above in the source):
    
    chpl -o vectoradd --fast vectoradd.chpl
    
    I run it with (dual quad core with 2 hardware threads):
    
    export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    ./vectoradd
    
    And get as output:
    
    n: 1073741824
    Time: 0.558806s
    GFLOPS: 1.92149s
    
    However, the performance for multi-locale is much worse:
    
    I compile this for multi-locale with domain maps, see the comment in the 
    source):
    
    CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \
       vectoradd.chpl
    
    I run it on the same type of node with:
    
    SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
    
    export GASNET_PHYSMEM_MAX=1G
    export GASNET_IBV_SPAWNER=ssh
    export GASNET_SSH_SERVERS="$SSH_SERVERS"
    
    export CHPL_RT_NUM_THREADS_PER_LOCALE=16
    export CHPL_LAUNCHER=gasnetrun_ibv
    export CHPL_COMM=gasnet
    export CHPL_COMM_SUBSTRATE=ibv
    
    ./vectoradd -nl 1
    
    And get as output:
    
    n: 1073741824
    Time: 8.65082s
    GFLOPS: 0.12412s
    
    I would understand a performance difference of say 10% because of 
    multi-locale execution, but not factors.  Is this to be expected from 
    the current state of Chapel?  This performance difference is examplary 
    for basically all my programs that also are more realistic and use 
    larger inputs.  The performance is strange as there is no communication 
    necessary (only one node) and the program is using the same amount of 
    threads.
    
    Is there any way for me to investigate this using profiling for example?
    
    By the way, the program does scale well to multiple nodes (which is not 
    difficult given the baseline):
    
      1 | 8.65s
      2 | 2.67s
      4 | 1.69s
      8 | 0.87s
    16 | 0.41s
    
    Thanks in advance for your input.
    
    Kind regards,
    
    Pieter Hijma
    
    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most 
    engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    _______________________________________________
    Chapel-users mailing list
    [email protected]
    https://lists.sourceforge.net/lists/listinfo/chapel-users
    

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Performance multi-locale

Reply via email to