Hi Richard —

OK, now back on the original question. First, I have a style suggestion that _might_ help performance, but I'm not certain it will address the main problem that you're running into. Specifically, where you're writing:

  forall i in b.dom with ( + reduce bnrm2 ) {
     bnrm2 += b.val[i] * b.val[i];
  }

I think you can get away with a much more simple expression of the computation by writing:

        var bnrm2 = + reduce [i in b.dom] (b.val[i] * b.val[i]);

or better:

        var bnrm2 = + reduce [i in b.dom] (b.val[i]**2);

or better:

        var bnrm2 = + reduce b.val**2;

Reduce expressions iterate over the expression that follows them, applying the reduction operator and returning the result. The first case uses a
forall expression to iterate over the domain and index into the vector.

The second replaces the redundant array indexing with an exponentiation which will strength-reduced to a multiplication at compilation time, yet without indexing the array twice.

The third replaces the explicit loop with an instance of promotion: applying the exponentiation operator to each element in the array and returning the sums.

There is _some_ chance that this third form will address your performance problem, but now that you've pointed it out, I'm worried that other idioms you write will run back into it, so let me explain what I believe is going on. And in doing so let me say that while I think this explanation for what's going on is reasonable, I'll also say that this is a known issue that we believe we need to do something to address it (because it's a trap waiting for people to step into it), but haven't had the chance to do so yet.

OK, so let me introduce the issue using a simple example.  Imagine that
I had the following code:

        record R {
          var x: int;
        }

        var myR: R;

        on Locales[1] {
          for i in 1..n {
            writeln(myR.x);
            myR.x += 1;
          }
        }

The instance of R named 'myR' is stored on locale 0 because that's where the task was running when its declaration was encountered. But the big computational loop that's accessing it is running on locale 1. What this means is that each time myR.x is read or written, communication back to locale 0 needs to take place (we might imagine that the compiler could do analysis to determine that nobody else is modifying myR.x at the same time, so could cache the value, compute all the updates, and then push the result back to the original variable, but that doesn't happen today).

Hopefully this makes sense and seems reasonable: myR.x is on a remote locale, so reads/writes of it need to communicate with that locale.

OK, now let's change R to be _slightly_ more like your example:

        record R {
          var A = newBlockArr({1..n}, real);
        }

        var myR: R;

        on Locales[1] {
          for i in n/2..n {
            writeln(myR.A[i]);
            myR.A[i] += 1;
          }
        }

This example is a bit weirder because R contains a distributed array A, so if we're running on two locales, half of myR.A's elements will be on Locale 0 and half on Locale 1.

Intuitively, that seems like it means all the references to myR.A[i] within the loop ought to be local / not require communication; however, the issue is that the myR record itself, as well as the original field A that defines the distributed array, lives on locale 0. So each time the compiler sees the expression myR.A, the first thing it does is say "let me go talk to locale 0 to find out about this A field... oh, it's a distributed array" before finding out that the specific element i that it's looking for happens to be local.

Arguably, the compiler should be doing more to help here: For example, maybe when a record containing a distributed array spans an on-clause, it should be pulling the array descriptor out of the record and passing a copy of it off to the next locale such that it can access it locally. The following GitHub issue captures our desire to do better here as well as some kicking around of various approaches (for those truly interested):

        https://github.com/chapel-lang/chapel/issues/10160

Even though our compiler isn't helping much with this pattern at present, a runtime optimization that's been implemented may. Once you get it working with Chapel 1.21, try compiling your original program (whether or not my proposed rewrite of the reduction helps) with --cache-remote and see if it helps. (--cache-remote turns on an execution-time optimization in which remote puts and gets are cached and avoided when the memory consistency model permits it... essentially striving for a similar effect as I was suggesting a more aggressive compiler could do in my first example).

If this doesn't help, there are some workarounds that can be applied
within the code itself to reduce communication but... they can get
pretty ugly in the extreme cases, so let's see where this gets us.

(I'll mention that we're working towards getting --cache-remote to the point that it can be enabled by default, but didn't get there in time for Chapel 1.21...).

Hope this helps, and definitely let us know what you find,
-Brad

PS — And sorry again for how long it took us to get to this.


On Mon, 13 Apr 2020, Barrett, Richard F via Chapel-users wrote:

Folks,

I have some questions regarding the use and subsequent performance of dmap. I’m 
doing some simple vector computations (add, scale, reduce), with the vectors 
defined in a record:

record RVectorRB { // Real valued vector.

   var dom = {1..num_vertices} dmapped Block ( boundingBox={1..num_vertices} );

   var val : [dom] real;
}

Instantiated as this:    var v1 : RVectorRB; , with num_vertices config const.

For RVectorRB vectors b, p, and z, reductions look like this:

  forall i in b.dom with ( + reduce bnrm2 ) {
     bnrm2 += b.val[i] * b.val[i];
  }

Vector updates : p.val = z.val + beta*p.val;   // beta real scalar.

Performance on 2 locales (2 nodes of a Cray XC30, 2x16 Haswells) is 44x slower 
than 1 locale/node. Further, when I comment out the dmap, single node runs 3x 
faster than with it. Vector lengths 1k up to 100M elements.

This is with v1.20. When I tried to compile with v1.21, I get these errors:

%  chpl main.chpl
main.chpl:12: In function 'main':
main.chpl:49: error: Attempt to 'new' a function or undefined symbol
main.chpl:59: error: 'timings' undeclared (first use this function)
main.chpl:61: error: 't' undeclared (first use this function)
main.chpl:63: error: 'TimeUnits' undeclared (first use this function)
main.chpl:77: error: 'alg' undeclared (first use this function)

'new' is for instantiating a record, alg is my enumeration, and I'm "using" 
Chapel's Time module within a module that's used by my main module which is used by main. 
Checked the release notes but not seeing what's changed. I'm seeing this on the XC as 
well as my Mac (not the homebrew version).

I'd greatly appreciate any help or information regarding any of the above.

Richard
--
Richard Barrett
PO Box 5800, MS-0620
Sandia National Laboratories
Albuquerque, NM 87185
Phone: 505-845-7655
Pager: 505-951-8087



_______________________________________________
Chapel-users mailing list
[email protected]
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_chapel-2Dusers&d=DwIGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=QUQW-BniEL_d2a7btR4rP5TPiNmpm1pG-Qa_xXzGVKc&m=U-WL-8YdV3j-zOKUf2QYYqRb5SnBUsgaG03nnhB0ER8&s=WyrpdSULX-IJKEjZ7FFiUXjNk9i6gqxL4E1YABd1rYA&e=
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to