Re: Distributed Matrix Vector Product

Brad Chamberlain Wed, 29 Mar 2017 16:53:05 -0700

Hi Barry --

My response falls into three parts, where maybe only the third one really 
matters to you -- so depending on your level of interest, you may want to 
skip ahead to part 3...


Part 1: The opening regret
--------------------------

I have to admit that the phrase "matrix-vector product" makes me cringe in 
slight fear in the Chapel context, because I believe we're lacking some 
key features that would make it far better/more scalable than it is today.

If you're interested in diving into some increasingly ancient history, we 
studied (sparse) matrix-vector multiplication at length in our previous 
work on ZPL and did reasonably well there compared to MPI.  When we 
started on Chapel, we made some core design decisions with the goal of 
closing the remaining scalability gaps we had with ZPL; yet we put off 
some other features that I believe are crucial for mat-vect mults (this 
was primarily because Chapel's early years focused more on irregular 
computations than on matrix-based ones).  We've actually been returning to 
these features recently (motivated by some users working on similar 
operations), but have not quite landed them yet.

Just to put a name to them, I think the main concepts we're missing are:

* partial reductions -- the ability to reduce a multi-dimensional array
   expression along a subset of its dimensions rather than all of them.

* replicated singleton dimensions -- i.e., {*, 1..n} might represent a
   replicated row that spans columns 1..n.  (these were called flood and
   grid dimensions in ZPL where the former were kept coherent by the
   language semantics and the latter were not).

Without these features, matrix-vector products can still be expressed, 
simply not as efficiently (and/or elegantly) as they could otherwise...


Part 2: Your declarations
-------------------------

In the end, I don't think there are any fundamental problems with your 
declarations, but in thinking through them myself, these are the notes I 
put together:


> - A is a square matrix (dimension x dimension), where each row is
> cyclically distributed to the locales

You implemented this as:

------
const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=MatrixSpace.low,
                                                   blocksize = (1, dimension))
                   = MatrixSpace;

var A : [MatrixDomain] int;
------

If you only want to distribute the first dimension cyclically (i.e., 
you'll always use a block size of 1), my suggestion would be to use the 
Cyclic distribution.  The typical trick to get it distributed in just one 
dimension is to pass in a 2D targetLocales array that is degenerate in one 
dimension (you could do this with BlockCyclic as well, you'd just be 
paying unnecessary blocking overhead).  So here's how I coded up A such 
that the rows are distributed cyclically and the columns not at all (along 
with some debugging code):

------
// create a numLocales x 1 view of the Locales array:
//
const grid = reshape(Locales, {0..#numLocales, 0..0});
//
writeln(grid, "\n");

// Distribute MatrixSpace in a Cyclic manner.  Because 'grid' is
// degenerate in the second dimension, only the rows will be cyclic.
//
const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx = MatrixSpace.low,
                                                 targetLocales = grid);

var A : [MatrixDomain] int;

//
// Make sure that ownership is as expected:
//
forall a in A do
   a = here.id;
writeln(A, "\n");

------

As a stylistic note, if you're comfortable leaving type declarations off, 
note that I've swung the 'dmapped' clause from the type to the value 
location, which I've come to find much simpler to read over time (in this 
way, you could even eliminate MatrixSpace if you wanted by inlining it 
here, and avoid problems with accidentally using one rather than the 
other.

> - B is a vector (dimension), replicated on each locale

Here, you did:

------
const VectorSpace = {0..#dimension};
const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
                              = VectorSpace;
------

which I think is reasonable.  I must admit that I don't use the 
ReplicatedDist often enough to be familiar with it without experimenting 
and refreshing my memory.  But a quick check makes this look like it's 
going to work as you want.

[Returning to the theme of part 1, I'll mention that my hope is that we'll 
soon be able to support this as a row-vector '{*, 0..#dimension}' that 
shares the distribution of A.  E.g., imagine expressing this long-form as:

     const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;

or more succinctly, as:

     const RowVect = MatrixDomain[*, ..];

(that is, as a slice of MatrixDomain)]


> - C is a vector (dimension), each value is cyclically distributed to match
> the row of A

Here, my suggestion would be to use the Cyclic distribution again and, if 
it were me, I'd also be inclined to have it share the same distribution as 
A rather than repeating all that distribution information.  This can be 
written most simply as:

----
const VectorCyclicDomain = MatrixDomain[.., 0];

var C : [VectorCyclicDomain] int;
----

which essentially says: "Perform a rank-change slice (intersection) of 
MatrixDomain taking all of its indices in the first dimension and 
collapsing its second dimension down to index 0."


Part 3: The real problem
------------------------

> The problem is when I use 2 locales, the second one is mainly doing 
> communication

The problem turns out to be the way you've written the loop, though it's a 
subtle issue and not an uncommon trap to fall into among new users:

-----
forall (i, c) in zip({0..#dimension}, C(..)) {
     forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) {
         c += a * B(j);
     }
}
-----

The implementation of a forall loop (how many tasks to create and where 
should they run?) is controlled by the thing that's being iterated over. 
Whenever a forall loop iterates over a zip() expression, the first thing 
in the list (the "leader" ) controls the implementation.  So these loops 
are iterating over the anonymous domain {0..#dimension} which is not 
distributed (because nothing has said it should be).  For that reason, all 
tasks created by the forall loop will be local to the original locale 
(Locale #0).  That's why you're seeing other locales not getting much work 
(and probably a lot of communication between them and locale #0 for all 
the non-local accesses...).

In order to distribute the tasks across locales using a forall loop, 
you'll want to iterate over one of your distributed domains or arrays. 
This will cause the distribution of that domain/array to create the tasks, 
and it'll do so using a task-per-core per locale to which the domain/array 
is distributed.

(You could also use an on-clause within the body of your loop to steer 
each iteration to the appropriate locale, but this will result in creating 
an active message per iteartion which will kill your performance... better 
to let the forall create the on-clauses for you only as necessary).

So you should see your program using more computational resources on 
locales > 0 if you rewrite this as:

        forall (i, c) in zip(VectorCyclicDomain, C) {
          forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
            c += a * B(j);
          }
        }

A few notes on the above:

* 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
   strength optimize the identity slice), so you'll get better performance
   if you just write it as 'C'.

* similarly, for the purposes of these loops, iterating over
   '{0..#dimension}' is equivalent to iterating over '0..#dimension'
   which is equivalent to iterating over 'VectorSpace' since none of them
   are distributed.  So, it can be better to iterate over the named,
   pre-created thing rather than (logically) creating a new range (cheaper)
   or domain (more expensive) per inner loop iteration.  (this _may_ get
   hoisted by the compiler as loop-invariant code, but why take the chance
   when you've already named it so nicely...?)

But, before stopping there, note that the use of the nested forall is 
likely to add overhead without much benefit.  Specifically, the outer loop 
will already create a task per processor core that 'VectorCyclicDomain' is 
distributed to (i.e., all of them), so your code will be fully parallel 
without the additiona parallel inner loop.  For this reason, I'd probably 
write the loop as:

        forall (i, c) in zip(VectorCyclicDomain, C) {
          for (j, a) in zip(VectorSpace, A(i, ..)) {
            c += a * B(j);
          }
        }

This also avoids the need for the 'with' clause since the for loop is 
serial, so won't race on its accesses to 'c'.

> (used chplvis, nice tool by the way).

Great, glad you like it!  I'll let its author (Phil Nelson, WWU) know.

I haven't pre-checked my loop rewrites to verify that the loop now results 
in less communication, so I'll be curious what you find.

Hope this helps, and let us know if it leads to follow-up questions,
-Brad


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Distributed Matrix Vector Product

Reply via email to