Re: Distributed Matrix Vector Product

Nikhil Padmanabhan Thu, 30 Mar 2017 09:05:03 -0700

Brad,

Would it be possible/useful for zip to throw a warning (ideally at compile
time) if the distributions of the vectors are not the same.....? In fact,
one might even extend that to cases like C = A + B.....


One might imagine this being an opt-in warning selected by a flag like the
Intel Compiler's opt-report....  and add in more "hints" to the user which
flag when certain optimizations are done, or being thwarted, or when the
compiler thinks certain operations might generate excessive communication
or non-local accesses....

What do you think?
-- Nikhil



---------------------------------
Nikhil Padmanabhan
[email protected]

On Thu, Mar 30, 2017 at 8:57 AM, Barry Moore <[email protected]> wrote:

> Brad,
>
> Thanks for the detailed response! I am going to dive in and I will try to
> put together some data for comparison.
>
> - Barry
>
> On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote:
>
>>
>> Hi Barry --
>>
>> My response falls into three parts, where maybe only the third one really
>> matters to you -- so depending on your level of interest, you may want to
>> skip ahead to part 3...
>>
>> Part 1: The opening regret
>> --------------------------
>>
>> I have to admit that the phrase "matrix-vector product" makes me cringe
>> in slight fear in the Chapel context, because I believe we're lacking some
>> key features that would make it far better/more scalable than it is today.
>>
>> If you're interested in diving into some increasingly ancient history, we
>> studied (sparse) matrix-vector multiplication at length in our previous
>> work on ZPL and did reasonably well there compared to MPI.  When we started
>> on Chapel, we made some core design decisions with the goal of closing the
>> remaining scalability gaps we had with ZPL; yet we put off some other
>> features that I believe are crucial for mat-vect mults (this was primarily
>> because Chapel's early years focused more on irregular computations than on
>> matrix-based ones).  We've actually been returning to these features
>> recently (motivated by some users working on similar operations), but have
>> not quite landed them yet.
>>
>> Just to put a name to them, I think the main concepts we're missing are:
>>
>> * partial reductions -- the ability to reduce a multi-dimensional array
>>   expression along a subset of its dimensions rather than all of them.
>>
>> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a
>>   replicated row that spans columns 1..n.  (these were called flood and
>>   grid dimensions in ZPL where the former were kept coherent by the
>>   language semantics and the latter were not).
>>
>> Without these features, matrix-vector products can still be expressed,
>> simply not as efficiently (and/or elegantly) as they could otherwise...
>>
>>
>> Part 2: Your declarations
>> -------------------------
>>
>> In the end, I don't think there are any fundamental problems with your
>> declarations, but in thinking through them myself, these are the notes I
>> put together:
>>
>>
>> - A is a square matrix (dimension x dimension), where each row is
>>> cyclically distributed to the locales
>>>
>>
>> You implemented this as:
>>
>> ------
>> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=MatrixSpa
>> ce.low,
>>                                                   blocksize = (1,
>> dimension))
>>                   = MatrixSpace;
>>
>> var A : [MatrixDomain] int;
>> ------
>>
>> If you only want to distribute the first dimension cyclically (i.e.,
>> you'll always use a block size of 1), my suggestion would be to use the
>> Cyclic distribution.  The typical trick to get it distributed in just one
>> dimension is to pass in a 2D targetLocales array that is degenerate in one
>> dimension (you could do this with BlockCyclic as well, you'd just be paying
>> unnecessary blocking overhead).  So here's how I coded up A such that the
>> rows are distributed cyclically and the columns not at all (along with some
>> debugging code):
>>
>> ------
>> // create a numLocales x 1 view of the Locales array:
>> //
>> const grid = reshape(Locales, {0..#numLocales, 0..0});
>> //
>> writeln(grid, "\n");
>>
>> // Distribute MatrixSpace in a Cyclic manner.  Because 'grid' is
>> // degenerate in the second dimension, only the rows will be cyclic.
>> //
>> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx =
>> MatrixSpace.low,
>>                                                 targetLocales = grid);
>>
>> var A : [MatrixDomain] int;
>>
>> //
>> // Make sure that ownership is as expected:
>> //
>> forall a in A do
>>   a = here.id;
>> writeln(A, "\n");
>>
>> ------
>>
>> As a stylistic note, if you're comfortable leaving type declarations off,
>> note that I've swung the 'dmapped' clause from the type to the value
>> location, which I've come to find much simpler to read over time (in this
>> way, you could even eliminate MatrixSpace if you wanted by inlining it
>> here, and avoid problems with accidentally using one rather than the other.
>>
>> - B is a vector (dimension), replicated on each locale
>>>
>>
>> Here, you did:
>>
>> ------
>> const VectorSpace = {0..#dimension};
>> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
>>                              = VectorSpace;
>> ------
>>
>> which I think is reasonable.  I must admit that I don't use the
>> ReplicatedDist often enough to be familiar with it without experimenting
>> and refreshing my memory.  But a quick check makes this look like it's
>> going to work as you want.
>>
>> [Returning to the theme of part 1, I'll mention that my hope is that
>> we'll soon be able to support this as a row-vector '{*, 0..#dimension}'
>> that shares the distribution of A.  E.g., imagine expressing this long-form
>> as:
>>
>>     const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;
>>
>> or more succinctly, as:
>>
>>     const RowVect = MatrixDomain[*, ..];
>>
>> (that is, as a slice of MatrixDomain)]
>>
>>
>> - C is a vector (dimension), each value is cyclically distributed to match
>>> the row of A
>>>
>>
>> Here, my suggestion would be to use the Cyclic distribution again and, if
>> it were me, I'd also be inclined to have it share the same distribution as
>> A rather than repeating all that distribution information.  This can be
>> written most simply as:
>>
>> ----
>> const VectorCyclicDomain = MatrixDomain[.., 0];
>>
>> var C : [VectorCyclicDomain] int;
>> ----
>>
>> which essentially says: "Perform a rank-change slice (intersection) of
>> MatrixDomain taking all of its indices in the first dimension and
>> collapsing its second dimension down to index 0."
>>
>>
>> Part 3: The real problem
>> ------------------------
>>
>> The problem is when I use 2 locales, the second one is mainly doing
>>> communication
>>>
>>
>> The problem turns out to be the way you've written the loop, though it's
>> a subtle issue and not an uncommon trap to fall into among new users:
>>
>> -----
>> forall (i, c) in zip({0..#dimension}, C(..)) {
>>     forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) {
>>         c += a * B(j);
>>     }
>> }
>> -----
>>
>> The implementation of a forall loop (how many tasks to create and where
>> should they run?) is controlled by the thing that's being iterated over.
>> Whenever a forall loop iterates over a zip() expression, the first thing in
>> the list (the "leader" ) controls the implementation.  So these loops are
>> iterating over the anonymous domain {0..#dimension} which is not
>> distributed (because nothing has said it should be).  For that reason, all
>> tasks created by the forall loop will be local to the original locale
>> (Locale #0).  That's why you're seeing other locales not getting much work
>> (and probably a lot of communication between them and locale #0 for all the
>> non-local accesses...).
>>
>> In order to distribute the tasks across locales using a forall loop,
>> you'll want to iterate over one of your distributed domains or arrays. This
>> will cause the distribution of that domain/array to create the tasks, and
>> it'll do so using a task-per-core per locale to which the domain/array is
>> distributed.
>>
>> (You could also use an on-clause within the body of your loop to steer
>> each iteration to the appropriate locale, but this will result in creating
>> an active message per iteartion which will kill your performance... better
>> to let the forall create the on-clauses for you only as necessary).
>>
>> So you should see your program using more computational resources on
>> locales > 0 if you rewrite this as:
>>
>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>           forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
>>             c += a * B(j);
>>           }
>>         }
>>
>> A few notes on the above:
>>
>> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
>>   strength optimize the identity slice), so you'll get better performance
>>   if you just write it as 'C'.
>>
>> * similarly, for the purposes of these loops, iterating over
>>   '{0..#dimension}' is equivalent to iterating over '0..#dimension'
>>   which is equivalent to iterating over 'VectorSpace' since none of them
>>   are distributed.  So, it can be better to iterate over the named,
>>   pre-created thing rather than (logically) creating a new range (cheaper)
>>   or domain (more expensive) per inner loop iteration.  (this _may_ get
>>   hoisted by the compiler as loop-invariant code, but why take the chance
>>   when you've already named it so nicely...?)
>>
>> But, before stopping there, note that the use of the nested forall is
>> likely to add overhead without much benefit.  Specifically, the outer loop
>> will already create a task per processor core that 'VectorCyclicDomain' is
>> distributed to (i.e., all of them), so your code will be fully parallel
>> without the additiona parallel inner loop.  For this reason, I'd probably
>> write the loop as:
>>
>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>           for (j, a) in zip(VectorSpace, A(i, ..)) {
>>             c += a * B(j);
>>           }
>>         }
>>
>> This also avoids the need for the 'with' clause since the for loop is
>> serial, so won't race on its accesses to 'c'.
>>
>> (used chplvis, nice tool by the way).
>>>
>>
>> Great, glad you like it!  I'll let its author (Phil Nelson, WWU) know.
>>
>> I haven't pre-checked my loop rewrites to verify that the loop now
>> results in less communication, so I'll be curious what you find.
>>
>> Hope this helps, and let us know if it leads to follow-up questions,
>> -Brad
>>
>>
>
>
> --
> Barry E Moore II, PhD
> E-mail: [email protected]
>
> Assistant Research Professor
> Center for Simulation and Modeling
> University of Pittsburgh
> Pittsburgh, PA 15260
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Chapel-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/chapel-users
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Distributed Matrix Vector Product

Reply via email to