Re: Distributed Matrix Vector Product

Brad Chamberlain Fri, 31 Mar 2017 08:59:03 -0700

Hi Nikhil --

I agree that this could be useful as an opt-in warning and seems that it 
would not be too hard to implement.  I'd worry about it being an always-on 
warning because I think there are many cases where it's legitimate to zip 
distributed and non-distributed things.  Maybe with experience we could 
tune the warning to distinguish between "more likely to be legitimate" and 
"more likely to be a mistake" cases...?


-Brad


On Thu, 30 Mar 2017, Nikhil Padmanabhan wrote:

> Brad,
>
> Would it be possible/useful for zip to throw a warning (ideally at compile
> time) if the distributions of the vectors are not the same.....? In fact,
> one might even extend that to cases like C = A + B.....
>
> One might imagine this being an opt-in warning selected by a flag like the
> Intel Compiler's opt-report....  and add in more "hints" to the user which
> flag when certain optimizations are done, or being thwarted, or when the
> compiler thinks certain operations might generate excessive communication
> or non-local accesses....
>
> What do you think?
> -- Nikhil
>
> ---------------------------------
> Nikhil Padmanabhan
> [email protected]
>
> On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote:
>
>>
>> Hi Barry --
>>
>> My response falls into three parts, where maybe only the third one really
>> matters to you -- so depending on your level of interest, you may want to
>> skip ahead to part 3...
>>
>> Part 1: The opening regret
>> --------------------------
>>
>> I have to admit that the phrase "matrix-vector product" makes me cringe in
>> slight fear in the Chapel context, because I believe we're lacking some
>> key features that would make it far better/more scalable than it is today.
>>
>> If you're interested in diving into some increasingly ancient history, we
>> studied (sparse) matrix-vector multiplication at length in our previous
>> work on ZPL and did reasonably well there compared to MPI.  When we
>> started on Chapel, we made some core design decisions with the goal of
>> closing the remaining scalability gaps we had with ZPL; yet we put off
>> some other features that I believe are crucial for mat-vect mults (this
>> was primarily because Chapel's early years focused more on irregular
>> computations than on matrix-based ones).  We've actually been returning to
>> these features recently (motivated by some users working on similar
>> operations), but have not quite landed them yet.
>>
>> Just to put a name to them, I think the main concepts we're missing are:
>>
>> * partial reductions -- the ability to reduce a multi-dimensional array
>>    expression along a subset of its dimensions rather than all of them.
>>
>> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a
>>    replicated row that spans columns 1..n.  (these were called flood and
>>    grid dimensions in ZPL where the former were kept coherent by the
>>    language semantics and the latter were not).
>>
>> Without these features, matrix-vector products can still be expressed,
>> simply not as efficiently (and/or elegantly) as they could otherwise...
>>
>>
>> Part 2: Your declarations
>> -------------------------
>>
>> In the end, I don't think there are any fundamental problems with your
>> declarations, but in thinking through them myself, these are the notes I
>> put together:
>>
>>
>>> - A is a square matrix (dimension x dimension), where each row is
>>> cyclically distributed to the locales
>>
>> You implemented this as:
>>
>> ------
>> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=
>> MatrixSpace.low,
>>                                                    blocksize = (1,
>> dimension))
>>                    = MatrixSpace;
>>
>> var A : [MatrixDomain] int;
>> ------
>>
>> If you only want to distribute the first dimension cyclically (i.e.,
>> you'll always use a block size of 1), my suggestion would be to use the
>> Cyclic distribution.  The typical trick to get it distributed in just one
>> dimension is to pass in a 2D targetLocales array that is degenerate in one
>> dimension (you could do this with BlockCyclic as well, you'd just be
>> paying unnecessary blocking overhead).  So here's how I coded up A such
>> that the rows are distributed cyclically and the columns not at all (along
>> with some debugging code):
>>
>> ------
>> // create a numLocales x 1 view of the Locales array:
>> //
>> const grid = reshape(Locales, {0..#numLocales, 0..0});
>> //
>> writeln(grid, "\n");
>>
>> // Distribute MatrixSpace in a Cyclic manner.  Because 'grid' is
>> // degenerate in the second dimension, only the rows will be cyclic.
>> //
>> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx = MatrixSpace.low,
>>                                                  targetLocales = grid);
>>
>> var A : [MatrixDomain] int;
>>
>> //
>> // Make sure that ownership is as expected:
>> //
>> forall a in A do
>>    a = here.id;
>> writeln(A, "\n");
>>
>> ------
>>
>> As a stylistic note, if you're comfortable leaving type declarations off,
>> note that I've swung the 'dmapped' clause from the type to the value
>> location, which I've come to find much simpler to read over time (in this
>> way, you could even eliminate MatrixSpace if you wanted by inlining it
>> here, and avoid problems with accidentally using one rather than the
>> other.
>>
>>> - B is a vector (dimension), replicated on each locale
>>
>> Here, you did:
>>
>> ------
>> const VectorSpace = {0..#dimension};
>> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
>>                               = VectorSpace;
>> ------
>>
>> which I think is reasonable.  I must admit that I don't use the
>> ReplicatedDist often enough to be familiar with it without experimenting
>> and refreshing my memory.  But a quick check makes this look like it's
>> going to work as you want.
>>
>> [Returning to the theme of part 1, I'll mention that my hope is that we'll
>> soon be able to support this as a row-vector '{*, 0..#dimension}' that
>> shares the distribution of A.  E.g., imagine expressing this long-form as:
>>
>>      const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;
>>
>> or more succinctly, as:
>>
>>      const RowVect = MatrixDomain[*, ..];
>>
>> (that is, as a slice of MatrixDomain)]
>>
>>
>>> - C is a vector (dimension), each value is cyclically distributed to
>> match
>>> the row of A
>>
>> Here, my suggestion would be to use the Cyclic distribution again and, if
>> it were me, I'd also be inclined to have it share the same distribution as
>> A rather than repeating all that distribution information.  This can be
>> written most simply as:
>>
>> ----
>> const VectorCyclicDomain = MatrixDomain[.., 0];
>>
>> var C : [VectorCyclicDomain] int;
>> ----
>>
>> which essentially says: "Perform a rank-change slice (intersection) of
>> MatrixDomain taking all of its indices in the first dimension and
>> collapsing its second dimension down to index 0."
>>
>>
>> Part 3: The real problem
>> ------------------------
>>
>>> The problem is when I use 2 locales, the second one is mainly doing
>>> communication
>>
>> The problem turns out to be the way you've written the loop, though it's a
>> subtle issue and not an uncommon trap to fall into among new users:
>>
>> -----
>> forall (i, c) in zip({0..#dimension}, C(..)) {
>>      forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) {
>>          c += a * B(j);
>>      }
>> }
>> -----
>>
>> The implementation of a forall loop (how many tasks to create and where
>> should they run?) is controlled by the thing that's being iterated over.
>> Whenever a forall loop iterates over a zip() expression, the first thing
>> in the list (the "leader" ) controls the implementation.  So these loops
>> are iterating over the anonymous domain {0..#dimension} which is not
>> distributed (because nothing has said it should be).  For that reason, all
>> tasks created by the forall loop will be local to the original locale
>> (Locale #0).  That's why you're seeing other locales not getting much work
>> (and probably a lot of communication between them and locale #0 for all
>> the non-local accesses...).
>>
>> In order to distribute the tasks across locales using a forall loop,
>> you'll want to iterate over one of your distributed domains or arrays.
>> This will cause the distribution of that domain/array to create the tasks,
>> and it'll do so using a task-per-core per locale to which the domain/array
>> is distributed.
>>
>> (You could also use an on-clause within the body of your loop to steer
>> each iteration to the appropriate locale, but this will result in creating
>> an active message per iteartion which will kill your performance... better
>> to let the forall create the on-clauses for you only as necessary).
>>
>> So you should see your program using more computational resources on
>> locales > 0 if you rewrite this as:
>>
>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>           forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
>>             c += a * B(j);
>>           }
>>         }
>>
>> A few notes on the above:
>>
>> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
>>    strength optimize the identity slice), so you'll get better performance
>>    if you just write it as 'C'.
>>
>> * similarly, for the purposes of these loops, iterating over
>>    '{0..#dimension}' is equivalent to iterating over '0..#dimension'
>>    which is equivalent to iterating over 'VectorSpace' since none of them
>>    are distributed.  So, it can be better to iterate over the named,
>>    pre-created thing rather than (logically) creating a new range (cheaper)
>>    or domain (more expensive) per inner loop iteration.  (this _may_ get
>>    hoisted by the compiler as loop-invariant code, but why take the chance
>>    when you've already named it so nicely...?)
>>
>> But, before stopping there, note that the use of the nested forall is
>> likely to add overhead without much benefit.  Specifically, the outer loop
>> will already create a task per processor core that 'VectorCyclicDomain' is
>> distributed to (i.e., all of them), so your code will be fully parallel
>> without the additiona parallel inner loop.  For this reason, I'd probably
>> write the loop as:
>>
>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>           for (j, a) in zip(VectorSpace, A(i, ..)) {
>>             c += a * B(j);
>>           }
>>         }
>>
>> This also avoids the need for the 'with' clause since the for loop is
>> serial, so won't race on its accesses to 'c'.
>>
>>> (used chplvis, nice tool by the way).
>>
>> Great, glad you like it!  I'll let its author (Phil Nelson, WWU) know.
>>
>> I haven't pre-checked my loop rewrites to verify that the loop now results
>> in less communication, so I'll be curious what you find.
>>
>> Hope this helps, and let us know if it leads to follow-up questions,
>> -Brad
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Chapel-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Distributed Matrix Vector Product

Reply via email to