Re: Distributed Matrix Vector Product

Nikhil Padmanabhan Fri, 31 Mar 2017 10:21:07 -0700

Brad,

I agree this should be opt-in....


The more I think about it, an opt-in flag that produces a set of
warnings/comments about the code sounds quite useful. Most people are
probably not used to programming in a PGAS model, and Chapel has its own
idioms as well, and so something like this would be useful when a user
finds the code not quite performing like they expect.

One could also use this to communicate places where the code/communication
is not as optimized as it might be --- eg. when writing a new Distribution,
if there are known slow steps that are WIP, the developer could warn the
user.....

There is, of course, the question of what set of warnings/informational
messages, if there should be levels of this etc....

Cheers,
-- Nikhil



---------------------------------
Nikhil Padmanabhan
[email protected]

On Fri, Mar 31, 2017 at 11:58 AM, Brad Chamberlain <[email protected]> wrote:

>
> Hi Nikhil --
>
> I agree that this could be useful as an opt-in warning and seems that it
> would not be too hard to implement.  I'd worry about it being an always-on
> warning because I think there are many cases where it's legitimate to zip
> distributed and non-distributed things.  Maybe with experience we could
> tune the warning to distinguish between "more likely to be legitimate" and
> "more likely to be a mistake" cases...?
>
> -Brad
>
>
>
> On Thu, 30 Mar 2017, Nikhil Padmanabhan wrote:
>
> Brad,
>>
>> Would it be possible/useful for zip to throw a warning (ideally at compile
>> time) if the distributions of the vectors are not the same.....? In fact,
>> one might even extend that to cases like C = A + B.....
>>
>> One might imagine this being an opt-in warning selected by a flag like the
>> Intel Compiler's opt-report....  and add in more "hints" to the user which
>> flag when certain optimizations are done, or being thwarted, or when the
>> compiler thinks certain operations might generate excessive communication
>> or non-local accesses....
>>
>> What do you think?
>> -- Nikhil
>>
>> ---------------------------------
>> Nikhil Padmanabhan
>> [email protected]
>>
>> On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote:
>>
>>
>>> Hi Barry --
>>>
>>> My response falls into three parts, where maybe only the third one really
>>> matters to you -- so depending on your level of interest, you may want to
>>> skip ahead to part 3...
>>>
>>> Part 1: The opening regret
>>> --------------------------
>>>
>>> I have to admit that the phrase "matrix-vector product" makes me cringe
>>> in
>>> slight fear in the Chapel context, because I believe we're lacking some
>>> key features that would make it far better/more scalable than it is
>>> today.
>>>
>>> If you're interested in diving into some increasingly ancient history, we
>>> studied (sparse) matrix-vector multiplication at length in our previous
>>> work on ZPL and did reasonably well there compared to MPI.  When we
>>> started on Chapel, we made some core design decisions with the goal of
>>> closing the remaining scalability gaps we had with ZPL; yet we put off
>>> some other features that I believe are crucial for mat-vect mults (this
>>> was primarily because Chapel's early years focused more on irregular
>>> computations than on matrix-based ones).  We've actually been returning
>>> to
>>> these features recently (motivated by some users working on similar
>>> operations), but have not quite landed them yet.
>>>
>>> Just to put a name to them, I think the main concepts we're missing are:
>>>
>>> * partial reductions -- the ability to reduce a multi-dimensional array
>>>    expression along a subset of its dimensions rather than all of them.
>>>
>>> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a
>>>    replicated row that spans columns 1..n.  (these were called flood and
>>>    grid dimensions in ZPL where the former were kept coherent by the
>>>    language semantics and the latter were not).
>>>
>>> Without these features, matrix-vector products can still be expressed,
>>> simply not as efficiently (and/or elegantly) as they could otherwise...
>>>
>>>
>>> Part 2: Your declarations
>>> -------------------------
>>>
>>> In the end, I don't think there are any fundamental problems with your
>>> declarations, but in thinking through them myself, these are the notes I
>>> put together:
>>>
>>>
>>> - A is a square matrix (dimension x dimension), where each row is
>>>> cyclically distributed to the locales
>>>>
>>>
>>> You implemented this as:
>>>
>>> ------
>>> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=
>>> MatrixSpace.low,
>>>                                                    blocksize = (1,
>>> dimension))
>>>                    = MatrixSpace;
>>>
>>> var A : [MatrixDomain] int;
>>> ------
>>>
>>> If you only want to distribute the first dimension cyclically (i.e.,
>>> you'll always use a block size of 1), my suggestion would be to use the
>>> Cyclic distribution.  The typical trick to get it distributed in just one
>>> dimension is to pass in a 2D targetLocales array that is degenerate in
>>> one
>>> dimension (you could do this with BlockCyclic as well, you'd just be
>>> paying unnecessary blocking overhead).  So here's how I coded up A such
>>> that the rows are distributed cyclically and the columns not at all
>>> (along
>>> with some debugging code):
>>>
>>> ------
>>> // create a numLocales x 1 view of the Locales array:
>>> //
>>> const grid = reshape(Locales, {0..#numLocales, 0..0});
>>> //
>>> writeln(grid, "\n");
>>>
>>> // Distribute MatrixSpace in a Cyclic manner.  Because 'grid' is
>>> // degenerate in the second dimension, only the rows will be cyclic.
>>> //
>>> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx =
>>> MatrixSpace.low,
>>>                                                  targetLocales = grid);
>>>
>>> var A : [MatrixDomain] int;
>>>
>>> //
>>> // Make sure that ownership is as expected:
>>> //
>>> forall a in A do
>>>    a = here.id;
>>> writeln(A, "\n");
>>>
>>> ------
>>>
>>> As a stylistic note, if you're comfortable leaving type declarations off,
>>> note that I've swung the 'dmapped' clause from the type to the value
>>> location, which I've come to find much simpler to read over time (in this
>>> way, you could even eliminate MatrixSpace if you wanted by inlining it
>>> here, and avoid problems with accidentally using one rather than the
>>> other.
>>>
>>> - B is a vector (dimension), replicated on each locale
>>>>
>>>
>>> Here, you did:
>>>
>>> ------
>>> const VectorSpace = {0..#dimension};
>>> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
>>>                               = VectorSpace;
>>> ------
>>>
>>> which I think is reasonable.  I must admit that I don't use the
>>> ReplicatedDist often enough to be familiar with it without experimenting
>>> and refreshing my memory.  But a quick check makes this look like it's
>>> going to work as you want.
>>>
>>> [Returning to the theme of part 1, I'll mention that my hope is that
>>> we'll
>>> soon be able to support this as a row-vector '{*, 0..#dimension}' that
>>> shares the distribution of A.  E.g., imagine expressing this long-form
>>> as:
>>>
>>>      const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;
>>>
>>> or more succinctly, as:
>>>
>>>      const RowVect = MatrixDomain[*, ..];
>>>
>>> (that is, as a slice of MatrixDomain)]
>>>
>>>
>>> - C is a vector (dimension), each value is cyclically distributed to
>>>>
>>> match
>>>
>>>> the row of A
>>>>
>>>
>>> Here, my suggestion would be to use the Cyclic distribution again and, if
>>> it were me, I'd also be inclined to have it share the same distribution
>>> as
>>> A rather than repeating all that distribution information.  This can be
>>> written most simply as:
>>>
>>> ----
>>> const VectorCyclicDomain = MatrixDomain[.., 0];
>>>
>>> var C : [VectorCyclicDomain] int;
>>> ----
>>>
>>> which essentially says: "Perform a rank-change slice (intersection) of
>>> MatrixDomain taking all of its indices in the first dimension and
>>> collapsing its second dimension down to index 0."
>>>
>>>
>>> Part 3: The real problem
>>> ------------------------
>>>
>>> The problem is when I use 2 locales, the second one is mainly doing
>>>> communication
>>>>
>>>
>>> The problem turns out to be the way you've written the loop, though it's
>>> a
>>> subtle issue and not an uncommon trap to fall into among new users:
>>>
>>> -----
>>> forall (i, c) in zip({0..#dimension}, C(..)) {
>>>      forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c )
>>> {
>>>          c += a * B(j);
>>>      }
>>> }
>>> -----
>>>
>>> The implementation of a forall loop (how many tasks to create and where
>>> should they run?) is controlled by the thing that's being iterated over.
>>> Whenever a forall loop iterates over a zip() expression, the first thing
>>> in the list (the "leader" ) controls the implementation.  So these loops
>>> are iterating over the anonymous domain {0..#dimension} which is not
>>> distributed (because nothing has said it should be).  For that reason,
>>> all
>>> tasks created by the forall loop will be local to the original locale
>>> (Locale #0).  That's why you're seeing other locales not getting much
>>> work
>>> (and probably a lot of communication between them and locale #0 for all
>>> the non-local accesses...).
>>>
>>> In order to distribute the tasks across locales using a forall loop,
>>> you'll want to iterate over one of your distributed domains or arrays.
>>> This will cause the distribution of that domain/array to create the
>>> tasks,
>>> and it'll do so using a task-per-core per locale to which the
>>> domain/array
>>> is distributed.
>>>
>>> (You could also use an on-clause within the body of your loop to steer
>>> each iteration to the appropriate locale, but this will result in
>>> creating
>>> an active message per iteartion which will kill your performance...
>>> better
>>> to let the forall create the on-clauses for you only as necessary).
>>>
>>> So you should see your program using more computational resources on
>>> locales > 0 if you rewrite this as:
>>>
>>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>>           forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
>>>             c += a * B(j);
>>>           }
>>>         }
>>>
>>> A few notes on the above:
>>>
>>> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
>>>    strength optimize the identity slice), so you'll get better
>>> performance
>>>    if you just write it as 'C'.
>>>
>>> * similarly, for the purposes of these loops, iterating over
>>>    '{0..#dimension}' is equivalent to iterating over '0..#dimension'
>>>    which is equivalent to iterating over 'VectorSpace' since none of them
>>>    are distributed.  So, it can be better to iterate over the named,
>>>    pre-created thing rather than (logically) creating a new range
>>> (cheaper)
>>>    or domain (more expensive) per inner loop iteration.  (this _may_ get
>>>    hoisted by the compiler as loop-invariant code, but why take the
>>> chance
>>>    when you've already named it so nicely...?)
>>>
>>> But, before stopping there, note that the use of the nested forall is
>>> likely to add overhead without much benefit.  Specifically, the outer
>>> loop
>>> will already create a task per processor core that 'VectorCyclicDomain'
>>> is
>>> distributed to (i.e., all of them), so your code will be fully parallel
>>> without the additiona parallel inner loop.  For this reason, I'd probably
>>> write the loop as:
>>>
>>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>>           for (j, a) in zip(VectorSpace, A(i, ..)) {
>>>             c += a * B(j);
>>>           }
>>>         }
>>>
>>> This also avoids the need for the 'with' clause since the for loop is
>>> serial, so won't race on its accesses to 'c'.
>>>
>>> (used chplvis, nice tool by the way).
>>>>
>>>
>>> Great, glad you like it!  I'll let its author (Phil Nelson, WWU) know.
>>>
>>> I haven't pre-checked my loop rewrites to verify that the loop now
>>> results
>>> in less communication, so I'll be curious what you find.
>>>
>>> Hope this helps, and let us know if it leads to follow-up questions,
>>> -Brad
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Chapel-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>>
>>>
>>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Distributed Matrix Vector Product

Reply via email to