Re: Distributed Matrix Vector Product

Brad Chamberlain Fri, 31 Mar 2017 11:05:50 -0700

Yeah, this is something I've fantasized about for awhile (essentially "let 
us point out what might be some obvious performance-related things you 
might consider... along the lines of automating some of the long email I 
sent out on this morning...).  I just haven't wanted (or found the time at 
least) to implement it.  :)  Would be fair game for a GitHub feature 
request issue...


-Brad



On Fri, 31 Mar 2017, Nikhil Padmanabhan wrote:

> Brad,
>
> I agree this should be opt-in....
>
> The more I think about it, an opt-in flag that produces a set of
> warnings/comments about the code sounds quite useful. Most people are
> probably not used to programming in a PGAS model, and Chapel has its own
> idioms as well, and so something like this would be useful when a user
> finds the code not quite performing like they expect.
>
> One could also use this to communicate places where the code/communication
> is not as optimized as it might be --- eg. when writing a new Distribution,
> if there are known slow steps that are WIP, the developer could warn the
> user.....
>
> There is, of course, the question of what set of warnings/informational
> messages, if there should be levels of this etc....
>
> Cheers,
> -- Nikhil
>
> ---------------------------------
> Nikhil Padmanabhan
> [email protected]
>
> On Fri, Mar 31, 2017 at 11:58 AM, Brad Chamberlain <[email protected]> wrote:
>
>>
>> Hi Nikhil --
>>
>> I agree that this could be useful as an opt-in warning and seems that it
>> would not be too hard to implement.  I'd worry about it being an always-on
>> warning because I think there are many cases where it's legitimate to zip
>> distributed and non-distributed things.  Maybe with experience we could
>> tune the warning to distinguish between "more likely to be legitimate" and
>> "more likely to be a mistake" cases...?
>>
>> -Brad
>>
>>
>>
>> On Thu, 30 Mar 2017, Nikhil Padmanabhan wrote:
>>
>> Brad,
>>>
>>> Would it be possible/useful for zip to throw a warning (ideally at compile
>>> time) if the distributions of the vectors are not the same.....? In fact,
>>> one might even extend that to cases like C = A + B.....
>>>
>>> One might imagine this being an opt-in warning selected by a flag like the
>>> Intel Compiler's opt-report....  and add in more "hints" to the user which
>>> flag when certain optimizations are done, or being thwarted, or when the
>>> compiler thinks certain operations might generate excessive communication
>>> or non-local accesses....
>>>
>>> What do you think?
>>> -- Nikhil
>>>
>>> ---------------------------------
>>> Nikhil Padmanabhan
>>> [email protected]
>>>
>>> On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote:
>>>
>>>
>>>> Hi Barry --
>>>>
>>>> My response falls into three parts, where maybe only the third one really
>>>> matters to you -- so depending on your level of interest, you may want to
>>>> skip ahead to part 3...
>>>>
>>>> Part 1: The opening regret
>>>> --------------------------
>>>>
>>>> I have to admit that the phrase "matrix-vector product" makes me cringe
>>>> in
>>>> slight fear in the Chapel context, because I believe we're lacking some
>>>> key features that would make it far better/more scalable than it is
>>>> today.
>>>>
>>>> If you're interested in diving into some increasingly ancient history, we
>>>> studied (sparse) matrix-vector multiplication at length in our previous
>>>> work on ZPL and did reasonably well there compared to MPI.  When we
>>>> started on Chapel, we made some core design decisions with the goal of
>>>> closing the remaining scalability gaps we had with ZPL; yet we put off
>>>> some other features that I believe are crucial for mat-vect mults (this
>>>> was primarily because Chapel's early years focused more on irregular
>>>> computations than on matrix-based ones).  We've actually been returning
>>>> to
>>>> these features recently (motivated by some users working on similar
>>>> operations), but have not quite landed them yet.
>>>>
>>>> Just to put a name to them, I think the main concepts we're missing are:
>>>>
>>>> * partial reductions -- the ability to reduce a multi-dimensional array
>>>>    expression along a subset of its dimensions rather than all of them.
>>>>
>>>> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a
>>>>    replicated row that spans columns 1..n.  (these were called flood and
>>>>    grid dimensions in ZPL where the former were kept coherent by the
>>>>    language semantics and the latter were not).
>>>>
>>>> Without these features, matrix-vector products can still be expressed,
>>>> simply not as efficiently (and/or elegantly) as they could otherwise...
>>>>
>>>>
>>>> Part 2: Your declarations
>>>> -------------------------
>>>>
>>>> In the end, I don't think there are any fundamental problems with your
>>>> declarations, but in thinking through them myself, these are the notes I
>>>> put together:
>>>>
>>>>
>>>> - A is a square matrix (dimension x dimension), where each row is
>>>>> cyclically distributed to the locales
>>>>>
>>>>
>>>> You implemented this as:
>>>>
>>>> ------
>>>> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=
>>>> MatrixSpace.low,
>>>>                                                    blocksize = (1,
>>>> dimension))
>>>>                    = MatrixSpace;
>>>>
>>>> var A : [MatrixDomain] int;
>>>> ------
>>>>
>>>> If you only want to distribute the first dimension cyclically (i.e.,
>>>> you'll always use a block size of 1), my suggestion would be to use the
>>>> Cyclic distribution.  The typical trick to get it distributed in just one
>>>> dimension is to pass in a 2D targetLocales array that is degenerate in
>>>> one
>>>> dimension (you could do this with BlockCyclic as well, you'd just be
>>>> paying unnecessary blocking overhead).  So here's how I coded up A such
>>>> that the rows are distributed cyclically and the columns not at all
>>>> (along
>>>> with some debugging code):
>>>>
>>>> ------
>>>> // create a numLocales x 1 view of the Locales array:
>>>> //
>>>> const grid = reshape(Locales, {0..#numLocales, 0..0});
>>>> //
>>>> writeln(grid, "\n");
>>>>
>>>> // Distribute MatrixSpace in a Cyclic manner.  Because 'grid' is
>>>> // degenerate in the second dimension, only the rows will be cyclic.
>>>> //
>>>> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx =
>>>> MatrixSpace.low,
>>>>                                                  targetLocales = grid);
>>>>
>>>> var A : [MatrixDomain] int;
>>>>
>>>> //
>>>> // Make sure that ownership is as expected:
>>>> //
>>>> forall a in A do
>>>>    a = here.id;
>>>> writeln(A, "\n");
>>>>
>>>> ------
>>>>
>>>> As a stylistic note, if you're comfortable leaving type declarations off,
>>>> note that I've swung the 'dmapped' clause from the type to the value
>>>> location, which I've come to find much simpler to read over time (in this
>>>> way, you could even eliminate MatrixSpace if you wanted by inlining it
>>>> here, and avoid problems with accidentally using one rather than the
>>>> other.
>>>>
>>>> - B is a vector (dimension), replicated on each locale
>>>>>
>>>>
>>>> Here, you did:
>>>>
>>>> ------
>>>> const VectorSpace = {0..#dimension};
>>>> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
>>>>                               = VectorSpace;
>>>> ------
>>>>
>>>> which I think is reasonable.  I must admit that I don't use the
>>>> ReplicatedDist often enough to be familiar with it without experimenting
>>>> and refreshing my memory.  But a quick check makes this look like it's
>>>> going to work as you want.
>>>>
>>>> [Returning to the theme of part 1, I'll mention that my hope is that
>>>> we'll
>>>> soon be able to support this as a row-vector '{*, 0..#dimension}' that
>>>> shares the distribution of A.  E.g., imagine expressing this long-form
>>>> as:
>>>>
>>>>      const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;
>>>>
>>>> or more succinctly, as:
>>>>
>>>>      const RowVect = MatrixDomain[*, ..];
>>>>
>>>> (that is, as a slice of MatrixDomain)]
>>>>
>>>>
>>>> - C is a vector (dimension), each value is cyclically distributed to
>>>>>
>>>> match
>>>>
>>>>> the row of A
>>>>>
>>>>
>>>> Here, my suggestion would be to use the Cyclic distribution again and, if
>>>> it were me, I'd also be inclined to have it share the same distribution
>>>> as
>>>> A rather than repeating all that distribution information.  This can be
>>>> written most simply as:
>>>>
>>>> ----
>>>> const VectorCyclicDomain = MatrixDomain[.., 0];
>>>>
>>>> var C : [VectorCyclicDomain] int;
>>>> ----
>>>>
>>>> which essentially says: "Perform a rank-change slice (intersection) of
>>>> MatrixDomain taking all of its indices in the first dimension and
>>>> collapsing its second dimension down to index 0."
>>>>
>>>>
>>>> Part 3: The real problem
>>>> ------------------------
>>>>
>>>> The problem is when I use 2 locales, the second one is mainly doing
>>>>> communication
>>>>>
>>>>
>>>> The problem turns out to be the way you've written the loop, though it's
>>>> a
>>>> subtle issue and not an uncommon trap to fall into among new users:
>>>>
>>>> -----
>>>> forall (i, c) in zip({0..#dimension}, C(..)) {
>>>>      forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c )
>>>> {
>>>>          c += a * B(j);
>>>>      }
>>>> }
>>>> -----
>>>>
>>>> The implementation of a forall loop (how many tasks to create and where
>>>> should they run?) is controlled by the thing that's being iterated over.
>>>> Whenever a forall loop iterates over a zip() expression, the first thing
>>>> in the list (the "leader" ) controls the implementation.  So these loops
>>>> are iterating over the anonymous domain {0..#dimension} which is not
>>>> distributed (because nothing has said it should be).  For that reason,
>>>> all
>>>> tasks created by the forall loop will be local to the original locale
>>>> (Locale #0).  That's why you're seeing other locales not getting much
>>>> work
>>>> (and probably a lot of communication between them and locale #0 for all
>>>> the non-local accesses...).
>>>>
>>>> In order to distribute the tasks across locales using a forall loop,
>>>> you'll want to iterate over one of your distributed domains or arrays.
>>>> This will cause the distribution of that domain/array to create the
>>>> tasks,
>>>> and it'll do so using a task-per-core per locale to which the
>>>> domain/array
>>>> is distributed.
>>>>
>>>> (You could also use an on-clause within the body of your loop to steer
>>>> each iteration to the appropriate locale, but this will result in
>>>> creating
>>>> an active message per iteartion which will kill your performance...
>>>> better
>>>> to let the forall create the on-clauses for you only as necessary).
>>>>
>>>> So you should see your program using more computational resources on
>>>> locales > 0 if you rewrite this as:
>>>>
>>>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>>>           forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
>>>>             c += a * B(j);
>>>>           }
>>>>         }
>>>>
>>>> A few notes on the above:
>>>>
>>>> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
>>>>    strength optimize the identity slice), so you'll get better
>>>> performance
>>>>    if you just write it as 'C'.
>>>>
>>>> * similarly, for the purposes of these loops, iterating over
>>>>    '{0..#dimension}' is equivalent to iterating over '0..#dimension'
>>>>    which is equivalent to iterating over 'VectorSpace' since none of them
>>>>    are distributed.  So, it can be better to iterate over the named,
>>>>    pre-created thing rather than (logically) creating a new range
>>>> (cheaper)
>>>>    or domain (more expensive) per inner loop iteration.  (this _may_ get
>>>>    hoisted by the compiler as loop-invariant code, but why take the
>>>> chance
>>>>    when you've already named it so nicely...?)
>>>>
>>>> But, before stopping there, note that the use of the nested forall is
>>>> likely to add overhead without much benefit.  Specifically, the outer
>>>> loop
>>>> will already create a task per processor core that 'VectorCyclicDomain'
>>>> is
>>>> distributed to (i.e., all of them), so your code will be fully parallel
>>>> without the additiona parallel inner loop.  For this reason, I'd probably
>>>> write the loop as:
>>>>
>>>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>>>           for (j, a) in zip(VectorSpace, A(i, ..)) {
>>>>             c += a * B(j);
>>>>           }
>>>>         }
>>>>
>>>> This also avoids the need for the 'with' clause since the for loop is
>>>> serial, so won't race on its accesses to 'c'.
>>>>
>>>> (used chplvis, nice tool by the way).
>>>>>
>>>>
>>>> Great, glad you like it!  I'll let its author (Phil Nelson, WWU) know.
>>>>
>>>> I haven't pre-checked my loop rewrites to verify that the loop now
>>>> results
>>>> in less communication, so I'll be curious what you find.
>>>>
>>>> Hope this helps, and let us know if it leads to follow-up questions,
>>>> -Brad
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Chapel-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>>>
>>>>
>>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Distributed Matrix Vector Product

Reply via email to