Brad,
Thanks for the detailed response! I am going to dive in and I will try to
put together some data for comparison.
- Barry
On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote:
>
> Hi Barry --
>
> My response falls into three parts, where maybe only the third one really
> matters to you -- so depending on your level of interest, you may want to
> skip ahead to part 3...
>
> Part 1: The opening regret
> --------------------------
>
> I have to admit that the phrase "matrix-vector product" makes me cringe in
> slight fear in the Chapel context, because I believe we're lacking some key
> features that would make it far better/more scalable than it is today.
>
> If you're interested in diving into some increasingly ancient history, we
> studied (sparse) matrix-vector multiplication at length in our previous
> work on ZPL and did reasonably well there compared to MPI. When we started
> on Chapel, we made some core design decisions with the goal of closing the
> remaining scalability gaps we had with ZPL; yet we put off some other
> features that I believe are crucial for mat-vect mults (this was primarily
> because Chapel's early years focused more on irregular computations than on
> matrix-based ones). We've actually been returning to these features
> recently (motivated by some users working on similar operations), but have
> not quite landed them yet.
>
> Just to put a name to them, I think the main concepts we're missing are:
>
> * partial reductions -- the ability to reduce a multi-dimensional array
> expression along a subset of its dimensions rather than all of them.
>
> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a
> replicated row that spans columns 1..n. (these were called flood and
> grid dimensions in ZPL where the former were kept coherent by the
> language semantics and the latter were not).
>
> Without these features, matrix-vector products can still be expressed,
> simply not as efficiently (and/or elegantly) as they could otherwise...
>
>
> Part 2: Your declarations
> -------------------------
>
> In the end, I don't think there are any fundamental problems with your
> declarations, but in thinking through them myself, these are the notes I
> put together:
>
>
> - A is a square matrix (dimension x dimension), where each row is
>> cyclically distributed to the locales
>>
>
> You implemented this as:
>
> ------
> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=MatrixSpa
> ce.low,
> blocksize = (1,
> dimension))
> = MatrixSpace;
>
> var A : [MatrixDomain] int;
> ------
>
> If you only want to distribute the first dimension cyclically (i.e.,
> you'll always use a block size of 1), my suggestion would be to use the
> Cyclic distribution. The typical trick to get it distributed in just one
> dimension is to pass in a 2D targetLocales array that is degenerate in one
> dimension (you could do this with BlockCyclic as well, you'd just be paying
> unnecessary blocking overhead). So here's how I coded up A such that the
> rows are distributed cyclically and the columns not at all (along with some
> debugging code):
>
> ------
> // create a numLocales x 1 view of the Locales array:
> //
> const grid = reshape(Locales, {0..#numLocales, 0..0});
> //
> writeln(grid, "\n");
>
> // Distribute MatrixSpace in a Cyclic manner. Because 'grid' is
> // degenerate in the second dimension, only the rows will be cyclic.
> //
> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx = MatrixSpace.low,
> targetLocales = grid);
>
> var A : [MatrixDomain] int;
>
> //
> // Make sure that ownership is as expected:
> //
> forall a in A do
> a = here.id;
> writeln(A, "\n");
>
> ------
>
> As a stylistic note, if you're comfortable leaving type declarations off,
> note that I've swung the 'dmapped' clause from the type to the value
> location, which I've come to find much simpler to read over time (in this
> way, you could even eliminate MatrixSpace if you wanted by inlining it
> here, and avoid problems with accidentally using one rather than the other.
>
> - B is a vector (dimension), replicated on each locale
>>
>
> Here, you did:
>
> ------
> const VectorSpace = {0..#dimension};
> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
> = VectorSpace;
> ------
>
> which I think is reasonable. I must admit that I don't use the
> ReplicatedDist often enough to be familiar with it without experimenting
> and refreshing my memory. But a quick check makes this look like it's
> going to work as you want.
>
> [Returning to the theme of part 1, I'll mention that my hope is that we'll
> soon be able to support this as a row-vector '{*, 0..#dimension}' that
> shares the distribution of A. E.g., imagine expressing this long-form as:
>
> const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;
>
> or more succinctly, as:
>
> const RowVect = MatrixDomain[*, ..];
>
> (that is, as a slice of MatrixDomain)]
>
>
> - C is a vector (dimension), each value is cyclically distributed to match
>> the row of A
>>
>
> Here, my suggestion would be to use the Cyclic distribution again and, if
> it were me, I'd also be inclined to have it share the same distribution as
> A rather than repeating all that distribution information. This can be
> written most simply as:
>
> ----
> const VectorCyclicDomain = MatrixDomain[.., 0];
>
> var C : [VectorCyclicDomain] int;
> ----
>
> which essentially says: "Perform a rank-change slice (intersection) of
> MatrixDomain taking all of its indices in the first dimension and
> collapsing its second dimension down to index 0."
>
>
> Part 3: The real problem
> ------------------------
>
> The problem is when I use 2 locales, the second one is mainly doing
>> communication
>>
>
> The problem turns out to be the way you've written the loop, though it's a
> subtle issue and not an uncommon trap to fall into among new users:
>
> -----
> forall (i, c) in zip({0..#dimension}, C(..)) {
> forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) {
> c += a * B(j);
> }
> }
> -----
>
> The implementation of a forall loop (how many tasks to create and where
> should they run?) is controlled by the thing that's being iterated over.
> Whenever a forall loop iterates over a zip() expression, the first thing in
> the list (the "leader" ) controls the implementation. So these loops are
> iterating over the anonymous domain {0..#dimension} which is not
> distributed (because nothing has said it should be). For that reason, all
> tasks created by the forall loop will be local to the original locale
> (Locale #0). That's why you're seeing other locales not getting much work
> (and probably a lot of communication between them and locale #0 for all the
> non-local accesses...).
>
> In order to distribute the tasks across locales using a forall loop,
> you'll want to iterate over one of your distributed domains or arrays. This
> will cause the distribution of that domain/array to create the tasks, and
> it'll do so using a task-per-core per locale to which the domain/array is
> distributed.
>
> (You could also use an on-clause within the body of your loop to steer
> each iteration to the appropriate locale, but this will result in creating
> an active message per iteartion which will kill your performance... better
> to let the forall create the on-clauses for you only as necessary).
>
> So you should see your program using more computational resources on
> locales > 0 if you rewrite this as:
>
> forall (i, c) in zip(VectorCyclicDomain, C) {
> forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
> c += a * B(j);
> }
> }
>
> A few notes on the above:
>
> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
> strength optimize the identity slice), so you'll get better performance
> if you just write it as 'C'.
>
> * similarly, for the purposes of these loops, iterating over
> '{0..#dimension}' is equivalent to iterating over '0..#dimension'
> which is equivalent to iterating over 'VectorSpace' since none of them
> are distributed. So, it can be better to iterate over the named,
> pre-created thing rather than (logically) creating a new range (cheaper)
> or domain (more expensive) per inner loop iteration. (this _may_ get
> hoisted by the compiler as loop-invariant code, but why take the chance
> when you've already named it so nicely...?)
>
> But, before stopping there, note that the use of the nested forall is
> likely to add overhead without much benefit. Specifically, the outer loop
> will already create a task per processor core that 'VectorCyclicDomain' is
> distributed to (i.e., all of them), so your code will be fully parallel
> without the additiona parallel inner loop. For this reason, I'd probably
> write the loop as:
>
> forall (i, c) in zip(VectorCyclicDomain, C) {
> for (j, a) in zip(VectorSpace, A(i, ..)) {
> c += a * B(j);
> }
> }
>
> This also avoids the need for the 'with' clause since the for loop is
> serial, so won't race on its accesses to 'c'.
>
> (used chplvis, nice tool by the way).
>>
>
> Great, glad you like it! I'll let its author (Phil Nelson, WWU) know.
>
> I haven't pre-checked my loop rewrites to verify that the loop now results
> in less communication, so I'll be curious what you find.
>
> Hope this helps, and let us know if it leads to follow-up questions,
> -Brad
>
>
--
Barry E Moore II, PhD
E-mail: [email protected]
Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users