Hey Guys,

Had a little time to cook up Brad's suggestion as a full program (pasted in
[code][/code]). It is much better with respect to communication (not
unsurprisingly based on the comments above). Also, it looks a LOT cleaner.

Some questions:

1. Stylistically, camelCase everywhere? I noticed sometimes with domains,
spaces, and locales you use CamelCase (there is probably a name for this,
but I don't want to look it up). Is there a suggestion from you guys?

One issue:

1. In this line `for (j, a) in zip({0..#dimension}, A(i(1), ..)) {`: I use
0 indexing everywhere, but I have to use i(1) in the for loop. It feels and
looks awkward to me. Maybe it is unavoidable, but it will be hard to
remember for me.

I still want to produce some timing results, but it will take me a little
bit of time. Then, on to the matrix-matrix product :)

[code]
use CyclicDist;
use ReplicatedDist;
use VisualDebug;

config const tasksPerLocale : int = 1;

// The problem dimensions
config const dimension : int = 10;

// Create a grid for the Locales, makes Cyclic dmap easier
const localeGrid = reshape(Locales, {0..#numLocales, 0..0});

// Matrix, rows distributed to locales using cyclic distribution
const matrixCyclic = {0..#dimension, 0..#dimension} dmapped Cyclic(startIdx
= (0, 0), targetLocales = localeGrid);
var A : [matrixCyclic] int;

// VectorDomain
// -> B is replicated across locales
// -> C is reduced, with cylic distribution similar to matrixCyclic
const vectorReplicated = {0..#dimension} dmapped ReplicatedDist();
const vectorCyclic = matrixCyclic[.., 0];
var B : [vectorReplicated] int;
var C : [vectorCyclic] int;

// Initialize Vectors
for a in A do a = 1;
for b in B do b = 1;

// Start the chplvis counting
startVdebug("chplvis");

// The main loop
forall (i, c) in zip(vectorCyclic, C) {
    for (j, a) in zip({0..#dimension}, A(i(1), ..)) {
        c += a * B(j);
    }
}

// Stop chplvis counting
stopVdebug();

// Print the results
writeln(C);
[/code]

Thanks,

Barry

On Thu, Mar 30, 2017 at 12:03 PM, Nikhil Padmanabhan <
[email protected]> wrote:

> Brad,
>
> Would it be possible/useful for zip to throw a warning (ideally at compile
> time) if the distributions of the vectors are not the same.....? In fact,
> one might even extend that to cases like C = A + B.....
>
> One might imagine this being an opt-in warning selected by a flag like the
> Intel Compiler's opt-report....  and add in more "hints" to the user which
> flag when certain optimizations are done, or being thwarted, or when the
> compiler thinks certain operations might generate excessive communication
> or non-local accesses....
>
> What do you think?
> -- Nikhil
>
>
>
> ---------------------------------
> Nikhil Padmanabhan
> [email protected]
>
> On Thu, Mar 30, 2017 at 8:57 AM, Barry Moore <[email protected]> wrote:
>
>> Brad,
>>
>> Thanks for the detailed response! I am going to dive in and I will try to
>> put together some data for comparison.
>>
>> - Barry
>>
>> On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote:
>>
>>>
>>> Hi Barry --
>>>
>>> My response falls into three parts, where maybe only the third one
>>> really matters to you -- so depending on your level of interest, you may
>>> want to skip ahead to part 3...
>>>
>>> Part 1: The opening regret
>>> --------------------------
>>>
>>> I have to admit that the phrase "matrix-vector product" makes me cringe
>>> in slight fear in the Chapel context, because I believe we're lacking some
>>> key features that would make it far better/more scalable than it is today.
>>>
>>> If you're interested in diving into some increasingly ancient history,
>>> we studied (sparse) matrix-vector multiplication at length in our previous
>>> work on ZPL and did reasonably well there compared to MPI.  When we started
>>> on Chapel, we made some core design decisions with the goal of closing the
>>> remaining scalability gaps we had with ZPL; yet we put off some other
>>> features that I believe are crucial for mat-vect mults (this was primarily
>>> because Chapel's early years focused more on irregular computations than on
>>> matrix-based ones).  We've actually been returning to these features
>>> recently (motivated by some users working on similar operations), but have
>>> not quite landed them yet.
>>>
>>> Just to put a name to them, I think the main concepts we're missing are:
>>>
>>> * partial reductions -- the ability to reduce a multi-dimensional array
>>>   expression along a subset of its dimensions rather than all of them.
>>>
>>> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a
>>>   replicated row that spans columns 1..n.  (these were called flood and
>>>   grid dimensions in ZPL where the former were kept coherent by the
>>>   language semantics and the latter were not).
>>>
>>> Without these features, matrix-vector products can still be expressed,
>>> simply not as efficiently (and/or elegantly) as they could otherwise...
>>>
>>>
>>> Part 2: Your declarations
>>> -------------------------
>>>
>>> In the end, I don't think there are any fundamental problems with your
>>> declarations, but in thinking through them myself, these are the notes I
>>> put together:
>>>
>>>
>>> - A is a square matrix (dimension x dimension), where each row is
>>>> cyclically distributed to the locales
>>>>
>>>
>>> You implemented this as:
>>>
>>> ------
>>> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=MatrixSpa
>>> ce.low,
>>>                                                   blocksize = (1,
>>> dimension))
>>>                   = MatrixSpace;
>>>
>>> var A : [MatrixDomain] int;
>>> ------
>>>
>>> If you only want to distribute the first dimension cyclically (i.e.,
>>> you'll always use a block size of 1), my suggestion would be to use the
>>> Cyclic distribution.  The typical trick to get it distributed in just one
>>> dimension is to pass in a 2D targetLocales array that is degenerate in one
>>> dimension (you could do this with BlockCyclic as well, you'd just be paying
>>> unnecessary blocking overhead).  So here's how I coded up A such that the
>>> rows are distributed cyclically and the columns not at all (along with some
>>> debugging code):
>>>
>>> ------
>>> // create a numLocales x 1 view of the Locales array:
>>> //
>>> const grid = reshape(Locales, {0..#numLocales, 0..0});
>>> //
>>> writeln(grid, "\n");
>>>
>>> // Distribute MatrixSpace in a Cyclic manner.  Because 'grid' is
>>> // degenerate in the second dimension, only the rows will be cyclic.
>>> //
>>> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx =
>>> MatrixSpace.low,
>>>                                                 targetLocales = grid);
>>>
>>> var A : [MatrixDomain] int;
>>>
>>> //
>>> // Make sure that ownership is as expected:
>>> //
>>> forall a in A do
>>>   a = here.id;
>>> writeln(A, "\n");
>>>
>>> ------
>>>
>>> As a stylistic note, if you're comfortable leaving type declarations
>>> off, note that I've swung the 'dmapped' clause from the type to the value
>>> location, which I've come to find much simpler to read over time (in this
>>> way, you could even eliminate MatrixSpace if you wanted by inlining it
>>> here, and avoid problems with accidentally using one rather than the other.
>>>
>>> - B is a vector (dimension), replicated on each locale
>>>>
>>>
>>> Here, you did:
>>>
>>> ------
>>> const VectorSpace = {0..#dimension};
>>> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
>>>                              = VectorSpace;
>>> ------
>>>
>>> which I think is reasonable.  I must admit that I don't use the
>>> ReplicatedDist often enough to be familiar with it without experimenting
>>> and refreshing my memory.  But a quick check makes this look like it's
>>> going to work as you want.
>>>
>>> [Returning to the theme of part 1, I'll mention that my hope is that
>>> we'll soon be able to support this as a row-vector '{*, 0..#dimension}'
>>> that shares the distribution of A.  E.g., imagine expressing this long-form
>>> as:
>>>
>>>     const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;
>>>
>>> or more succinctly, as:
>>>
>>>     const RowVect = MatrixDomain[*, ..];
>>>
>>> (that is, as a slice of MatrixDomain)]
>>>
>>>
>>> - C is a vector (dimension), each value is cyclically distributed to
>>>> match
>>>> the row of A
>>>>
>>>
>>> Here, my suggestion would be to use the Cyclic distribution again and,
>>> if it were me, I'd also be inclined to have it share the same distribution
>>> as A rather than repeating all that distribution information.  This can be
>>> written most simply as:
>>>
>>> ----
>>> const VectorCyclicDomain = MatrixDomain[.., 0];
>>>
>>> var C : [VectorCyclicDomain] int;
>>> ----
>>>
>>> which essentially says: "Perform a rank-change slice (intersection) of
>>> MatrixDomain taking all of its indices in the first dimension and
>>> collapsing its second dimension down to index 0."
>>>
>>>
>>> Part 3: The real problem
>>> ------------------------
>>>
>>> The problem is when I use 2 locales, the second one is mainly doing
>>>> communication
>>>>
>>>
>>> The problem turns out to be the way you've written the loop, though it's
>>> a subtle issue and not an uncommon trap to fall into among new users:
>>>
>>> -----
>>> forall (i, c) in zip({0..#dimension}, C(..)) {
>>>     forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) {
>>>         c += a * B(j);
>>>     }
>>> }
>>> -----
>>>
>>> The implementation of a forall loop (how many tasks to create and where
>>> should they run?) is controlled by the thing that's being iterated over.
>>> Whenever a forall loop iterates over a zip() expression, the first thing in
>>> the list (the "leader" ) controls the implementation.  So these loops are
>>> iterating over the anonymous domain {0..#dimension} which is not
>>> distributed (because nothing has said it should be).  For that reason, all
>>> tasks created by the forall loop will be local to the original locale
>>> (Locale #0).  That's why you're seeing other locales not getting much work
>>> (and probably a lot of communication between them and locale #0 for all the
>>> non-local accesses...).
>>>
>>> In order to distribute the tasks across locales using a forall loop,
>>> you'll want to iterate over one of your distributed domains or arrays. This
>>> will cause the distribution of that domain/array to create the tasks, and
>>> it'll do so using a task-per-core per locale to which the domain/array is
>>> distributed.
>>>
>>> (You could also use an on-clause within the body of your loop to steer
>>> each iteration to the appropriate locale, but this will result in creating
>>> an active message per iteartion which will kill your performance... better
>>> to let the forall create the on-clauses for you only as necessary).
>>>
>>> So you should see your program using more computational resources on
>>> locales > 0 if you rewrite this as:
>>>
>>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>>           forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
>>>             c += a * B(j);
>>>           }
>>>         }
>>>
>>> A few notes on the above:
>>>
>>> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
>>>   strength optimize the identity slice), so you'll get better performance
>>>   if you just write it as 'C'.
>>>
>>> * similarly, for the purposes of these loops, iterating over
>>>   '{0..#dimension}' is equivalent to iterating over '0..#dimension'
>>>   which is equivalent to iterating over 'VectorSpace' since none of them
>>>   are distributed.  So, it can be better to iterate over the named,
>>>   pre-created thing rather than (logically) creating a new range
>>> (cheaper)
>>>   or domain (more expensive) per inner loop iteration.  (this _may_ get
>>>   hoisted by the compiler as loop-invariant code, but why take the chance
>>>   when you've already named it so nicely...?)
>>>
>>> But, before stopping there, note that the use of the nested forall is
>>> likely to add overhead without much benefit.  Specifically, the outer loop
>>> will already create a task per processor core that 'VectorCyclicDomain' is
>>> distributed to (i.e., all of them), so your code will be fully parallel
>>> without the additiona parallel inner loop.  For this reason, I'd probably
>>> write the loop as:
>>>
>>>         forall (i, c) in zip(VectorCyclicDomain, C) {
>>>           for (j, a) in zip(VectorSpace, A(i, ..)) {
>>>             c += a * B(j);
>>>           }
>>>         }
>>>
>>> This also avoids the need for the 'with' clause since the for loop is
>>> serial, so won't race on its accesses to 'c'.
>>>
>>> (used chplvis, nice tool by the way).
>>>>
>>>
>>> Great, glad you like it!  I'll let its author (Phil Nelson, WWU) know.
>>>
>>> I haven't pre-checked my loop rewrites to verify that the loop now
>>> results in less communication, so I'll be curious what you find.
>>>
>>> Hope this helps, and let us know if it leads to follow-up questions,
>>> -Brad
>>>
>>>
>>
>>
>> --
>> Barry E Moore II, PhD
>> E-mail: [email protected]
>>
>> Assistant Research Professor
>> Center for Simulation and Modeling
>> University of Pittsburgh
>> Pittsburgh, PA 15260
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Chapel-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>
>>
>


-- 
Barry E Moore II, PhD
E-mail: [email protected]

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to