Hi Barry --
My response falls into three parts, where maybe only the third one really
matters to you -- so depending on your level of interest, you may want to
skip ahead to part 3...
Part 1: The opening regret
--------------------------
I have to admit that the phrase "matrix-vector product" makes me cringe in
slight fear in the Chapel context, because I believe we're lacking some
key features that would make it far better/more scalable than it is today.
If you're interested in diving into some increasingly ancient history, we
studied (sparse) matrix-vector multiplication at length in our previous
work on ZPL and did reasonably well there compared to MPI. When we
started on Chapel, we made some core design decisions with the goal of
closing the remaining scalability gaps we had with ZPL; yet we put off
some other features that I believe are crucial for mat-vect mults (this
was primarily because Chapel's early years focused more on irregular
computations than on matrix-based ones). We've actually been returning to
these features recently (motivated by some users working on similar
operations), but have not quite landed them yet.
Just to put a name to them, I think the main concepts we're missing are:
* partial reductions -- the ability to reduce a multi-dimensional array
expression along a subset of its dimensions rather than all of them.
* replicated singleton dimensions -- i.e., {*, 1..n} might represent a
replicated row that spans columns 1..n. (these were called flood and
grid dimensions in ZPL where the former were kept coherent by the
language semantics and the latter were not).
Without these features, matrix-vector products can still be expressed,
simply not as efficiently (and/or elegantly) as they could otherwise...
Part 2: Your declarations
-------------------------
In the end, I don't think there are any fundamental problems with your
declarations, but in thinking through them myself, these are the notes I
put together:
> - A is a square matrix (dimension x dimension), where each row is
> cyclically distributed to the locales
You implemented this as:
------
const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx=MatrixSpace.low,
blocksize = (1, dimension))
= MatrixSpace;
var A : [MatrixDomain] int;
------
If you only want to distribute the first dimension cyclically (i.e.,
you'll always use a block size of 1), my suggestion would be to use the
Cyclic distribution. The typical trick to get it distributed in just one
dimension is to pass in a 2D targetLocales array that is degenerate in one
dimension (you could do this with BlockCyclic as well, you'd just be
paying unnecessary blocking overhead). So here's how I coded up A such
that the rows are distributed cyclically and the columns not at all (along
with some debugging code):
------
// create a numLocales x 1 view of the Locales array:
//
const grid = reshape(Locales, {0..#numLocales, 0..0});
//
writeln(grid, "\n");
// Distribute MatrixSpace in a Cyclic manner. Because 'grid' is
// degenerate in the second dimension, only the rows will be cyclic.
//
const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx = MatrixSpace.low,
targetLocales = grid);
var A : [MatrixDomain] int;
//
// Make sure that ownership is as expected:
//
forall a in A do
a = here.id;
writeln(A, "\n");
------
As a stylistic note, if you're comfortable leaving type declarations off,
note that I've swung the 'dmapped' clause from the type to the value
location, which I've come to find much simpler to read over time (in this
way, you could even eliminate MatrixSpace if you wanted by inlining it
here, and avoid problems with accidentally using one rather than the
other.
> - B is a vector (dimension), replicated on each locale
Here, you did:
------
const VectorSpace = {0..#dimension};
const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist()
= VectorSpace;
------
which I think is reasonable. I must admit that I don't use the
ReplicatedDist often enough to be familiar with it without experimenting
and refreshing my memory. But a quick check makes this look like it's
going to work as you want.
[Returning to the theme of part 1, I'll mention that my hope is that we'll
soon be able to support this as a row-vector '{*, 0..#dimension}' that
shares the distribution of A. E.g., imagine expressing this long-form as:
const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist;
or more succinctly, as:
const RowVect = MatrixDomain[*, ..];
(that is, as a slice of MatrixDomain)]
> - C is a vector (dimension), each value is cyclically distributed to match
> the row of A
Here, my suggestion would be to use the Cyclic distribution again and, if
it were me, I'd also be inclined to have it share the same distribution as
A rather than repeating all that distribution information. This can be
written most simply as:
----
const VectorCyclicDomain = MatrixDomain[.., 0];
var C : [VectorCyclicDomain] int;
----
which essentially says: "Perform a rank-change slice (intersection) of
MatrixDomain taking all of its indices in the first dimension and
collapsing its second dimension down to index 0."
Part 3: The real problem
------------------------
> The problem is when I use 2 locales, the second one is mainly doing
> communication
The problem turns out to be the way you've written the loop, though it's a
subtle issue and not an uncommon trap to fall into among new users:
-----
forall (i, c) in zip({0..#dimension}, C(..)) {
forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) {
c += a * B(j);
}
}
-----
The implementation of a forall loop (how many tasks to create and where
should they run?) is controlled by the thing that's being iterated over.
Whenever a forall loop iterates over a zip() expression, the first thing
in the list (the "leader" ) controls the implementation. So these loops
are iterating over the anonymous domain {0..#dimension} which is not
distributed (because nothing has said it should be). For that reason, all
tasks created by the forall loop will be local to the original locale
(Locale #0). That's why you're seeing other locales not getting much work
(and probably a lot of communication between them and locale #0 for all
the non-local accesses...).
In order to distribute the tasks across locales using a forall loop,
you'll want to iterate over one of your distributed domains or arrays.
This will cause the distribution of that domain/array to create the tasks,
and it'll do so using a task-per-core per locale to which the domain/array
is distributed.
(You could also use an on-clause within the body of your loop to steer
each iteration to the appropriate locale, but this will result in creating
an active message per iteartion which will kill your performance... better
to let the forall create the on-clauses for you only as necessary).
So you should see your program using more computational resources on
locales > 0 if you rewrite this as:
forall (i, c) in zip(VectorCyclicDomain, C) {
forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) {
c += a * B(j);
}
}
A few notes on the above:
* 'C(..)' is equivalent to 'C' but more expensive to compute (we don't
strength optimize the identity slice), so you'll get better performance
if you just write it as 'C'.
* similarly, for the purposes of these loops, iterating over
'{0..#dimension}' is equivalent to iterating over '0..#dimension'
which is equivalent to iterating over 'VectorSpace' since none of them
are distributed. So, it can be better to iterate over the named,
pre-created thing rather than (logically) creating a new range (cheaper)
or domain (more expensive) per inner loop iteration. (this _may_ get
hoisted by the compiler as loop-invariant code, but why take the chance
when you've already named it so nicely...?)
But, before stopping there, note that the use of the nested forall is
likely to add overhead without much benefit. Specifically, the outer loop
will already create a task per processor core that 'VectorCyclicDomain' is
distributed to (i.e., all of them), so your code will be fully parallel
without the additiona parallel inner loop. For this reason, I'd probably
write the loop as:
forall (i, c) in zip(VectorCyclicDomain, C) {
for (j, a) in zip(VectorSpace, A(i, ..)) {
c += a * B(j);
}
}
This also avoids the need for the 'with' clause since the for loop is
serial, so won't race on its accesses to 'c'.
> (used chplvis, nice tool by the way).
Great, glad you like it! I'll let its author (Phil Nelson, WWU) know.
I haven't pre-checked my loop rewrites to verify that the loop now results
in less communication, so I'll be curious what you find.
Hope this helps, and let us know if it leads to follow-up questions,
-Brad
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users