Hi Nikhil -- I agree that this could be useful as an opt-in warning and seems that it would not be too hard to implement. I'd worry about it being an always-on warning because I think there are many cases where it's legitimate to zip distributed and non-distributed things. Maybe with experience we could tune the warning to distinguish between "more likely to be legitimate" and "more likely to be a mistake" cases...?
-Brad On Thu, 30 Mar 2017, Nikhil Padmanabhan wrote: > Brad, > > Would it be possible/useful for zip to throw a warning (ideally at compile > time) if the distributions of the vectors are not the same.....? In fact, > one might even extend that to cases like C = A + B..... > > One might imagine this being an opt-in warning selected by a flag like the > Intel Compiler's opt-report.... and add in more "hints" to the user which > flag when certain optimizations are done, or being thwarted, or when the > compiler thinks certain operations might generate excessive communication > or non-local accesses.... > > What do you think? > -- Nikhil > > --------------------------------- > Nikhil Padmanabhan > [email protected] > > On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote: > >> >> Hi Barry -- >> >> My response falls into three parts, where maybe only the third one really >> matters to you -- so depending on your level of interest, you may want to >> skip ahead to part 3... >> >> Part 1: The opening regret >> -------------------------- >> >> I have to admit that the phrase "matrix-vector product" makes me cringe in >> slight fear in the Chapel context, because I believe we're lacking some >> key features that would make it far better/more scalable than it is today. >> >> If you're interested in diving into some increasingly ancient history, we >> studied (sparse) matrix-vector multiplication at length in our previous >> work on ZPL and did reasonably well there compared to MPI. When we >> started on Chapel, we made some core design decisions with the goal of >> closing the remaining scalability gaps we had with ZPL; yet we put off >> some other features that I believe are crucial for mat-vect mults (this >> was primarily because Chapel's early years focused more on irregular >> computations than on matrix-based ones). We've actually been returning to >> these features recently (motivated by some users working on similar >> operations), but have not quite landed them yet. >> >> Just to put a name to them, I think the main concepts we're missing are: >> >> * partial reductions -- the ability to reduce a multi-dimensional array >> expression along a subset of its dimensions rather than all of them. >> >> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a >> replicated row that spans columns 1..n. (these were called flood and >> grid dimensions in ZPL where the former were kept coherent by the >> language semantics and the latter were not). >> >> Without these features, matrix-vector products can still be expressed, >> simply not as efficiently (and/or elegantly) as they could otherwise... >> >> >> Part 2: Your declarations >> ------------------------- >> >> In the end, I don't think there are any fundamental problems with your >> declarations, but in thinking through them myself, these are the notes I >> put together: >> >> >>> - A is a square matrix (dimension x dimension), where each row is >>> cyclically distributed to the locales >> >> You implemented this as: >> >> ------ >> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx= >> MatrixSpace.low, >> blocksize = (1, >> dimension)) >> = MatrixSpace; >> >> var A : [MatrixDomain] int; >> ------ >> >> If you only want to distribute the first dimension cyclically (i.e., >> you'll always use a block size of 1), my suggestion would be to use the >> Cyclic distribution. The typical trick to get it distributed in just one >> dimension is to pass in a 2D targetLocales array that is degenerate in one >> dimension (you could do this with BlockCyclic as well, you'd just be >> paying unnecessary blocking overhead). So here's how I coded up A such >> that the rows are distributed cyclically and the columns not at all (along >> with some debugging code): >> >> ------ >> // create a numLocales x 1 view of the Locales array: >> // >> const grid = reshape(Locales, {0..#numLocales, 0..0}); >> // >> writeln(grid, "\n"); >> >> // Distribute MatrixSpace in a Cyclic manner. Because 'grid' is >> // degenerate in the second dimension, only the rows will be cyclic. >> // >> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx = MatrixSpace.low, >> targetLocales = grid); >> >> var A : [MatrixDomain] int; >> >> // >> // Make sure that ownership is as expected: >> // >> forall a in A do >> a = here.id; >> writeln(A, "\n"); >> >> ------ >> >> As a stylistic note, if you're comfortable leaving type declarations off, >> note that I've swung the 'dmapped' clause from the type to the value >> location, which I've come to find much simpler to read over time (in this >> way, you could even eliminate MatrixSpace if you wanted by inlining it >> here, and avoid problems with accidentally using one rather than the >> other. >> >>> - B is a vector (dimension), replicated on each locale >> >> Here, you did: >> >> ------ >> const VectorSpace = {0..#dimension}; >> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist() >> = VectorSpace; >> ------ >> >> which I think is reasonable. I must admit that I don't use the >> ReplicatedDist often enough to be familiar with it without experimenting >> and refreshing my memory. But a quick check makes this look like it's >> going to work as you want. >> >> [Returning to the theme of part 1, I'll mention that my hope is that we'll >> soon be able to support this as a row-vector '{*, 0..#dimension}' that >> shares the distribution of A. E.g., imagine expressing this long-form as: >> >> const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist; >> >> or more succinctly, as: >> >> const RowVect = MatrixDomain[*, ..]; >> >> (that is, as a slice of MatrixDomain)] >> >> >>> - C is a vector (dimension), each value is cyclically distributed to >> match >>> the row of A >> >> Here, my suggestion would be to use the Cyclic distribution again and, if >> it were me, I'd also be inclined to have it share the same distribution as >> A rather than repeating all that distribution information. This can be >> written most simply as: >> >> ---- >> const VectorCyclicDomain = MatrixDomain[.., 0]; >> >> var C : [VectorCyclicDomain] int; >> ---- >> >> which essentially says: "Perform a rank-change slice (intersection) of >> MatrixDomain taking all of its indices in the first dimension and >> collapsing its second dimension down to index 0." >> >> >> Part 3: The real problem >> ------------------------ >> >>> The problem is when I use 2 locales, the second one is mainly doing >>> communication >> >> The problem turns out to be the way you've written the loop, though it's a >> subtle issue and not an uncommon trap to fall into among new users: >> >> ----- >> forall (i, c) in zip({0..#dimension}, C(..)) { >> forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) { >> c += a * B(j); >> } >> } >> ----- >> >> The implementation of a forall loop (how many tasks to create and where >> should they run?) is controlled by the thing that's being iterated over. >> Whenever a forall loop iterates over a zip() expression, the first thing >> in the list (the "leader" ) controls the implementation. So these loops >> are iterating over the anonymous domain {0..#dimension} which is not >> distributed (because nothing has said it should be). For that reason, all >> tasks created by the forall loop will be local to the original locale >> (Locale #0). That's why you're seeing other locales not getting much work >> (and probably a lot of communication between them and locale #0 for all >> the non-local accesses...). >> >> In order to distribute the tasks across locales using a forall loop, >> you'll want to iterate over one of your distributed domains or arrays. >> This will cause the distribution of that domain/array to create the tasks, >> and it'll do so using a task-per-core per locale to which the domain/array >> is distributed. >> >> (You could also use an on-clause within the body of your loop to steer >> each iteration to the appropriate locale, but this will result in creating >> an active message per iteartion which will kill your performance... better >> to let the forall create the on-clauses for you only as necessary). >> >> So you should see your program using more computational resources on >> locales > 0 if you rewrite this as: >> >> forall (i, c) in zip(VectorCyclicDomain, C) { >> forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) { >> c += a * B(j); >> } >> } >> >> A few notes on the above: >> >> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't >> strength optimize the identity slice), so you'll get better performance >> if you just write it as 'C'. >> >> * similarly, for the purposes of these loops, iterating over >> '{0..#dimension}' is equivalent to iterating over '0..#dimension' >> which is equivalent to iterating over 'VectorSpace' since none of them >> are distributed. So, it can be better to iterate over the named, >> pre-created thing rather than (logically) creating a new range (cheaper) >> or domain (more expensive) per inner loop iteration. (this _may_ get >> hoisted by the compiler as loop-invariant code, but why take the chance >> when you've already named it so nicely...?) >> >> But, before stopping there, note that the use of the nested forall is >> likely to add overhead without much benefit. Specifically, the outer loop >> will already create a task per processor core that 'VectorCyclicDomain' is >> distributed to (i.e., all of them), so your code will be fully parallel >> without the additiona parallel inner loop. For this reason, I'd probably >> write the loop as: >> >> forall (i, c) in zip(VectorCyclicDomain, C) { >> for (j, a) in zip(VectorSpace, A(i, ..)) { >> c += a * B(j); >> } >> } >> >> This also avoids the need for the 'with' clause since the for loop is >> serial, so won't race on its accesses to 'c'. >> >>> (used chplvis, nice tool by the way). >> >> Great, glad you like it! I'll let its author (Phil Nelson, WWU) know. >> >> I haven't pre-checked my loop rewrites to verify that the loop now results >> in less communication, so I'll be curious what you find. >> >> Hope this helps, and let us know if it leads to follow-up questions, >> -Brad >> >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> Chapel-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/chapel-users >> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
