Yeah, this is something I've fantasized about for awhile (essentially "let us point out what might be some obvious performance-related things you might consider... along the lines of automating some of the long email I sent out on this morning...). I just haven't wanted (or found the time at least) to implement it. :) Would be fair game for a GitHub feature request issue...
-Brad On Fri, 31 Mar 2017, Nikhil Padmanabhan wrote: > Brad, > > I agree this should be opt-in.... > > The more I think about it, an opt-in flag that produces a set of > warnings/comments about the code sounds quite useful. Most people are > probably not used to programming in a PGAS model, and Chapel has its own > idioms as well, and so something like this would be useful when a user > finds the code not quite performing like they expect. > > One could also use this to communicate places where the code/communication > is not as optimized as it might be --- eg. when writing a new Distribution, > if there are known slow steps that are WIP, the developer could warn the > user..... > > There is, of course, the question of what set of warnings/informational > messages, if there should be levels of this etc.... > > Cheers, > -- Nikhil > > --------------------------------- > Nikhil Padmanabhan > [email protected] > > On Fri, Mar 31, 2017 at 11:58 AM, Brad Chamberlain <[email protected]> wrote: > >> >> Hi Nikhil -- >> >> I agree that this could be useful as an opt-in warning and seems that it >> would not be too hard to implement. I'd worry about it being an always-on >> warning because I think there are many cases where it's legitimate to zip >> distributed and non-distributed things. Maybe with experience we could >> tune the warning to distinguish between "more likely to be legitimate" and >> "more likely to be a mistake" cases...? >> >> -Brad >> >> >> >> On Thu, 30 Mar 2017, Nikhil Padmanabhan wrote: >> >> Brad, >>> >>> Would it be possible/useful for zip to throw a warning (ideally at compile >>> time) if the distributions of the vectors are not the same.....? In fact, >>> one might even extend that to cases like C = A + B..... >>> >>> One might imagine this being an opt-in warning selected by a flag like the >>> Intel Compiler's opt-report.... and add in more "hints" to the user which >>> flag when certain optimizations are done, or being thwarted, or when the >>> compiler thinks certain operations might generate excessive communication >>> or non-local accesses.... >>> >>> What do you think? >>> -- Nikhil >>> >>> --------------------------------- >>> Nikhil Padmanabhan >>> [email protected] >>> >>> On Wed, Mar 29, 2017 at 7:51 PM, Brad Chamberlain <[email protected]> wrote: >>> >>> >>>> Hi Barry -- >>>> >>>> My response falls into three parts, where maybe only the third one really >>>> matters to you -- so depending on your level of interest, you may want to >>>> skip ahead to part 3... >>>> >>>> Part 1: The opening regret >>>> -------------------------- >>>> >>>> I have to admit that the phrase "matrix-vector product" makes me cringe >>>> in >>>> slight fear in the Chapel context, because I believe we're lacking some >>>> key features that would make it far better/more scalable than it is >>>> today. >>>> >>>> If you're interested in diving into some increasingly ancient history, we >>>> studied (sparse) matrix-vector multiplication at length in our previous >>>> work on ZPL and did reasonably well there compared to MPI. When we >>>> started on Chapel, we made some core design decisions with the goal of >>>> closing the remaining scalability gaps we had with ZPL; yet we put off >>>> some other features that I believe are crucial for mat-vect mults (this >>>> was primarily because Chapel's early years focused more on irregular >>>> computations than on matrix-based ones). We've actually been returning >>>> to >>>> these features recently (motivated by some users working on similar >>>> operations), but have not quite landed them yet. >>>> >>>> Just to put a name to them, I think the main concepts we're missing are: >>>> >>>> * partial reductions -- the ability to reduce a multi-dimensional array >>>> expression along a subset of its dimensions rather than all of them. >>>> >>>> * replicated singleton dimensions -- i.e., {*, 1..n} might represent a >>>> replicated row that spans columns 1..n. (these were called flood and >>>> grid dimensions in ZPL where the former were kept coherent by the >>>> language semantics and the latter were not). >>>> >>>> Without these features, matrix-vector products can still be expressed, >>>> simply not as efficiently (and/or elegantly) as they could otherwise... >>>> >>>> >>>> Part 2: Your declarations >>>> ------------------------- >>>> >>>> In the end, I don't think there are any fundamental problems with your >>>> declarations, but in thinking through them myself, these are the notes I >>>> put together: >>>> >>>> >>>> - A is a square matrix (dimension x dimension), where each row is >>>>> cyclically distributed to the locales >>>>> >>>> >>>> You implemented this as: >>>> >>>> ------ >>>> const MatrixDomain: domain(2) dmapped BlockCyclic(startIdx= >>>> MatrixSpace.low, >>>> blocksize = (1, >>>> dimension)) >>>> = MatrixSpace; >>>> >>>> var A : [MatrixDomain] int; >>>> ------ >>>> >>>> If you only want to distribute the first dimension cyclically (i.e., >>>> you'll always use a block size of 1), my suggestion would be to use the >>>> Cyclic distribution. The typical trick to get it distributed in just one >>>> dimension is to pass in a 2D targetLocales array that is degenerate in >>>> one >>>> dimension (you could do this with BlockCyclic as well, you'd just be >>>> paying unnecessary blocking overhead). So here's how I coded up A such >>>> that the rows are distributed cyclically and the columns not at all >>>> (along >>>> with some debugging code): >>>> >>>> ------ >>>> // create a numLocales x 1 view of the Locales array: >>>> // >>>> const grid = reshape(Locales, {0..#numLocales, 0..0}); >>>> // >>>> writeln(grid, "\n"); >>>> >>>> // Distribute MatrixSpace in a Cyclic manner. Because 'grid' is >>>> // degenerate in the second dimension, only the rows will be cyclic. >>>> // >>>> const MatrixDomain = MatrixSpace dmapped Cyclic(startIdx = >>>> MatrixSpace.low, >>>> targetLocales = grid); >>>> >>>> var A : [MatrixDomain] int; >>>> >>>> // >>>> // Make sure that ownership is as expected: >>>> // >>>> forall a in A do >>>> a = here.id; >>>> writeln(A, "\n"); >>>> >>>> ------ >>>> >>>> As a stylistic note, if you're comfortable leaving type declarations off, >>>> note that I've swung the 'dmapped' clause from the type to the value >>>> location, which I've come to find much simpler to read over time (in this >>>> way, you could even eliminate MatrixSpace if you wanted by inlining it >>>> here, and avoid problems with accidentally using one rather than the >>>> other. >>>> >>>> - B is a vector (dimension), replicated on each locale >>>>> >>>> >>>> Here, you did: >>>> >>>> ------ >>>> const VectorSpace = {0..#dimension}; >>>> const VectorReplicatedDomain : domain(1) dmapped ReplicatedDist() >>>> = VectorSpace; >>>> ------ >>>> >>>> which I think is reasonable. I must admit that I don't use the >>>> ReplicatedDist often enough to be familiar with it without experimenting >>>> and refreshing my memory. But a quick check makes this look like it's >>>> going to work as you want. >>>> >>>> [Returning to the theme of part 1, I'll mention that my hope is that >>>> we'll >>>> soon be able to support this as a row-vector '{*, 0..#dimension}' that >>>> shares the distribution of A. E.g., imagine expressing this long-form >>>> as: >>>> >>>> const RowVect = {*, 0..#dimension} dmapped MatrixDomain.dist; >>>> >>>> or more succinctly, as: >>>> >>>> const RowVect = MatrixDomain[*, ..]; >>>> >>>> (that is, as a slice of MatrixDomain)] >>>> >>>> >>>> - C is a vector (dimension), each value is cyclically distributed to >>>>> >>>> match >>>> >>>>> the row of A >>>>> >>>> >>>> Here, my suggestion would be to use the Cyclic distribution again and, if >>>> it were me, I'd also be inclined to have it share the same distribution >>>> as >>>> A rather than repeating all that distribution information. This can be >>>> written most simply as: >>>> >>>> ---- >>>> const VectorCyclicDomain = MatrixDomain[.., 0]; >>>> >>>> var C : [VectorCyclicDomain] int; >>>> ---- >>>> >>>> which essentially says: "Perform a rank-change slice (intersection) of >>>> MatrixDomain taking all of its indices in the first dimension and >>>> collapsing its second dimension down to index 0." >>>> >>>> >>>> Part 3: The real problem >>>> ------------------------ >>>> >>>> The problem is when I use 2 locales, the second one is mainly doing >>>>> communication >>>>> >>>> >>>> The problem turns out to be the way you've written the loop, though it's >>>> a >>>> subtle issue and not an uncommon trap to fall into among new users: >>>> >>>> ----- >>>> forall (i, c) in zip({0..#dimension}, C(..)) { >>>> forall (j, a) in zip({0..#dimension}, A(i, ..)) with ( + reduce c ) >>>> { >>>> c += a * B(j); >>>> } >>>> } >>>> ----- >>>> >>>> The implementation of a forall loop (how many tasks to create and where >>>> should they run?) is controlled by the thing that's being iterated over. >>>> Whenever a forall loop iterates over a zip() expression, the first thing >>>> in the list (the "leader" ) controls the implementation. So these loops >>>> are iterating over the anonymous domain {0..#dimension} which is not >>>> distributed (because nothing has said it should be). For that reason, >>>> all >>>> tasks created by the forall loop will be local to the original locale >>>> (Locale #0). That's why you're seeing other locales not getting much >>>> work >>>> (and probably a lot of communication between them and locale #0 for all >>>> the non-local accesses...). >>>> >>>> In order to distribute the tasks across locales using a forall loop, >>>> you'll want to iterate over one of your distributed domains or arrays. >>>> This will cause the distribution of that domain/array to create the >>>> tasks, >>>> and it'll do so using a task-per-core per locale to which the >>>> domain/array >>>> is distributed. >>>> >>>> (You could also use an on-clause within the body of your loop to steer >>>> each iteration to the appropriate locale, but this will result in >>>> creating >>>> an active message per iteartion which will kill your performance... >>>> better >>>> to let the forall create the on-clauses for you only as necessary). >>>> >>>> So you should see your program using more computational resources on >>>> locales > 0 if you rewrite this as: >>>> >>>> forall (i, c) in zip(VectorCyclicDomain, C) { >>>> forall (j, a) in zip(VectorSpace, A(i, ..)) with (+ reduce c) { >>>> c += a * B(j); >>>> } >>>> } >>>> >>>> A few notes on the above: >>>> >>>> * 'C(..)' is equivalent to 'C' but more expensive to compute (we don't >>>> strength optimize the identity slice), so you'll get better >>>> performance >>>> if you just write it as 'C'. >>>> >>>> * similarly, for the purposes of these loops, iterating over >>>> '{0..#dimension}' is equivalent to iterating over '0..#dimension' >>>> which is equivalent to iterating over 'VectorSpace' since none of them >>>> are distributed. So, it can be better to iterate over the named, >>>> pre-created thing rather than (logically) creating a new range >>>> (cheaper) >>>> or domain (more expensive) per inner loop iteration. (this _may_ get >>>> hoisted by the compiler as loop-invariant code, but why take the >>>> chance >>>> when you've already named it so nicely...?) >>>> >>>> But, before stopping there, note that the use of the nested forall is >>>> likely to add overhead without much benefit. Specifically, the outer >>>> loop >>>> will already create a task per processor core that 'VectorCyclicDomain' >>>> is >>>> distributed to (i.e., all of them), so your code will be fully parallel >>>> without the additiona parallel inner loop. For this reason, I'd probably >>>> write the loop as: >>>> >>>> forall (i, c) in zip(VectorCyclicDomain, C) { >>>> for (j, a) in zip(VectorSpace, A(i, ..)) { >>>> c += a * B(j); >>>> } >>>> } >>>> >>>> This also avoids the need for the 'with' clause since the for loop is >>>> serial, so won't race on its accesses to 'c'. >>>> >>>> (used chplvis, nice tool by the way). >>>>> >>>> >>>> Great, glad you like it! I'll let its author (Phil Nelson, WWU) know. >>>> >>>> I haven't pre-checked my loop rewrites to verify that the loop now >>>> results >>>> in less communication, so I'll be curious what you find. >>>> >>>> Hope this helps, and let us know if it leads to follow-up questions, >>>> -Brad >>>> >>>> >>>> ------------------------------------------------------------ >>>> ------------------ >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>>> _______________________________________________ >>>> Chapel-users mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/chapel-users >>>> >>>> >>> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
