Hey Brad, Elliot reminded me I haven't been working Chapel recently. I can't seem to reproduce your data. I tried an up to date git installation as well as 1.15.0 (both with my addition for gasnetrun_psm). I attached my version. If I can verify you get decent performance with my code, then I know it something wrong with the way the job is being launched.
Thanks again! - Barry On Mon, Apr 3, 2017 at 10:54 AM, Barry Moore <[email protected]> wrote: > Brad, > > Going to spend some time with this sometime this week. Definitely a lot to > think about. > > - Barry > > On Fri, Mar 31, 2017 at 1:11 PM, Brad Chamberlain <[email protected]> wrote: > >> >> Hi Barry -- >> >> I realized some things last night while thinking about this a bit more, >> and experimented with some other things quickly this morning which reduced >> the communication significantly, though there are still some aspects that I >> can't explain off the top of my head. Below, I'll walk through the series >> of chanegs I made and the impact on the max tasks / max comms counts. >> >> I'll also mention here at the outset that we've long had an intention / >> need to write a Chapel performance guide, but have never kicked off that >> effort. This discussion is a good example of the kinds of things that we >> developers are likely to be more aware of than users, where some of the >> issues are "that's the way we think Chapel has to be" and others are >> "that's the way things are today, but they should get better over time." If >> you or anyone else is interested in starting to construct a Chapel >> performance programmer's guide, I think that would be of great value to the >> community. >> >> I started with your version 3, and ran the size-10 problem on 3 locales. >> For the body of this mail, I'm just going to quote the max tasks / max >> comms counts, which I was using as a quick check to see whether I was >> helping or hurting: >> >> 205 / 1050 >> >> Next I wondered what the impact of --fast would be. Recall that I said >> to use --fast when doing performance timings; it has the effect of turning >> off a bunch of runtime checks that we do for safety by default, which >> aren't particularly optimized for anything. This took a bite out of >> communications though my task counts went up slightly: >> >> 207 / 748 >> >> Next, I tried changing the domain in your zippered iteration to a range >> (dropped the curly brackets). Generally speaking, ranges are lighter-weight >> than 1D domains, so are preferable when either could be used. This had only >> a modest impact (as you'd hope... I'm a little surprised that it had any >> impact at all, anticipating that it'd probably reduce scalar communications >> but not tasking... though I also wonder if there might be any >> nondeterminism in these numbers from run to run... I didn't explore that at >> all). In any case, I think it's stylistically preferable to use ranges >> over local 1D domains if there isn't any deep reason to prefer the domain, >> so suggest it anyway: >> >> 206 / 743 >> >> Next, I tried changing your rank-change slice into a 2D slice. >> Rank-changes are generally more expensive than full slices because they >> require creating a dummy 1D domain and distribution to support all of the >> expression's possible operations. One could imagine a Chapel compiler that >> could analyze which operations which are used and cheapen the >> implementation based on what's needed, but today we don't do that. This >> had a notable impact: >> >> 160 / 453 >> >> Next, I used the feature that was what occurred to me last night: When >> we make a (rank-change or rank-preserving) slice of a distributed domain or >> array, we create a distributed data structure such that each locale knows >> what portion of that slice it owns. This makes slicing distributed things >> reasonably expensive, particularly if the result is only going to be used >> locally. So in cases where a slice is completely local to a single locale >> (as in your code), that's overkill. So there's currently a 'localSlice()' >> method that can be used to assert that a slice is local to the current >> locale and doesn't involve any remote elements. This generates only local >> descriptors to describe the slice expression, and therefore saves a lot of >> multi-locale overhead: >> >> 5 / 61 >> >> (It's worth noting that we've done virtually nothing to optimize the >> creation of distributed slices, and it's quite likely that there's >> low-hanging fruit there that could reduce the gap between the previous two >> sets of numbers. But it'd take a lot more work to get the compiler to the >> point where it could automatically reason about slices and determine that a >> vanilla slice is guaranteed to be local to a locale and therefore could be >> strength-reduced to a localSlice()... So in the meantime we fall back on >> .localSlice() as a programmer-guided aid and are also exploring more >> elegant ways of expressing such patterns...). >> >> From there I tried a few other things, but didn't manage to reduce it >> further. So here's my final loop nest (noting that the fix for the 1-tuple >> bug you made us aware of yesterday is now on master and I'm using it here): >> >> forall (i, c) in zip(vectorCyclic, C) { >> for (a, j) in zip(A.localSlice(i..i, ..), 0..#dimension) { >> c += a * B(here.id, j); >> } >> } >> >> I'm attaching a screenshot of the chplvis output similar to what you sent >> yesterday. As you can see, we're now in a situation where locales 1 and 2 >> only do a single on-call back to locale 0 (presumably to tell it that >> they're done with their portion of the forall loop?) and no gets or puts. >> So that's a good sign. Locale 0 does on "on" for each of locale 1 and 2, >> which is required to kick off the forall loop. It also does 60 gets which >> I can't explain yet... That would require more investigation (but I'm >> going to move on to the rest of my email for now). >> >> As I mentioned a mail or two ago, we can't really expect this code to >> ever be communication-free because the forall loop itself is going to need >> to fire up communication across locales. But it seems plausible that we >> should still be able to tighten things up further (where I'm not sure if >> the "we" in question is "we the users" or "we the developers.") If I were >> to continue with this, I'd probably start picking stuff out of the >> bracketed portion of the code to simplify it ("What do just the loops cost >> with no computation in the body?" "What does just the outer loop cost?" >> "What if I iterate over one thing rather than zippering?" to see if that >> would shed light on what is causing the gets. Alternatively, one could >> dive down to a lower level and start investigating the communication in the >> generated code more manually... >> >> I also wanted to pop back to my very first response on this thread where >> I asserted that I think we really want partial reductions to express this >> idiom both cleanly and efficiently. This kept coming back to me while >> looking at this idiom because, generally speaking, for a computation like >> this, we'd like to see a single forall loop over the entire problem space >> (Matrix) rather than all the per-dimension looping and slicing that you've >> had to resort to here. The partial reduction would enable that, where my >> current mental draft of it would end up looking something like this: >> >> C = + reduce(resultShape=vectorCyclic) >> forall (i,j) in matrixCyclic do >> A(i,j) * B(here.id, j); >> >> or, using the shorthand for forall loops: >> >> C = + reduce(resultShape=vectorCyclic) >> [(i,j) in matrixCyclic] A(i,j) * B(here.id, j); >> >> >> which says "add up the expressions generated by this forall loop and >> reduce them such that the result's shape is vectorCyclic." The >> implementation would then compare the dimensions of matrixCyclic and >> vectorCyclic and see that it should add values along the rows to create a >> column result. >> >> Then, if we were to introduce the replicated dimensions that I mentioned >> and used them to declare B: >> >> var B: [matrixCyclic[.., *]] int; >> >> we could write the loop even more simply since both A and B would accept >> 2D indices (and B's second dimension would automatically select the >> localized value): >> >> C = + reduce(resultShape=vectorCyclic) >> [ij in matrixCyclic] A(ij) * B(ij); >> >> or, even better: >> >> C = + reduce(resultShape=vectorCyclic) (A * B); >> >> >> Hopefully this illustrates why I think those features are so crucial to >> expressing this computation cleanly (and you'll have to trust me that it >> should only help the performance as well, if the reasons aren't clear). >> >> Hope this is helpful/interesting, >> -Brad >> >> >> >> On Fri, 31 Mar 2017, Barry Moore wrote: >> >> Brad, >>> >>> I have some versions on github (https://github.com/barrymoo/ >>> chapel-playground/tree/master/matrix-vector-multiply/versions) >>> >>> 1. version2.chpl - Uses Replicated Distribution of B, I tried both for >>> and >>> forall inner loops >>> 2. version3.chpl - Uses a 3 x dimension domain for B distributed >>> Cyclically >>> similarly to A >>> >>> The github repo is a little messy because I was running a bunch of >>> variants >>> one directory up, but the version3.chpl was used to generate the image. I >>> suspected (1) from some of your other emails which is why I tried this >>> version too. It's nice to know I am thinking about this correctly. Even >>> if >>> it is only a simple problem. >>> >>> https://github.com/chapel-lang/chapel/pull/5912 >>> >>> >>> I will grab the request and test it on my code sometime this weekend. >>> >>> A side note: I did notice that using the ReplicatedDist had *slightly* >>> more >>> >>> overhead (data transfered) than the by-hand replicated version. Probably >>> some additional data carried by the structure to describe what locales it >>> is replicated over? I don't think it is an issue, but I thought it was >>> interesting. >>> >>> Thanks again! >>> >>> Barry >>> >>> On Thu, Mar 30, 2017 at 8:21 PM, Brad Chamberlain <[email protected]> >>> wrote: >>> >>> >>>> I'm timing out today without digging into this, but will hope to look at >>>> it more tomorrow. If your profiling calls are still outside of the >>>> forall >>>> loop, we shouldn't expect it to be communication-free because the loop >>>> logically starts on locale 0 and has to create tasks on the other >>>> locales; >>>> but at a sniff, the image you sent does seem too high, at least for a >>>> single loop. My two guesses would be: (1) maybe the Replicated >>>> distribution is not working as well as it should (as I said, I'm not all >>>> that familiar with it, and I don't think it sees much action in >>>> practice); >>>> (2) maybe I led you astray in some way and we're hauling some global >>>> data >>>> structure across the network from locale 0 more than we should be... >>>> >>>> Would you send me the final version of the code from which you created >>>> this graph (if it changed from the last version modulo my suggested >>>> edits) >>>> as well as the problem size at which you ran it? >>>> >>>> BTW, thanks for pointing out that tuple indexing issue bug... I'm >>>> embarrassed that it slipped by us (me) and we've got a fix that we'll >>>> hope >>>> to merge tonight once it passes some more testing: >>>> >>>> https://github.com/chapel-lang/chapel/pull/5912 >>>> >>>> Thanks again, >>>> -Brad >>>> >>>> >>>> On Thu, 30 Mar 2017, Barry Moore wrote: >>>> >>>> Hey Guys, >>>> >>>>> >>>>> Definitely better now, but there seems to be a lot of communication >>>>> going >>>>> on. Maybe I am misinterpreting this plot. I would have guessed by >>>>> forcing >>>>> everything to be local on the node I wouldn't be doing any >>>>> communication >>>>> inside the work loop. >>>>> >>>>> Any thoughts? The rabbit hole gets deep fast ha. >>>>> >>>>> - Barry >>>>> >>>>> On Thu, Mar 30, 2017 at 4:31 PM, Brad Chamberlain <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> Had a little time to cook up Brad's suggestion as a full program >>>>>> (pasted >>>>>> in >>>>>> >>>>>> [code][/code]). It is much better with respect to communication (not >>>>>>> unsurprisingly based on the comments above). Also, it looks a LOT >>>>>>> cleaner. >>>>>>> >>>>>>> >>>>>>> Good! >>>>>> >>>>>> >>>>>> 1. Stylistically, camelCase everywhere? I noticed sometimes with >>>>>> domains, >>>>>> >>>>>> spaces, and locales you use CamelCase (there is probably a name for >>>>>>> this, >>>>>>> but I don't want to look it up). Is there a suggestion from you guys? >>>>>>> >>>>>>> >>>>>>> Most of us on the project use camelCase, but not exclusively -- >>>>>> underscores_still_appear at times. I'm probably an outlier, but I >>>>>> tend >>>>>> to >>>>>> use InitialCapsCamelCase for distributed domains and arrays because, >>>>>> in >>>>>> my >>>>>> mind, they are the big-ticket items from a cost/weight perspective and >>>>>> deserve being distinguished in some way. Sometimes that bleeds over >>>>>> to >>>>>> using initial caps for my "main" domains/arrays that define the >>>>>> overall >>>>>> problem size/space as well in non-distributed programs. But again, >>>>>> I'd >>>>>> say >>>>>> use personal preference -- I'm not a big one for imposing >>>>>> naming/stylistic >>>>>> conventions on others (for better or worse) and to date there is no >>>>>> Chapel >>>>>> style guide. >>>>>> >>>>>> >>>>>> 1. In this line `for (j, a) in zip({0..#dimension}, A(i(1), ..)) {`: I >>>>>> use >>>>>> >>>>>> 0 indexing everywhere, but I have to use i(1) in the for loop. It >>>>>>> feels >>>>>>> and >>>>>>> looks awkward to me. Maybe it is unavoidable, but it will be hard to >>>>>>> remember for me. >>>>>>> >>>>>>> >>>>>>> We tried to make Chapel as index-neutral as possible (i.e., arrays >>>>>> can >>>>>> start at 0 or 1 or -3 or lo or whatever), but in a few places we had >>>>>> to >>>>>> make a choice, where tuple indexing (what you're doing here) is one of >>>>>> the >>>>>> main ones. I've tried to concoct ways of making tuple indexing either >>>>>> index-neutral or at least user-configurable, but haven't yet come up >>>>>> with >>>>>> anything appealing -- if you have ideas, send them in. >>>>>> >>>>>> In the meantime, one way to avoid relying on tuple indexing is to >>>>>> detuple, >>>>>> so: >>>>>> >>>>>> const D = {1..n, 1..n}; >>>>>> >>>>>> forall (i,j) in D ... // i and j are integers >>>>>> >>>>>> rather than: >>>>>> >>>>>> forall ij in D ... // ij is a 2-tuple >>>>>> >>>>>> This stacks with zippering as well: >>>>>> >>>>>> forall ((i,j), a) in zip(D, A) ... >>>>>> >>>>>> >>>>>> But shoot, in your code, vectorCyclic should really be yielding >>>>>> integers, >>>>>> not tuples since it's 1D. This looks like a regression that has been >>>>>> introduced since 1.14 to me (I assume you're working on master?) and >>>>>> is >>>>>> likely something that I broke... 1D arrays can be indexed using >>>>>> either >>>>>> 1-tuples or integers, so I'm guessing that all of our existing tests >>>>>> just >>>>>> feed the indices yielded by the loop back into array accesses and are >>>>>> none >>>>>> the wiser... :( I'm going to see whether we can get a fix in before >>>>>> next >>>>>> week's release to avoid this unintended change in behavior... >>>>>> >>>>>> Bottom line, you _ought_ to be able to write the inner loop as: >>>>>> >>>>>> for (j, a) in zip({0..#dimension}, A(i, ..)) { >>>>>> >>>>>> and as a means of working around the regression and avoiding the tuple >>>>>> indexing, for the time being, you can write the outer loop as: >>>>>> >>>>>> forall ((i,), c) in zip(vectorCyclic, C) { >>>>>> >>>>>> and once the regression is fixed can change back to simply: >>>>>> >>>>>> forall (i, c) in zip(vectorCyclic, C) { >>>>>> for (j, a) in zip({0..#dimension}, A(i, ..)) { >>>>>> >>>>>> >>>>>> I still want to produce some timing results, but it will take me a >>>>>> little >>>>>> >>>>>> bit of time. Then, on to the matrix-matrix product :) >>>>>>> >>>>>>> >>>>>>> Be sure to use the --fast flag before doing any timing experiments... >>>>>> >>>>>> -Brad >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Barry E Moore II, PhD >>>>> E-mail: [email protected] >>>>> >>>>> Assistant Research Professor >>>>> Center for Simulation and Modeling >>>>> University of Pittsburgh >>>>> Pittsburgh, PA 15260 >>>>> >>>>> >>>>> >>> >>> -- >>> Barry E Moore II, PhD >>> E-mail: [email protected] >>> >>> Assistant Research Professor >>> Center for Simulation and Modeling >>> University of Pittsburgh >>> Pittsburgh, PA 15260 >>> >> > > > -- > Barry E Moore II, PhD > E-mail: [email protected] > > Assistant Research Professor > Center for Simulation and Modeling > University of Pittsburgh > Pittsburgh, PA 15260 > -- Barry E Moore II, PhD E-mail: [email protected] Assistant Research Professor Center for Simulation and Modeling University of Pittsburgh Pittsburgh, PA 15260
matrix-vector-multiply.chpl
Description: Binary data
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
