Brad,
Going to spend some time with this sometime this week. Definitely a lot to
think about.
- Barry
On Fri, Mar 31, 2017 at 1:11 PM, Brad Chamberlain <[email protected]> wrote:
>
> Hi Barry --
>
> I realized some things last night while thinking about this a bit more,
> and experimented with some other things quickly this morning which reduced
> the communication significantly, though there are still some aspects that I
> can't explain off the top of my head. Below, I'll walk through the series
> of chanegs I made and the impact on the max tasks / max comms counts.
>
> I'll also mention here at the outset that we've long had an intention /
> need to write a Chapel performance guide, but have never kicked off that
> effort. This discussion is a good example of the kinds of things that we
> developers are likely to be more aware of than users, where some of the
> issues are "that's the way we think Chapel has to be" and others are
> "that's the way things are today, but they should get better over time." If
> you or anyone else is interested in starting to construct a Chapel
> performance programmer's guide, I think that would be of great value to the
> community.
>
> I started with your version 3, and ran the size-10 problem on 3 locales.
> For the body of this mail, I'm just going to quote the max tasks / max
> comms counts, which I was using as a quick check to see whether I was
> helping or hurting:
>
> 205 / 1050
>
> Next I wondered what the impact of --fast would be. Recall that I said to
> use --fast when doing performance timings; it has the effect of turning off
> a bunch of runtime checks that we do for safety by default, which aren't
> particularly optimized for anything. This took a bite out of
> communications though my task counts went up slightly:
>
> 207 / 748
>
> Next, I tried changing the domain in your zippered iteration to a range
> (dropped the curly brackets). Generally speaking, ranges are lighter-weight
> than 1D domains, so are preferable when either could be used. This had only
> a modest impact (as you'd hope... I'm a little surprised that it had any
> impact at all, anticipating that it'd probably reduce scalar communications
> but not tasking... though I also wonder if there might be any
> nondeterminism in these numbers from run to run... I didn't explore that at
> all). In any case, I think it's stylistically preferable to use ranges
> over local 1D domains if there isn't any deep reason to prefer the domain,
> so suggest it anyway:
>
> 206 / 743
>
> Next, I tried changing your rank-change slice into a 2D slice.
> Rank-changes are generally more expensive than full slices because they
> require creating a dummy 1D domain and distribution to support all of the
> expression's possible operations. One could imagine a Chapel compiler that
> could analyze which operations which are used and cheapen the
> implementation based on what's needed, but today we don't do that. This
> had a notable impact:
>
> 160 / 453
>
> Next, I used the feature that was what occurred to me last night: When we
> make a (rank-change or rank-preserving) slice of a distributed domain or
> array, we create a distributed data structure such that each locale knows
> what portion of that slice it owns. This makes slicing distributed things
> reasonably expensive, particularly if the result is only going to be used
> locally. So in cases where a slice is completely local to a single locale
> (as in your code), that's overkill. So there's currently a 'localSlice()'
> method that can be used to assert that a slice is local to the current
> locale and doesn't involve any remote elements. This generates only local
> descriptors to describe the slice expression, and therefore saves a lot of
> multi-locale overhead:
>
> 5 / 61
>
> (It's worth noting that we've done virtually nothing to optimize the
> creation of distributed slices, and it's quite likely that there's
> low-hanging fruit there that could reduce the gap between the previous two
> sets of numbers. But it'd take a lot more work to get the compiler to the
> point where it could automatically reason about slices and determine that a
> vanilla slice is guaranteed to be local to a locale and therefore could be
> strength-reduced to a localSlice()... So in the meantime we fall back on
> .localSlice() as a programmer-guided aid and are also exploring more
> elegant ways of expressing such patterns...).
>
> From there I tried a few other things, but didn't manage to reduce it
> further. So here's my final loop nest (noting that the fix for the 1-tuple
> bug you made us aware of yesterday is now on master and I'm using it here):
>
> forall (i, c) in zip(vectorCyclic, C) {
> for (a, j) in zip(A.localSlice(i..i, ..), 0..#dimension) {
> c += a * B(here.id, j);
> }
> }
>
> I'm attaching a screenshot of the chplvis output similar to what you sent
> yesterday. As you can see, we're now in a situation where locales 1 and 2
> only do a single on-call back to locale 0 (presumably to tell it that
> they're done with their portion of the forall loop?) and no gets or puts.
> So that's a good sign. Locale 0 does on "on" for each of locale 1 and 2,
> which is required to kick off the forall loop. It also does 60 gets which
> I can't explain yet... That would require more investigation (but I'm
> going to move on to the rest of my email for now).
>
> As I mentioned a mail or two ago, we can't really expect this code to ever
> be communication-free because the forall loop itself is going to need to
> fire up communication across locales. But it seems plausible that we
> should still be able to tighten things up further (where I'm not sure if
> the "we" in question is "we the users" or "we the developers.") If I were
> to continue with this, I'd probably start picking stuff out of the
> bracketed portion of the code to simplify it ("What do just the loops cost
> with no computation in the body?" "What does just the outer loop cost?"
> "What if I iterate over one thing rather than zippering?" to see if that
> would shed light on what is causing the gets. Alternatively, one could
> dive down to a lower level and start investigating the communication in the
> generated code more manually...
>
> I also wanted to pop back to my very first response on this thread where I
> asserted that I think we really want partial reductions to express this
> idiom both cleanly and efficiently. This kept coming back to me while
> looking at this idiom because, generally speaking, for a computation like
> this, we'd like to see a single forall loop over the entire problem space
> (Matrix) rather than all the per-dimension looping and slicing that you've
> had to resort to here. The partial reduction would enable that, where my
> current mental draft of it would end up looking something like this:
>
> C = + reduce(resultShape=vectorCyclic)
> forall (i,j) in matrixCyclic do
> A(i,j) * B(here.id, j);
>
> or, using the shorthand for forall loops:
>
> C = + reduce(resultShape=vectorCyclic)
> [(i,j) in matrixCyclic] A(i,j) * B(here.id, j);
>
>
> which says "add up the expressions generated by this forall loop and
> reduce them such that the result's shape is vectorCyclic." The
> implementation would then compare the dimensions of matrixCyclic and
> vectorCyclic and see that it should add values along the rows to create a
> column result.
>
> Then, if we were to introduce the replicated dimensions that I mentioned
> and used them to declare B:
>
> var B: [matrixCyclic[.., *]] int;
>
> we could write the loop even more simply since both A and B would accept
> 2D indices (and B's second dimension would automatically select the
> localized value):
>
> C = + reduce(resultShape=vectorCyclic)
> [ij in matrixCyclic] A(ij) * B(ij);
>
> or, even better:
>
> C = + reduce(resultShape=vectorCyclic) (A * B);
>
>
> Hopefully this illustrates why I think those features are so crucial to
> expressing this computation cleanly (and you'll have to trust me that it
> should only help the performance as well, if the reasons aren't clear).
>
> Hope this is helpful/interesting,
> -Brad
>
>
>
> On Fri, 31 Mar 2017, Barry Moore wrote:
>
> Brad,
>>
>> I have some versions on github (https://github.com/barrymoo/
>> chapel-playground/tree/master/matrix-vector-multiply/versions)
>>
>> 1. version2.chpl - Uses Replicated Distribution of B, I tried both for and
>> forall inner loops
>> 2. version3.chpl - Uses a 3 x dimension domain for B distributed
>> Cyclically
>> similarly to A
>>
>> The github repo is a little messy because I was running a bunch of
>> variants
>> one directory up, but the version3.chpl was used to generate the image. I
>> suspected (1) from some of your other emails which is why I tried this
>> version too. It's nice to know I am thinking about this correctly. Even if
>> it is only a simple problem.
>>
>> https://github.com/chapel-lang/chapel/pull/5912
>>
>>
>> I will grab the request and test it on my code sometime this weekend.
>>
>> A side note: I did notice that using the ReplicatedDist had *slightly*
>> more
>>
>> overhead (data transfered) than the by-hand replicated version. Probably
>> some additional data carried by the structure to describe what locales it
>> is replicated over? I don't think it is an issue, but I thought it was
>> interesting.
>>
>> Thanks again!
>>
>> Barry
>>
>> On Thu, Mar 30, 2017 at 8:21 PM, Brad Chamberlain <[email protected]> wrote:
>>
>>
>>> I'm timing out today without digging into this, but will hope to look at
>>> it more tomorrow. If your profiling calls are still outside of the
>>> forall
>>> loop, we shouldn't expect it to be communication-free because the loop
>>> logically starts on locale 0 and has to create tasks on the other
>>> locales;
>>> but at a sniff, the image you sent does seem too high, at least for a
>>> single loop. My two guesses would be: (1) maybe the Replicated
>>> distribution is not working as well as it should (as I said, I'm not all
>>> that familiar with it, and I don't think it sees much action in
>>> practice);
>>> (2) maybe I led you astray in some way and we're hauling some global data
>>> structure across the network from locale 0 more than we should be...
>>>
>>> Would you send me the final version of the code from which you created
>>> this graph (if it changed from the last version modulo my suggested
>>> edits)
>>> as well as the problem size at which you ran it?
>>>
>>> BTW, thanks for pointing out that tuple indexing issue bug... I'm
>>> embarrassed that it slipped by us (me) and we've got a fix that we'll
>>> hope
>>> to merge tonight once it passes some more testing:
>>>
>>> https://github.com/chapel-lang/chapel/pull/5912
>>>
>>> Thanks again,
>>> -Brad
>>>
>>>
>>> On Thu, 30 Mar 2017, Barry Moore wrote:
>>>
>>> Hey Guys,
>>>
>>>>
>>>> Definitely better now, but there seems to be a lot of communication
>>>> going
>>>> on. Maybe I am misinterpreting this plot. I would have guessed by
>>>> forcing
>>>> everything to be local on the node I wouldn't be doing any communication
>>>> inside the work loop.
>>>>
>>>> Any thoughts? The rabbit hole gets deep fast ha.
>>>>
>>>> - Barry
>>>>
>>>> On Thu, Mar 30, 2017 at 4:31 PM, Brad Chamberlain <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>> Had a little time to cook up Brad's suggestion as a full program (pasted
>>>>> in
>>>>>
>>>>> [code][/code]). It is much better with respect to communication (not
>>>>>> unsurprisingly based on the comments above). Also, it looks a LOT
>>>>>> cleaner.
>>>>>>
>>>>>>
>>>>>> Good!
>>>>>
>>>>>
>>>>> 1. Stylistically, camelCase everywhere? I noticed sometimes with
>>>>> domains,
>>>>>
>>>>> spaces, and locales you use CamelCase (there is probably a name for
>>>>>> this,
>>>>>> but I don't want to look it up). Is there a suggestion from you guys?
>>>>>>
>>>>>>
>>>>>> Most of us on the project use camelCase, but not exclusively --
>>>>> underscores_still_appear at times. I'm probably an outlier, but I tend
>>>>> to
>>>>> use InitialCapsCamelCase for distributed domains and arrays because, in
>>>>> my
>>>>> mind, they are the big-ticket items from a cost/weight perspective and
>>>>> deserve being distinguished in some way. Sometimes that bleeds over to
>>>>> using initial caps for my "main" domains/arrays that define the overall
>>>>> problem size/space as well in non-distributed programs. But again, I'd
>>>>> say
>>>>> use personal preference -- I'm not a big one for imposing
>>>>> naming/stylistic
>>>>> conventions on others (for better or worse) and to date there is no
>>>>> Chapel
>>>>> style guide.
>>>>>
>>>>>
>>>>> 1. In this line `for (j, a) in zip({0..#dimension}, A(i(1), ..)) {`: I
>>>>> use
>>>>>
>>>>> 0 indexing everywhere, but I have to use i(1) in the for loop. It feels
>>>>>> and
>>>>>> looks awkward to me. Maybe it is unavoidable, but it will be hard to
>>>>>> remember for me.
>>>>>>
>>>>>>
>>>>>> We tried to make Chapel as index-neutral as possible (i.e., arrays can
>>>>> start at 0 or 1 or -3 or lo or whatever), but in a few places we had to
>>>>> make a choice, where tuple indexing (what you're doing here) is one of
>>>>> the
>>>>> main ones. I've tried to concoct ways of making tuple indexing either
>>>>> index-neutral or at least user-configurable, but haven't yet come up
>>>>> with
>>>>> anything appealing -- if you have ideas, send them in.
>>>>>
>>>>> In the meantime, one way to avoid relying on tuple indexing is to
>>>>> detuple,
>>>>> so:
>>>>>
>>>>> const D = {1..n, 1..n};
>>>>>
>>>>> forall (i,j) in D ... // i and j are integers
>>>>>
>>>>> rather than:
>>>>>
>>>>> forall ij in D ... // ij is a 2-tuple
>>>>>
>>>>> This stacks with zippering as well:
>>>>>
>>>>> forall ((i,j), a) in zip(D, A) ...
>>>>>
>>>>>
>>>>> But shoot, in your code, vectorCyclic should really be yielding
>>>>> integers,
>>>>> not tuples since it's 1D. This looks like a regression that has been
>>>>> introduced since 1.14 to me (I assume you're working on master?) and is
>>>>> likely something that I broke... 1D arrays can be indexed using either
>>>>> 1-tuples or integers, so I'm guessing that all of our existing tests
>>>>> just
>>>>> feed the indices yielded by the loop back into array accesses and are
>>>>> none
>>>>> the wiser... :( I'm going to see whether we can get a fix in before
>>>>> next
>>>>> week's release to avoid this unintended change in behavior...
>>>>>
>>>>> Bottom line, you _ought_ to be able to write the inner loop as:
>>>>>
>>>>> for (j, a) in zip({0..#dimension}, A(i, ..)) {
>>>>>
>>>>> and as a means of working around the regression and avoiding the tuple
>>>>> indexing, for the time being, you can write the outer loop as:
>>>>>
>>>>> forall ((i,), c) in zip(vectorCyclic, C) {
>>>>>
>>>>> and once the regression is fixed can change back to simply:
>>>>>
>>>>> forall (i, c) in zip(vectorCyclic, C) {
>>>>> for (j, a) in zip({0..#dimension}, A(i, ..)) {
>>>>>
>>>>>
>>>>> I still want to produce some timing results, but it will take me a
>>>>> little
>>>>>
>>>>> bit of time. Then, on to the matrix-matrix product :)
>>>>>>
>>>>>>
>>>>>> Be sure to use the --fast flag before doing any timing experiments...
>>>>>
>>>>> -Brad
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Barry E Moore II, PhD
>>>> E-mail: [email protected]
>>>>
>>>> Assistant Research Professor
>>>> Center for Simulation and Modeling
>>>> University of Pittsburgh
>>>> Pittsburgh, PA 15260
>>>>
>>>>
>>>>
>>
>> --
>> Barry E Moore II, PhD
>> E-mail: [email protected]
>>
>> Assistant Research Professor
>> Center for Simulation and Modeling
>> University of Pittsburgh
>> Pittsburgh, PA 15260
>>
>
--
Barry E Moore II, PhD
E-mail: [email protected]
Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users