Brad,
Ignore me! I needed to `clobber` (great name by the way) my updated github
repo and I am getting 6/61 compared to 5/51 from the previous numbers
(probably close to sufficient). Time to think about Matrix-Matrix.
- Barry
On Wed, Apr 26, 2017 at 8:55 PM, Barry Moore <[email protected]> wrote:
> Hey Brad,
>
> Elliot reminded me I haven't been working Chapel recently. I can't seem to
> reproduce your data. I tried an up to date git installation as well as
> 1.15.0 (both with my addition for gasnetrun_psm). I attached my version. If
> I can verify you get decent performance with my code, then I know it
> something wrong with the way the job is being launched.
>
> Thanks again!
>
> - Barry
>
> On Mon, Apr 3, 2017 at 10:54 AM, Barry Moore <[email protected]> wrote:
>
>> Brad,
>>
>> Going to spend some time with this sometime this week. Definitely a lot
>> to think about.
>>
>> - Barry
>>
>> On Fri, Mar 31, 2017 at 1:11 PM, Brad Chamberlain <[email protected]> wrote:
>>
>>>
>>> Hi Barry --
>>>
>>> I realized some things last night while thinking about this a bit more,
>>> and experimented with some other things quickly this morning which reduced
>>> the communication significantly, though there are still some aspects that I
>>> can't explain off the top of my head. Below, I'll walk through the series
>>> of chanegs I made and the impact on the max tasks / max comms counts.
>>>
>>> I'll also mention here at the outset that we've long had an intention /
>>> need to write a Chapel performance guide, but have never kicked off that
>>> effort. This discussion is a good example of the kinds of things that we
>>> developers are likely to be more aware of than users, where some of the
>>> issues are "that's the way we think Chapel has to be" and others are
>>> "that's the way things are today, but they should get better over time." If
>>> you or anyone else is interested in starting to construct a Chapel
>>> performance programmer's guide, I think that would be of great value to the
>>> community.
>>>
>>> I started with your version 3, and ran the size-10 problem on 3 locales.
>>> For the body of this mail, I'm just going to quote the max tasks / max
>>> comms counts, which I was using as a quick check to see whether I was
>>> helping or hurting:
>>>
>>> 205 / 1050
>>>
>>> Next I wondered what the impact of --fast would be. Recall that I said
>>> to use --fast when doing performance timings; it has the effect of turning
>>> off a bunch of runtime checks that we do for safety by default, which
>>> aren't particularly optimized for anything. This took a bite out of
>>> communications though my task counts went up slightly:
>>>
>>> 207 / 748
>>>
>>> Next, I tried changing the domain in your zippered iteration to a range
>>> (dropped the curly brackets). Generally speaking, ranges are lighter-weight
>>> than 1D domains, so are preferable when either could be used. This had only
>>> a modest impact (as you'd hope... I'm a little surprised that it had any
>>> impact at all, anticipating that it'd probably reduce scalar communications
>>> but not tasking... though I also wonder if there might be any
>>> nondeterminism in these numbers from run to run... I didn't explore that at
>>> all). In any case, I think it's stylistically preferable to use ranges
>>> over local 1D domains if there isn't any deep reason to prefer the domain,
>>> so suggest it anyway:
>>>
>>> 206 / 743
>>>
>>> Next, I tried changing your rank-change slice into a 2D slice.
>>> Rank-changes are generally more expensive than full slices because they
>>> require creating a dummy 1D domain and distribution to support all of the
>>> expression's possible operations. One could imagine a Chapel compiler that
>>> could analyze which operations which are used and cheapen the
>>> implementation based on what's needed, but today we don't do that. This
>>> had a notable impact:
>>>
>>> 160 / 453
>>>
>>> Next, I used the feature that was what occurred to me last night: When
>>> we make a (rank-change or rank-preserving) slice of a distributed domain or
>>> array, we create a distributed data structure such that each locale knows
>>> what portion of that slice it owns. This makes slicing distributed things
>>> reasonably expensive, particularly if the result is only going to be used
>>> locally. So in cases where a slice is completely local to a single locale
>>> (as in your code), that's overkill. So there's currently a 'localSlice()'
>>> method that can be used to assert that a slice is local to the current
>>> locale and doesn't involve any remote elements. This generates only local
>>> descriptors to describe the slice expression, and therefore saves a lot of
>>> multi-locale overhead:
>>>
>>> 5 / 61
>>>
>>> (It's worth noting that we've done virtually nothing to optimize the
>>> creation of distributed slices, and it's quite likely that there's
>>> low-hanging fruit there that could reduce the gap between the previous two
>>> sets of numbers. But it'd take a lot more work to get the compiler to the
>>> point where it could automatically reason about slices and determine that a
>>> vanilla slice is guaranteed to be local to a locale and therefore could be
>>> strength-reduced to a localSlice()... So in the meantime we fall back on
>>> .localSlice() as a programmer-guided aid and are also exploring more
>>> elegant ways of expressing such patterns...).
>>>
>>> From there I tried a few other things, but didn't manage to reduce it
>>> further. So here's my final loop nest (noting that the fix for the 1-tuple
>>> bug you made us aware of yesterday is now on master and I'm using it here):
>>>
>>> forall (i, c) in zip(vectorCyclic, C) {
>>> for (a, j) in zip(A.localSlice(i..i, ..), 0..#dimension) {
>>> c += a * B(here.id, j);
>>> }
>>> }
>>>
>>> I'm attaching a screenshot of the chplvis output similar to what you
>>> sent yesterday. As you can see, we're now in a situation where locales 1
>>> and 2 only do a single on-call back to locale 0 (presumably to tell it that
>>> they're done with their portion of the forall loop?) and no gets or puts.
>>> So that's a good sign. Locale 0 does on "on" for each of locale 1 and 2,
>>> which is required to kick off the forall loop. It also does 60 gets which
>>> I can't explain yet... That would require more investigation (but I'm
>>> going to move on to the rest of my email for now).
>>>
>>> As I mentioned a mail or two ago, we can't really expect this code to
>>> ever be communication-free because the forall loop itself is going to need
>>> to fire up communication across locales. But it seems plausible that we
>>> should still be able to tighten things up further (where I'm not sure if
>>> the "we" in question is "we the users" or "we the developers.") If I were
>>> to continue with this, I'd probably start picking stuff out of the
>>> bracketed portion of the code to simplify it ("What do just the loops cost
>>> with no computation in the body?" "What does just the outer loop cost?"
>>> "What if I iterate over one thing rather than zippering?" to see if that
>>> would shed light on what is causing the gets. Alternatively, one could
>>> dive down to a lower level and start investigating the communication in the
>>> generated code more manually...
>>>
>>> I also wanted to pop back to my very first response on this thread where
>>> I asserted that I think we really want partial reductions to express this
>>> idiom both cleanly and efficiently. This kept coming back to me while
>>> looking at this idiom because, generally speaking, for a computation like
>>> this, we'd like to see a single forall loop over the entire problem space
>>> (Matrix) rather than all the per-dimension looping and slicing that you've
>>> had to resort to here. The partial reduction would enable that, where my
>>> current mental draft of it would end up looking something like this:
>>>
>>> C = + reduce(resultShape=vectorCyclic)
>>> forall (i,j) in matrixCyclic do
>>> A(i,j) * B(here.id, j);
>>>
>>> or, using the shorthand for forall loops:
>>>
>>> C = + reduce(resultShape=vectorCyclic)
>>> [(i,j) in matrixCyclic] A(i,j) * B(here.id, j);
>>>
>>>
>>> which says "add up the expressions generated by this forall loop and
>>> reduce them such that the result's shape is vectorCyclic." The
>>> implementation would then compare the dimensions of matrixCyclic and
>>> vectorCyclic and see that it should add values along the rows to create a
>>> column result.
>>>
>>> Then, if we were to introduce the replicated dimensions that I mentioned
>>> and used them to declare B:
>>>
>>> var B: [matrixCyclic[.., *]] int;
>>>
>>> we could write the loop even more simply since both A and B would accept
>>> 2D indices (and B's second dimension would automatically select the
>>> localized value):
>>>
>>> C = + reduce(resultShape=vectorCyclic)
>>> [ij in matrixCyclic] A(ij) * B(ij);
>>>
>>> or, even better:
>>>
>>> C = + reduce(resultShape=vectorCyclic) (A * B);
>>>
>>>
>>> Hopefully this illustrates why I think those features are so crucial to
>>> expressing this computation cleanly (and you'll have to trust me that it
>>> should only help the performance as well, if the reasons aren't clear).
>>>
>>> Hope this is helpful/interesting,
>>> -Brad
>>>
>>>
>>>
>>> On Fri, 31 Mar 2017, Barry Moore wrote:
>>>
>>> Brad,
>>>>
>>>> I have some versions on github (https://github.com/barrymoo/
>>>> chapel-playground/tree/master/matrix-vector-multiply/versions)
>>>>
>>>> 1. version2.chpl - Uses Replicated Distribution of B, I tried both for
>>>> and
>>>> forall inner loops
>>>> 2. version3.chpl - Uses a 3 x dimension domain for B distributed
>>>> Cyclically
>>>> similarly to A
>>>>
>>>> The github repo is a little messy because I was running a bunch of
>>>> variants
>>>> one directory up, but the version3.chpl was used to generate the image.
>>>> I
>>>> suspected (1) from some of your other emails which is why I tried this
>>>> version too. It's nice to know I am thinking about this correctly. Even
>>>> if
>>>> it is only a simple problem.
>>>>
>>>> https://github.com/chapel-lang/chapel/pull/5912
>>>>
>>>>
>>>> I will grab the request and test it on my code sometime this weekend.
>>>>
>>>> A side note: I did notice that using the ReplicatedDist had *slightly*
>>>> more
>>>>
>>>> overhead (data transfered) than the by-hand replicated version. Probably
>>>> some additional data carried by the structure to describe what locales
>>>> it
>>>> is replicated over? I don't think it is an issue, but I thought it was
>>>> interesting.
>>>>
>>>> Thanks again!
>>>>
>>>> Barry
>>>>
>>>> On Thu, Mar 30, 2017 at 8:21 PM, Brad Chamberlain <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>>> I'm timing out today without digging into this, but will hope to look
>>>>> at
>>>>> it more tomorrow. If your profiling calls are still outside of the
>>>>> forall
>>>>> loop, we shouldn't expect it to be communication-free because the loop
>>>>> logically starts on locale 0 and has to create tasks on the other
>>>>> locales;
>>>>> but at a sniff, the image you sent does seem too high, at least for a
>>>>> single loop. My two guesses would be: (1) maybe the Replicated
>>>>> distribution is not working as well as it should (as I said, I'm not
>>>>> all
>>>>> that familiar with it, and I don't think it sees much action in
>>>>> practice);
>>>>> (2) maybe I led you astray in some way and we're hauling some global
>>>>> data
>>>>> structure across the network from locale 0 more than we should be...
>>>>>
>>>>> Would you send me the final version of the code from which you created
>>>>> this graph (if it changed from the last version modulo my suggested
>>>>> edits)
>>>>> as well as the problem size at which you ran it?
>>>>>
>>>>> BTW, thanks for pointing out that tuple indexing issue bug... I'm
>>>>> embarrassed that it slipped by us (me) and we've got a fix that we'll
>>>>> hope
>>>>> to merge tonight once it passes some more testing:
>>>>>
>>>>> https://github.com/chapel-lang/chapel/pull/5912
>>>>>
>>>>> Thanks again,
>>>>> -Brad
>>>>>
>>>>>
>>>>> On Thu, 30 Mar 2017, Barry Moore wrote:
>>>>>
>>>>> Hey Guys,
>>>>>
>>>>>>
>>>>>> Definitely better now, but there seems to be a lot of communication
>>>>>> going
>>>>>> on. Maybe I am misinterpreting this plot. I would have guessed by
>>>>>> forcing
>>>>>> everything to be local on the node I wouldn't be doing any
>>>>>> communication
>>>>>> inside the work loop.
>>>>>>
>>>>>> Any thoughts? The rabbit hole gets deep fast ha.
>>>>>>
>>>>>> - Barry
>>>>>>
>>>>>> On Thu, Mar 30, 2017 at 4:31 PM, Brad Chamberlain <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Had a little time to cook up Brad's suggestion as a full program
>>>>>>> (pasted
>>>>>>> in
>>>>>>>
>>>>>>> [code][/code]). It is much better with respect to communication (not
>>>>>>>> unsurprisingly based on the comments above). Also, it looks a LOT
>>>>>>>> cleaner.
>>>>>>>>
>>>>>>>>
>>>>>>>> Good!
>>>>>>>
>>>>>>>
>>>>>>> 1. Stylistically, camelCase everywhere? I noticed sometimes with
>>>>>>> domains,
>>>>>>>
>>>>>>> spaces, and locales you use CamelCase (there is probably a name for
>>>>>>>> this,
>>>>>>>> but I don't want to look it up). Is there a suggestion from you
>>>>>>>> guys?
>>>>>>>>
>>>>>>>>
>>>>>>>> Most of us on the project use camelCase, but not exclusively --
>>>>>>> underscores_still_appear at times. I'm probably an outlier, but I
>>>>>>> tend
>>>>>>> to
>>>>>>> use InitialCapsCamelCase for distributed domains and arrays because,
>>>>>>> in
>>>>>>> my
>>>>>>> mind, they are the big-ticket items from a cost/weight perspective
>>>>>>> and
>>>>>>> deserve being distinguished in some way. Sometimes that bleeds over
>>>>>>> to
>>>>>>> using initial caps for my "main" domains/arrays that define the
>>>>>>> overall
>>>>>>> problem size/space as well in non-distributed programs. But again,
>>>>>>> I'd
>>>>>>> say
>>>>>>> use personal preference -- I'm not a big one for imposing
>>>>>>> naming/stylistic
>>>>>>> conventions on others (for better or worse) and to date there is no
>>>>>>> Chapel
>>>>>>> style guide.
>>>>>>>
>>>>>>>
>>>>>>> 1. In this line `for (j, a) in zip({0..#dimension}, A(i(1), ..)) {`:
>>>>>>> I
>>>>>>> use
>>>>>>>
>>>>>>> 0 indexing everywhere, but I have to use i(1) in the for loop. It
>>>>>>>> feels
>>>>>>>> and
>>>>>>>> looks awkward to me. Maybe it is unavoidable, but it will be hard to
>>>>>>>> remember for me.
>>>>>>>>
>>>>>>>>
>>>>>>>> We tried to make Chapel as index-neutral as possible (i.e., arrays
>>>>>>> can
>>>>>>> start at 0 or 1 or -3 or lo or whatever), but in a few places we had
>>>>>>> to
>>>>>>> make a choice, where tuple indexing (what you're doing here) is one
>>>>>>> of
>>>>>>> the
>>>>>>> main ones. I've tried to concoct ways of making tuple indexing
>>>>>>> either
>>>>>>> index-neutral or at least user-configurable, but haven't yet come up
>>>>>>> with
>>>>>>> anything appealing -- if you have ideas, send them in.
>>>>>>>
>>>>>>> In the meantime, one way to avoid relying on tuple indexing is to
>>>>>>> detuple,
>>>>>>> so:
>>>>>>>
>>>>>>> const D = {1..n, 1..n};
>>>>>>>
>>>>>>> forall (i,j) in D ... // i and j are integers
>>>>>>>
>>>>>>> rather than:
>>>>>>>
>>>>>>> forall ij in D ... // ij is a 2-tuple
>>>>>>>
>>>>>>> This stacks with zippering as well:
>>>>>>>
>>>>>>> forall ((i,j), a) in zip(D, A) ...
>>>>>>>
>>>>>>>
>>>>>>> But shoot, in your code, vectorCyclic should really be yielding
>>>>>>> integers,
>>>>>>> not tuples since it's 1D. This looks like a regression that has been
>>>>>>> introduced since 1.14 to me (I assume you're working on master?) and
>>>>>>> is
>>>>>>> likely something that I broke... 1D arrays can be indexed using
>>>>>>> either
>>>>>>> 1-tuples or integers, so I'm guessing that all of our existing tests
>>>>>>> just
>>>>>>> feed the indices yielded by the loop back into array accesses and are
>>>>>>> none
>>>>>>> the wiser... :( I'm going to see whether we can get a fix in before
>>>>>>> next
>>>>>>> week's release to avoid this unintended change in behavior...
>>>>>>>
>>>>>>> Bottom line, you _ought_ to be able to write the inner loop as:
>>>>>>>
>>>>>>> for (j, a) in zip({0..#dimension}, A(i, ..)) {
>>>>>>>
>>>>>>> and as a means of working around the regression and avoiding the
>>>>>>> tuple
>>>>>>> indexing, for the time being, you can write the outer loop as:
>>>>>>>
>>>>>>> forall ((i,), c) in zip(vectorCyclic, C) {
>>>>>>>
>>>>>>> and once the regression is fixed can change back to simply:
>>>>>>>
>>>>>>> forall (i, c) in zip(vectorCyclic, C) {
>>>>>>> for (j, a) in zip({0..#dimension}, A(i, ..)) {
>>>>>>>
>>>>>>>
>>>>>>> I still want to produce some timing results, but it will take me a
>>>>>>> little
>>>>>>>
>>>>>>> bit of time. Then, on to the matrix-matrix product :)
>>>>>>>>
>>>>>>>>
>>>>>>>> Be sure to use the --fast flag before doing any timing
>>>>>>> experiments...
>>>>>>>
>>>>>>> -Brad
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Barry E Moore II, PhD
>>>>>> E-mail: [email protected]
>>>>>>
>>>>>> Assistant Research Professor
>>>>>> Center for Simulation and Modeling
>>>>>> University of Pittsburgh
>>>>>> Pittsburgh, PA 15260
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Barry E Moore II, PhD
>>>> E-mail: [email protected]
>>>>
>>>> Assistant Research Professor
>>>> Center for Simulation and Modeling
>>>> University of Pittsburgh
>>>> Pittsburgh, PA 15260
>>>>
>>>
>>
>>
>> --
>> Barry E Moore II, PhD
>> E-mail: [email protected]
>>
>> Assistant Research Professor
>> Center for Simulation and Modeling
>> University of Pittsburgh
>> Pittsburgh, PA 15260
>>
>
>
>
> --
> Barry E Moore II, PhD
> E-mail: [email protected]
>
> Assistant Research Professor
> Center for Simulation and Modeling
> University of Pittsburgh
> Pittsburgh, PA 15260
>
--
Barry E Moore II, PhD
E-mail: [email protected]
Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users