Re: Distributed Matrix Vector Product

Barry Moore Wed, 26 Apr 2017 17:57:55 -0700

Hey Brad,

Elliot reminded me I haven't been working Chapel recently. I can't seem to
reproduce your data. I tried an up to date git installation as well as
1.15.0 (both with my addition for gasnetrun_psm). I attached my version. If
I can verify you get decent performance with my code, then I know it
something wrong with the way the job is being launched.


Thanks again!

- Barry

On Mon, Apr 3, 2017 at 10:54 AM, Barry Moore <[email protected]> wrote:

> Brad,
>
> Going to spend some time with this sometime this week. Definitely a lot to
> think about.
>
> - Barry
>
> On Fri, Mar 31, 2017 at 1:11 PM, Brad Chamberlain <[email protected]> wrote:
>
>>
>> Hi Barry --
>>
>> I realized some things last night while thinking about this a bit more,
>> and experimented with some other things quickly this morning which reduced
>> the communication significantly, though there are still some aspects that I
>> can't explain off the top of my head.  Below, I'll walk through the series
>> of chanegs I made and the impact on the max tasks / max comms counts.
>>
>> I'll also mention here at the outset that we've long had an intention /
>> need to write a Chapel performance guide, but have never kicked off that
>> effort.  This discussion is a good example of the kinds of things that we
>> developers are likely to be more aware of than users, where some of the
>> issues are "that's the way we think Chapel has to be" and others are
>> "that's the way things are today, but they should get better over time." If
>> you or anyone else is interested in starting to construct a Chapel
>> performance programmer's guide, I think that would be of great value to the
>> community.
>>
>> I started with your version 3, and ran the size-10 problem on 3 locales.
>> For the body of this mail, I'm just going to quote the max tasks / max
>> comms counts, which I was using as a quick check to see whether I was
>> helping or hurting:
>>
>> 205 / 1050
>>
>> Next I wondered what the impact of --fast would be.  Recall that I said
>> to use --fast when doing performance timings; it has the effect of turning
>> off a bunch of runtime checks that we do for safety by default, which
>> aren't particularly optimized for anything.  This took a bite out of
>> communications though my task counts went up slightly:
>>
>> 207 / 748
>>
>> Next, I tried changing the domain in your zippered iteration to a range
>> (dropped the curly brackets). Generally speaking, ranges are lighter-weight
>> than 1D domains, so are preferable when either could be used. This had only
>> a modest impact (as you'd hope... I'm a little surprised that it had any
>> impact at all, anticipating that it'd probably reduce scalar communications
>> but not tasking... though I also wonder if there might be any
>> nondeterminism in these numbers from run to run... I didn't explore that at
>> all).  In any case, I think it's stylistically preferable to use ranges
>> over local 1D domains if there isn't any deep reason to prefer the domain,
>> so suggest it anyway:
>>
>> 206 / 743
>>
>> Next, I tried changing your rank-change slice into a 2D slice.
>> Rank-changes are generally more expensive than full slices because they
>> require creating a dummy 1D domain and distribution to support all of the
>> expression's possible operations.  One could imagine a Chapel compiler that
>> could analyze which operations which are used and cheapen the
>> implementation based on what's needed, but today we don't do that.  This
>> had a notable impact:
>>
>> 160 / 453
>>
>> Next, I used the feature that was what occurred to me last night:  When
>> we make a (rank-change or rank-preserving) slice of a distributed domain or
>> array, we create a distributed data structure such that each locale knows
>> what portion of that slice it owns.  This makes slicing distributed things
>> reasonably expensive, particularly if the result is only going to be used
>> locally.  So in cases where a slice is completely local to a single locale
>> (as in your code), that's overkill.  So there's currently a 'localSlice()'
>> method that can be used to assert that a slice is local to the current
>> locale and doesn't involve any remote elements.  This generates only local
>> descriptors to describe the slice expression, and therefore saves a lot of
>> multi-locale overhead:
>>
>> 5 / 61
>>
>> (It's worth noting that we've done virtually nothing to optimize the
>> creation of distributed slices, and it's quite likely that there's
>> low-hanging fruit there that could reduce the gap between the previous two
>> sets of numbers.  But it'd take a lot more work to get the compiler to the
>> point where it could automatically reason about slices and determine that a
>> vanilla slice is guaranteed to be local to a locale and therefore could be
>> strength-reduced to a localSlice()...  So in the meantime we fall back on
>> .localSlice() as a programmer-guided aid and are also exploring more
>> elegant ways of expressing such patterns...).
>>
>> From there I tried a few other things, but didn't manage to reduce it
>> further.  So here's my final loop nest (noting that the fix for the 1-tuple
>> bug you made us aware of yesterday is now on master and I'm using it here):
>>
>> forall (i, c) in zip(vectorCyclic, C) {
>>   for (a, j) in zip(A.localSlice(i..i, ..), 0..#dimension) {
>>     c += a * B(here.id, j);
>>   }
>> }
>>
>> I'm attaching a screenshot of the chplvis output similar to what you sent
>> yesterday.  As you can see, we're now in a situation where locales 1 and 2
>> only do a single on-call back to locale 0 (presumably to tell it that
>> they're done with their portion of the forall loop?) and no gets or puts.
>> So that's a good sign.  Locale 0 does on "on" for each of locale 1 and 2,
>> which is required to kick off the forall loop.  It also does 60 gets which
>> I can't explain yet...  That would require more investigation (but I'm
>> going to move on to the rest of my email for now).
>>
>> As I mentioned a mail or two ago, we can't really expect this code to
>> ever be communication-free because the forall loop itself is going to need
>> to fire up communication across locales.  But it seems plausible that we
>> should still be able to tighten things up further (where I'm not sure if
>> the "we" in question is "we the users" or "we the developers.")  If I were
>> to continue with this, I'd probably start picking stuff out of the
>> bracketed portion of the code to simplify it ("What do just the loops cost
>> with no computation in the body?"  "What does just the outer loop cost?"
>> "What if I iterate over one thing rather than zippering?" to see if that
>> would shed light on what is causing the gets.  Alternatively, one could
>> dive down to a lower level and start investigating the communication in the
>> generated code more manually...
>>
>> I also wanted to pop back to my very first response on this thread where
>> I asserted that I think we really want partial reductions to express this
>> idiom both cleanly and efficiently.  This kept coming back to me while
>> looking at this idiom because, generally speaking, for a computation like
>> this, we'd like to see a single forall loop over the entire problem space
>> (Matrix) rather than all the per-dimension looping and slicing that you've
>> had to resort to here.  The partial reduction would enable that, where my
>> current mental draft of it would end up looking something like this:
>>
>>         C = + reduce(resultShape=vectorCyclic)
>>                 forall (i,j) in matrixCyclic do
>>                   A(i,j) * B(here.id, j);
>>
>> or, using the shorthand for forall loops:
>>
>>         C = + reduce(resultShape=vectorCyclic)
>>                 [(i,j) in matrixCyclic] A(i,j) * B(here.id, j);
>>
>>
>> which says "add up the expressions generated by this forall loop and
>> reduce them such that the result's shape is vectorCyclic."  The
>> implementation would then compare the dimensions of matrixCyclic and
>> vectorCyclic and see that it should add values along the rows to create a
>> column result.
>>
>> Then, if we were to introduce the replicated dimensions that I mentioned
>> and used them to declare B:
>>
>>         var B: [matrixCyclic[.., *]] int;
>>
>> we could write the loop even more simply since both A and B would accept
>> 2D indices (and B's second dimension would automatically select the
>> localized value):
>>
>>         C = + reduce(resultShape=vectorCyclic)
>>                 [ij in matrixCyclic] A(ij) * B(ij);
>>
>> or, even better:
>>
>>         C = + reduce(resultShape=vectorCyclic) (A * B);
>>
>>
>> Hopefully this illustrates why I think those features are so crucial to
>> expressing this computation cleanly (and you'll have to trust me that it
>> should only help the performance as well, if the reasons aren't clear).
>>
>> Hope this is helpful/interesting,
>> -Brad
>>
>>
>>
>> On Fri, 31 Mar 2017, Barry Moore wrote:
>>
>> Brad,
>>>
>>> I have some versions on github (https://github.com/barrymoo/
>>> chapel-playground/tree/master/matrix-vector-multiply/versions)
>>>
>>> 1. version2.chpl - Uses Replicated Distribution of B, I tried both for
>>> and
>>> forall inner loops
>>> 2. version3.chpl - Uses a 3 x dimension domain for B distributed
>>> Cyclically
>>> similarly to A
>>>
>>> The github repo is a little messy because I was running a bunch of
>>> variants
>>> one directory up, but the version3.chpl was used to generate the image. I
>>> suspected (1) from some of your other emails which is why I tried this
>>> version too. It's nice to know I am thinking about this correctly. Even
>>> if
>>> it is only a simple problem.
>>>
>>> https://github.com/chapel-lang/chapel/pull/5912
>>>
>>>
>>> I will grab the request and test it on my code sometime this weekend.
>>>
>>> A side note: I did notice that using the ReplicatedDist had *slightly*
>>> more
>>>
>>> overhead (data transfered) than the by-hand replicated version. Probably
>>> some additional data carried by the structure to describe what locales it
>>> is replicated over? I don't think it is an issue, but I thought it was
>>> interesting.
>>>
>>> Thanks again!
>>>
>>> Barry
>>>
>>> On Thu, Mar 30, 2017 at 8:21 PM, Brad Chamberlain <[email protected]>
>>> wrote:
>>>
>>>
>>>> I'm timing out today without digging into this, but will hope to look at
>>>> it more tomorrow.  If your profiling calls are still outside of the
>>>> forall
>>>> loop, we shouldn't expect it to be communication-free because the loop
>>>> logically starts on locale 0 and has to create tasks on the other
>>>> locales;
>>>> but at a sniff, the image you sent does seem too high, at least for a
>>>> single loop.  My two guesses would be:  (1) maybe the Replicated
>>>> distribution is not working as well as it should (as I said, I'm not all
>>>> that familiar with it, and I don't think it sees much action in
>>>> practice);
>>>> (2) maybe I led you astray in some way and we're hauling some global
>>>> data
>>>> structure across the network from locale 0 more than we should be...
>>>>
>>>> Would you send me the final version of the code from which you created
>>>> this graph (if it changed from the last version modulo my suggested
>>>> edits)
>>>> as well as the problem size at which you ran it?
>>>>
>>>> BTW, thanks for pointing out that tuple indexing issue bug...  I'm
>>>> embarrassed that it slipped by us (me) and we've got a fix that we'll
>>>> hope
>>>> to merge tonight once it passes some more testing:
>>>>
>>>>         https://github.com/chapel-lang/chapel/pull/5912
>>>>
>>>> Thanks again,
>>>> -Brad
>>>>
>>>>
>>>> On Thu, 30 Mar 2017, Barry Moore wrote:
>>>>
>>>> Hey Guys,
>>>>
>>>>>
>>>>> Definitely better now, but there seems to be a lot of communication
>>>>> going
>>>>> on. Maybe I am misinterpreting this plot. I would have guessed by
>>>>> forcing
>>>>> everything to be local on the node I wouldn't be doing any
>>>>> communication
>>>>> inside the work loop.
>>>>>
>>>>> Any thoughts? The rabbit hole gets deep fast ha.
>>>>>
>>>>> - Barry
>>>>>
>>>>> On Thu, Mar 30, 2017 at 4:31 PM, Brad Chamberlain <[email protected]>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Had a little time to cook up Brad's suggestion as a full program
>>>>>> (pasted
>>>>>> in
>>>>>>
>>>>>> [code][/code]). It is much better with respect to communication (not
>>>>>>> unsurprisingly based on the comments above). Also, it looks a LOT
>>>>>>> cleaner.
>>>>>>>
>>>>>>>
>>>>>>> Good!
>>>>>>
>>>>>>
>>>>>> 1. Stylistically, camelCase everywhere? I noticed sometimes with
>>>>>> domains,
>>>>>>
>>>>>> spaces, and locales you use CamelCase (there is probably a name for
>>>>>>> this,
>>>>>>> but I don't want to look it up). Is there a suggestion from you guys?
>>>>>>>
>>>>>>>
>>>>>>> Most of us on the project use camelCase, but not exclusively --
>>>>>> underscores_still_appear at times.  I'm probably an outlier, but I
>>>>>> tend
>>>>>> to
>>>>>> use InitialCapsCamelCase for distributed domains and arrays because,
>>>>>> in
>>>>>> my
>>>>>> mind, they are the big-ticket items from a cost/weight perspective and
>>>>>> deserve being distinguished in some way.  Sometimes that bleeds over
>>>>>> to
>>>>>> using initial caps for my "main" domains/arrays that define the
>>>>>> overall
>>>>>> problem size/space as well in non-distributed programs.  But again,
>>>>>> I'd
>>>>>> say
>>>>>> use personal preference -- I'm not a big one for imposing
>>>>>> naming/stylistic
>>>>>> conventions on others (for better or worse) and to date there is no
>>>>>> Chapel
>>>>>> style guide.
>>>>>>
>>>>>>
>>>>>> 1. In this line `for (j, a) in zip({0..#dimension}, A(i(1), ..)) {`: I
>>>>>> use
>>>>>>
>>>>>> 0 indexing everywhere, but I have to use i(1) in the for loop. It
>>>>>>> feels
>>>>>>> and
>>>>>>> looks awkward to me. Maybe it is unavoidable, but it will be hard to
>>>>>>> remember for me.
>>>>>>>
>>>>>>>
>>>>>>> We tried to make Chapel as index-neutral as possible (i.e., arrays
>>>>>> can
>>>>>> start at 0 or 1 or -3 or lo or whatever), but in a few places we had
>>>>>> to
>>>>>> make a choice, where tuple indexing (what you're doing here) is one of
>>>>>> the
>>>>>> main ones.  I've tried to concoct ways of making tuple indexing either
>>>>>> index-neutral or at least user-configurable, but haven't yet come up
>>>>>> with
>>>>>> anything appealing -- if you have ideas, send them in.
>>>>>>
>>>>>> In the meantime, one way to avoid relying on tuple indexing is to
>>>>>> detuple,
>>>>>> so:
>>>>>>
>>>>>>         const D = {1..n, 1..n};
>>>>>>
>>>>>>         forall (i,j) in D ...  // i and j are integers
>>>>>>
>>>>>> rather than:
>>>>>>
>>>>>>         forall ij in D ...  // ij is a 2-tuple
>>>>>>
>>>>>> This stacks with zippering as well:
>>>>>>
>>>>>>         forall ((i,j), a) in zip(D, A) ...
>>>>>>
>>>>>>
>>>>>> But shoot, in your code, vectorCyclic should really be yielding
>>>>>> integers,
>>>>>> not tuples since it's 1D.  This looks like a regression that has been
>>>>>> introduced since 1.14 to me (I assume you're working on master?) and
>>>>>> is
>>>>>> likely something that I broke...  1D arrays can be indexed using
>>>>>> either
>>>>>> 1-tuples or integers, so I'm guessing that all of our existing tests
>>>>>> just
>>>>>> feed the indices yielded by the loop back into array accesses and are
>>>>>> none
>>>>>> the wiser... :( I'm going to see whether we can get a fix in before
>>>>>> next
>>>>>> week's release to avoid this unintended change in behavior...
>>>>>>
>>>>>> Bottom line, you _ought_ to be able to write the inner loop as:
>>>>>>
>>>>>>           for (j, a) in zip({0..#dimension}, A(i, ..)) {
>>>>>>
>>>>>> and as a means of working around the regression and avoiding the tuple
>>>>>> indexing, for the time being, you can write the outer loop as:
>>>>>>
>>>>>>         forall ((i,), c) in zip(vectorCyclic, C) {
>>>>>>
>>>>>> and once the regression is fixed can change back to simply:
>>>>>>
>>>>>>         forall (i, c) in zip(vectorCyclic, C) {
>>>>>>           for (j, a) in zip({0..#dimension}, A(i, ..)) {
>>>>>>
>>>>>>
>>>>>> I still want to produce some timing results, but it will take me a
>>>>>> little
>>>>>>
>>>>>> bit of time. Then, on to the matrix-matrix product :)
>>>>>>>
>>>>>>>
>>>>>>> Be sure to use the --fast flag before doing any timing experiments...
>>>>>>
>>>>>> -Brad
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Barry E Moore II, PhD
>>>>> E-mail: [email protected]
>>>>>
>>>>> Assistant Research Professor
>>>>> Center for Simulation and Modeling
>>>>> University of Pittsburgh
>>>>> Pittsburgh, PA 15260
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Barry E Moore II, PhD
>>> E-mail: [email protected]
>>>
>>> Assistant Research Professor
>>> Center for Simulation and Modeling
>>> University of Pittsburgh
>>> Pittsburgh, PA 15260
>>>
>>
>
>
> --
> Barry E Moore II, PhD
> E-mail: [email protected]
>
> Assistant Research Professor
> Center for Simulation and Modeling
> University of Pittsburgh
> Pittsburgh, PA 15260
>



-- 
Barry E Moore II, PhD
E-mail: [email protected]

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260

matrix-vector-multiply.chpl
Description: Binary data

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Distributed Matrix Vector Product

Reply via email to