Variable Block Distributions (was Re: Would Chapel suit my needs?)

Brad Chamberlain Thu, 22 Jan 2015 16:23:51 -0800


Hi John --

Thanks for the response. I'm going to try and feed you just enoughinformation to get you started, so that if you don't get back to thisproject, I haven't wasted too much time. But if you do get to it, youshould try to ask follow-up questions as necessary (and at that point,_may_ want to slosh this thread over to chapel-developers -- it'ssomewhere on the cusp between a user and developer topic...).

There haven't been significant changes to the distribution interface sincewe last talked. We've started the discussion I alluded to previouslyregarding a different interpretations for slicing and reindexing, but Ithink that's orthogonal to this discussion for the most part (though itmay ultimately simplify the interface required to implement a domain map).

There are two different approaches we've discussed, both of which I wouldpersonally approach by making modifications to a copy of the Blockdistribution in modules/dists/BlockDist.chpl:


-----

Approach 1: Specialize the Block distribution to support blocking in onedimension only, but with varying block sizes across locales (as computedheuristically, or by passing in "cut" values or ...).

Approach 2: Create a variation of the Block distribution that gives a userthe ability to pass in a "policy" object that implements the simplifiedinterface you proposed:

  - "how big should this locale's local block be?"
  - "which locale owns this index?"
  - "please access the local array using this index."

-----

Either one of these approaches will involve similar components ofBlockDist, so let me try and highlight the salient bits. I apologize inadvance for the lack of commenting and ample opportunity for cleanup inBlockDist.chpl. It was written during the hectic HPCS years of Chapel andhasn't had an opportunity to be freshened up since then. Just puttingthese notes together, I came across a number of things that look like theycould be trivially cleaned up/simplified...


* I can't recall whether you've read the relevant papers or slides on
  Chapel's distribution framework, but there are essentially six classes
  involved that you can think of as being in a 3 x 2 grid:  The 3-ary
  dimension is "domain map", "domain", and "array".  The other dimension
  is global classes (those that represent the concept as a whole) and
  local classes (those that represent a single locale's view of the
  concept).  These classes are named:

        Block           BlockDom        BlockArr
        LocBlock        LocBlockDom     LocBlockArr

  respectively.  If you need more background on their roles and
  organization, please refer to the second and third papers (and/or their
  slides) in the "Advanced Chapel Features" section of:

        http://chapel.cray.com/papers.html

  "Authoring User-Defined Domain Maps in Chapel"
  "User-Defined Distributions and Layouts in Chapel: Philosophy and Framework"

* The LocBlock class is the crucial bit, I believe.  This is the class
  that specifies what indices the current locale owns for the Block
  distribution.  In the class, 'myChunk' is the crucial field, as it
  describes the indices owned by the current locale (itself stored as
  a non-distributed domain).  This field is computed in the class's
  constructor, LocBlock.LocBlock() (on line 497 of my current repo, but
  your mileage may vary) using a utility function that we use to
  compute blockings, _computeBlock(), defined in DSIUtil.chpl.  My
  suspicion is that if you redefine the logic to define 'myChunk' in
  this constructor, much of the rest of the implementation will just
  fall out.  Specifically, when declaring domains and arrays over a
  domain map, the determination of what each locale owns is typically
  computed by looking at LocBlock.myChunk.

* The other crucial bit is the Block.dsiIndexToLocale() routine which
  maps an index in rank-space to a locale in the domain map's
  'targetLocales' array (indicating which locales the array should be
  distributed between).  As you'll see, this is implemented for Block
  via a helper routine Block.targetLocsIdx(), which, if we've done our
  math correctly, should be the inverse of _computeBlock() above
  (begging the question why these are in different files, but see the
  "hectic years" note above...).

My hypothesis is that changing the above two things may be sufficient toget simple programs that declare, access, and loop over distributed arraysworking, though it's also very possible that I'm missing some other keypieces. Specifically...


* When we declare a domain over the Block distribution, the
  BlockDom.setup() routine computes the subdomain owned by each locale
  using the Block.getChunk() helper routine which returns the locale's
  subdomain by intersecting the whole domain with the domain stored by
  the corresponding LocBlock class.  This is then stored in the
  LocBlockDom class's 'myBlock' field.

* From there, the LocBlockArr class stores the local array of elements
  in the 'myElems' array which in turn is declared over the locDom.myBlock
  domain (a reference to that same locale's LocBlockDom object from the
  previous bullet).

Life being what it is, probably I'm missing some other key part... But I'mpretty sure I'm not missing a dozen key parts, so I'll stop there for nowand hope that this is enough to get you going. What I'd suggest doing iswriting a simple program that declares a 1D block distributed domain andarray and initializes the array using the owning locale's ID, similar towhat's done in [test/release/]examples/primers/distributions.chpl. ThenI'd try replacing the computation of 'myChunk' and 'dsiIndexToLocale()' asabove, and re-compiling the program to see if it works. If it does, I'dgradually move to more complicated programs to gain further confidencethat I didn't miss anything. If it doesn't, I'd follow up with morequestions.

I'll be very curious what you're able to accomplish. And if you create ageneralization of the Block() distribution that would be more broadly ofinterest to people and can contribute it back, that'd be great!


Hope this is helpful,
-Brad



On Thu, 22 Jan 2015, John MacFrenz wrote:

Hi,

I have been doing a lot of other projects lately, so I didn't rememberto answer to this...

However, I am still interested in this issue. At the moment I don't haveexcessive amounts of time to dedicate to this so my progress would berather slow, though.

So, I'd appreciate if you could give me some hints where to start andhave there been any significant changes since we last discussed this?


John

21.12.2014, 01:54, "Brad Chamberlain" <[email protected]>:

Hi John --

I'm trying to clean my inbox to wrap up the year, and am realizing that I
never replied to this last message on the thread for which I apologize
(you ran into the 1-2-punch that is SC and Thanksgiving back-to-back, and
then I never got back to older mails in December).

Just to put in a quick note, I think that we could support the approach
that you're suggesting for a simpler distribution specification in a way
that doesn't throw away the generality that the current scheme has (and
that I think it must have).  Specifically, the approach would be to write
a distribution that, itself could be parameterized by a class providing
the simpler interface that you want to specify.  This would put all the
complexity into a framework distribution and you could just plug in the
details in the characterizing class.  I'm 99% certain that this could be
done with no changes to the compiler/runtime/etc. -- it would just require
someone writing that framework code.

Though not directly related to your question, I'll mention that we've also
started a converstion recently about a way to remove some of the more
complex routines from the distribution interface -- particularly the ones
related to slicing and reindexing.  We'll see if that comes about in 2015
or not.

Anyway, I invite you to kick this thread back to life in early 2015 if
it's of interest to you still; and if not, I do apologize again for
letting it lapse this long.

-Brad

On Sat, 8 Nov 2014, John MacFrenz wrote:

Hi,

   If I decided to go with Chapel, how would you recommend me to do load
   balancing with existing tools when using rectangular grids?

  I'm afraid I'm not familiar enough with your application area to know
  precisely what to suggest.  One simple model that we supported in ZPL but
  have not yet taken to Chapel (that I recall) is the notion of a "cut"
  distribution which is almost identical to the block distribution except
  that rather than evenly dividing a bounding box across the locales, it
  takes per-dimension hyperplanes and makes the cuts there instead of
  evenly. In a 1D example, distributing 1..12 across 2 locales, you'd get
  1..6 and 7..12 naively.  With a cut distribution, you might say "my first
  locale is 2x as capable as my second, so let's make the cut at 8 (or 7.5?
  :) which would result in an allocation of 1..8 and 9..12, respectively.
  If this type of statically skewed load balancing was sufficient for your
  work, I think the approach would probably be to take the Block
  distribution, modify its constructor to take the cut hyperplanes (rather
  than a bounding box), change the logic used by the locales to determine
  what they own, change the logic used to determine who owns index 'i', and
  it very well may be that the rest would fall out.  If you're interested in
  pursuing this, let me know and I can provide a few more pointers in the
  code specifically.  (though I should also warn that with SC14 coming up,
  I'm not likely to be the best at responding this month).

   Would copying Block dist and making such changes be very difficult?
   Basically, the kind of distribution I was thinking of would be very
   similar to Block dist. Differences would be that the data would be
   partitioned only along one axis, the longest axis. That would make it
   easier to assign different sized blocks to locales. The constructor
   would take a map as an argument which describes (relative) number of
   elements assigned to each locale. Later I'd also like to add
   functionality to change distribution during execution by passing a
   similar map. That way I could do load balancing at some other place,
   assuming I have a way to time the executions on each locale there.

  I've been responding to this as I read, so apologize for not having the
  complete picture in my notes above.  Yes, I think that this is essentially
  a similar concept to the cut distribution I mentioned above and I think
  that modifying block to do this ought to be fairly tractable (at least, I
  can't think of any major problems and believe it should be a modest amount
  of work; simply in a big chunk of code that's not very user-friendly.
  But I'd be happy to help guide you through the work and point to the main
  places I can think of that would require changes).

 Well, actually looking at the code of block dist almost managed to scare me 
away from this task... But if you would be willing to help me with getting 
started, then sure I could give it a try :) .I didn't quite understand how the 
cut distribution works, do you have any papers that would describe this? 
Especially the hyperplanes confused me; my understanding of geometry is pretty 
much limited to three dimensions...

   In earlier
   post you said that could be done implicitly by distributions of arrays,
   which would require custom domain map, or explicitly via on clauses.
   Wouldn't doing that with on clauses cause a lot of unnecessary data
   transfers between machines, or is there some clever mechanism that
   avoids that?

  On-clauses do require active messages to be created between locales, so
  you do want to try to minimize their use when possible.  But when creating
  something that load balances between locales, it usually isn't possible to
  do it without communicating with those other locales somehow...

 I wasn't very clear about what I meant. Let's say that we have two locales, A 
and B, and a distributed array. Machine A is significantly faster than B. If we 
did load balancing using on-clauses, would every time A used data distributed 
to B happen this: data is transferred from B to A - A makes computations - data 
is transferred back to B. So is there some mechanism that A can cache the data 
owned by B, and only send it back if B actually needs it?

  On the other hand, if these techniques work for you, it's easier than
  writing new code.  It's just not how I want Chapel to be used eventually
  once we've beefed up our suite of distributions.

   I don't have enough knowledge to say anything about the efficiency
   issue. However, as an end-user I think that implementing custom domain
   map seems way too difficult. I think that either writing custom domain
   maps should be easier,

  That's a fine opinion to hold, but personally I have no idea how to make
  it easier without requiring O(n) storage (where n is the number of indices
  in your domain) and/or making performance terrible (and we already get
  enough complaints about that).  The current approach is our answer to what
  is arguably an open research problem.  The papers might be interesting
  reading if you haven't seen them (see the "Advanced Chapel Features"
  section at http://chapel.cray.com/papers.html).  There obviously may be
  other better solutions, but I have no idea what they would be after
  working on this problem for the past 20 years.  I may just have worked
  myself too far into a specific mental rut, though.

 One thing I'd find easy to use would be that constructor of distribution would 
take an object with at least following methods as an argument:
 - One that takes a locale as an argument and returns size of array 
(zero-based, densely indexed) that should be allocated to that locale
 - One that takes an index as an argument and returns which locale it belongs 
to.
 - One that takes an index as an argument and returns its position in the local 
array.

 Also ability to constrain implementation to a specific dimension would make 
things easier. Another similar solution would be to write an abstract class 
describing distributions of rectangular, densely-indexed grids, and having 
those aforementioned methods defined as abstract. Apart from those, generic 
default implementations could be provided for methods, or in some cases 
compile-time error could be generated. I don't know if these would be viable 
options in Chapel, though. But however, at least when it comes to user-defined 
distributions I don't think that the one size fits all -approach would work too 
well. Instead, providing specialized tools for most common situations would be 
better in my opinion, like ones I suggested above for rectangular, densely 
indexed grids. As far as I can see, this solution would provide a lot of 
flexibility, while still being rather easy to implement and not being too 
inefficient.

------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Variable Block Distributions (was Re: Would Chapel suit my needs?)

Reply via email to