Hi John --
Thanks for the response. I'm going to try and feed you just enough
information to get you started, so that if you don't get back to this
project, I haven't wasted too much time. But if you do get to it, you
should try to ask follow-up questions as necessary (and at that point,
_may_ want to slosh this thread over to chapel-developers -- it's
somewhere on the cusp between a user and developer topic...).
There haven't been significant changes to the distribution interface since
we last talked. We've started the discussion I alluded to previously
regarding a different interpretations for slicing and reindexing, but I
think that's orthogonal to this discussion for the most part (though it
may ultimately simplify the interface required to implement a domain map).
There are two different approaches we've discussed, both of which I would
personally approach by making modifications to a copy of the Block
distribution in modules/dists/BlockDist.chpl:
-----
Approach 1: Specialize the Block distribution to support blocking in one
dimension only, but with varying block sizes across locales (as computed
heuristically, or by passing in "cut" values or ...).
Approach 2: Create a variation of the Block distribution that gives a user
the ability to pass in a "policy" object that implements the simplified
interface you proposed:
- "how big should this locale's local block be?"
- "which locale owns this index?"
- "please access the local array using this index."
-----
Either one of these approaches will involve similar components of
BlockDist, so let me try and highlight the salient bits. I apologize in
advance for the lack of commenting and ample opportunity for cleanup in
BlockDist.chpl. It was written during the hectic HPCS years of Chapel and
hasn't had an opportunity to be freshened up since then. Just putting
these notes together, I came across a number of things that look like they
could be trivially cleaned up/simplified...
* I can't recall whether you've read the relevant papers or slides on
Chapel's distribution framework, but there are essentially six classes
involved that you can think of as being in a 3 x 2 grid: The 3-ary
dimension is "domain map", "domain", and "array". The other dimension
is global classes (those that represent the concept as a whole) and
local classes (those that represent a single locale's view of the
concept). These classes are named:
Block BlockDom BlockArr
LocBlock LocBlockDom LocBlockArr
respectively. If you need more background on their roles and
organization, please refer to the second and third papers (and/or their
slides) in the "Advanced Chapel Features" section of:
http://chapel.cray.com/papers.html
"Authoring User-Defined Domain Maps in Chapel"
"User-Defined Distributions and Layouts in Chapel: Philosophy and Framework"
* The LocBlock class is the crucial bit, I believe. This is the class
that specifies what indices the current locale owns for the Block
distribution. In the class, 'myChunk' is the crucial field, as it
describes the indices owned by the current locale (itself stored as
a non-distributed domain). This field is computed in the class's
constructor, LocBlock.LocBlock() (on line 497 of my current repo, but
your mileage may vary) using a utility function that we use to
compute blockings, _computeBlock(), defined in DSIUtil.chpl. My
suspicion is that if you redefine the logic to define 'myChunk' in
this constructor, much of the rest of the implementation will just
fall out. Specifically, when declaring domains and arrays over a
domain map, the determination of what each locale owns is typically
computed by looking at LocBlock.myChunk.
* The other crucial bit is the Block.dsiIndexToLocale() routine which
maps an index in rank-space to a locale in the domain map's
'targetLocales' array (indicating which locales the array should be
distributed between). As you'll see, this is implemented for Block
via a helper routine Block.targetLocsIdx(), which, if we've done our
math correctly, should be the inverse of _computeBlock() above
(begging the question why these are in different files, but see the
"hectic years" note above...).
My hypothesis is that changing the above two things may be sufficient to
get simple programs that declare, access, and loop over distributed arrays
working, though it's also very possible that I'm missing some other key
pieces. Specifically...
* When we declare a domain over the Block distribution, the
BlockDom.setup() routine computes the subdomain owned by each locale
using the Block.getChunk() helper routine which returns the locale's
subdomain by intersecting the whole domain with the domain stored by
the corresponding LocBlock class. This is then stored in the
LocBlockDom class's 'myBlock' field.
* From there, the LocBlockArr class stores the local array of elements
in the 'myElems' array which in turn is declared over the locDom.myBlock
domain (a reference to that same locale's LocBlockDom object from the
previous bullet).
Life being what it is, probably I'm missing some other key part... But I'm
pretty sure I'm not missing a dozen key parts, so I'll stop there for now
and hope that this is enough to get you going. What I'd suggest doing is
writing a simple program that declares a 1D block distributed domain and
array and initializes the array using the owning locale's ID, similar to
what's done in [test/release/]examples/primers/distributions.chpl. Then
I'd try replacing the computation of 'myChunk' and 'dsiIndexToLocale()' as
above, and re-compiling the program to see if it works. If it does, I'd
gradually move to more complicated programs to gain further confidence
that I didn't miss anything. If it doesn't, I'd follow up with more
questions.
I'll be very curious what you're able to accomplish. And if you create a
generalization of the Block() distribution that would be more broadly of
interest to people and can contribute it back, that'd be great!
Hope this is helpful,
-Brad
On Thu, 22 Jan 2015, John MacFrenz wrote:
Hi,
I have been doing a lot of other projects lately, so I didn't remember
to answer to this...
However, I am still interested in this issue. At the moment I don't have
excessive amounts of time to dedicate to this so my progress would be
rather slow, though.
So, I'd appreciate if you could give me some hints where to start and
have there been any significant changes since we last discussed this?
John
21.12.2014, 01:54, "Brad Chamberlain" <[email protected]>:
Hi John --
I'm trying to clean my inbox to wrap up the year, and am realizing that I
never replied to this last message on the thread for which I apologize
(you ran into the 1-2-punch that is SC and Thanksgiving back-to-back, and
then I never got back to older mails in December).
Just to put in a quick note, I think that we could support the approach
that you're suggesting for a simpler distribution specification in a way
that doesn't throw away the generality that the current scheme has (and
that I think it must have). Specifically, the approach would be to write
a distribution that, itself could be parameterized by a class providing
the simpler interface that you want to specify. This would put all the
complexity into a framework distribution and you could just plug in the
details in the characterizing class. I'm 99% certain that this could be
done with no changes to the compiler/runtime/etc. -- it would just require
someone writing that framework code.
Though not directly related to your question, I'll mention that we've also
started a converstion recently about a way to remove some of the more
complex routines from the distribution interface -- particularly the ones
related to slicing and reindexing. We'll see if that comes about in 2015
or not.
Anyway, I invite you to kick this thread back to life in early 2015 if
it's of interest to you still; and if not, I do apologize again for
letting it lapse this long.
-Brad
On Sat, 8 Nov 2014, John MacFrenz wrote:
Hi,
If I decided to go with Chapel, how would you recommend me to do load
balancing with existing tools when using rectangular grids?
I'm afraid I'm not familiar enough with your application area to know
precisely what to suggest. One simple model that we supported in ZPL but
have not yet taken to Chapel (that I recall) is the notion of a "cut"
distribution which is almost identical to the block distribution except
that rather than evenly dividing a bounding box across the locales, it
takes per-dimension hyperplanes and makes the cuts there instead of
evenly. In a 1D example, distributing 1..12 across 2 locales, you'd get
1..6 and 7..12 naively. With a cut distribution, you might say "my first
locale is 2x as capable as my second, so let's make the cut at 8 (or 7.5?
:) which would result in an allocation of 1..8 and 9..12, respectively.
If this type of statically skewed load balancing was sufficient for your
work, I think the approach would probably be to take the Block
distribution, modify its constructor to take the cut hyperplanes (rather
than a bounding box), change the logic used by the locales to determine
what they own, change the logic used to determine who owns index 'i', and
it very well may be that the rest would fall out. If you're interested in
pursuing this, let me know and I can provide a few more pointers in the
code specifically. (though I should also warn that with SC14 coming up,
I'm not likely to be the best at responding this month).
Would copying Block dist and making such changes be very difficult?
Basically, the kind of distribution I was thinking of would be very
similar to Block dist. Differences would be that the data would be
partitioned only along one axis, the longest axis. That would make it
easier to assign different sized blocks to locales. The constructor
would take a map as an argument which describes (relative) number of
elements assigned to each locale. Later I'd also like to add
functionality to change distribution during execution by passing a
similar map. That way I could do load balancing at some other place,
assuming I have a way to time the executions on each locale there.
I've been responding to this as I read, so apologize for not having the
complete picture in my notes above. Yes, I think that this is essentially
a similar concept to the cut distribution I mentioned above and I think
that modifying block to do this ought to be fairly tractable (at least, I
can't think of any major problems and believe it should be a modest amount
of work; simply in a big chunk of code that's not very user-friendly.
But I'd be happy to help guide you through the work and point to the main
places I can think of that would require changes).
Well, actually looking at the code of block dist almost managed to scare me
away from this task... But if you would be willing to help me with getting
started, then sure I could give it a try :) .I didn't quite understand how the
cut distribution works, do you have any papers that would describe this?
Especially the hyperplanes confused me; my understanding of geometry is pretty
much limited to three dimensions...
In earlier
post you said that could be done implicitly by distributions of arrays,
which would require custom domain map, or explicitly via on clauses.
Wouldn't doing that with on clauses cause a lot of unnecessary data
transfers between machines, or is there some clever mechanism that
avoids that?
On-clauses do require active messages to be created between locales, so
you do want to try to minimize their use when possible. But when creating
something that load balances between locales, it usually isn't possible to
do it without communicating with those other locales somehow...
I wasn't very clear about what I meant. Let's say that we have two locales, A
and B, and a distributed array. Machine A is significantly faster than B. If we
did load balancing using on-clauses, would every time A used data distributed
to B happen this: data is transferred from B to A - A makes computations - data
is transferred back to B. So is there some mechanism that A can cache the data
owned by B, and only send it back if B actually needs it?
On the other hand, if these techniques work for you, it's easier than
writing new code. It's just not how I want Chapel to be used eventually
once we've beefed up our suite of distributions.
I don't have enough knowledge to say anything about the efficiency
issue. However, as an end-user I think that implementing custom domain
map seems way too difficult. I think that either writing custom domain
maps should be easier,
That's a fine opinion to hold, but personally I have no idea how to make
it easier without requiring O(n) storage (where n is the number of indices
in your domain) and/or making performance terrible (and we already get
enough complaints about that). The current approach is our answer to what
is arguably an open research problem. The papers might be interesting
reading if you haven't seen them (see the "Advanced Chapel Features"
section at http://chapel.cray.com/papers.html). There obviously may be
other better solutions, but I have no idea what they would be after
working on this problem for the past 20 years. I may just have worked
myself too far into a specific mental rut, though.
One thing I'd find easy to use would be that constructor of distribution would
take an object with at least following methods as an argument:
- One that takes a locale as an argument and returns size of array
(zero-based, densely indexed) that should be allocated to that locale
- One that takes an index as an argument and returns which locale it belongs
to.
- One that takes an index as an argument and returns its position in the local
array.
Also ability to constrain implementation to a specific dimension would make
things easier. Another similar solution would be to write an abstract class
describing distributions of rectangular, densely-indexed grids, and having
those aforementioned methods defined as abstract. Apart from those, generic
default implementations could be provided for methods, or in some cases
compile-time error could be generated. I don't know if these would be viable
options in Chapel, though. But however, at least when it comes to user-defined
distributions I don't think that the one size fits all -approach would work too
well. Instead, providing specialized tools for most common situations would be
better in my opinion, like ones I suggested above for rectangular, densely
indexed grids. As far as I can see, this solution would provide a lot of
flexibility, while still being rather easy to implement and not being too
inefficient.
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users