On Fri, Jul 14, 2017 at 10:45 AM, Brad Chamberlain <[email protected]> wrote:

>
> Hi Jeff --
>
> Thanks for the notes.  After sending these mails this week, I'd been
> musing over whether I believed "CMO and RMO are sufficient for most cases"
> vs. "any dimension ordering should be permitted", so it's good to hear
> about a context that would benefit from more general orderings.
>
> I don't think it would be terribly difficult to write and use a domain map
> that supported arbitrary dimension memory orderings in Chapel -- that's
> certainly the type of thing that domain maps were designed to support.  I
> think the biggest challenge would probably be making it (a)
> rank-independent with (b) the dimension ordering known at compile-time. It
> seems this would want a combination of varargs and param (statically known)
> values that I don't think we support well today.  Creating a
> specialized-to-4D case would be pretty simple, but would of course be less
> satisfying from a generality perspective.
>
>
I don't know any good uses for arrays with more than 8 dimensions in this
context, and 4/6/8 are more useful than 5/7.  Obviously, 3 is quite useful,
perhaps so much that specializing makes sense.


> Thinking about my exchange with Michael the other day, I'd be inclined to
> make this a new domain map ('e.g., dmapped PermuteDims(3, 4, 1, 2)') rather
> than extending the DefaultRectangular domain map to support this ability
> (as I've argued we could/should do for CMO).  My rationale is that I think
> the code would be different enough (and just epsilon-more-complex enough)
> from the current DefaultRectangular implementation that the two wouldn't
> live within a single implementation as well.  But I could be wrong about
> that.
>
>
It would not bother me at all if I needed a domain map name for tensors.


> The challenge w.r.t. rank-independence and vararg params I mention above
> would be simpler if only the most memory-consecutive dimension mattered, as
> Apan characterizes in his mail:
>
>
Contiguous access is the most important issue, but cache affinity and TLB
misses matter quite a bit at second-order.


> we only care about which dimension is the innermost
>>
>
> But it sounds like for your quantum chem context, you care about the
> ordering of all dimensions... true?  I ask only from a perspective of
> wishful thinking because specifying one "tightest" dimension is a much
> easier interface to support in a rank-independent manner.  :)
>
>
I care about the ordering of all dimensions to the extent that I care about
performance ;-)

Once arrays become distributed, having control over the ordering of all
dimensions may mean the difference between a contiguous get to one node and
a one-sided gather across dozens of nodes.

Supporting only the tightest dimension seems equivalent to having the user
flatten all but the tightest dimension into a single dimension, which can
be implemented manually without much overhead, since the offset computation
won't happen in the inner loop in this case.

Best,

Jeff


> Thanks,
> -Brad
>
>
>
> On Thu, 13 Jul 2017, Jeff Hammond wrote:
>
> I think it would be incredibly useful to allow a user to specify the layout
>> ordering of array dimensions, not just column-major (i.e. left-to-right in
>> N-d) and row-major (i.e. right-to-left in N-d).
>>
>> For example, quantum chemists do a lot with 4D arrays and use all 4!=24
>> access patterns at one point or another.  It would be neat if one could
>> define a 4D array with e.g. 1234 dimension ordering (this is either row or
>> column major depending on what one means by "first dimension") and another
>> with e.g. 3412 ordering (which is a column-major matrix of row-major
>> matrices in one convention).
>>
>> From this, one would expect array assignment to implement the out-of-place
>> array transposition internally, which allows the runtime to use an optimal
>> implementation.  This matters, because the optimal implementation is >2x
>> faster than the naive one a user writes (e.g.
>> https://arxiv.org/abs/1704.04374, https://arxiv.org/abs/1607.01249).
>>
>> Implementing any ordering of dimensions is pretty easy under the hood, and
>> doing so is often much more efficient because compilers appear to be
>> better
>> at indexing multi-dimensional arrays than they are at optimizing integer
>> expressions used to compute explicit offsets into 1D arrays.  I saw this
>> based upon my experience with writing tensor code in Fortran and C99 for
>> more than a decade.
>>
>> In any case, Chapel should do all that it can to ensure that nobody every
>> writes code that looks like this:
>>
>> t3[h2+h2u*(h3+h3u*(h1+h1u*(p4+p4u*(p6+p6u*p5))))] +=
>> t2[h7+h7u*(p4+p4u*(h1+h1u*h2))] * v2[h7+h7u*(h3+h3u*(p6+p6u*p5))];
>>
>> (
>> https://github.com/jeffhammond/nwchem-tce-triples-kernels/
>> blob/master/src/ccsd_t_kernels_1d.c#L696
>> )
>>
>> One can also think about not-strictly-linear orderings like Morton/Z-order
>> and space-filling curves but I'm not going to attempt to do so.
>>
>> Jeff
>>
>>
>> On Wed, Jul 12, 2017 at 5:42 AM, Michael Ferguson <[email protected]>
>> wrote:
>>
>>>
>>> Hi -
>>>
>>> It may well be the right choice to have DefaultRectangular
>>> be configurable in row-major vs column-major layout, but
>>> I have some reservations about this approach.
>>>
>>> First, DefaultRectangular has to support arbitrary dimension.
>>> Is it important to support arbitrary dimension "column-major"
>>> layouts? What about a 3D layout, where only the middle dimension
>>> is stored differently from how it works now? As far as I know,
>>> people interested in "column major" arrays are talking
>>> about 2D arrays / matrices.
>>>
>>> Second, I'm not so sure it's a good precedent to put it in
>>> to DefaultRectangular. Won't we prefer to make any new layout a
>>> configurable option in DefaultRectangular for to "minimize software
>>> layers and complexity"? I think it would be preferably to allow
>>> other layouts to exist separately and to solve whatever software
>>> engineering/complexity problems come up. (E.g. giving Block
>>> the ability to select the layout used).
>>>
>>> Couldn't we view a Column-major layout as an array that is
>>> similar to an array view? That is, it is not really storing
>>> the array as much as adjusting the mapping from indices
>>> to elements. Isn't that similar to a rank-change array view,
>>> in particular? Didn't we just decide that it was better
>>> to make a rank-change array view a different type, rather
>>> than folding it in to DefaultRectangular? Why is this different?
>>>
>>> Cheers,
>>>
>>> -michael
>>>
>>>
>>>> One belated follow-up that I meant to include in the original mail:
>>>>
>>>> An advantage of having 'DefaultRectangular' able to choose between row-
>>>> and column-major order (or potentially, eventually, other layouts --
>>>> tiling?  Morton order?) is that most other domain maps are built in
>>>> terms
>>>> of DefaultRectangular.  So (I believe) it would probably simplify our
>>>> ability to have other array implementations change their layout.  E.g.,
>>>> you could imagine having 'Block' take a similar 'param' and thread it
>>>> down
>>>> into the DefaultRectangulars that it uses to represent each local's
>>>> block
>>>> of data rather than having it switch to a completely different domain
>>>>
>>> map.
>>
>>>
>>>> -Brad
>>>>
>>>>
>>>> On Tue, 11 Jul 2017, Brad Chamberlain wrote:
>>>>
>>>>
>>>>> Hi Apan --
>>>>>
>>>>> Thanks for continuing to wrestle with these questions of memory layout
>>>>> which,
>>>>> I agree, are very important (and increasingly so).
>>>>>
>>>>> A few people have taken stabs at doing column-major domain maps in
>>>>> Chapel
>>>>> over time (e.g., test/arrays/marybeth/CMO_array.chpl was an early
>>>>> proof-of-concept that has arguably long outlived its utility and I
>>>>> suspect
>>>>> isn't really worth poking into beyond noting that it exists). But, as
>>>>> you've
>>>>> probably observed, none have made it into the standard modules or
>>>>> release.  I
>>>>> think it'd be great to change that.
>>>>>
>>>>> My thought about how to approach this would be to have the
>>>>> DefaultRectangular
>>>>> layout take a 'param' argument indicating whether to use row- or
>>>>> column-major
>>>>> order in order to minimize software layers and complexity while keeping
>>>>> that
>>>>> decision as much in the core array type as possible. I've long
>>>>> hypothesized
>>>>> that the changes to the domain map ought to be as simple as adding the
>>>>> param
>>>>> and reversing a few of the loops that iterate over the dimensions in
>>>>> order to
>>>>> calculate offsets and such.  But I could be overlooking something.
>>>>>
>>>>> Then the remaining challenge would be around how to expose
>>>>> 'DefaultRectangular' to the user so that they could invoke it directly
>>>>> to
>>>>> override the default value of that param, say. This would likely
>>>>> involve a
>>>>> conversation about naming (DefaultRectangular has always been intended
>>>>> as an
>>>>> internal working name rather than something a user would call, but with
>>>>> an
>>>>> appropriate choice of name, I think we could/perhaps should expose it
>>>>> to the
>>>>> user.  I think of this as some wrestling with naming (which is always
>>>>> surprisingly hard) and minor code shuffling.
>>>>>
>>>>> (Or maybe we'd just have two instances of DefaultRectangular available
>>>>> to the
>>>>> user to use in a 'dmapped' clause if they wanted to be explicit, one
>>>>> set up
>>>>> for row-major order and one for column-major?).
>>>>>
>>>>> Anyway, those are my immediate thoughts -- others' may vary,
>>>>> -Brad
>>>>>
>>>>>
>>>>> On Tue, 11 Jul 2017, Qasem, Apan M wrote:
>>>>>
>>>>> Hello,
>>>>>>
>>>>>> I am investigating data layout transformations in Chapel that are
>>>>>> suitable
>>>>>> for heterogeneous memory hierarchies. Organization of data - layout
>>>>>> and
>>>>>> placement - can have a huge impact on the performance of heterogeneous
>>>>>> applications. For example, an AoS data structure allocated in host
>>>>>> (CPU)
>>>>>> memory, should generally be converted to SoA when being mapped to
>>>>>> device
>>>>>> (GPU/accelerator) memory to improve memory coalescing behavior. Below
>>>>>> is a
>>>>>> simple example of two alternate layouts for an array of records for a
>>>>>> GPU
>>>>>> offloaded task (AMD is currently working on providing GPU support in
>>>>>> Chapel). The discrete array implementation yields almost a factor of
>>>>>> two
>>>>>> speedup over the array-of-records implementation on an AMD APU
>>>>>> (integrated
>>>>>> GPU) platform.
>>>>>>
>>>>>> record img {
>>>>>>  var r: real;
>>>>>>  var g: real;
>>>>>>  var b: real;
>>>>>>  var x: real;
>>>>>> }
>>>>>>
>>>>>> var dom: domain(1) = 1 .. N;
>>>>>>
>>>>>> var src: [dom] img;
>>>>>> var dst: [dom] img;
>>>>>>
>>>>>> on (Locales[0]:LocaleModel).GPU do {
>>>>>>  forall i in 1 .. N {
>>>>>>    dst[i].r = src[i].r * v0 – src[i].g * v1;
>>>>>>    dst[i].x = v1;
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> var src_r: [dom] real;
>>>>>> var src_g: [dom] real;
>>>>>> var src_b: [dom] real;
>>>>>> var src_x: [dom] real;
>>>>>>
>>>>>> var dst_r: [dom] real;
>>>>>> var dst_g: [dom] real;
>>>>>> var dst_b: [dom] real;
>>>>>> var dst_x: [dom] real;
>>>>>>
>>>>>> on (Locales[0]:LocaleModel).GPU do {
>>>>>>  forall i in 1 .. N {
>>>>>>    dst_r[i] = src_r[i] * v0 - src_g[i] * v1;
>>>>>>    dst_x[i] = v1;
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> As part of this work, I implemented a domain map module for
>>>>>> column-major
>>>>>> ordering of multi-dimensional arrays. Depending on the data access
>>>>>> patterns, a column-major ordering may be appropriate for device-mapped
>>>>>> data
>>>>>> structures because it can reduce memory divergence in the kernel. It
>>>>>> should
>>>>>> be noted, however, that column major ordering can have a major impact
>>>>>> on
>>>>>> CPU performance as well. For instance, when the innermost  loop index
>>>>>> strides through contiguous elements of a column.
>>>>>>
>>>>>> Below is an example, of how the column-major layout would be invoked
>>>>>> in a
>>>>>> Chapel program
>>>>>>
>>>>>>
>>>>>>
>>>>>> var domRowMajor : domain(2) = {rows, cols};
>>>>>> var domColMajor : domain(2) dmapped ColMajor() = {rows, cols};
>>>>>>
>>>>>> var A : [domRowMajor] real;
>>>>>> var B : [domColMajor] real;
>>>>>>
>>>>>> for i in rows {
>>>>>>  for j in cols {
>>>>>>    A[i,j] = 17.0; // accessed as row-major
>>>>>>    B[i,j] = 0.0;  // accessed as col-major
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Implementation of ColMajor() draws from the implementation of the
>>>>>> DefaultRectangular() module. Column major ordering is enforced by
>>>>>> re-mapping the indices in the domain. For a 2D array, this is
>>>>>> essentially a
>>>>>> swap of the two dimensions. For higher order arrays, the dimensions
>>>>>> are
>>>>>> reversed such that the nth dimension becomes the innermost (i.e.,
>>>>>> contiguous). The entire code is contained in the file
>>>>>> modules/layouts/ColMajor.chpl.
>>>>>>
>>>>>> I would like to make a pull request to integrate the ColMajor() module
>>>>>> into
>>>>>> Chapel. But I wanted to get feedback from the dev community first.
>>>>>>
>>>>>> - Apan
>>>>>>
>>>>>
>>>
>>> ------------------------------------------------------------
>> ------------------
>>
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Chapel-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers
>>>
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> [email protected]
>> http://jeffhammond.github.io/
>>
>


-- 
Jeff Hammond
[email protected]
http://jeffhammond.github.io/
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to