Rafael,
Thanks. This clarification helps a lot. Aroon Sharma
University of Maryland, Class of 2015
M.S. Computer Engineering
(301) 908-9528
On Friday, December 19, 2014 11:15 AM, Rafael Asenjo Plaza <[email protected]>
wrote:
Hi Aroon and all,
El 19/12/2014, a las 03:57, Aroon Sharma <[email protected]> escribió:
With respect to the work that was done by the University of Malaga, our work
applies bulk transfers to more generic zippered loops that zip Cyclic and Block
Cyclic array slices. From what I remember, their work was restricted to whole
array assignments between Block and Cyclic arrays (i.e A = B where B is Block
and A is Cyclic). Since whole array assignment and zippered iteration are
fundamentally related, I think there is a lot of overlap between both works. In
fact, our implementation uses a strided communication primitive that they
developed.
As stated in this paper:
http://www.ac.uma.es/~compilacion/publicaciones/UMA-DAC-12-02.pdf
array assignments do not necessary need to assign whole arrays to benefit from
the bulk transfer optimization. More precisely, we aggregate data for
assignments of the form:
A[Da] = B[Db] where,
- A is a Block or Cyclic array, - B is a Block or Cyclic array, - Da is of the
form {xa1..ya1 by za1, xa2..ya2 by za2, …, xan..yan by zan} and - Db is of the
form {xb1..yb1 by zb1, xb2..yb2 by zb2, …, xbn..ybn by zbn}.
That way, this optimization covers block-to-block, cyclic-to-cyclic,
block-to-cyclic and cyclic-to-block kind of assignments. It is not set by
default, so -s useBulkTransferStride has to be specified to enable this
optimization.
Our work, for example, can aggregate something like:
forall (a, b, c) in zip(A[1..100], B[2..101], C[3..102]) { a = b + c;}
where A, B, and C are all Cyclic. Because different array slices are referenced
in the zippering, a, b, and c will be from different locales on all iterations
of the loop. I don't believe that the work by the University of Malaga could be
applied to situations like this.
We have certainly not tackle the problem you describe above. However, and
thinking offhand, with our optimization the required slices of B and C (in your
example) could be moved to temporary arrays on the locales owning the
corresponding slice of A, and then do the local computation. If you implement
further optimizations like overlapping communications and computations,
minimizing data movement or memory footprint, or the like, then our work is not
directly applicable to the situation you describe.
Regards,
Rafa.
__Rafael Asenjo Plaza
Dept. Arquitectura de Computadores Complejo Tecnologico Campus de
TeatinosE-29071 MALAGA (SPAIN)Tel: +34 95 213 27 91
Fax: +34 95 213 27 90 http://www.ac.uma.es/~asenjo
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers