Rafael, 
Thanks. This clarification helps a lot.  Aroon Sharma
University of Maryland, Class of 2015
M.S. Computer Engineering
(301) 908-9528 

     On Friday, December 19, 2014 11:15 AM, Rafael Asenjo Plaza <[email protected]> 
wrote:
   

 Hi Aroon and all,
El 19/12/2014, a las 03:57, Aroon Sharma <[email protected]> escribió:

With respect to the work that was done by the University of Malaga, our work 
applies bulk transfers to more generic zippered loops that zip Cyclic and Block 
Cyclic array slices. From what I remember, their work was restricted to whole 
array assignments between Block and Cyclic arrays (i.e A = B where B is Block 
and A is Cyclic). Since whole array assignment and zippered iteration are 
fundamentally related, I think there is a lot of overlap between both works. In 
fact, our implementation uses a strided communication primitive that they 
developed. 

As stated in this paper:
http://www.ac.uma.es/~compilacion/publicaciones/UMA-DAC-12-02.pdf
array assignments do not necessary need to assign whole arrays to benefit from 
the bulk transfer optimization. More precisely, we aggregate data for 
assignments of the form: 
A[Da] = B[Db]  where, 
 - A is a Block or Cyclic array, - B is a Block or Cyclic array, - Da is of the 
form {xa1..ya1 by za1, xa2..ya2 by za2, …, xan..yan by zan} and - Db is of the 
form {xb1..yb1 by zb1, xb2..yb2 by zb2, …, xbn..ybn by zbn}.
That way, this optimization covers block-to-block, cyclic-to-cyclic, 
block-to-cyclic and cyclic-to-block kind of assignments. It is not set by 
default, so -s useBulkTransferStride has to be specified to enable this 
optimization.

Our work, for example, can aggregate something like:
forall (a, b, c) in zip(A[1..100], B[2..101], C[3..102]) {        a = b + c;}
where A, B, and C are all Cyclic. Because different array slices are referenced 
in the zippering, a, b, and c will be from different locales on all iterations 
of the loop. I don't believe that the work by the University of Malaga could be 
applied to situations like this.

We have certainly not tackle the problem you describe above. However, and 
thinking offhand, with our optimization the required slices of B and C (in your 
example) could be moved to temporary arrays on the locales owning the 
corresponding slice of A, and then do the local computation. If you implement 
further optimizations like overlapping communications and computations, 
minimizing data movement or memory footprint, or the like, then our work is not 
directly applicable to the situation you describe.
Regards,
Rafa.
__Rafael Asenjo Plaza
Dept. Arquitectura de Computadores      Complejo Tecnologico Campus de 
TeatinosE-29071 MALAGA (SPAIN)Tel: +34 95 213 27 91
Fax: +34 95 213 27 90        http://www.ac.uma.es/~asenjo


   
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to