Hi Aroon and all,

El 19/12/2014, a las 03:57, Aroon Sharma <[email protected]> escribió:

> With respect to the work that was done by the University of Malaga, our work 
> applies bulk transfers to more generic zippered loops that zip Cyclic and 
> Block Cyclic array slices. From what I remember, their work was restricted to 
> whole array assignments between Block and Cyclic arrays (i.e A = B where B is 
> Block and A is Cyclic). Since whole array assignment and zippered iteration 
> are fundamentally related, I think there is a lot of overlap between both 
> works. In fact, our implementation uses a strided communication primitive 
> that they developed. 

As stated in this paper:

http://www.ac.uma.es/~compilacion/publicaciones/UMA-DAC-12-02.pdf

array assignments do not necessary need to assign whole arrays to benefit from 
the bulk transfer optimization. More precisely, we aggregate data for 
assignments of the form: 

A[Da] = B[Db]  where, 

        - A is a Block or Cyclic array,
        - B is a Block or Cyclic array,
        - Da is of the form {xa1..ya1 by za1, xa2..ya2 by za2, …, xan..yan by 
zan} and
        - Db is of the form {xb1..yb1 by zb1, xb2..yb2 by zb2, …, xbn..ybn by 
zbn}.

That way, this optimization covers block-to-block, cyclic-to-cyclic, 
block-to-cyclic and cyclic-to-block kind of assignments. It is not set by 
default, so -s useBulkTransferStride has to be specified to enable this 
optimization.

> Our work, for example, can aggregate something like:
> 
> forall (a, b, c) in zip(A[1..100], B[2..101], C[3..102]) {
>         a = b + c;
> }
> 
> where A, B, and C are all Cyclic. Because different array slices are 
> referenced in the zippering, a, b, and c will be from different locales on 
> all iterations of the loop. I don't believe that the work by the University 
> of Malaga could be applied to situations like this.


We have certainly not tackle the problem you describe above. However, and 
thinking offhand, with our optimization the required slices of B and C (in your 
example) could be moved to temporary arrays on the locales owning the 
corresponding slice of A, and then do the local computation. If you implement 
further optimizations like overlapping communications and computations, 
minimizing data movement or memory footprint, or the like, then our work is not 
directly applicable to the situation you describe.

Regards,

Rafa.

__
Rafael Asenjo Plaza
Dept. Arquitectura de Computadores      
Complejo Tecnologico Campus de Teatinos
E-29071 MALAGA (SPAIN)
Tel: +34 95 213 27 91
Fax: +34 95 213 27 90        
http://www.ac.uma.es/~asenjo

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to