Re: One Task per Locale

Vassily Litvinov Sat, 29 Nov 2014 19:46:43 -0800

Michael,

My understanding of your code is, assuming just 3 locales:

* Let  a, b, c  be the portions of TupelBlockArray
allocated on Locales(0), Locales(1), Locales(2), respectively.

* Then perform the following as parallel as possible:

 on a.locale:  compute(a,a); compute(a,b); compute(a,c);
 on b.locale:  compute(b,a); compute(b,b); compute(b,c);
 on c.locale:  compute(c,a); compute(c,b); compute(c,c);

Is this summary adequate?

The way your wrote your code is that the remote accesses, such as
accesses to b and c on a.locale, are performed within the timed portion
of your code.

Would it make sense to move those remote accesses out from the timed code?

If not, I would suggest overlapping them with computation.
For example, in pseudo-code:

 on a.locale:
   cobegin { compute(a,a); fetch(b); }
   cobegin { compute(a,b); fetch(c); }
   compute(a,c);

analogously for b.locale and c.locale. Ideally Chapel would perform
this optimization for you; currently it doesn't.

One tool that might help is our Replicated distribution.
For documentation and examples please consult these files
in the Chapel distribution:

  modules/dists/ReplicatedDist.chpl
  modules/standard/UtilReplicatedVar.chpl

Also - assignments between whole arrays or array slices, e.g.
  MyFirstArray = MySecondArray;
  MyFirstArray(some range) = MySecondArray(another range);
  MyFirstArray(some range) = MySecondArray;
  etc.
will work faster in may cases than semantically-equivalent for or
forall loops, due to the "bulk transfer" optimization.

As to why your optimized version takes longer time:

(a) Both "unoptimized" and "optimized" portions perform the same amount
of remote accesses. Indeed, if I understand correctly, both portions
fetch each remote element of TupelBlockArray once.

(b) The "optimized" portion performs an inner 'forall'. Since the outer
'forall' exhausts the available parallelism, the inner 'forall' does
not add any parallelism. It does unfortunately add overhead for trying
to parallelize. Although I am not sure this overhead is significant,
so there may be other factors in play, e.g. cache effects(?).

Aside: I noticed that you use "on Locales(bla_i.locale.id)".
You can simplify that:

* given that the following three forms are equivalent:
 on Locales(bla_i.locale.id)
 on bla_i.locale
 on bla_i

  where the third form is the preferred style
  and the first form might not work with hierarchical locales

* When you run a 'forall' over an array that is distributed with most
(or all? at least Block and Replicated) distributions, each iteration
is performed on the locale where the corresponding array element resides.
This makes the above 'on' clause unnecessary in your code, although
it adds only small performance overhead when present.

Vassily

On Thu, Nov 27, 2014 at 03:42:45PM +0100, Michael Dietrich wrote:
> Hi Vassily,
>
> thank you for your answer.
>
> Okay, so I tried to write a program that applies remote access.
> This program is similar to a project I'm doing at the moment. Iit's 
> intended to do a lot of accesses.
> It is clear that it runs slower due to this if I don't optimize. So I 
> thought about a temporary array that holds the needed values from every 
> Locale. This array should be created once per locale, so every of them can 
> get the values locally after they are got remotely only once. Unfortunately 
> I'm having some problems to implement this.
>
> Could you have a view on my code [1] and give some suggestions?
> The program iterates over every (distributed) array element where it does 
> the same iteration in an inner loop. Within this there are some 
> calculations which need the array values.
> The algorithm is done locally, then distributed and then distributed with 
> my suggested optimization (which is actually no good idea).
> The console output includes the time measures and calculation results. The 
> results are not meant to make any sense, they just show consistency.
>
> One example:
> ./test -nl 16 --N=1000
> Time on one Locale: 0.05074 seconds
> Time on 16 Locales: 3.38145 seconds
> Optimized time on 16 Locales: 3.74074 seconds
>
> bye
>
> [1] https://www-user.tu-chemnitz.de/~michd/distExample.chpl
>
> ...

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: One Task per Locale

Reply via email to