Re: One Task per Locale

Michael Dietrich Wed, 10 Dec 2014 13:33:07 -0800

Zitat von Vassily Litvinov <[email protected]>:

> Michael,
>


Hi Vassily,

sorry for my late answer. I'm currently having some problems with the  
new Chapel release which I wasn't able so solve. So by now I stay with  
1.9.0


> My understanding of your code is, assuming just 3 locales:
>
> * Let  a, b, c  be the portions of TupelBlockArray
> allocated on Locales(0), Locales(1), Locales(2), respectively.
>
> * Then perform the following as parallel as possible:
>
>  on a.locale:  compute(a,a); compute(a,b); compute(a,c);
>  on b.locale:  compute(b,a); compute(b,b); compute(b,c);
>  on c.locale:  compute(c,a); compute(c,b); compute(c,c);
>
> Is this summary adequate?
>

You are right.
On all calculations except compute(x,x) there are remote accesses  
neccessary. The amount grows by O(N^2) so it easily becomes much  
slower than the serial computation (already at values like N=1000  
which is much less than I want to use).


>
> The way your wrote your code is that the remote accesses, such as
> accesses to b and c on a.locale, are performed within the timed portion
> of your code.
>
> Would it make sense to move those remote accesses out from the timed code?
>

Unfortunately it wouldn't since I need to measure in general if the  
distributed execution is faster than serial.


> If not, I would suggest overlapping them with computation.
> For example, in pseudo-code:
>
>  on a.locale:
>    cobegin { compute(a,a); fetch(b); }
>    cobegin { compute(a,b); fetch(c); }
>    compute(a,c);
>
> analogously for b.locale and c.locale. Ideally Chapel would perform
> this optimization for you; currently it doesn't.
>

That's the point.
For this I need to know the ranges of the portions a, b and c within  
the distributed array.
Otherwise the program wouldn't know how much has to be done on e.g.  
a.locale nor how much and which elements have to be fetched. That's  
what I'm stuck at.


>
> One tool that might help is our Replicated distribution.
> For documentation and examples please consult these files
> in the Chapel distribution:
>
>   modules/dists/ReplicatedDist.chpl
>   modules/standard/UtilReplicatedVar.chpl
>
> Also - assignments between whole arrays or array slices, e.g.
>   MyFirstArray = MySecondArray;
>   MyFirstArray(some range) = MySecondArray(another range);
>   MyFirstArray(some range) = MySecondArray;
>   etc.
> will work faster in may cases than semantically-equivalent for or
> forall loops, due to the "bulk transfer" optimization.
>

I already tried to work with this distribution. Even without doing  
anything on an array distributed in this kind it takes a lot of time  
to run and even compilate.
As far as I remember, the complete code was something like this:

use ReplicatedDist;

const Space = {1..25};
const RepSpace = Space dmapped ReplicatedDist();

var RepFeld: [RepSpace] int;

forall bla in RepFeld do
     bla = bla.index();

write("ReplicatedDist: ");
writeln(RepFeld);

Or didn't I use it properly? I will have one more look on it.


>
> As to why your optimized version takes longer time:
>
> (a) Both "unoptimized" and "optimized" portions perform the same amount
> of remote accesses. Indeed, if I understand correctly, both portions
> fetch each remote element of TupelBlockArray once.
>

Yes, this was only some try to start with more than no idea. :)


> (b) The "optimized" portion performs an inner 'forall'. Since the outer
> 'forall' exhausts the available parallelism, the inner 'forall' does
> not add any parallelism. It does unfortunately add overhead for trying
> to parallelize. Although I am not sure this overhead is significant,
> so there may be other factors in play, e.g. cache effects(?).
>

You're right. I removed the inner forall loop.


>
> Aside: I noticed that you use "on Locales(bla_i.locale.id)".
> You can simplify that:
>
> * given that the following three forms are equivalent:
>  on Locales(bla_i.locale.id)
>  on bla_i.locale
>  on bla_i
>
>   where the third form is the preferred style
>   and the first form might not work with hierarchical locales
>
> * When you run a 'forall' over an array that is distributed with most
> (or all? at least Block and Replicated) distributions, each iteration
> is performed on the locale where the corresponding array element resides.
> This makes the above 'on' clause unnecessary in your code, although
> it adds only small performance overhead when present.
>

That is also correct. As a beginner I will switch it to "on bla_i".


> Vassily
>

Bye
Michael

>
> On Thu, Nov 27, 2014 at 03:42:45PM +0100, Michael Dietrich wrote:
>> Hi Vassily,
>>
>> thank you for your answer.
>>
>> Okay, so I tried to write a program that applies remote access.
>> This program is similar to a project I'm doing at the moment. Iit's
>> intended to do a lot of accesses.
>> It is clear that it runs slower due to this if I don't optimize. So I
>> thought about a temporary array that holds the needed values from every
>> Locale. This array should be created once per locale, so every of them can
>> get the values locally after they are got remotely only once. Unfortunately
>> I'm having some problems to implement this.
>>
>> Could you have a view on my code [1] and give some suggestions?
>> The program iterates over every (distributed) array element where it does
>> the same iteration in an inner loop. Within this there are some
>> calculations which need the array values.
>> The algorithm is done locally, then distributed and then distributed with
>> my suggested optimization (which is actually no good idea).
>> The console output includes the time measures and calculation results. The
>> results are not meant to make any sense, they just show consistency.
>>
>> One example:
>> ./test -nl 16 --N=1000
>> Time on one Locale: 0.05074 seconds
>> Time on 16 Locales: 3.38145 seconds
>> Optimized time on 16 Locales: 3.74074 seconds
>>
>> bye
>>
>> [1] https://www-user.tu-chemnitz.de/~michd/distExample.chpl
>>
>> ...


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: One Task per Locale

Reply via email to