Re: One Task per Locale

Vassily Litvinov Wed, 10 Dec 2014 14:35:39 -0800

I will respond to Michael directly, to reduce the volume of this list.

If anybody has follow-up questions, please post them or email me directly.


Ways to query what indices of a given distributed array are owned by a 
particular locale are described in the following README in the 1.10 
Chapel distribution:

  $CHPL_HOME/doc/technotes/README.subquery

Vassily

On 12/10/14 13:32, Michael Dietrich wrote:
> Zitat von Vassily Litvinov <[email protected]>:
>
>> Michael,
>>
>
> Hi Vassily,
>
> sorry for my late answer. I'm currently having some problems with the
> new Chapel release which I wasn't able so solve. So by now I stay with
> 1.9.0
>
>
>> My understanding of your code is, assuming just 3 locales:
>>
>> * Let  a, b, c  be the portions of TupelBlockArray
>> allocated on Locales(0), Locales(1), Locales(2), respectively.
>>
>> * Then perform the following as parallel as possible:
>>
>>  on a.locale:  compute(a,a); compute(a,b); compute(a,c);
>>  on b.locale:  compute(b,a); compute(b,b); compute(b,c);
>>  on c.locale:  compute(c,a); compute(c,b); compute(c,c);
>>
>> Is this summary adequate?
>>
>
> You are right.
> On all calculations except compute(x,x) there are remote accesses
> neccessary. The amount grows by O(N^2) so it easily becomes much slower
> than the serial computation (already at values like N=1000 which is much
> less than I want to use).
>
>
>>
>> The way your wrote your code is that the remote accesses, such as
>> accesses to b and c on a.locale, are performed within the timed portion
>> of your code.
>>
>> Would it make sense to move those remote accesses out from the timed
>> code?
>>
>
> Unfortunately it wouldn't since I need to measure in general if the
> distributed execution is faster than serial.
>
>
>> If not, I would suggest overlapping them with computation.
>> For example, in pseudo-code:
>>
>>  on a.locale:
>>    cobegin { compute(a,a); fetch(b); }
>>    cobegin { compute(a,b); fetch(c); }
>>    compute(a,c);
>>
>> analogously for b.locale and c.locale. Ideally Chapel would perform
>> this optimization for you; currently it doesn't.
>>
>
> That's the point.
> For this I need to know the ranges of the portions a, b and c within the
> distributed array.
> Otherwise the program wouldn't know how much has to be done on e.g.
> a.locale nor how much and which elements have to be fetched. That's what
> I'm stuck at.
>
>
>>
>> One tool that might help is our Replicated distribution.
>> For documentation and examples please consult these files
>> in the Chapel distribution:
>>
>>   modules/dists/ReplicatedDist.chpl
>>   modules/standard/UtilReplicatedVar.chpl
>>
>> Also - assignments between whole arrays or array slices, e.g.
>>   MyFirstArray = MySecondArray;
>>   MyFirstArray(some range) = MySecondArray(another range);
>>   MyFirstArray(some range) = MySecondArray;
>>   etc.
>> will work faster in may cases than semantically-equivalent for or
>> forall loops, due to the "bulk transfer" optimization.
>>
>
> I already tried to work with this distribution. Even without doing
> anything on an array distributed in this kind it takes a lot of time to
> run and even compilate.
> As far as I remember, the complete code was something like this:
>
> use ReplicatedDist;
>
> const Space = {1..25};
> const RepSpace = Space dmapped ReplicatedDist();
>
> var RepFeld: [RepSpace] int;
>
> forall bla in RepFeld do
>      bla = bla.index();
>
> write("ReplicatedDist: ");
> writeln(RepFeld);
>
> Or didn't I use it properly? I will have one more look on it.
>
>
>>
>> As to why your optimized version takes longer time:
>>
>> (a) Both "unoptimized" and "optimized" portions perform the same amount
>> of remote accesses. Indeed, if I understand correctly, both portions
>> fetch each remote element of TupelBlockArray once.
>>
>
> Yes, this was only some try to start with more than no idea. :)
>
>
>> (b) The "optimized" portion performs an inner 'forall'. Since the outer
>> 'forall' exhausts the available parallelism, the inner 'forall' does
>> not add any parallelism. It does unfortunately add overhead for trying
>> to parallelize. Although I am not sure this overhead is significant,
>> so there may be other factors in play, e.g. cache effects(?).
>>
>
> You're right. I removed the inner forall loop.
>
>
>>
>> Aside: I noticed that you use "on Locales(bla_i.locale.id)".
>> You can simplify that:
>>
>> * given that the following three forms are equivalent:
>>  on Locales(bla_i.locale.id)
>>  on bla_i.locale
>>  on bla_i
>>
>>   where the third form is the preferred style
>>   and the first form might not work with hierarchical locales
>>
>> * When you run a 'forall' over an array that is distributed with most
>> (or all? at least Block and Replicated) distributions, each iteration
>> is performed on the locale where the corresponding array element resides.
>> This makes the above 'on' clause unnecessary in your code, although
>> it adds only small performance overhead when present.
>>
>
> That is also correct. As a beginner I will switch it to "on bla_i".
>
>
>> Vassily
>>
>
> Bye
> Michael
>
>>
>> On Thu, Nov 27, 2014 at 03:42:45PM +0100, Michael Dietrich wrote:
>>> Hi Vassily,
>>>
>>> thank you for your answer.
>>>
>>> Okay, so I tried to write a program that applies remote access.
>>> This program is similar to a project I'm doing at the moment. Iit's
>>> intended to do a lot of accesses.
>>> It is clear that it runs slower due to this if I don't optimize. So I
>>> thought about a temporary array that holds the needed values from every
>>> Locale. This array should be created once per locale, so every of
>>> them can
>>> get the values locally after they are got remotely only once.
>>> Unfortunately
>>> I'm having some problems to implement this.
>>>
>>> Could you have a view on my code [1] and give some suggestions?
>>> The program iterates over every (distributed) array element where it
>>> does
>>> the same iteration in an inner loop. Within this there are some
>>> calculations which need the array values.
>>> The algorithm is done locally, then distributed and then distributed
>>> with
>>> my suggested optimization (which is actually no good idea).
>>> The console output includes the time measures and calculation
>>> results. The
>>> results are not meant to make any sense, they just show consistency.
>>>
>>> One example:
>>> ./test -nl 16 --N=1000
>>> Time on one Locale: 0.05074 seconds
>>> Time on 16 Locales: 3.38145 seconds
>>> Optimized time on 16 Locales: 3.74074 seconds
>>>
>>> bye
>>>
>>> [1] https://www-user.tu-chemnitz.de/~michd/distExample.chpl
>>>
>>> ...
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: One Task per Locale

Reply via email to