I will respond to Michael directly, to reduce the volume of this list. If anybody has follow-up questions, please post them or email me directly.
Ways to query what indices of a given distributed array are owned by a particular locale are described in the following README in the 1.10 Chapel distribution: $CHPL_HOME/doc/technotes/README.subquery Vassily On 12/10/14 13:32, Michael Dietrich wrote: > Zitat von Vassily Litvinov <[email protected]>: > >> Michael, >> > > Hi Vassily, > > sorry for my late answer. I'm currently having some problems with the > new Chapel release which I wasn't able so solve. So by now I stay with > 1.9.0 > > >> My understanding of your code is, assuming just 3 locales: >> >> * Let a, b, c be the portions of TupelBlockArray >> allocated on Locales(0), Locales(1), Locales(2), respectively. >> >> * Then perform the following as parallel as possible: >> >> on a.locale: compute(a,a); compute(a,b); compute(a,c); >> on b.locale: compute(b,a); compute(b,b); compute(b,c); >> on c.locale: compute(c,a); compute(c,b); compute(c,c); >> >> Is this summary adequate? >> > > You are right. > On all calculations except compute(x,x) there are remote accesses > neccessary. The amount grows by O(N^2) so it easily becomes much slower > than the serial computation (already at values like N=1000 which is much > less than I want to use). > > >> >> The way your wrote your code is that the remote accesses, such as >> accesses to b and c on a.locale, are performed within the timed portion >> of your code. >> >> Would it make sense to move those remote accesses out from the timed >> code? >> > > Unfortunately it wouldn't since I need to measure in general if the > distributed execution is faster than serial. > > >> If not, I would suggest overlapping them with computation. >> For example, in pseudo-code: >> >> on a.locale: >> cobegin { compute(a,a); fetch(b); } >> cobegin { compute(a,b); fetch(c); } >> compute(a,c); >> >> analogously for b.locale and c.locale. Ideally Chapel would perform >> this optimization for you; currently it doesn't. >> > > That's the point. > For this I need to know the ranges of the portions a, b and c within the > distributed array. > Otherwise the program wouldn't know how much has to be done on e.g. > a.locale nor how much and which elements have to be fetched. That's what > I'm stuck at. > > >> >> One tool that might help is our Replicated distribution. >> For documentation and examples please consult these files >> in the Chapel distribution: >> >> modules/dists/ReplicatedDist.chpl >> modules/standard/UtilReplicatedVar.chpl >> >> Also - assignments between whole arrays or array slices, e.g. >> MyFirstArray = MySecondArray; >> MyFirstArray(some range) = MySecondArray(another range); >> MyFirstArray(some range) = MySecondArray; >> etc. >> will work faster in may cases than semantically-equivalent for or >> forall loops, due to the "bulk transfer" optimization. >> > > I already tried to work with this distribution. Even without doing > anything on an array distributed in this kind it takes a lot of time to > run and even compilate. > As far as I remember, the complete code was something like this: > > use ReplicatedDist; > > const Space = {1..25}; > const RepSpace = Space dmapped ReplicatedDist(); > > var RepFeld: [RepSpace] int; > > forall bla in RepFeld do > bla = bla.index(); > > write("ReplicatedDist: "); > writeln(RepFeld); > > Or didn't I use it properly? I will have one more look on it. > > >> >> As to why your optimized version takes longer time: >> >> (a) Both "unoptimized" and "optimized" portions perform the same amount >> of remote accesses. Indeed, if I understand correctly, both portions >> fetch each remote element of TupelBlockArray once. >> > > Yes, this was only some try to start with more than no idea. :) > > >> (b) The "optimized" portion performs an inner 'forall'. Since the outer >> 'forall' exhausts the available parallelism, the inner 'forall' does >> not add any parallelism. It does unfortunately add overhead for trying >> to parallelize. Although I am not sure this overhead is significant, >> so there may be other factors in play, e.g. cache effects(?). >> > > You're right. I removed the inner forall loop. > > >> >> Aside: I noticed that you use "on Locales(bla_i.locale.id)". >> You can simplify that: >> >> * given that the following three forms are equivalent: >> on Locales(bla_i.locale.id) >> on bla_i.locale >> on bla_i >> >> where the third form is the preferred style >> and the first form might not work with hierarchical locales >> >> * When you run a 'forall' over an array that is distributed with most >> (or all? at least Block and Replicated) distributions, each iteration >> is performed on the locale where the corresponding array element resides. >> This makes the above 'on' clause unnecessary in your code, although >> it adds only small performance overhead when present. >> > > That is also correct. As a beginner I will switch it to "on bla_i". > > >> Vassily >> > > Bye > Michael > >> >> On Thu, Nov 27, 2014 at 03:42:45PM +0100, Michael Dietrich wrote: >>> Hi Vassily, >>> >>> thank you for your answer. >>> >>> Okay, so I tried to write a program that applies remote access. >>> This program is similar to a project I'm doing at the moment. Iit's >>> intended to do a lot of accesses. >>> It is clear that it runs slower due to this if I don't optimize. So I >>> thought about a temporary array that holds the needed values from every >>> Locale. This array should be created once per locale, so every of >>> them can >>> get the values locally after they are got remotely only once. >>> Unfortunately >>> I'm having some problems to implement this. >>> >>> Could you have a view on my code [1] and give some suggestions? >>> The program iterates over every (distributed) array element where it >>> does >>> the same iteration in an inner loop. Within this there are some >>> calculations which need the array values. >>> The algorithm is done locally, then distributed and then distributed >>> with >>> my suggested optimization (which is actually no good idea). >>> The console output includes the time measures and calculation >>> results. The >>> results are not meant to make any sense, they just show consistency. >>> >>> One example: >>> ./test -nl 16 --N=1000 >>> Time on one Locale: 0.05074 seconds >>> Time on 16 Locales: 3.38145 seconds >>> Optimized time on 16 Locales: 3.74074 seconds >>> >>> bye >>> >>> [1] https://www-user.tu-chemnitz.de/~michd/distExample.chpl >>> >>> ... > ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk _______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
