Re: Performance observations for vs forall vs reduce

Brad Chamberlain Fri, 06 Dec 2013 08:56:44 -0800


On Fri, 6 Dec 2013, Peter Kjellström wrote:

On Thursday 05 December 2013 17:17:54 Brad Chamberlain wrote:

Hi Peter --

Attempting to fix this I went looking for an example forall with proper
syncronization. The first one I found was:
 http://chapel.cray.com/tutorials.html ->
 http://faculty.knox.edu/dbunde/teaching/chapel/ ->

...

Which by the looks of it contains the same mistake I did...


If you read the text just below this example, it points out that there is a
race condition with a forward reference for how to fix it (though the fix
is not quite as elegant as the one you used below -- I'll check with them
to see whether that was intentional.


Doh, now I feel really silly..

On the other hand, maybe making the sole forall example an incorrect "do not
do this"-example, without a warning before or in the code, wasn't such a great
idea.

Reading the Forall part of the otherwise great "Productive Programming in
Chapel[...]
didn't yield any clues or
references to atomic or sync variables.


atomic and sync variables are considered part of the task parallel subset of
the language (though they can also be used in any other parallel or serial
part of the language), so are discussed in that subset of the tutorial
slides.


I see, I've not yet interested myself in the task parallel parts. Maybe a
reference/link from the data parallel part would be a good idea?

I don't have any ownership over that tutorial, but the author has beenfollowing this thread, so may take up these suggestions.

...

 var sum$ : sync real = 0.0;
 forall i in 1..N by -1 do sum$ += 1.0/(i*i);


This is a reasonable approach and has been the traditional way to do this in
Chapel; atomic types are a newer addition to the language and are the more
optimal way to express this

...

You should be able to write this as:

var sum: atomic real = 0.0;
forall i in 1..N by -1 do sum.add(1.0/(i*i));


Almost, direct assignment is not allowed but this works:

var sum : atomic real;
sum.write(0.0); // not needed? atomic vars seem to be 0-initialized...
forall i in 1..N do sum.add(1.0/(i*i));
writeln(sqrt(6*sum.read()));

My apologies, I forgot that we can't initialize atomics directly (Ithought we'd made a special case for that), and obviously didn't test mycode. But you're right that it's not necessary anyway all variables inChapel are default-initialized by default.

AFAICT there's no documentation of the atomic typ methods in the language spec
(but I did find the one page in the task parallel section you pointed at
earlier).

You're right, sorry not to mention it -- atomics haven't made it into thespec yet for no good reason that I can recall. They are currentlydocumented in doc/technotes/README.atomic (or:


        
http://svn.code.sf.net/p/chapel/code/trunk/doc/release/technotes/README.atomics

For completeness, the above atomic example ran in 38s optimized (vs the 130s
for sync).

OK, thanks. And can you remind me: Do you have any other parallelimplementation timings to compare against?


Returning to your original question about performance:

It makes sense that the reduction-based implementation would do betterthan the atomic- or sync-based implementations because in that version,the number of synchronization events between the two tasks implementingthe forall loop is O(1) whereas the number of synchronization eventsbetween the atomic and sync implementations is O(N).

It's not entirely surprising that the forall versions on one core areslower than the serial version since there is additional overhead requiredto create (or consider creating) the parallelism at runtime, implement thecoordination, etc. Presumably, the longer-running/more computationallyintensive the bodies of your loops were, the more you could amortize theseoverheads away.

There are potential optimization opportunities in the forall loop with onetask cases -- imagine, for example, runtime specializations for cases whenonly 1 task is used to implement the forall loops. But it's not obviousto me whether those would be worthwhile given the code inflation size andthe argument that forall loops are perhaps most often executed withmultiple tasks (otherwise, why use a forall at all?).

It's also worth mentioning that most Chapel constructs still haveoverheads that we could work harder to squeeze performance out of -- whilethe reduction worked well here (and will always probably be the bestperforming for this code compared to the alternatives we've discussed),our reductions are known to have scalability bottlenecks that need to beaddressed going forward (one of our priorities from a scalabilityperspective).

Finally I'll mention that the one timing in your chart that surprises methe most is the fact that the race-y forall loop is so much slower on twocores than it is on one. My only conjecture here is that with the twotasks battling over the shared location, there's some sort of bad cacheconflict behavior here, but that's just a guess.

Hope this is useful. If you feel sufficiently motivated, it would beinteresting to get these variations into the Chapel test system'sperformance testing framework and add them to the suite of performancetests that we run on a regular basis. Let me know if you're interestedand I'll find some instructions for you. No pressure, though.


-Brad

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Performance observations for vs forall vs reduce

Reply via email to