On Fri, 6 Dec 2013, Peter Kjellström wrote:

On Thursday 05 December 2013 17:17:54 Brad Chamberlain wrote:
Hi Peter --

Attempting to fix this I went looking for an example forall with proper
syncronization. The first one I found was:
 http://chapel.cray.com/tutorials.html ->
 http://faculty.knox.edu/dbunde/teaching/chapel/ ->
...
Which by the looks of it contains the same mistake I did...

If you read the text just below this example, it points out that there is a
race condition with a forward reference for how to fix it (though the fix
is not quite as elegant as the one you used below -- I'll check with them
to see whether that was intentional.

Doh, now I feel really silly..

On the other hand, maybe making the sole forall example an incorrect "do not
do this"-example, without a warning before or in the code, wasn't such a great
idea.

Reading the Forall part of the otherwise great "Productive Programming in
Chapel[...]
didn't yield any clues or
references to atomic or sync variables.

atomic and sync variables are considered part of the task parallel subset of
the language (though they can also be used in any other parallel or serial
part of the language), so are discussed in that subset of the tutorial
slides.

I see, I've not yet interested myself in the task parallel parts. Maybe a
reference/link from the data parallel part would be a good idea?

I don't have any ownership over that tutorial, but the author has been following this thread, so may take up these suggestions.


...
 var sum$ : sync real = 0.0;
 forall i in 1..N by -1 do sum$ += 1.0/(i*i);

This is a reasonable approach and has been the traditional way to do this in
Chapel; atomic types are a newer addition to the language and are the more
optimal way to express this
...
You should be able to write this as:

var sum: atomic real = 0.0;
forall i in 1..N by -1 do sum.add(1.0/(i*i));

Almost, direct assignment is not allowed but this works:

var sum : atomic real;
sum.write(0.0); // not needed? atomic vars seem to be 0-initialized...
forall i in 1..N do sum.add(1.0/(i*i));
writeln(sqrt(6*sum.read()));

My apologies, I forgot that we can't initialize atomics directly (I thought we'd made a special case for that), and obviously didn't test my code. But you're right that it's not necessary anyway all variables in Chapel are default-initialized by default.


AFAICT there's no documentation of the atomic typ methods in the language spec
(but I did find the one page in the task parallel section you pointed at
earlier).

You're right, sorry not to mention it -- atomics haven't made it into the spec yet for no good reason that I can recall. They are currently documented in doc/technotes/README.atomic (or:

        
http://svn.code.sf.net/p/chapel/code/trunk/doc/release/technotes/README.atomics


For completeness, the above atomic example ran in 38s optimized (vs the 130s
for sync).

OK, thanks. And can you remind me: Do you have any other parallel implementation timings to compare against?

Returning to your original question about performance:

It makes sense that the reduction-based implementation would do better than the atomic- or sync-based implementations because in that version, the number of synchronization events between the two tasks implementing the forall loop is O(1) whereas the number of synchronization events between the atomic and sync implementations is O(N).

It's not entirely surprising that the forall versions on one core are slower than the serial version since there is additional overhead required to create (or consider creating) the parallelism at runtime, implement the coordination, etc. Presumably, the longer-running/more computationally intensive the bodies of your loops were, the more you could amortize these overheads away.

There are potential optimization opportunities in the forall loop with one task cases -- imagine, for example, runtime specializations for cases when only 1 task is used to implement the forall loops. But it's not obvious to me whether those would be worthwhile given the code inflation size and the argument that forall loops are perhaps most often executed with multiple tasks (otherwise, why use a forall at all?).

It's also worth mentioning that most Chapel constructs still have overheads that we could work harder to squeeze performance out of -- while the reduction worked well here (and will always probably be the best performing for this code compared to the alternatives we've discussed), our reductions are known to have scalability bottlenecks that need to be addressed going forward (one of our priorities from a scalability perspective).

Finally I'll mention that the one timing in your chart that surprises me the most is the fact that the race-y forall loop is so much slower on two cores than it is on one. My only conjecture here is that with the two tasks battling over the shared location, there's some sort of bad cache conflict behavior here, but that's just a guess.

Hope this is useful. If you feel sufficiently motivated, it would be interesting to get these variations into the Chapel test system's performance testing framework and add them to the suite of performance tests that we run on a regular basis. Let me know if you're interested and I'll find some instructions for you. No pressure, though.

-Brad
------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to