On Fri, 6 Dec 2013, Peter Kjellström wrote:
On Thursday 05 December 2013 17:17:54 Brad Chamberlain wrote:Hi Peter --Attempting to fix this I went looking for an example forall with proper syncronization. The first one I found was: http://chapel.cray.com/tutorials.html -> http://faculty.knox.edu/dbunde/teaching/chapel/ ->...Which by the looks of it contains the same mistake I did...If you read the text just below this example, it points out that there is a race condition with a forward reference for how to fix it (though the fix is not quite as elegant as the one you used below -- I'll check with them to see whether that was intentional.Doh, now I feel really silly.. On the other hand, maybe making the sole forall example an incorrect "do not do this"-example, without a warning before or in the code, wasn't such a great idea.Reading the Forall part of the otherwise great "Productive Programming in Chapel[...] didn't yield any clues or references to atomic or sync variables.atomic and sync variables are considered part of the task parallel subset of the language (though they can also be used in any other parallel or serial part of the language), so are discussed in that subset of the tutorial slides.I see, I've not yet interested myself in the task parallel parts. Maybe a reference/link from the data parallel part would be a good idea?
I don't have any ownership over that tutorial, but the author has been following this thread, so may take up these suggestions.
...var sum$ : sync real = 0.0; forall i in 1..N by -1 do sum$ += 1.0/(i*i);This is a reasonable approach and has been the traditional way to do this in Chapel; atomic types are a newer addition to the language and are the more optimal way to express this...You should be able to write this as: var sum: atomic real = 0.0; forall i in 1..N by -1 do sum.add(1.0/(i*i));Almost, direct assignment is not allowed but this works: var sum : atomic real; sum.write(0.0); // not needed? atomic vars seem to be 0-initialized... forall i in 1..N do sum.add(1.0/(i*i)); writeln(sqrt(6*sum.read()));
My apologies, I forgot that we can't initialize atomics directly (I thought we'd made a special case for that), and obviously didn't test my code. But you're right that it's not necessary anyway all variables in Chapel are default-initialized by default.
AFAICT there's no documentation of the atomic typ methods in the language spec (but I did find the one page in the task parallel section you pointed at earlier).
You're right, sorry not to mention it -- atomics haven't made it into the spec yet for no good reason that I can recall. They are currently documented in doc/technotes/README.atomic (or:
http://svn.code.sf.net/p/chapel/code/trunk/doc/release/technotes/README.atomics
For completeness, the above atomic example ran in 38s optimized (vs the 130s for sync).
OK, thanks. And can you remind me: Do you have any other parallel implementation timings to compare against?
Returning to your original question about performance:It makes sense that the reduction-based implementation would do better than the atomic- or sync-based implementations because in that version, the number of synchronization events between the two tasks implementing the forall loop is O(1) whereas the number of synchronization events between the atomic and sync implementations is O(N).
It's not entirely surprising that the forall versions on one core are slower than the serial version since there is additional overhead required to create (or consider creating) the parallelism at runtime, implement the coordination, etc. Presumably, the longer-running/more computationally intensive the bodies of your loops were, the more you could amortize these overheads away.
There are potential optimization opportunities in the forall loop with one task cases -- imagine, for example, runtime specializations for cases when only 1 task is used to implement the forall loops. But it's not obvious to me whether those would be worthwhile given the code inflation size and the argument that forall loops are perhaps most often executed with multiple tasks (otherwise, why use a forall at all?).
It's also worth mentioning that most Chapel constructs still have overheads that we could work harder to squeeze performance out of -- while the reduction worked well here (and will always probably be the best performing for this code compared to the alternatives we've discussed), our reductions are known to have scalability bottlenecks that need to be addressed going forward (one of our priorities from a scalability perspective).
Finally I'll mention that the one timing in your chart that surprises me the most is the fact that the race-y forall loop is so much slower on two cores than it is on one. My only conjecture here is that with the two tasks battling over the shared location, there's some sort of bad cache conflict behavior here, but that's just a guess.
Hope this is useful. If you feel sufficiently motivated, it would be interesting to get these variations into the Chapel test system's performance testing framework and add them to the suite of performance tests that we run on a regular basis. Let me know if you're interested and I'll find some instructions for you. No pressure, though.
-Brad
------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
