Hi Praveen --
I think the MPI code also has lot of overheads since it has to transfer
data b/w processess which the Chapel code does not have to do.
Yes, sorry, I didn't mean to imply that all the disadvantages were
Chapel's, but merely to point out some differences between the two codes
that would cause them not to be equivalent in their approaches.
I also have the same halo cells in MPI code as in the Chapel code. In
the MPI code, each process copies data from a global vector to a local
vector, then does the actual computations, which the Chapel code doesnt
do. Hence I expected the MPI code to do worse.
The trouble is that the StencilDist adds overheads to the shared-memory
case because it is written in a way that assumes it's going to be run in
distributed memory mode (example: When randomly accessing an array, it
does a check to see "is this element remote or local?" that would not be
necessary in a single-locale environment or an MPI program). With
additional effort, the StencilDist distribution could be optimized to
reduce or eliminate overhead in single-locale runs, but that isn't an
effort we've made since it's not a common case. Hence the approaches like
the one I pointed to in lulesh to optimize it out manually for
I was wondering if my Chapel code is not well written. E.g., there are
loops like this
forall (i,j) in Dx
// do some computation
res[i-1,j] += flux * dy;
res[i,j] -= flux * dy;
Do I have to worry about different threads writing into same location of
the "res” variable ?
Yes, you do (assuming that Dx contains adjacent elements in dimension 1).
Specifically, the use of the forall loop says that the distinct loop
iterations are safe to run in parallel with one another, but if one
iteration were doing the += line on a given element while an adjacent
iteration were doing the -= line for the same element, that could lead to
a race condition.
How can I check how much time is spent in different parts of the Chapel
Take a look at this primer for a way to do it by inserting timers into the
Another option would be to use chplvis:
On 10-Oct-2016, at 10:49 PM, Brad Chamberlain <br...@cray.com> wrote:
Hi Praveen --
In addition to Jeff's good advice on timing the computation you care
about, I wanted to point out a difference between the MPI and the
As you know, MPI is designed to be a distributed memory execution
model, so to take advantage of the four cores on your Mac, you use
mpirun -np 4.
Chapel supports both shared- and distributed-memory parallelism, so the
way you're running on this 4-core Mac is reasonable, yet different than
the MPI. Specifically, we will create a single process that will use
multiple threads to implement your forall loops (typically 4). So
there will be no inter-process communication in the Chapel
implementation as there is in the MPI version and comparing against an
OpenMP implementation would be a more fair comparison.
Related: The use of the 'StencilDist' domain map has no positive impact
for a shared-memory execution like this, and will likely add overhead.
It is designed for use on distributed-memory executions that do
stencil-based computations in order to enable caching of values owned
by neighboring processes. But when you've only got one process like
this, there's no remote data to cache. So for a shared-memory
execution like this, it'd be interesting to see how much faster the
code would be if the 'dmapped StencilDist' clause was commented out (in
practice, we often write codes that can be compiled with or without
distributed data using a 'param' conditional -- for example, see the
declarations of 'Elems' and 'Nodes' in
Running on a distributed memory system using the 'StencilDist'
distribution against MPI (or better, vs. an MPI + OpenMP code) would
also be more of an apples-to-apples comparison, though I suspect you'll
see Chapel fall further behind in terms of performance at that point...
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Chapel-users mailing list