Hello Brad I will do better timing and also try larger problems.
I think the MPI code also has lot of overheads since it has to transfer data b/w processess which the Chapel code does not have to do. I also have the same halo cells in MPI code as in the Chapel code. In the MPI code, each process copies data from a global vector to a local vector, then does the actual computations, which the Chapel code doesnt do. Hence I expected the MPI code to do worse. I was wondering if my Chapel code is not well written. E.g., there are loops like this forall (i,j) in Dx { // do some computation res[i-1,j] += flux * dy; res[i,j] -= flux * dy; } Do I have to worry about different threads writing into same location of the "res” variable ? How can I check how much time is spent in different parts of the Chapel code ? Best praveen > On 10-Oct-2016, at 10:49 PM, Brad Chamberlain <br...@cray.com> wrote: > > > Hi Praveen -- > > In addition to Jeff's good advice on timing the computation you care about, I > wanted to point out a difference between the MPI and the Chapel code: > > As you know, MPI is designed to be a distributed memory execution model, so > to take advantage of the four cores on your Mac, you use mpirun -np 4. > > Chapel supports both shared- and distributed-memory parallelism, so the way > you're running on this 4-core Mac is reasonable, yet different than the MPI. > Specifically, we will create a single process that will use multiple threads > to implement your forall loops (typically 4). So there will be no > inter-process communication in the Chapel implementation as there is in the > MPI version and comparing against an OpenMP implementation would be a more > fair comparison. > > Related: The use of the 'StencilDist' domain map has no positive impact for a > shared-memory execution like this, and will likely add overhead. It is > designed for use on distributed-memory executions that do stencil-based > computations in order to enable caching of values owned by neighboring > processes. But when you've only got one process like this, there's no remote > data to cache. So for a shared-memory execution like this, it'd be > interesting to see how much faster the code would be if the 'dmapped > StencilDist' clause was commented out (in practice, we often write codes that > can be compiled with or without distributed data using a 'param' conditional > -- for example, see the declarations of 'Elems' and 'Nodes' in > examples/benchmarks/lulesh.chpl). > > Running on a distributed memory system using the 'StencilDist' distribution > against MPI (or better, vs. an MPI + OpenMP code) would also be more of an > apples-to-apples comparison, though I suspect you'll see Chapel fall further > behind in terms of performance at that point... > > -Brad ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Chapel-users mailing list Chapel-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/chapel-users