Hello Brad

I will do better timing and also try larger problems.

I think the MPI code also has lot of overheads since it has to transfer data 
b/w processess which the Chapel code does not have to do. I also have the same 
halo cells in MPI code as in the Chapel code. In the MPI code, each process 
copies data from a global vector to a local vector, then does the actual 
computations, which the Chapel code doesnt do. Hence I expected the MPI code to 
do worse. 

I was wondering if my Chapel code is not well written. E.g., there are loops 
like this

forall (i,j) in Dx
// do some computation
res[i-1,j] += flux * dy;
res[i,j] -= flux * dy;

Do I have to worry about different threads writing into same location of the 
"res” variable ?

How can I check how much time is spent in different parts of the Chapel code ? 


> On 10-Oct-2016, at 10:49 PM, Brad Chamberlain <br...@cray.com> wrote:
> Hi Praveen --
> In addition to Jeff's good advice on timing the computation you care about, I 
> wanted to point out a difference between the MPI and the Chapel code:
> As you know, MPI is designed to be a distributed memory execution model, so 
> to take advantage of the four cores on your Mac, you use mpirun -np 4.
> Chapel supports both shared- and distributed-memory parallelism, so the way 
> you're running on this 4-core Mac is reasonable, yet different than the MPI.  
> Specifically, we will create a single process that will use multiple threads 
> to implement your forall loops (typically 4).  So there will be no 
> inter-process communication in the Chapel implementation as there is in the 
> MPI version and comparing against an OpenMP implementation would be a more 
> fair comparison.
> Related: The use of the 'StencilDist' domain map has no positive impact for a 
> shared-memory execution like this, and will likely add overhead. It is 
> designed for use on distributed-memory executions that do stencil-based 
> computations in order to enable caching of values owned by neighboring 
> processes.  But when you've only got one process like this, there's no remote 
> data to cache.  So for a shared-memory execution like this, it'd be 
> interesting to see how much faster the code would be if the 'dmapped 
> StencilDist' clause was commented out (in practice, we often write codes that 
> can be compiled with or without distributed data using a 'param' conditional 
> -- for example, see the declarations of 'Elems' and 'Nodes' in 
> examples/benchmarks/lulesh.chpl).
> Running on a distributed memory system using the 'StencilDist' distribution 
> against MPI (or better, vs. an MPI + OpenMP code) would also be more of an 
> apples-to-apples comparison, though I suspect you'll see Chapel fall further 
> behind in terms of performance at that point...
> -Brad

Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Chapel-users mailing list

Reply via email to