Hi Julien and Herve --

I'll second Tom's note about not having seen either the original mail, and even the second forward of it somehow escaped my notice as well (I'm wondering whether the lack of a subject line caused it to get filtered out somewhere in my email path). Tom pointed this mail out to me last night on my way out of the office, so I've only had a chance to look at it in any depth now. All of the following suggestions come with the caveat that I may easily have misunderstood some of what you've told us, either due to insufficient details, or insufficient time invested in understanding this thread on my part.

* Overall, I'd be interested to see the full version of your code to
  understand some of the details that have been compressed out in this
  mail and make sure nothing else is slipping through the cracks.  For
  example, your profile runs imply that you're doing 3D array
  computations, but your included code appears to be 2D?  That
  said, it may take me a few days to open it up as we're working on a
  deadline this week.  And hopefully I'll give you enough here to chew
  on in the meantime.

* Since you're compiling Chapel programs for single-locale execution,
  avoiding the use of domain maps will go a long way toward reducing
  overhead.  Domain maps like "Block" are, at present, not particularly
  optimized for the degenerate case of the entire space being mapped
  to a single locale.  In many of our codes (for example,
  examples/benchmarks/lulesh/lulesh.chpl), we use a compile-time switch
  that lets us turn the use of a Block distribution on and off (and have
  its default depend on whether or not a CHPL_COMM setting is used).  See
  'useBlockDist' in the lulesh example, for instance.  Ultimately, of
  course, one would like the Block distribution itself to do such
  specialization under the covers, but that's where we are today.  Note
  that domains and arrays that are not dmapped, as well as ranges, will
  use all of a locale's resources and parallelism without the use of a
  dmap.  That is, simply declaring const D = {1..n, 1..n} and running
  parallel computations over it will make full use of your locale's
  resources; there's no requirement to dmap it to turn on parallelism.

* I'm worried about the inconsistency in using the --fast flag in the
  compiler lines that you sent.  For example, your "profiling" compile
  lines don't use the --fast flag.  By default, this would mean that
  runtime checks for array bounds checking and null pointer checking
  would be embedded into the generated code and these result in
  *significant* overhead compared to not using --fast.  If you're
  avoiding --fast for some reason that I'm not anticipating (e.g., to
  turn off C-level optimizations?), you should at least throw the
  --no-checks flag to disable these runtime checks (--fast is a meta-flag
  that throws --no-checks, -O, and one or two others -- see the man page
  for details).  Another option is to throw --fast and then disable
  aspects of it (e.g., --fast --no-optimize would counteract fast's
  use of the -O flag which is called --optimize in its long form).
  The presence of "dsiMember" high in your profiling output suggests to
  me that bounds checking is being performed at runtime.

* All that said, note that multidimensional arrays are a case where Chapel
  idioms do tend to underperform compared to C.  It might be telling to
  write a microbenchmark that simply does a simple 3D array computation
  (removing the stencil and science) in Chapel and C (using whatever
  idioms you're using there) to get a sense of what this gap is.  I
  don't happen to know offhand (and it depends a lot on the C idioms
  you're using), though I'd expect it to be less than what you're seeing
  here and am hoping that the apparent runtime bounds checks are part of
  the problem.

* I'm curious how you're limiting the number of threads to 8 (there are
  a few ways), and also what your motivation for doing so was:  Did you
  see worse performance if you let Chapel follow its defaults?  (which,
  for a data parallel code like this, I'd expect to be 8 threads assuming
  that's best for the architecture).

* I doubt that this is currently affecting your performance, but your
  use of domain maps for the x_domain and y_domain cases seems somewhat
  confused to me.  In particular, let's say that you are working with
  an 8 x 8 physicalDomain with 4 locales.  The physicalDomain domain map
  will give a 4 x 4 block to each locale.  But mapping 8-element x_domain
  and y_domain via their own domain maps will give a 2-element block to
  each locale with no alignment or correspondance to the physicalDomain.
  Better would probably be to define the boundary conditions as 2D slices
  of completeDomain (or physicalDomain?  I didn't look carefully at how
  the boundaries were defined) in order to align the boundaries with the
  overall space.  While this will have the downside of using fewer
  resources (e.g., the leftmost column will only be operated on by 2
  locales by default rather than all 4 in my example), given that the
  amount of time spent on the boundaries is asymptotically much lower,
  the alignment is probably the better thing to strive for.

  Slicing like this could either be done directly via slicing or via
  the interior/exterior domain operations.

  Of course, since my first recommendation was to remove the domain maps,
  this comment is designed more for a day that you're trying to run this
  code on muliple locales.

* Stylistically, note that you can use the swap operator to exchange
  temp and temp_old:  temp <=> temp_old.  I'm assuming that the comment
  is correct and that these are classes wrapping Chapel arrays.  If they
  were arrays themselves, this would result in a deep copy at present
  (probably more expensive than you'd want).

* You asked whether we'd looked at codes like this (which I'd characterize
  as "stencil-based computations" if that's a fair characterization).
  Under Chapel, we haven't done nearly as much with stencils as we did
  in our previous work on the ZPL language to date; the main case that we
  have looked at a bit is miniMD (in a sibling directory to the lulesh
  example above) which is a stencil-based molecular dynamics problem.
  But most of our work there was more from a language design perspective
  than performance measurement, tuning, and optimization.  In particular,
  we were looking at what it would take to extend the Block distribution
  to store a notion of halos/ghost cells/fluff which (as you surely know)
  tends to be crucial for good performance.  miniMD also has the downside
  of being a fairly large computation, whereas to really focus on
  stencil performance, starting with something more compact/simpler is
  probably more useful.  If you look at the SourceForge directory, we
  did some sketches of 9-point 2D stencils in Chapel as a warm-up exercise
  for the miniMD work, though these also have not received much attention
  from a tuning/optimization perspective:

        http://svn.code.sf.net/p/chapel/code/trunk/test/studies/stencil9

  I hope/expect we will return to doing more with stencils in miniMD and
  standalone in the year to come, though we haven't yet done a
  prioritization exercise for the second half of 2014, so I can't say for
  sure.

* Looping back around to my original point, if you end up creating
  performance benchmarks that you think are useful for us to track and
  improve upon over time (including your code as a whole) and would like
  to contribute such code back to the Chapel project for use in our
  performance testing suite (which Tom pointed you to below), we'd be
  happy to receive such contributions.


Thanks,
-Brad


On Tue, 13 May 2014, Tom MacDonald wrote:


Hello Julien and Hervé

My name  is  Tom  MacDonald and I am the manager of the Chapel team at Cray.
This is the first time we have seen this question from you, and we are not sure 
why
we did not see this email the first time you sent it.  I’m sorry you had to 
send this
message a second time.

The first question I have is which version of Chapel are you using?
The most recent sources are the 1.9.0 version and are currently available
for download from the SourceForge web site:

 https://sourceforge.net/projects/chapel/

If you are not using the latest Chapel version, can you please compile and run
your program again using the latest version and let us know your results?

I see you already are using the –fast flag when you compile.

We are currently working to improve Chapel performance with each
release and are making significant strides.  To track our progress
over time, refer to:

       http://chapel.sourceforge.net/perf/

Also please read the file named PERFORMANCE that comes with the download.

If you are using the 1.9.0 version, please let us know that too.

Thanks for contacting us and I look forward to hearing from you.

Tom MacDonald

From: Julien Bodart [mailto:[email protected]]
Sent: Tuesday, May 13, 2014 11:17 AM
To: [email protected]
Subject: Fwd:

This message, originally posted by Herve Prost has not been answered and does 
not appear in the archive. Did it get lost somehow?

Thanks,

------------------------------------------------

Hello,

We are students of a French engineering school (or university) ISAE-Supaero and 
we are working with Chapel for one of our projects.
Our goal is to benchmark Chapel on the calculation of the heat equation on a 2D 
domain with the finite volume method. The performances are compared to our 
implementation in C-MPI.

Our implementation of the problem in Chapel is the following (short version):
// Domain
const physicalDomain: domain(2) dmapped Block({1..pb.nb_cell_x, 
1..pb.nb_cell_y}) = {1..pb.nb_cell_x, 1..pb.nb_cell_y};
// Expanded domain to impose boundary conditions with ghost cells
const completeDomain = physicalDomain.expand(1);

// 1D domain for ghost cell update
const x_domain: domain(1) dmapped Block({1..pb.nb_cell_x}) = {1..pb.nb_cell_x};
const y_domain: domain(1) dmapped Block({1..pb.nb_cell_y}) = {1..pb.nb_cell_y};

/* Main Loop: time iterations */
for k in 1..pb.nb_timestep {

   // Boundary condition: ghost cell update
   forall j in y_domain {
       temp_old.arr(0, j) = pb.bnd_type(BND_LEFT) * (2 * pb.bnd_value(BND_LEFT) 
- temp_old.arr(1, j)) + (1 - pb.bnd_type(BND_LEFT)) * (temp_old.arr(1, j) - 
pb.dy * pb.bnd_value(BND_LEFT));
       temp_old.arr(pb.nb_cell_x+1, j) = pb.bnd_type(BND_RIGHT) * (2 * 
pb.bnd_value(BND_RIGHT) - temp_old.arr(pb.nb_cell_x, j)) + (1 - 
pb.bnd_type(BND_RIGHT)) * (temp_old.arr(pb.nb_cell_x, j) - pb.dy * 
pb.bnd_value(BND_RIGHT));
   }
   forall i in x_domain {
       temp_old.arr(i, 0) = pb.bnd_type(BND_BOTTOM) * (2 * 
pb.bnd_value(BND_BOTTOM) - temp_old.arr(i, 1)) + (1 - pb.bnd_type(BND_BOTTOM)) 
* (temp_old.arr(i, 1) - pb.dx * pb.bnd_value(BND_BOTTOM));
       temp_old.arr(i, pb.nb_cell_y+1) = pb.bnd_type(BND_TOP) * (2 * 
pb.bnd_value(BND_TOP) - temp_old.arr(i, pb.nb_cell_y)) + (1 - 
pb.bnd_type(BND_TOP)) * (temp_old.arr(i, pb.nb_cell_y) - pb.dx * 
pb.bnd_value(BND_TOP));
   }


   // Parallel calculation of the temperature
   forall cell in physicalDomain {
       // reducing memory access
       var temp_cell = temp_old.arr(cell);

       temp.arr(cell) = temp_cell + coeff_flux * (dy_dx * 
(temp_old.arr(cell+(1,0)) - temp_cell) + dx_dy * (temp_old.arr(cell+(0,1)) - 
temp_cell) - dy_dx * (temp_cell - temp_old.arr(cell-(1,0))) - dx_dy * 
(temp_cell - temp_old.arr(cell-(0,1))));
   }

   // Data storage for next iteration (switching Class references)
   temp_temp = temp_old;
   temp_old = temp;
   temp = temp_temp;
}

In this implementation, temp_old and temp represent the matrices where the cell 
temperatures are stored. Only the calculation over the 2D domain is 
parallelized with a forall loop.
We use a similar MPI code, where the 2D domain is subdivided in blocks and each 
processor is in charge of a block, the communication between processors only 
concerns block interfaces.

** We've been doing some tests and it appears that the Chapel version is 10 
times slower than the MPI version (Time module is used with Chapel).
For example, with a 32768*32768 mesh (1 billion cells) with 8 processors (a 
single node of a SGI Altix ICE 8200 server, with enough RAM memory), Chapel 
takes 10s per time iteration where MPI takes only 0.8s.

Chapel configuration is the following (No communication protocol is used in 
this example since a single node is used, therefore the calculation is done 
with a single Locale):
export CHPL_HOST_PLATFORM=linux64
export CHPL_HOST_COMPILER=intel
export CHPL_COMM=none

The chapel code is compiled with the --fast option and the number of thread is 
limited to 8 (number of processors on the node), the command used:
./heat2D -v > $PBS_JOBNAME.out


** In order to locate the issue, we tried to profile the C code generated while 
compiling Chapel, we compiled Chapel with the following command:
chpl -o heat2D -g --ccflags="-pg"  --ldflags="-pg" --savec=codeC main.chpl
and execute the code with (no optimization, single thread on a personal 
computer, different domain: 2000*2000).

Gprof was then used to profile, and the following functions are time consuming:
Flat profile:
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total
time   seconds   seconds    calls   s/call   s/call  name
32.85      5.67     5.67 80064000     0.00     0.00  dsiAccess3
15.87      8.41     2.74 80064000     0.00     0.00  this17
12.75     10.61     2.20 196200043     0.00     0.00  dsiMember3
 9.47     12.25     1.64 392528165     0.00     0.00  member4
 8.46     13.71     1.46        4     0.37     4.13  coforall_fn22
 5.76     14.70     1.00 16032000     0.00     0.00  dsiAccess2
 4.08     15.41     0.71 96096000<tel:71%2096096000>     0.00     0.00  member5
 2.49     15.84     0.43 16032000<tel:43%2016032000>     0.00     0.00  this16
 2.26     16.23     0.39 96096000     0.00     0.00  member
 1.42     16.47     0.25 128000000     0.00     0.00  chpl__tupleRestHelper

It appears that the function dsiAccess3 is the function used to access the 
value of a cell of an array, which is called 80M times:
- 5 times per "forall cell in physicalDomain", with 4M cells and 4 time 
iterations: 5*4M*4=80M
- 2 times per ghost cell update, with 8000 ghost cells (4*2000) and 4 time 
iterations: 2*8000*4=64000
which is exactly 80064000 calls in total.

We have done the same test while using C code optimization (-O):
In order to locate the issue, we tried to profile the C code generated while 
compiling Chapel, we compiled Chapel with the following command:
chpl -o heat2D -g -O --ccflags="-pg"  --ldflags="-pg" --savec=codeC main.chpl
and execute the code with (no optimization, single thread on a personal 
computer, different domain: 2000*2000).

This time with C code optimization, the profiling gives the following results:
Flat profile:
Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
30.36      0.34     0.34 100232282     0.00     0.00  
chpl_localeID_to_locale.const
prop.643
21.43      0.58     0.24 80064000     0.00     0.00  this17.constprop.559
15.18      0.75     0.17 64000000     0.00     0.00  dsiAccess3.constprop.550
 9.82      0.86     0.11 16000000     0.00     0.00  dsiAccess3.constprop.549
 9.82      0.97     0.11        4    27.50   258.84  coforall_fn22
 4.46      1.02     0.05 16032000     0.00     0.00  this16
 2.68      1.05     0.03 16000000     0.00     0.00  dsiAccess2.constprop.456
 1.79      1.07     0.02        1    20.00    20.00  
chpl_startTrackingMemory.constprop.445
 1.79      1.09     0.02                             chpl___ASSIGN_7
 1.79      1.11     0.02                             dsiAccess2


To compare, without C code optimization, a time iteration takes 9s, whereas it 
takes 1.4s with optimization.

If the code is not profiled:
chpl -o heat2D -O --savec=codeC main.chpl

A single time iteration takes 0.4s with chapel, where it takes 0.02s with MPI 
on a single thread.

** Based on these results, we would like to know if we missed something that 
could enhance the performances of the calculation with Chapel. We find it 
strange that the results are so different between chapel and MPI (when the code 
is normally compiled and executed).
Did you tried to use Chapel for this kind of problem?

Thanks,
Hervé Prost
[email protected]<mailto:[email protected]>
---------------------------------------

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to