Hi Julien and Herve --
I'll second Tom's note about not having seen either the original mail, and
even the second forward of it somehow escaped my notice as well (I'm
wondering whether the lack of a subject line caused it to get filtered out
somewhere in my email path). Tom pointed this mail out to me last night
on my way out of the office, so I've only had a chance to look at it in
any depth now. All of the following suggestions come with the caveat that
I may easily have misunderstood some of what you've told us, either due to
insufficient details, or insufficient time invested in understanding this
thread on my part.
* Overall, I'd be interested to see the full version of your code to
understand some of the details that have been compressed out in this
mail and make sure nothing else is slipping through the cracks. For
example, your profile runs imply that you're doing 3D array
computations, but your included code appears to be 2D? That
said, it may take me a few days to open it up as we're working on a
deadline this week. And hopefully I'll give you enough here to chew
on in the meantime.
* Since you're compiling Chapel programs for single-locale execution,
avoiding the use of domain maps will go a long way toward reducing
overhead. Domain maps like "Block" are, at present, not particularly
optimized for the degenerate case of the entire space being mapped
to a single locale. In many of our codes (for example,
examples/benchmarks/lulesh/lulesh.chpl), we use a compile-time switch
that lets us turn the use of a Block distribution on and off (and have
its default depend on whether or not a CHPL_COMM setting is used). See
'useBlockDist' in the lulesh example, for instance. Ultimately, of
course, one would like the Block distribution itself to do such
specialization under the covers, but that's where we are today. Note
that domains and arrays that are not dmapped, as well as ranges, will
use all of a locale's resources and parallelism without the use of a
dmap. That is, simply declaring const D = {1..n, 1..n} and running
parallel computations over it will make full use of your locale's
resources; there's no requirement to dmap it to turn on parallelism.
* I'm worried about the inconsistency in using the --fast flag in the
compiler lines that you sent. For example, your "profiling" compile
lines don't use the --fast flag. By default, this would mean that
runtime checks for array bounds checking and null pointer checking
would be embedded into the generated code and these result in
*significant* overhead compared to not using --fast. If you're
avoiding --fast for some reason that I'm not anticipating (e.g., to
turn off C-level optimizations?), you should at least throw the
--no-checks flag to disable these runtime checks (--fast is a meta-flag
that throws --no-checks, -O, and one or two others -- see the man page
for details). Another option is to throw --fast and then disable
aspects of it (e.g., --fast --no-optimize would counteract fast's
use of the -O flag which is called --optimize in its long form).
The presence of "dsiMember" high in your profiling output suggests to
me that bounds checking is being performed at runtime.
* All that said, note that multidimensional arrays are a case where Chapel
idioms do tend to underperform compared to C. It might be telling to
write a microbenchmark that simply does a simple 3D array computation
(removing the stencil and science) in Chapel and C (using whatever
idioms you're using there) to get a sense of what this gap is. I
don't happen to know offhand (and it depends a lot on the C idioms
you're using), though I'd expect it to be less than what you're seeing
here and am hoping that the apparent runtime bounds checks are part of
the problem.
* I'm curious how you're limiting the number of threads to 8 (there are
a few ways), and also what your motivation for doing so was: Did you
see worse performance if you let Chapel follow its defaults? (which,
for a data parallel code like this, I'd expect to be 8 threads assuming
that's best for the architecture).
* I doubt that this is currently affecting your performance, but your
use of domain maps for the x_domain and y_domain cases seems somewhat
confused to me. In particular, let's say that you are working with
an 8 x 8 physicalDomain with 4 locales. The physicalDomain domain map
will give a 4 x 4 block to each locale. But mapping 8-element x_domain
and y_domain via their own domain maps will give a 2-element block to
each locale with no alignment or correspondance to the physicalDomain.
Better would probably be to define the boundary conditions as 2D slices
of completeDomain (or physicalDomain? I didn't look carefully at how
the boundaries were defined) in order to align the boundaries with the
overall space. While this will have the downside of using fewer
resources (e.g., the leftmost column will only be operated on by 2
locales by default rather than all 4 in my example), given that the
amount of time spent on the boundaries is asymptotically much lower,
the alignment is probably the better thing to strive for.
Slicing like this could either be done directly via slicing or via
the interior/exterior domain operations.
Of course, since my first recommendation was to remove the domain maps,
this comment is designed more for a day that you're trying to run this
code on muliple locales.
* Stylistically, note that you can use the swap operator to exchange
temp and temp_old: temp <=> temp_old. I'm assuming that the comment
is correct and that these are classes wrapping Chapel arrays. If they
were arrays themselves, this would result in a deep copy at present
(probably more expensive than you'd want).
* You asked whether we'd looked at codes like this (which I'd characterize
as "stencil-based computations" if that's a fair characterization).
Under Chapel, we haven't done nearly as much with stencils as we did
in our previous work on the ZPL language to date; the main case that we
have looked at a bit is miniMD (in a sibling directory to the lulesh
example above) which is a stencil-based molecular dynamics problem.
But most of our work there was more from a language design perspective
than performance measurement, tuning, and optimization. In particular,
we were looking at what it would take to extend the Block distribution
to store a notion of halos/ghost cells/fluff which (as you surely know)
tends to be crucial for good performance. miniMD also has the downside
of being a fairly large computation, whereas to really focus on
stencil performance, starting with something more compact/simpler is
probably more useful. If you look at the SourceForge directory, we
did some sketches of 9-point 2D stencils in Chapel as a warm-up exercise
for the miniMD work, though these also have not received much attention
from a tuning/optimization perspective:
http://svn.code.sf.net/p/chapel/code/trunk/test/studies/stencil9
I hope/expect we will return to doing more with stencils in miniMD and
standalone in the year to come, though we haven't yet done a
prioritization exercise for the second half of 2014, so I can't say for
sure.
* Looping back around to my original point, if you end up creating
performance benchmarks that you think are useful for us to track and
improve upon over time (including your code as a whole) and would like
to contribute such code back to the Chapel project for use in our
performance testing suite (which Tom pointed you to below), we'd be
happy to receive such contributions.
Thanks,
-Brad
On Tue, 13 May 2014, Tom MacDonald wrote:
Hello Julien and Hervé
My name is Tom MacDonald and I am the manager of the Chapel team at Cray.
This is the first time we have seen this question from you, and we are not sure
why
we did not see this email the first time you sent it. I’m sorry you had to
send this
message a second time.
The first question I have is which version of Chapel are you using?
The most recent sources are the 1.9.0 version and are currently available
for download from the SourceForge web site:
https://sourceforge.net/projects/chapel/
If you are not using the latest Chapel version, can you please compile and run
your program again using the latest version and let us know your results?
I see you already are using the –fast flag when you compile.
We are currently working to improve Chapel performance with each
release and are making significant strides. To track our progress
over time, refer to:
http://chapel.sourceforge.net/perf/
Also please read the file named PERFORMANCE that comes with the download.
If you are using the 1.9.0 version, please let us know that too.
Thanks for contacting us and I look forward to hearing from you.
Tom MacDonald
From: Julien Bodart [mailto:[email protected]]
Sent: Tuesday, May 13, 2014 11:17 AM
To: [email protected]
Subject: Fwd:
This message, originally posted by Herve Prost has not been answered and does
not appear in the archive. Did it get lost somehow?
Thanks,
------------------------------------------------
Hello,
We are students of a French engineering school (or university) ISAE-Supaero and
we are working with Chapel for one of our projects.
Our goal is to benchmark Chapel on the calculation of the heat equation on a 2D
domain with the finite volume method. The performances are compared to our
implementation in C-MPI.
Our implementation of the problem in Chapel is the following (short version):
// Domain
const physicalDomain: domain(2) dmapped Block({1..pb.nb_cell_x,
1..pb.nb_cell_y}) = {1..pb.nb_cell_x, 1..pb.nb_cell_y};
// Expanded domain to impose boundary conditions with ghost cells
const completeDomain = physicalDomain.expand(1);
// 1D domain for ghost cell update
const x_domain: domain(1) dmapped Block({1..pb.nb_cell_x}) = {1..pb.nb_cell_x};
const y_domain: domain(1) dmapped Block({1..pb.nb_cell_y}) = {1..pb.nb_cell_y};
/* Main Loop: time iterations */
for k in 1..pb.nb_timestep {
// Boundary condition: ghost cell update
forall j in y_domain {
temp_old.arr(0, j) = pb.bnd_type(BND_LEFT) * (2 * pb.bnd_value(BND_LEFT)
- temp_old.arr(1, j)) + (1 - pb.bnd_type(BND_LEFT)) * (temp_old.arr(1, j) -
pb.dy * pb.bnd_value(BND_LEFT));
temp_old.arr(pb.nb_cell_x+1, j) = pb.bnd_type(BND_RIGHT) * (2 *
pb.bnd_value(BND_RIGHT) - temp_old.arr(pb.nb_cell_x, j)) + (1 -
pb.bnd_type(BND_RIGHT)) * (temp_old.arr(pb.nb_cell_x, j) - pb.dy *
pb.bnd_value(BND_RIGHT));
}
forall i in x_domain {
temp_old.arr(i, 0) = pb.bnd_type(BND_BOTTOM) * (2 *
pb.bnd_value(BND_BOTTOM) - temp_old.arr(i, 1)) + (1 - pb.bnd_type(BND_BOTTOM))
* (temp_old.arr(i, 1) - pb.dx * pb.bnd_value(BND_BOTTOM));
temp_old.arr(i, pb.nb_cell_y+1) = pb.bnd_type(BND_TOP) * (2 *
pb.bnd_value(BND_TOP) - temp_old.arr(i, pb.nb_cell_y)) + (1 -
pb.bnd_type(BND_TOP)) * (temp_old.arr(i, pb.nb_cell_y) - pb.dx *
pb.bnd_value(BND_TOP));
}
// Parallel calculation of the temperature
forall cell in physicalDomain {
// reducing memory access
var temp_cell = temp_old.arr(cell);
temp.arr(cell) = temp_cell + coeff_flux * (dy_dx *
(temp_old.arr(cell+(1,0)) - temp_cell) + dx_dy * (temp_old.arr(cell+(0,1)) -
temp_cell) - dy_dx * (temp_cell - temp_old.arr(cell-(1,0))) - dx_dy *
(temp_cell - temp_old.arr(cell-(0,1))));
}
// Data storage for next iteration (switching Class references)
temp_temp = temp_old;
temp_old = temp;
temp = temp_temp;
}
In this implementation, temp_old and temp represent the matrices where the cell
temperatures are stored. Only the calculation over the 2D domain is
parallelized with a forall loop.
We use a similar MPI code, where the 2D domain is subdivided in blocks and each
processor is in charge of a block, the communication between processors only
concerns block interfaces.
** We've been doing some tests and it appears that the Chapel version is 10
times slower than the MPI version (Time module is used with Chapel).
For example, with a 32768*32768 mesh (1 billion cells) with 8 processors (a
single node of a SGI Altix ICE 8200 server, with enough RAM memory), Chapel
takes 10s per time iteration where MPI takes only 0.8s.
Chapel configuration is the following (No communication protocol is used in
this example since a single node is used, therefore the calculation is done
with a single Locale):
export CHPL_HOST_PLATFORM=linux64
export CHPL_HOST_COMPILER=intel
export CHPL_COMM=none
The chapel code is compiled with the --fast option and the number of thread is
limited to 8 (number of processors on the node), the command used:
./heat2D -v > $PBS_JOBNAME.out
** In order to locate the issue, we tried to profile the C code generated while
compiling Chapel, we compiled Chapel with the following command:
chpl -o heat2D -g --ccflags="-pg" --ldflags="-pg" --savec=codeC main.chpl
and execute the code with (no optimization, single thread on a personal
computer, different domain: 2000*2000).
Gprof was then used to profile, and the following functions are time consuming:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
32.85 5.67 5.67 80064000 0.00 0.00 dsiAccess3
15.87 8.41 2.74 80064000 0.00 0.00 this17
12.75 10.61 2.20 196200043 0.00 0.00 dsiMember3
9.47 12.25 1.64 392528165 0.00 0.00 member4
8.46 13.71 1.46 4 0.37 4.13 coforall_fn22
5.76 14.70 1.00 16032000 0.00 0.00 dsiAccess2
4.08 15.41 0.71 96096000<tel:71%2096096000> 0.00 0.00 member5
2.49 15.84 0.43 16032000<tel:43%2016032000> 0.00 0.00 this16
2.26 16.23 0.39 96096000 0.00 0.00 member
1.42 16.47 0.25 128000000 0.00 0.00 chpl__tupleRestHelper
It appears that the function dsiAccess3 is the function used to access the
value of a cell of an array, which is called 80M times:
- 5 times per "forall cell in physicalDomain", with 4M cells and 4 time
iterations: 5*4M*4=80M
- 2 times per ghost cell update, with 8000 ghost cells (4*2000) and 4 time
iterations: 2*8000*4=64000
which is exactly 80064000 calls in total.
We have done the same test while using C code optimization (-O):
In order to locate the issue, we tried to profile the C code generated while
compiling Chapel, we compiled Chapel with the following command:
chpl -o heat2D -g -O --ccflags="-pg" --ldflags="-pg" --savec=codeC main.chpl
and execute the code with (no optimization, single thread on a personal
computer, different domain: 2000*2000).
This time with C code optimization, the profiling gives the following results:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
30.36 0.34 0.34 100232282 0.00 0.00
chpl_localeID_to_locale.const
prop.643
21.43 0.58 0.24 80064000 0.00 0.00 this17.constprop.559
15.18 0.75 0.17 64000000 0.00 0.00 dsiAccess3.constprop.550
9.82 0.86 0.11 16000000 0.00 0.00 dsiAccess3.constprop.549
9.82 0.97 0.11 4 27.50 258.84 coforall_fn22
4.46 1.02 0.05 16032000 0.00 0.00 this16
2.68 1.05 0.03 16000000 0.00 0.00 dsiAccess2.constprop.456
1.79 1.07 0.02 1 20.00 20.00
chpl_startTrackingMemory.constprop.445
1.79 1.09 0.02 chpl___ASSIGN_7
1.79 1.11 0.02 dsiAccess2
To compare, without C code optimization, a time iteration takes 9s, whereas it
takes 1.4s with optimization.
If the code is not profiled:
chpl -o heat2D -O --savec=codeC main.chpl
A single time iteration takes 0.4s with chapel, where it takes 0.02s with MPI
on a single thread.
** Based on these results, we would like to know if we missed something that
could enhance the performances of the calculation with Chapel. We find it
strange that the results are so different between chapel and MPI (when the code
is normally compiled and executed).
Did you tried to use Chapel for this kind of problem?
Thanks,
Hervé Prost
[email protected]<mailto:[email protected]>
---------------------------------------
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users