RE: Fwd:

Tom MacDonald Tue, 13 May 2014 17:47:23 -0700

Hello Julien and Hervé

My name  is  Tom  MacDonald and I am the manager of the Chapel team at Cray.
This is the first time we have seen this question from you, and we are not sure 
why
we did not see this email the first time you sent it.  I’m sorry you had to 
send this
message a second time.


The first question I have is which version of Chapel are you using?
The most recent sources are the 1.9.0 version and are currently available
for download from the SourceForge web site:

  https://sourceforge.net/projects/chapel/

If you are not using the latest Chapel version, can you please compile and run
your program again using the latest version and let us know your results?

I see you already are using the –fast flag when you compile.

We are currently working to improve Chapel performance with each
release and are making significant strides.  To track our progress
over time, refer to:

        http://chapel.sourceforge.net/perf/

Also please read the file named PERFORMANCE that comes with the download.

If you are using the 1.9.0 version, please let us know that too.

Thanks for contacting us and I look forward to hearing from you.

Tom MacDonald

From: Julien Bodart [mailto:[email protected]]
Sent: Tuesday, May 13, 2014 11:17 AM
To: [email protected]
Subject: Fwd:

This message, originally posted by Herve Prost has not been answered and does 
not appear in the archive. Did it get lost somehow?

Thanks,

------------------------------------------------

Hello,

We are students of a French engineering school (or university) ISAE-Supaero and 
we are working with Chapel for one of our projects.
Our goal is to benchmark Chapel on the calculation of the heat equation on a 2D 
domain with the finite volume method. The performances are compared to our 
implementation in C-MPI.

Our implementation of the problem in Chapel is the following (short version):
// Domain
const physicalDomain: domain(2) dmapped Block({1..pb.nb_cell_x, 
1..pb.nb_cell_y}) = {1..pb.nb_cell_x, 1..pb.nb_cell_y};
// Expanded domain to impose boundary conditions with ghost cells
const completeDomain = physicalDomain.expand(1);

// 1D domain for ghost cell update
const x_domain: domain(1) dmapped Block({1..pb.nb_cell_x}) = {1..pb.nb_cell_x};
const y_domain: domain(1) dmapped Block({1..pb.nb_cell_y}) = {1..pb.nb_cell_y};

/* Main Loop: time iterations */
for k in 1..pb.nb_timestep {

    // Boundary condition: ghost cell update
    forall j in y_domain {
        temp_old.arr(0, j) = pb.bnd_type(BND_LEFT) * (2 * 
pb.bnd_value(BND_LEFT) - temp_old.arr(1, j)) + (1 - pb.bnd_type(BND_LEFT)) * 
(temp_old.arr(1, j) - pb.dy * pb.bnd_value(BND_LEFT));
        temp_old.arr(pb.nb_cell_x+1, j) = pb.bnd_type(BND_RIGHT) * (2 * 
pb.bnd_value(BND_RIGHT) - temp_old.arr(pb.nb_cell_x, j)) + (1 - 
pb.bnd_type(BND_RIGHT)) * (temp_old.arr(pb.nb_cell_x, j) - pb.dy * 
pb.bnd_value(BND_RIGHT));
    }
    forall i in x_domain {
        temp_old.arr(i, 0) = pb.bnd_type(BND_BOTTOM) * (2 * 
pb.bnd_value(BND_BOTTOM) - temp_old.arr(i, 1)) + (1 - pb.bnd_type(BND_BOTTOM)) 
* (temp_old.arr(i, 1) - pb.dx * pb.bnd_value(BND_BOTTOM));
        temp_old.arr(i, pb.nb_cell_y+1) = pb.bnd_type(BND_TOP) * (2 * 
pb.bnd_value(BND_TOP) - temp_old.arr(i, pb.nb_cell_y)) + (1 - 
pb.bnd_type(BND_TOP)) * (temp_old.arr(i, pb.nb_cell_y) - pb.dx * 
pb.bnd_value(BND_TOP));
    }


    // Parallel calculation of the temperature
    forall cell in physicalDomain {
        // reducing memory access
        var temp_cell = temp_old.arr(cell);

        temp.arr(cell) = temp_cell + coeff_flux * (dy_dx * 
(temp_old.arr(cell+(1,0)) - temp_cell) + dx_dy * (temp_old.arr(cell+(0,1)) - 
temp_cell) - dy_dx * (temp_cell - temp_old.arr(cell-(1,0))) - dx_dy * 
(temp_cell - temp_old.arr(cell-(0,1))));
    }

    // Data storage for next iteration (switching Class references)
    temp_temp = temp_old;
    temp_old = temp;
    temp = temp_temp;
}

In this implementation, temp_old and temp represent the matrices where the cell 
temperatures are stored. Only the calculation over the 2D domain is 
parallelized with a forall loop.
We use a similar MPI code, where the 2D domain is subdivided in blocks and each 
processor is in charge of a block, the communication between processors only 
concerns block interfaces.

** We've been doing some tests and it appears that the Chapel version is 10 
times slower than the MPI version (Time module is used with Chapel).
For example, with a 32768*32768 mesh (1 billion cells) with 8 processors (a 
single node of a SGI Altix ICE 8200 server, with enough RAM memory), Chapel 
takes 10s per time iteration where MPI takes only 0.8s.

Chapel configuration is the following (No communication protocol is used in 
this example since a single node is used, therefore the calculation is done 
with a single Locale):
export CHPL_HOST_PLATFORM=linux64
export CHPL_HOST_COMPILER=intel
export CHPL_COMM=none

The chapel code is compiled with the --fast option and the number of thread is 
limited to 8 (number of processors on the node), the command used:
./heat2D -v > $PBS_JOBNAME.out


** In order to locate the issue, we tried to profile the C code generated while 
compiling Chapel, we compiled Chapel with the following command:
chpl -o heat2D -g --ccflags="-pg"  --ldflags="-pg" --savec=codeC main.chpl
and execute the code with (no optimization, single thread on a personal 
computer, different domain: 2000*2000).

Gprof was then used to profile, and the following functions are time consuming:
Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 32.85      5.67     5.67 80064000     0.00     0.00  dsiAccess3
 15.87      8.41     2.74 80064000     0.00     0.00  this17
 12.75     10.61     2.20 196200043     0.00     0.00  dsiMember3
  9.47     12.25     1.64 392528165     0.00     0.00  member4
  8.46     13.71     1.46        4     0.37     4.13  coforall_fn22
  5.76     14.70     1.00 16032000     0.00     0.00  dsiAccess2
  4.08     15.41     0.71 96096000<tel:71%2096096000>     0.00     0.00  member5
  2.49     15.84     0.43 16032000<tel:43%2016032000>     0.00     0.00  this16
  2.26     16.23     0.39 96096000     0.00     0.00  member
  1.42     16.47     0.25 128000000     0.00     0.00  chpl__tupleRestHelper

It appears that the function dsiAccess3 is the function used to access the 
value of a cell of an array, which is called 80M times:
- 5 times per "forall cell in physicalDomain", with 4M cells and 4 time 
iterations: 5*4M*4=80M
- 2 times per ghost cell update, with 8000 ghost cells (4*2000) and 4 time 
iterations: 2*8000*4=64000
which is exactly 80064000 calls in total.

We have done the same test while using C code optimization (-O):
In order to locate the issue, we tried to profile the C code generated while 
compiling Chapel, we compiled Chapel with the following command:
chpl -o heat2D -g -O --ccflags="-pg"  --ldflags="-pg" --savec=codeC main.chpl
and execute the code with (no optimization, single thread on a personal 
computer, different domain: 2000*2000).

This time with C code optimization, the profiling gives the following results:
Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 30.36      0.34     0.34 100232282     0.00     0.00  
chpl_localeID_to_locale.const
prop.643
 21.43      0.58     0.24 80064000     0.00     0.00  this17.constprop.559
 15.18      0.75     0.17 64000000     0.00     0.00  dsiAccess3.constprop.550
  9.82      0.86     0.11 16000000     0.00     0.00  dsiAccess3.constprop.549
  9.82      0.97     0.11        4    27.50   258.84  coforall_fn22
  4.46      1.02     0.05 16032000     0.00     0.00  this16
  2.68      1.05     0.03 16000000     0.00     0.00  dsiAccess2.constprop.456
  1.79      1.07     0.02        1    20.00    20.00  
chpl_startTrackingMemory.constprop.445
  1.79      1.09     0.02                             chpl___ASSIGN_7
  1.79      1.11     0.02                             dsiAccess2


To compare, without C code optimization, a time iteration takes 9s, whereas it 
takes 1.4s with optimization.

If the code is not profiled:
chpl -o heat2D -O --savec=codeC main.chpl

A single time iteration takes 0.4s with chapel, where it takes 0.02s with MPI 
on a single thread.

** Based on these results, we would like to know if we missed something that 
could enhance the performances of the calculation with Chapel. We find it 
strange that the results are so different between chapel and MPI (when the code 
is normally compiled and executed).
Did you tried to use Chapel for this kind of problem?

Thanks,
Hervé Prost
[email protected]<mailto:[email protected]>
---------------------------------------

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

RE: Fwd:

Reply via email to