Hi,

Please find attached 2 output files from 64MPI/1 OMP vs 64/2 OMPs examples,
23321 vs 23325 slurm task ids.

Best,
Damian


W liście datowanym 19 czerwca 2017 (15:39:53) napisano:


On Mon, Jun 19, 2017 at 7:32 AM, Damian Kaliszan <[email protected]> wrote:
Hi,
Thank you for the answer and the article.
I  use  SLURM  (srun)  for  job  submission by running
'srun script.py script_parameters' command inside batch script so this is SPMD 
model.
What  I  noticed  is  that the problems I'm having now didn't happened
before  on CPU E5-2697 v3  nodes (28 cores - the best perormance I had
was using 14MPIs/2OMP per node). Problems started to appear when I moved to 
KNLs.
The   funny   thing   is   that   switching  OMP  on/off  (by  setting
OMP_NUM_THREADS   to   1)   doesn't  help  for  all  #NODES/# MPI/ #OMP
combinations.  For  example, for 2 nodes, 16 MPIs, for OMP=1 and 2 the
timings are huge and for 4 is OK.

Lets narrow this down to MPI_Barrier(). What memory mode is KNL in? Did you 
require
KNL to use only MCDRAM? Please show the MPI_Barrier()/MPI_Send() numbers for 
the different configurations.
This measures just latency. We could also look at VecScale() to look at memory 
bandwidth achieved.

  Thanks,

    Matt
 
Playing with affinitty didn't help so far.
In  other  words at first glance results look completely random   (I can
provide more such examples).



Best,
Damian

W liście datowanym 19 czerwca 2017 (14:50:25) napisano:


On Mon, Jun 19, 2017 at 6:42 AM, Damian Kaliszan <[email protected]> wrote:
Hi,

Regarding my previous post
I looked into both logs of 64MPI/1 OMP vs. 64MPI/2 OMP.


What attracted my attention is huge difference in MPI timings in the following 
places:

Average time to get PetscTime(): 2.14577e-07
Average time for MPI_Barrier(): 3.9196e-05
Average time for zero size MPI_Send(): 5.45382e-06

vs.

Average time to get PetscTime(): 4.05312e-07
Average time for MPI_Barrier(): 0.348399
Average time for zero size MPI_Send(): 0.029937

Isn't something wrong with PETSc library itself?...

 I don't think so. This is bad interaction of MPI and your threading mechanism. 
MPI_Barrier() and MPI_Send() are lower
level than PETSc. What threading mode did you choose for MPI? This can have a 
performance impact.

Also, the justifications for threading in this context are weak (or 
non-existent): 
http://www.orau.gov/hpcor2015/whitepapers/Exascale_Computing_without_Threads-Barry_Smith.pdf

  Thanks,

    Matt


Best,
Damian

Wiadomość przekazana
Od: Damian Kaliszan <[email protected]>
Do: PETSc users list <[email protected]>
Data: 16 czerwca 2017, 14:57:10
Temat: [petsc-users] strange PETSc/KSP GMRES timings for MPI+OMP configuration 
on KNLs

===8<===============Treść oryginalnej wiadomości===============
Hi,

For  several  days  I've been trying to figure out what is going wrong
with my python app timings solving Ax=b with KSP (GMRES) solver when trying to 
run on Intel's KNL 7210/7230.

I  downsized  the  problem  to  1000x1000 A matrix and a single node and
observed the following:


I'm attaching 2 extreme timings where configurations differ only by 1 OMP 
thread (64MPI/1 OMP vs 64/2 OMPs),
23321 vs 23325 slurm task ids.

Any help will be appreciated....

Best,
Damian

===8<===========Koniec treści oryginalnej wiadomości===========



-------------------------------------------------------
Damian Kaliszan

Poznan Supercomputing and Networking Center
HPC and Data Centres Technologies
ul. Jana Pawła II 10
61-139 Poznan
POLAND

phone (+48 61) 858 5109
e-mail [email protected]
www - http://www.man.poznan.pl/
-------------------------------------------------------


---------- Forwarded message ----------
From: Damian Kaliszan <[email protected]>
To: PETSc users list <[email protected]>
Cc:
Bcc:
Date: Fri, 16 Jun 2017 14:57:10 +0200
Subject: [petsc-users] strange PETSc/KSP GMRES timings for MPI+OMP 
configuration on KNLs
Hi,

For  several  days  I've been trying to figure out what is going wrong
with my python app timings solving Ax=b with KSP (GMRES) solver when trying to 
run on Intel's KNL 7210/7230.

I  downsized  the  problem  to  1000x1000 A matrix and a single node and
observed the following:


I'm attaching 2 extreme timings where configurations differ only by 1 OMP 
thread (64MPI/1 OMP vs 64/2 OMPs),
23321 vs 23325 slurm task ids.

Any help will be appreciated....

Best,
Damian



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

http://www.caam.rice.edu/~mk51/



-------------------------------------------------------
Damian Kaliszan

Poznan Supercomputing and Networking Center
HPC and Data Centres Technologies
ul. Jana Pawła II 10
61-139 Poznan
POLAND

phone (+48 61) 858 5109
e-mail [email protected]
www - http://www.man.poznan.pl/
-------------------------------------------------------




-- 
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

http://www.caam.rice.edu/~mk51/



-------------------------------------------------------
Damian Kaliszan

Poznan Supercomputing and Networking Center
HPC and Data Centres Technologies
ul. Jana Pawła II 10
61-139 Poznan
POLAND

phone (+48 61) 858 5109
e-mail [email protected]
www - http://www.man.poznan.pl/
------------------------------------------------------- 

Attachment: slurm-23321.out
Description: Binary data

Attachment: slurm-23325.out
Description: Binary data

Reply via email to