Hi, Please find attached 2 output files from 64MPI/1 OMP vs 64/2 OMPs examples, 23321 vs 23325 slurm task ids.
Best, Damian W liście datowanym 19 czerwca 2017 (15:39:53) napisano: On Mon, Jun 19, 2017 at 7:32 AM, Damian Kaliszan <[email protected]> wrote: Hi, Thank you for the answer and the article. I use SLURM (srun) for job submission by running 'srun script.py script_parameters' command inside batch script so this is SPMD model. What I noticed is that the problems I'm having now didn't happened before on CPU E5-2697 v3 nodes (28 cores - the best perormance I had was using 14MPIs/2OMP per node). Problems started to appear when I moved to KNLs. The funny thing is that switching OMP on/off (by setting OMP_NUM_THREADS to 1) doesn't help for all #NODES/# MPI/ #OMP combinations. For example, for 2 nodes, 16 MPIs, for OMP=1 and 2 the timings are huge and for 4 is OK. Lets narrow this down to MPI_Barrier(). What memory mode is KNL in? Did you require KNL to use only MCDRAM? Please show the MPI_Barrier()/MPI_Send() numbers for the different configurations. This measures just latency. We could also look at VecScale() to look at memory bandwidth achieved. Thanks, Matt Playing with affinitty didn't help so far. In other words at first glance results look completely random (I can provide more such examples). Best, Damian W liście datowanym 19 czerwca 2017 (14:50:25) napisano: On Mon, Jun 19, 2017 at 6:42 AM, Damian Kaliszan <[email protected]> wrote: Hi, Regarding my previous post I looked into both logs of 64MPI/1 OMP vs. 64MPI/2 OMP. What attracted my attention is huge difference in MPI timings in the following places: Average time to get PetscTime(): 2.14577e-07 Average time for MPI_Barrier(): 3.9196e-05 Average time for zero size MPI_Send(): 5.45382e-06 vs. Average time to get PetscTime(): 4.05312e-07 Average time for MPI_Barrier(): 0.348399 Average time for zero size MPI_Send(): 0.029937 Isn't something wrong with PETSc library itself?... I don't think so. This is bad interaction of MPI and your threading mechanism. MPI_Barrier() and MPI_Send() are lower level than PETSc. What threading mode did you choose for MPI? This can have a performance impact. Also, the justifications for threading in this context are weak (or non-existent): http://www.orau.gov/hpcor2015/whitepapers/Exascale_Computing_without_Threads-Barry_Smith.pdf Thanks, Matt Best, Damian Wiadomość przekazana Od: Damian Kaliszan <[email protected]> Do: PETSc users list <[email protected]> Data: 16 czerwca 2017, 14:57:10 Temat: [petsc-users] strange PETSc/KSP GMRES timings for MPI+OMP configuration on KNLs ===8<===============Treść oryginalnej wiadomości=============== Hi, For several days I've been trying to figure out what is going wrong with my python app timings solving Ax=b with KSP (GMRES) solver when trying to run on Intel's KNL 7210/7230. I downsized the problem to 1000x1000 A matrix and a single node and observed the following: I'm attaching 2 extreme timings where configurations differ only by 1 OMP thread (64MPI/1 OMP vs 64/2 OMPs), 23321 vs 23325 slurm task ids. Any help will be appreciated.... Best, Damian ===8<===========Koniec treści oryginalnej wiadomości=========== ------------------------------------------------------- Damian Kaliszan Poznan Supercomputing and Networking Center HPC and Data Centres Technologies ul. Jana Pawła II 10 61-139 Poznan POLAND phone (+48 61) 858 5109 e-mail [email protected] www - http://www.man.poznan.pl/ ------------------------------------------------------- ---------- Forwarded message ---------- From: Damian Kaliszan <[email protected]> To: PETSc users list <[email protected]> Cc: Bcc: Date: Fri, 16 Jun 2017 14:57:10 +0200 Subject: [petsc-users] strange PETSc/KSP GMRES timings for MPI+OMP configuration on KNLs Hi, For several days I've been trying to figure out what is going wrong with my python app timings solving Ax=b with KSP (GMRES) solver when trying to run on Intel's KNL 7210/7230. I downsized the problem to 1000x1000 A matrix and a single node and observed the following: I'm attaching 2 extreme timings where configurations differ only by 1 OMP thread (64MPI/1 OMP vs 64/2 OMPs), 23321 vs 23325 slurm task ids. Any help will be appreciated.... Best, Damian -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener http://www.caam.rice.edu/~mk51/ ------------------------------------------------------- Damian Kaliszan Poznan Supercomputing and Networking Center HPC and Data Centres Technologies ul. Jana Pawła II 10 61-139 Poznan POLAND phone (+48 61) 858 5109 e-mail [email protected] www - http://www.man.poznan.pl/ ------------------------------------------------------- -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener http://www.caam.rice.edu/~mk51/ ------------------------------------------------------- Damian Kaliszan Poznan Supercomputing and Networking Center HPC and Data Centres Technologies ul. Jana Pawła II 10 61-139 Poznan POLAND phone (+48 61) 858 5109 e-mail [email protected] www - http://www.man.poznan.pl/ -------------------------------------------------------
slurm-23321.out
Description: Binary data
slurm-23325.out
Description: Binary data
