Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Justin Lemkul Mon, 29 Jul 2019 06:15:55 -0700


On 7/29/19 8:46 AM, Carlos Navarro wrote:

Hi Mark,
I tried that before, but unfortunately in that case (removing —gres=gpu:1
and including in each line the -gpu_id flag) for some reason the jobs are
run one at a time (one after the other), so I can’t use properly the whole
node.


You need to run all but the last mdrun process in the background (&).

-Justin

——————
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl

On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abra...@gmail.com)
wrote:

Hi,

When you use

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "

then the environment seems to make sure only one GPU is visible. (The log
files report only finding one GPU.) But it's probably the same GPU in each
case, with three remaining idle. I would suggest not using --gres unless
you can specify *which* of the four available GPUs each run can use.

Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.

$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 -gpu_id 0
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 -gpu_id 1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
-ntomp 20 -gpu_id 2
etc.

Mark

On Mon, 29 Jul 2019 at 11:34, Carlos Navarro <carlos.navarr...@gmail.com>
wrote:

Hi Szilárd,
To answer your questions:
**are you trying to run multiple simulations concurrently on the same
node or are you trying to strong-scale?
I'm trying to run multiple simulations on the same node at the same time.

** what are you simulating?
Regular and CompEl simulations

** can you provide log files of the runs?
In the following link are some logs files:
https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
In short, alone.log -> single run in the node (using 1 gpu).
multi1/2/3/4.log ->4 independent simulations ran at the same time in a
single node. In all cases, 20 cpus are used.
Best regards,
Carlos

El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<pall.szil...@gmail.com>)
escribió:

Hi,

It is not clear to me how are you trying to set up your runs, so
please provide some details:
- are you trying to run multiple simulations concurrently on the same
node or are you trying to strong-scale?
- what are you simulating?
- can you provide log files of the runs?

Cheers,

--
Szilárd

On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
<carlos.navarr...@gmail.com> wrote:

No one can give me an idea of what can be happening? Or how I can

solve

it?

Best regards,
Carlos

——————
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl

On July 19, 2019 at 2:20:41 PM, Carlos Navarro (

carlos.navarr...@gmail.com)

wrote:

Dear gmx-users,
I’m currently working in a server where each node posses 40 physical

cores

(40 threads) and 4 Nvidia-V100.
When I launch a single job (1 simulation using a single gpu card) I

get a

performance of about ~35ns/day in a system of about 300k atoms.

Looking

into the usage of the video card during the simulation I notice that

the

card is being used about and ~80%.
The problems arise when I increase the number of jobs running at the

same

time. If for instance 2 jobs are running at the same time, the

performance

drops to ~25ns/day each and the usage of the video cards also drops

during

the simulation to about a ~30-40% (and sometimes dropping to less than

5%).

Clearly there is a communication problem between the gpu cards and the

cpu

during the simulations, but I don’t know how to solve this.
Here is the script I use to run the simulations:

#!/bin/bash -x
#SBATCH --job-name=testAtTPC1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --account=hdd22
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --output=sout.%j
#SBATCH --error=s4err.%j
#SBATCH --time=00:10:00
#SBATCH --partition=develgpus
#SBATCH --gres=gpu:4

module use /gpfs/software/juwels/otherstages
module load Stages/2018b
module load Intel/2019.0.117-GCC-7.3.0
module load IntelMPI/2019.0.117
module load GROMACS/2018.3

WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
EXE=" gmx mdrun "

cd $WORKDIR1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset

-ntomp 20 &>log &
cd $WORKDIR2
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset

-ntomp 20 &>log &
cd $WORKDIR3
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset

-ntomp 20 &>log &
cd $WORKDIR4
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset

-ntomp 20 &>log &


Regarding to pinoffset, I first tried using 20 cores for each job but

then

also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job

2,

pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the

problem

persist.

Currently in this machine I’m not able to use more than 1 gpu per job,

so

this is my only choice to use properly the whole node.
If you need more information please just let me know.
Best regards.
Carlos

——————
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl
--
Gromacs Users mailing list

* Please search the archive at

http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or

send a mail to gmx-users-requ...@gromacs.org.
--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--

----------

Carlos Navarro Retamal

Bioinformatic Engineering. PhD

Postdoctoral Researcher in Center for Bioinformatics and Molecular
Simulations

Universidad de Talca

Av. Lircay S/N, Talca, Chile

T: (+56) 712201 <//T:%20(+56)%20712201> 798

E: carlos.navarr...@gmail.com or cnava...@utalca.cl
--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.


--
==================================================

Justin A. Lemkul, Ph.D.
Assistant Professor
Office: 301 Fralin Hall
Lab: 303 Engel Hall

Virginia Tech Department of Biochemistry
340 West Campus Dr.
Blacksburg, VA 24061

jalem...@vt.edu | (540) 231-3129
http://www.thelemkullab.com

==================================================

--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Reply via email to