Dear gmx-users, I’m currently working in a server where each node posses 40 physical cores (40 threads) and 4 Nvidia-V100. When I launch a single job (1 simulation using a single gpu card) I get a performance of about ~35ns/day in a system of about 300k atoms. Looking into the usage of the video card during the simulation I notice that the card is being used about and ~80%. The problems arise when I increase the number of jobs running at the same time. If for instance 2 jobs are running at the same time, the performance drops to ~25ns/day each and the usage of the video cards also drops during the simulation to about a ~30-40% (and sometimes dropping to less than 5%). Clearly there is a communication problem between the gpu cards and the cpu during the simulations, but I don’t know how to solve this. Here is the script I use to run the simulations:
#!/bin/bash -x #SBATCH --job-name=testAtTPC1 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=20 #SBATCH --account=hdd22 #SBATCH --nodes=1 #SBATCH --mem=0 #SBATCH --output=sout.%j #SBATCH --error=s4err.%j #SBATCH --time=00:10:00 #SBATCH --partition=develgpus #SBATCH --gres=gpu:4 module use /gpfs/software/juwels/otherstages module load Stages/2018b module load Intel/2019.0.117-GCC-7.3.0 module load IntelMPI/2019.0.117 module load GROMACS/2018.3 WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1 WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2 WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3 WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4 DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 " EXE=" gmx mdrun " cd $WORKDIR1 $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0 -ntomp 20 &>log & cd $WORKDIR2 $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10 -ntomp 20 &>log & cd $WORKDIR3 $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20 -ntomp 20 &>log & cd $WORKDIR4 $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 30 -ntomp 20 &>log & Regarding to pinoffset, I first tried using 20 cores for each job but then also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job 2, pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the problem persist. Currently in this machine I’m not able to use more than 1 gpu per job, so this is my only choice to use properly the whole node. If you need more information please just let me know. Best regards. Carlos —————— Carlos Navarro Retamal Bioinformatic Engineering. PhD. Postdoctoral Researcher in Center of Bioinformatics and Molecular Simulations Universidad de Talca Av. Lircay S/N, Talca, Chile E: [email protected] or [email protected] -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to [email protected].
