We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running RHEL 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel nodes. Everything below is about of the POWER9/V100 node.
We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1 CPU-core and 1 GPU (ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives 102 ns/day. The difference in performance is roughly the same for the larger ADH benchmark and when different numbers of CPU-cores are used. GROMACS is always underperforming on our POWER9/V100 nodes. We have pinning turned on (see Slurm script at bottom). Below is our build procedure on the POWER9/V100 node: version_gmx=2019.6 wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz tar zxvf gromacs-${version_gmx}.tar.gz cd gromacs-${version_gmx} mkdir build && cd build module purge module load rh/devtoolset/7 module load cudatoolkit/10.2 OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG" cmake3 .. -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \ -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \ -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \ -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \ -DGMX_BUILD_OWN_FFTW=ON \ -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \ -DGMX_OPENMP_MAX_THREADS=128 \ -DCMAKE_INSTALL_PREFIX=$HOME/.local \ -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON make -j 10 make check make install 45 of the 46 tests pass with the exception being HardwareUnitTests. There are several posts about this and apparently it is not a concern. The full build log is here: https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log Here is more info about our POWER9/V100 node: $ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 6 Model: 2.3 (pvr 004e 1203) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 You see that we have 4 hardware threads per physical core. If we use 4 hardware threads on the RNASE benchmark instead of 1 the performance goes to 119 ns/day which is still about 20% less than the Broadwell/P100 value. When using multiple CPU-cores on the POWER9/V100 there is significant variation in the execution time of the code. There are four GPUs per POWER9/V100 node: $ nvidia-smi -q Driver Version : 440.33.01 CUDA Version : 10.2 GPU 00000004:04:00.0 Product Name : Tesla V100-SXM2-32GB The GPUs have been shown to perform as expected on other applications. The following lines are found in md.log for the POWER9/V100 run: Overriding thread affinity set outside gmx mdrun Pinning threads with an auto-selected logical core stride of 128 NOTE: Thread affinity was not set. The full md.log is available here: https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log Below are the MegaFlops Accounting for the POWER9/V100 versus Broadwell/P100: ================ IBM POWER9 WITH NVIDIA V100 ================ Computing: M-Number M-Flops % Flops ----------------------------------------------------------------------------- Pair Search distance check 297.763872 2679.875 0.0 NxN Ewald Elec. + LJ [F] 244214.215808 16118138.243 98.0 NxN Ewald Elec. + LJ [V&F] 2483.565760 265741.536 1.6 1,4 nonbonded interactions 53.415341 4807.381 0.0 Shift-X 3.029040 18.174 0.0 Angles 37.043704 6223.342 0.0 Propers 55.825582 12784.058 0.1 Impropers 4.220422 877.848 0.0 Virial 2.432585 43.787 0.0 Stop-CM 2.452080 24.521 0.0 Calc-Ekin 48.128080 1299.458 0.0 Lincs 20.536159 1232.170 0.0 Lincs-Mat 444.613344 1778.453 0.0 Constraint-V 261.192228 2089.538 0.0 Constraint-Vir 2.430161 58.324 0.0 Settle 73.382008 23702.389 0.1 ----------------------------------------------------------------------------- Total 16441499.096 100.0 ----------------------------------------------------------------------------- ================ INTEL BROADWELL WITH NVIDIA P100 ================ Computing: M-Number M-Flops % Flops ----------------------------------------------------------------------------- Pair Search distance check 271.334272 2442.008 0.0 NxN Ewald Elec. + LJ [F] 191599.850112 12645590.107 98.0 NxN Ewald Elec. + LJ [V&F] 1946.866432 208314.708 1.6 1,4 nonbonded interactions 53.415341 4807.381 0.0 Shift-X 3.029040 18.174 0.0 Bonds 10.541054 621.922 0.0 Angles 37.043704 6223.342 0.0 Propers 55.825582 12784.058 0.1 Impropers 4.220422 877.848 0.0 Virial 2.432585 43.787 0.0 Stop-CM 2.452080 24.521 0.0 Calc-Ekin 48.128080 1299.458 0.0 Lincs 9.992997 599.580 0.0 Lincs-Mat 50.775228 203.101 0.0 Constraint-V 240.108012 1920.864 0.0 Constraint-Vir 2.323707 55.769 0.0 Settle 73.382008 23702.389 0.2 ----------------------------------------------------------------------------- Total 12909529.017 100.0 ----------------------------------------------------------------------------- Some of the rows are identical between the two tables above. The largest difference is observed for the "NxN Ewald Elec. + LJ [F]" row. Here is our Slurm script: #!/bin/bash #SBATCH --job-name=gmx # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=4G # memory per node (4G per cpu-core is default) #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS) #SBATCH --gres=gpu:1 # number of gpus per node module purge module load cudatoolkit/10.2 BCH=../rnase_cubic gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o bench.tpr gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s bench.tpr How do we get optimal performance out of GROMACS on our POWER9/V100 nodes? Jon -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.