*Message sent from a system outside of UConn.*
Hi Kevin,
md.log for the Intel run is here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100
Thanks for the info on constraints with 2020. I'll try some runs with
different values of -pinoffset for 2019.6.
I know a group at NIST is having the same or similar problems with
POWER9/V100.
Jon
________________________________
From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
Boyd <kevin.b...@uconn.edu>
Sent: Thursday, April 23, 2020 9:08 PM
To: gmx-us...@gromacs.org <gmx-us...@gromacs.org>
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node
Hi,
Can you post the full log for the Intel system? I typically find the real
cycle and time accounting section a better place to start debugging
performance issues.
A couple quick notes, but need a side-by-side comparison for more useful
analysis, and these points may apply to both systems so may not be your
root cause:
* At first glance, your Power system spends 1/3 of its time in constraint
calculation, which is unusual. This can be reduced 2 ways - first, by
adding more CPU cores. It doesn't make a ton of sense to benchmark on one
core if your applications will use more. Second, if you upgrade to Gromacs
2020 you can probably put the constraint calculation on the GPU with
-update GPU.
* The Power system log has this line:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304
indicating
that threads perhaps were not actually pinned. Try adding -pinoffset 0 (or
some other core) to specify where you want the process pinned.
Kevin
On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:
*Message sent from a system outside of UConn.*
We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on an
IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running
RHEL
7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
nodes. Everything below is about of the POWER9/V100 node.
We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
CPU-core and 1 GPU (
ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives
102
ns/day. The difference in performance is roughly the same for the larger
ADH benchmark and when different numbers of CPU-cores are used. GROMACS
is
always underperforming on our POWER9/V100 nodes. We have pinning turned
on
(see Slurm script at bottom).
Below is our build procedure on the POWER9/V100 node:
version_gmx=2019.6
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
tar zxvf gromacs-${version_gmx}.tar.gz
cd gromacs-${version_gmx}
mkdir build && cd build
module purge
module load rh/devtoolset/7
module load cudatoolkit/10.2
OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"
cmake3 .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
-DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
-DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
-DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
-DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
-DGMX_OPENMP_MAX_THREADS=128 \
-DCMAKE_INSTALL_PREFIX=$HOME/.local \
-DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON
make -j 10
make check
make install
45 of the 46 tests pass with the exception being HardwareUnitTests. There
are several posts about this and apparently it is not a concern. The full
build log is here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log
Here is more info about our POWER9/V100 node:
$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 6
Model: 2.3 (pvr 004e 1203)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
You see that we have 4 hardware threads per physical core. If we use 4
hardware threads on the RNASE benchmark instead of 1 the performance goes
to 119 ns/day which is still about 20% less than the Broadwell/P100
value.
When using multiple CPU-cores on the POWER9/V100 there is significant
variation in the execution time of the code.
There are four GPUs per POWER9/V100 node:
$ nvidia-smi -q
Driver Version : 440.33.01
CUDA Version : 10.2
GPU 00000004:04:00.0
Product Name : Tesla V100-SXM2-32GB
The GPUs have been shown to perform as expected on other applications.
The following lines are found in md.log for the POWER9/V100 run:
Overriding thread affinity set outside gmx mdrun
Pinning threads with an auto-selected logical core stride of 128
NOTE: Thread affinity was not set.
The full md.log is available here:
https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log
Below are the MegaFlops Accounting for the POWER9/V100 versus
Broadwell/P100:
================ IBM POWER9 WITH NVIDIA V100 ================
Computing: M-Number M-Flops %
Flops
-----------------------------------------------------------------------------
Pair Search distance check 297.763872 2679.875
0.0
NxN Ewald Elec. + LJ [F] 244214.215808 16118138.243
98.0
NxN Ewald Elec. + LJ [V&F] 2483.565760 265741.536
1.6
1,4 nonbonded interactions 53.415341 4807.381
0.0
Shift-X 3.029040 18.174
0.0
Angles 37.043704 6223.342
0.0
Propers 55.825582 12784.058
0.1
Impropers 4.220422 877.848
0.0
Virial 2.432585 43.787
0.0
Stop-CM 2.452080 24.521
0.0
Calc-Ekin 48.128080 1299.458
0.0
Lincs 20.536159 1232.170
0.0
Lincs-Mat 444.613344 1778.453
0.0
Constraint-V 261.192228 2089.538
0.0
Constraint-Vir 2.430161 58.324
0.0
Settle 73.382008 23702.389
0.1
-----------------------------------------------------------------------------
Total 16441499.096
100.0
-----------------------------------------------------------------------------
================ INTEL BROADWELL WITH NVIDIA P100 ================
Computing: M-Number M-Flops %
Flops
-----------------------------------------------------------------------------
Pair Search distance check 271.334272 2442.008
0.0
NxN Ewald Elec. + LJ [F] 191599.850112 12645590.107
98.0
NxN Ewald Elec. + LJ [V&F] 1946.866432 208314.708
1.6
1,4 nonbonded interactions 53.415341 4807.381
0.0
Shift-X 3.029040 18.174
0.0
Bonds 10.541054 621.922
0.0
Angles 37.043704 6223.342
0.0
Propers 55.825582 12784.058
0.1
Impropers 4.220422 877.848
0.0
Virial 2.432585 43.787
0.0
Stop-CM 2.452080 24.521
0.0
Calc-Ekin 48.128080 1299.458
0.0
Lincs 9.992997 599.580
0.0
Lincs-Mat 50.775228 203.101
0.0
Constraint-V 240.108012 1920.864
0.0
Constraint-Vir 2.323707 55.769
0.0
Settle 73.382008 23702.389
0.2
-----------------------------------------------------------------------------
Total 12909529.017
100.0
-----------------------------------------------------------------------------
Some of the rows are identical between the two tables above. The largest
difference
is observed for the "NxN Ewald Elec. + LJ [F]" row.
Here is our Slurm script:
#!/bin/bash
#SBATCH --job-name=gmx # create a short name for your job
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if
multi-threaded tasks)
#SBATCH --mem=4G # memory per node (4G per cpu-core is
default)
#SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
#SBATCH --gres=gpu:1 # number of gpus per node
module purge
module load cudatoolkit/10.2
BCH=../rnase_cubic
gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o
bench.tpr
gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s
bench.tpr
How do we get optimal performance out of GROMACS on our POWER9/V100
nodes?
Jon
--
Gromacs Users mailing list
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.
--
Gromacs Users mailing list
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.
--
Gromacs Users mailing list
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.