Hi, I use gromacs-2019.4.
Sent from my iPhone > On 25 Apr 2020, at 6:54 am, gromacs.org_gmx-users-requ...@maillist.sys.kth.se > wrote: > > Send gromacs.org_gmx-users mailing list submissions to > gromacs.org_gmx-users@maillist.sys.kth.se > > To subscribe or unsubscribe via the World Wide Web, visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > or, via email, send a message with subject or body 'help' to > gromacs.org_gmx-users-requ...@maillist.sys.kth.se > > You can reach the person managing the list at > gromacs.org_gmx-users-ow...@maillist.sys.kth.se > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gromacs.org_gmx-users digest..." > > > Today's Topics: > > 1. Re: GROMACS performance issues on POWER9/V100 node (Szil?rd P?ll) > 2. Re: GROMACS performance issues on POWER9/V100 node (Szil?rd P?ll) > 3. Re: GROMACS performance issues on POWER9/V100 node (Szil?rd P?ll) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 24 Apr 2020 22:31:11 +0200 > From: Szil?rd P?ll <pall.szil...@gmail.com> > To: Discussion list for GROMACS users <gmx-us...@gromacs.org> > Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 > node > Message-ID: > <CANnYEw410kwAD9ivgCayUC_nU4i6eJ+KtK-o0ztc8W+voL=x...@mail.gmail.com> > Content-Type: text/plain; charset="UTF-8" > >> On Fri, Apr 24, 2020 at 5:55 AM Alex <nedoma...@gmail.com> wrote: >> >> Hi Kevin, >> >> We've been having issues with Power9/V100 very similar to what Jon >> described and basically settled on what I believe is sub-par >> performance. We tested it on systems with ~30-50K particles and threads >> simply cannot be pinned. > > > What does that mean, how did you verify that? > The Linux kernel can in general set affinities on ppc64el, whether that's > requested by mdrun or some other tool, so if you have observed that the > affinity mask is not respected (or it does not change), that more likely OS > / setup issue, I'd think. > > What is different compared to x86 is that the hardware thread layout is > different on Power9 (with default Linux kernel configs) and hardware > threads are exposed as consecutive "CPUs" by the OS rather than strided by > #cores. > > I could try to sum up some details on how to sett affinities (with mdrun or > external tools), if that is of interest. However, it really should be > something that's possible to do even using the job scheduler (+ along > reasonable system configuration). > > >> As far as Gromacs is concerned, our brand-new >> Power9 nodes operate as if they were based on Intel CPUs (two threads >> per core) > > > Unless the hardware thread layout has been changed, that's perhaps not the > case, see above. > > >> and zero advantage of IBM parallelization is being taken. >> > > You mean the SMT4? > > >> Other users of the same nodes reported similar issues with other >> software, which to me suggests that our sysadmins don't really know how >> to set these nodes up. >> >> At this point, if someone could figure out a clear set of build >> instructions in combination with slurm/mdrun inputs, it would be very >> much appreciated. >> > > Have you checked public documentation on ORNL's sites? GROMACS has been > used successfully on Summit. What about IBM support? > > -- > Szil?rd > > >> >> Alex >> >>> On 4/23/2020 9:37 PM, Kevin Boyd wrote: >>> I'm not entirely sure how thread-pinning plays with slurm allocations on >>> partial nodes. I always reserve the entire node when I use thread >> pinning, >>> and run a bunch of simulations by pinning to different cores manually, >>> rather than relying on slurm to divvy up resources for multiple jobs. >>> >>> Looking at both logs now, a few more points >>> >>> * Your benchmarks are short enough that little things like cores spinning >>> up frequencies can matter. I suggest running longer (increase nsteps in >> the >>> mdp or at the command line), and throwing away your initial benchmark >> data >>> (see -resetstep and -resethway) to avoid artifacts >>> * Your benchmark system is quite small for such a powerful GPU. I might >>> expect better performance running multiple simulations per-GPU if the >>> workflows being run can rely on replicates, and a larger system would >>> probably scale better to the V100. >>> * The P100/intel system appears to have pinned cores properly, it's >>> unclear whether it had a real impact on these benchmarks >>> * It looks like the CPU-based computations were the primary contributors >> to >>> the observed difference in performance. That should decrease or go away >>> with increased core counts and shifting the update phase to the GPU. It >> may >>> be (I have no prior experience to indicate either way) that the intel >> cores >>> are simply better on a 1-1 basis than the Power cores. If you have 4-8 >>> cores per simulation (try -ntomp 4 and increasing the allocation of your >>> slurm job), the individual core performance shouldn't matter too >>> much, you're just certainly bottlenecked on one CPU core per GPU, which >> can >>> emphasize performance differences.. >>> >>> Kevin >>> >>> On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson < >>> halver...@princeton.edu> wrote: >>> >>>> *Message sent from a system outside of UConn.* >>>> >>>> >>>> Hi Kevin, >>>> >>>> md.log for the Intel run is here: >>>> >>>> >> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100 >>>> >>>> Thanks for the info on constraints with 2020. I'll try some runs with >>>> different values of -pinoffset for 2019.6. >>>> >>>> I know a group at NIST is having the same or similar problems with >>>> POWER9/V100. >>>> >>>> Jon >>>> ________________________________ >>>> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se < >>>> gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin >>>> Boyd <kevin.b...@uconn.edu> >>>> Sent: Thursday, April 23, 2020 9:08 PM >>>> To: gmx-us...@gromacs.org <gmx-us...@gromacs.org> >>>> Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node >>>> >>>> Hi, >>>> >>>> Can you post the full log for the Intel system? I typically find the >> real >>>> cycle and time accounting section a better place to start debugging >>>> performance issues. >>>> >>>> A couple quick notes, but need a side-by-side comparison for more useful >>>> analysis, and these points may apply to both systems so may not be your >>>> root cause: >>>> * At first glance, your Power system spends 1/3 of its time in >> constraint >>>> calculation, which is unusual. This can be reduced 2 ways - first, by >>>> adding more CPU cores. It doesn't make a ton of sense to benchmark on >> one >>>> core if your applications will use more. Second, if you upgrade to >> Gromacs >>>> 2020 you can probably put the constraint calculation on the GPU with >>>> -update GPU. >>>> * The Power system log has this line: >>>> >>>> >> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304 >>>> indicating >>>> that threads perhaps were not actually pinned. Try adding -pinoffset 0 >> (or >>>> some other core) to specify where you want the process pinned. >>>> >>>> Kevin >>>> >>>> On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson < >>>> halver...@princeton.edu> wrote: >>>> >>>>> *Message sent from a system outside of UConn.* >>>>> >>>>> >>>>> We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on >> an >>>>> IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running >>>> RHEL >>>>> 7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel >>>>> nodes. Everything below is about of the POWER9/V100 node. >>>>> >>>>> We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1 >>>>> CPU-core and 1 GPU ( >>>>> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and >>>>> found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives >>>> 102 >>>>> ns/day. The difference in performance is roughly the same for the >> larger >>>>> ADH benchmark and when different numbers of CPU-cores are used. GROMACS >>>> is >>>>> always underperforming on our POWER9/V100 nodes. We have pinning turned >>>> on >>>>> (see Slurm script at bottom). >>>>> >>>>> Below is our build procedure on the POWER9/V100 node: >>>>> >>>>> version_gmx=2019.6 >>>>> wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz >>>>> tar zxvf gromacs-${version_gmx}.tar.gz >>>>> cd gromacs-${version_gmx} >>>>> mkdir build && cd build >>>>> >>>>> module purge >>>>> module load rh/devtoolset/7 >>>>> module load cudatoolkit/10.2 >>>>> >>>>> OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG" >>>>> >>>>> cmake3 .. -DCMAKE_BUILD_TYPE=Release \ >>>>> -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \ >>>>> -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \ >>>>> -DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \ >>>>> -DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \ >>>>> -DGMX_BUILD_OWN_FFTW=ON \ >>>>> -DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \ >>>>> -DGMX_OPENMP_MAX_THREADS=128 \ >>>>> -DCMAKE_INSTALL_PREFIX=$HOME/.local \ >>>>> -DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON >>>>> >>>>> make -j 10 >>>>> make check >>>>> make install >>>>> >>>>> 45 of the 46 tests pass with the exception being HardwareUnitTests. >> There >>>>> are several posts about this and apparently it is not a concern. The >> full >>>>> build log is here: >>>>> >>>> >> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log >>>>> >>>>> >>>>> Here is more info about our POWER9/V100 node: >>>>> >>>>> $ lscpu >>>>> Architecture: ppc64le >>>>> Byte Order: Little Endian >>>>> CPU(s): 128 >>>>> On-line CPU(s) list: 0-127 >>>>> Thread(s) per core: 4 >>>>> Core(s) per socket: 16 >>>>> Socket(s): 2 >>>>> NUMA node(s): 6 >>>>> Model: 2.3 (pvr 004e 1203) >>>>> Model name: POWER9, altivec supported >>>>> CPU max MHz: 3800.0000 >>>>> CPU min MHz: 2300.0000 >>>>> >>>>> You see that we have 4 hardware threads per physical core. If we use 4 >>>>> hardware threads on the RNASE benchmark instead of 1 the performance >> goes >>>>> to 119 ns/day which is still about 20% less than the Broadwell/P100 >>>> value. >>>>> When using multiple CPU-cores on the POWER9/V100 there is significant >>>>> variation in the execution time of the code. >>>>> >>>>> There are four GPUs per POWER9/V100 node: >>>>> >>>>> $ nvidia-smi -q >>>>> Driver Version : 440.33.01 >>>>> CUDA Version : 10.2 >>>>> GPU 00000004:04:00.0 >>>>> Product Name : Tesla V100-SXM2-32GB >>>>> >>>>> The GPUs have been shown to perform as expected on other applications. >>>>> >>>>> >>>>> >>>>> >>>>> The following lines are found in md.log for the POWER9/V100 run: >>>>> >>>>> Overriding thread affinity set outside gmx mdrun >>>>> Pinning threads with an auto-selected logical core stride of 128 >>>>> NOTE: Thread affinity was not set. >>>>> >>>>> The full md.log is available here: >>>>> >> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log >>>>> >>>>> >>>>> >>>>> >>>>> Below are the MegaFlops Accounting for the POWER9/V100 versus >>>>> Broadwell/P100: >>>>> >>>>> ================ IBM POWER9 WITH NVIDIA V100 ================ >>>>> Computing: M-Number M-Flops % >>>> Flops >>>>> >>>> >> ----------------------------------------------------------------------------- >>>>> Pair Search distance check 297.763872 2679.875 >>>> 0.0 >>>>> NxN Ewald Elec. + LJ [F] 244214.215808 16118138.243 >>>> 98.0 >>>>> NxN Ewald Elec. + LJ [V&F] 2483.565760 265741.536 >>>> 1.6 >>>>> 1,4 nonbonded interactions 53.415341 4807.381 >>>> 0.0 >>>>> Shift-X 3.029040 18.174 >>>> 0.0 >>>>> Angles 37.043704 6223.342 >>>> 0.0 >>>>> Propers 55.825582 12784.058 >>>> 0.1 >>>>> Impropers 4.220422 877.848 >>>> 0.0 >>>>> Virial 2.432585 43.787 >>>> 0.0 >>>>> Stop-CM 2.452080 24.521 >>>> 0.0 >>>>> Calc-Ekin 48.128080 1299.458 >>>> 0.0 >>>>> Lincs 20.536159 1232.170 >>>> 0.0 >>>>> Lincs-Mat 444.613344 1778.453 >>>> 0.0 >>>>> Constraint-V 261.192228 2089.538 >>>> 0.0 >>>>> Constraint-Vir 2.430161 58.324 >>>> 0.0 >>>>> Settle 73.382008 23702.389 >>>> 0.1 >>>>> >>>> >> ----------------------------------------------------------------------------- >>>>> Total 16441499.096 >>>> 100.0 >>>>> >>>> >> ----------------------------------------------------------------------------- >>>>> >>>>> ================ INTEL BROADWELL WITH NVIDIA P100 ================ >>>>> Computing: M-Number M-Flops % >>>> Flops >>>>> >>>> >> ----------------------------------------------------------------------------- >>>>> Pair Search distance check 271.334272 2442.008 >>>> 0.0 >>>>> NxN Ewald Elec. + LJ [F] 191599.850112 12645590.107 >>>> 98.0 >>>>> NxN Ewald Elec. + LJ [V&F] 1946.866432 208314.708 >>>> 1.6 >>>>> 1,4 nonbonded interactions 53.415341 4807.381 >>>> 0.0 >>>>> Shift-X 3.029040 18.174 >>>> 0.0 >>>>> Bonds 10.541054 621.922 >>>> 0.0 >>>>> Angles 37.043704 6223.342 >>>> 0.0 >>>>> Propers 55.825582 12784.058 >>>> 0.1 >>>>> Impropers 4.220422 877.848 >>>> 0.0 >>>>> Virial 2.432585 43.787 >>>> 0.0 >>>>> Stop-CM 2.452080 24.521 >>>> 0.0 >>>>> Calc-Ekin 48.128080 1299.458 >>>> 0.0 >>>>> Lincs 9.992997 599.580 >>>> 0.0 >>>>> Lincs-Mat 50.775228 203.101 >>>> 0.0 >>>>> Constraint-V 240.108012 1920.864 >>>> 0.0 >>>>> Constraint-Vir 2.323707 55.769 >>>> 0.0 >>>>> Settle 73.382008 23702.389 >>>> 0.2 >>>>> >>>> >> ----------------------------------------------------------------------------- >>>>> Total 12909529.017 >>>> 100.0 >>>>> >>>> >> ----------------------------------------------------------------------------- >>>>> Some of the rows are identical between the two tables above. The >> largest >>>>> difference >>>>> is observed for the "NxN Ewald Elec. + LJ [F]" row. >>>>> >>>>> >>>>> >>>>> Here is our Slurm script: >>>>> >>>>> #!/bin/bash >>>>> #SBATCH --job-name=gmx # create a short name for your job >>>>> #SBATCH --nodes=1 # node count >>>>> #SBATCH --ntasks=1 # total number of tasks across all >> nodes >>>>> #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if >>>>> multi-threaded tasks) >>>>> #SBATCH --mem=4G # memory per node (4G per cpu-core is >>>>> default) >>>>> #SBATCH --time=00:10:00 # total run time limit (HH:MM:SS) >>>>> #SBATCH --gres=gpu:1 # number of gpus per node >>>>> >>>>> module purge >>>>> module load cudatoolkit/10.2 >>>>> >>>>> BCH=../rnase_cubic >>>>> gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o >>>>> bench.tpr >>>>> gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s >>>>> bench.tpr >>>>> >>>>> >>>>> >>>>> How do we get optimal performance out of GROMACS on our POWER9/V100 >>>> nodes? >>>>> Jon >>>>> -- >>>>> Gromacs Users mailing list >>>>> >>>>> * Please search the archive at >>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>>> posting! >>>>> >>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>>> >>>>> * For (un)subscribe requests visit >>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>>>> send a mail to gmx-users-requ...@gromacs.org. >>>>> >>>> -- >>>> Gromacs Users mailing list >>>> >>>> * Please search the archive at >>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>> posting! >>>> >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>>> * For (un)subscribe requests visit >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>>> send a mail to gmx-users-requ...@gromacs.org. >>>> -- >>>> Gromacs Users mailing list >>>> >>>> * Please search the archive at >>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>> posting! >>>> >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>>> * For (un)subscribe requests visit >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>>> send a mail to gmx-users-requ...@gromacs.org. >>>> >> -- >> Gromacs Users mailing list >> >> * Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> send a mail to gmx-users-requ...@gromacs.org. >> > > > ------------------------------ > > Message: 2 > Date: Fri, 24 Apr 2020 22:52:48 +0200 > From: Szil?rd P?ll <pall.szil...@gmail.com> > To: Discussion list for GROMACS users <gmx-us...@gromacs.org> > Cc: "gromacs.org_gmx-users@maillist.sys.kth.se" > <gromacs.org_gmx-users@maillist.sys.kth.se> > Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 > node > Message-ID: > <cannyew6j7b5fsjlrkdi7z2pahko_rbvft173kfuzk6+c7gu...@mail.gmail.com> > Content-Type: text/plain; charset="UTF-8" > >> The following lines are found in md.log for the POWER9/V100 run: >> >> Overriding thread affinity set outside gmx mdrun >> Pinning threads with an auto-selected logical core stride of 128 >> NOTE: Thread affinity was not set. >> >> The full md.log is available here: >> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log > > > I glanced over that at first, will see if I can reproduce it, though I only > have access to a Raptor Talos, not an IBM machine with Ubuntu. > > What OS are you using? > > > -- >> Gromacs Users mailing list >> >> * Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> send a mail to gmx-users-requ...@gromacs.org. >> > > > ------------------------------ > > Message: 3 > Date: Fri, 24 Apr 2020 22:52:48 +0200 > From: Szil?rd P?ll <pall.szil...@gmail.com> > To: Discussion list for GROMACS users <gmx-us...@gromacs.org> > Cc: "gromacs.org_gmx-users@maillist.sys.kth.se" > <gromacs.org_gmx-users@maillist.sys.kth.se> > Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 > node > Message-ID: > <cannyew6j7b5fsjlrkdi7z2pahko_rbvft173kfuzk6+c7gu...@mail.gmail.com> > Content-Type: text/plain; charset="UTF-8" > >> The following lines are found in md.log for the POWER9/V100 run: >> >> Overriding thread affinity set outside gmx mdrun >> Pinning threads with an auto-selected logical core stride of 128 >> NOTE: Thread affinity was not set. >> >> The full md.log is available here: >> https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log > > > I glanced over that at first, will see if I can reproduce it, though I only > have access to a Raptor Talos, not an IBM machine with Ubuntu. > > What OS are you using? > > > -- >> Gromacs Users mailing list >> >> * Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> send a mail to gmx-users-requ...@gromacs.org. >> > > > ------------------------------ > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a > mail to gmx-users-requ...@gromacs.org. > > End of gromacs.org_gmx-users Digest, Vol 192, Issue 89 > ****************************************************** -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.