Hi Matt,
Here are a few bullet points that might help you, maybe other experts can
contribute more.
If you're running on a single machine, using thread-mpi over mpi is a good
choice.
"-pin on" might help you.
60k atoms is not very large, here are some other systems ready to benchmark https://www.mpibpc.mpg.de/grubmueller/bench
that be able to tell you more about your performance on a range of systems.
It is normal that the GPU is not fully utilized; the newest GROMACS release should be able to make more use of the GPU,
so you might want to try out the beta-3 version to get an idea, but please don't use for production, but wait till
January when GROMACS-2020 is released.
If you want to maximise sampling, incorporate running multiple simulations simultaneously in your benchmark set (mdrun
-multidir makes things easy here), most often this is what you actually want and can give you a drastic increase in
output from your hardware (guessing a long shot, you might get 4 * 150 ns/day)
I assume you had already a look at this, but for reference check here:
http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
http://manual.gromacs.org/documentation/current/onlinehelp/gmx-mdrun.html
http://manual.gromacs.org/documentation/current/user-guide/mdrun-features.html
https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.26011
Best,
Christian
On 2019-12-04 17:53, Matthew Fisher wrote:
Dear all,
We're currently running some experiments with a new hardware configuration and
attempting to maximise performance from it. Our system contains 1x V100 and 2x 12
core (24 logical) Xeon Silver 4214 CPUs which, after optimisation of CUDA drivers
& kernels etc., we've been able to get a performance of 210 ns/day for 60k
atoms with GROMACS 2019.3 (allowing mdrun to select threads, which has surprised us
as it only creates 24 OpenMP threads for our 48 logical core system). Furthermore
we have a surprising amount of wasted GPU time. Therefore, we were wondering if
anyone had any advice on how we could maximise our hardware output? We've enclosed
the real cycle and time accounting display below.
Any help will be massively appreciated
Thanks,
Matt
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 24 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 24 12501 32.590 1716.686 3.2
Launch GPU ops. 1 24 2500002 105.169 5539.764 10.2
Force 1 24 1250001 140.283 7389.414 13.6
Wait PME GPU gather 1 24 1250001 79.714 4198.902 7.7
Reduce GPU PME F 1 24 1250001 25.159 1325.260 2.4
Wait GPU NB local 1 24 1250001 264.961 13956.769 25.7
NB X/F buffer ops. 1 24 2487501 177.862 9368.871 17.3
Write traj. 1 24 252 5.748 302.799 0.6
Update 1 24 1250001 81.151 4274.601 7.9
Constraints 1 24 1250001 70.231 3699.389 6.8
Rest 47.521 2503.167 4.6
-----------------------------------------------------------------------------
Total 1030.389 54275.623 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 24729.331 1030.389 2400.0
(ns/day) (hour/ns)
Performance: 209.630 0.114
--
Gromacs Users mailing list
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
mail to [email protected].