Hi Kevin, Thanks for your very useful post. Could you give a few command line examples on how to start multiple runs at different times (e.g., allocate a subset of CPU/GPU to one run, and start another run later using another unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the drawbacks of the MPI compilation that you hinted at? Gregory
From: Kevin Boyd<mailto:kevin.b...@uconn.edu> Sent: Thursday, July 25, 2019 10:31 PM To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org> Subject: Re: [gmx-users] simulation on 2 gpus Hi, I've done a lot of research/experimentation on this, so I can maybe get you started - if anyone has any questions about the essay to follow, feel free to email me personally, and I'll link it to the email thread if it ends up being pertinent. First, there's some more internet resources to checkout. See Mark's talk at - https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2F&data=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338&sdata=%2BaUIuI63M7HRo%2B2VSUs0WIr0nYB10jE7lxnHW6gM8Os%3D&reserved=0 Gromacs development moves fast, but a lot of it is still relevant. I'll expand a bit here, with the caveat that Gromacs GPU development is moving very fast and so the correct commands for optimal performance are both system-dependent and a moving target between versions. This is a good thing - GPUs have revolutionized the field, and with each iteration we make better use of them. The downside is that it's unclear exactly what sort of CPU-GPU balance you should look to purchase to take advantage of future developments, though the trend is certainly that more and more computation is being offloaded to the GPUs. The most important consideration is that to get maximum total throughput performance, you should be running not one but multiple simulations simultaneously. You can do this through the -multidir option, but I don't recommend that in this case, as it requires compiling with MPI and limits some of your options. My run scripts usually use "gmx mdrun ... &" to initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin -pinoffset, and -gputasks. I can give specific examples if you're interested. Another important point is that you can run more simulations than the number of GPUs you have. Depending on CPU-GPU balance and quality, you won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but you might increase it up to 1.5x. This would involve targeting the same GPU with -gputasks. Within a simulation, you should set up a benchmarking script to figure out the best combination of thread-mpi ranks and open-mp threads - this can have pretty drastic effects on performance. For example, if you want to use your entire machine for one simulation (not recommended for maximal efficiency), you have a lot of decomposition options (ignoring PME - which is important, see below): -ntmpi 2 -ntomp 32 -gputasks 01 -ntmpi 4 -ntomp 16 -gputasks 0011 -ntmpi 8 -ntomp 8 -gputasks 00001111 -ntmpi 16 -ntomp 4 -gputasks 000000001111111 (and a few others - note that ntmpi * ntomp = total threads available) In my experience, you need to scan the options in a benchmarking script for each simulation size/content you want to simulate, and the difference between the best and the worst can be up to a factor of 2-4 in terms of performance. If you're splitting your machine among multiple simulations, I suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your benchmarking suggests that the optimal performance lies elsewhere. Things get more complicated when you start putting PME on the GPUs. For the machines I work on, putting PME on GPUs absolutely improves performance, but I'm not fully confident in that assessment without testing your specific machine - you have a lot of cores with that threadripper, and this is another area where I expect Gromacs 2020 might shift the GPU-CPU optimal balance. The issue with PME on GPUs is that we can (currently) only have one rank doing GPU PME work. So, if we have a machine with say 20 cores and 2 gpus, if I run the following gmx mdrun .... -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01 , two ranks will be started - one with cores 0-9, will work on the short-range interactions, offloading where it can to GPU 0, and the PME rank (cores 10-19) will offload to GPU 1. There is one significant problem (and one minor problem) with this setup. First, it is massively inefficient in terms of load balance. In a typical system (there are exceptions), PME takes up ~1/3 of the computation that short-range interactions take. So, we are offloading 1/4 of our interactions to one GPU and 3/4 to the other, which leads to imbalance. In this specific case (2 GPUs and sufficient cores), the most optimal solution is often (but not always) to run with -ntmpi 4 (in this example, then -ntomp 5), as the PME rank then gets 1/4 of the GPU instructions, proportional to the computation needed. The second(less critical - don't worry about this unless you're CPU-limited) problem is that PME-GPU mpi ranks only use 1 CPU core in their calculations. So, with a node of 20 cores and 2 GPUs, if I run a simulation with -ntmpi 4 -ntmpi 5 -pme gpu -npme 1 -pme gpu, each one of those ranks will have 5 CPUs, but the PME rank will only use one of them. You can specify the number of PME cores per rank with -ntomp_pme. This is useful in restricted cases. For example, given the above architecture setup (20 cores, 2 GPUs), I could maximally exploit my CPUs with the following commands: gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks 0000 -pin on -pinoffset 0 & gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks 1111 -pin on -pinoffset 10 where the 1st 10 cores are (0-2 - PP) (3-5 - PP) (6-8 -PP) (9 - PME) and similar for the other 10 cores. There are a few other parameters to scan for minor improvements - for example nstlist, which I typically scan in a range between 80-140 for GPU simulations, with an effect between 2-5% of performance I'm happy to expand the discussion with anyone who's interested. Kevin On Thu, Jul 25, 2019 at 1:47 PM Stefano Guglielmo < stefano.guglie...@unito.it> wrote: > Dear all, > I am trying to run simulation with Gromacs 2019.2 on a workstation with an > amd Threadripper cpu (32 core, 64 threads, 128 GB RAM and with two rtx 2080 > ti with nvlink bridge. I read user's guide section regarding performance > and I am exploring some possibile combinations of cpu/gpu work to run as > fast as possible. I was wondering if some of you has experience of running > on more than one gpu with several cores and can give some hints as starting > point. > Thanks > Stefano > > > -- > Stefano GUGLIELMO PhD > Assistant Professor of Medicinal Chemistry > Department of Drug Science and Technology > Via P. Giuria 9 > 10125 Turin, ITALY > ph. +39 (0)11 6707178 > -- > Gromacs Users mailing list > > * Please search the archive at > https://nam03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338&sdata=Cl4IpR%2B4PrlBvfwODwfWNzDLao3eTVg%2BQDXNiQCFJno%3D&reserved=0 > before posting! > > * Can't post? Read > https://nam03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338&sdata=XjyPITW4lVBI09tVbqPwwFAan22YS6ZkGEkBk9fuaGM%3D&reserved=0 > > * For (un)subscribe requests visit > > https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338&sdata=t0GoJ9Udb%2FRCmHCOVgyC242c%2FCHGJJ9WMi5KUfe9T8k%3D&reserved=0 > or send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at https://nam03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338&sdata=Cl4IpR%2B4PrlBvfwODwfWNzDLao3eTVg%2BQDXNiQCFJno%3D&reserved=0 before posting! * Can't post? Read https://nam03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338&sdata=XjyPITW4lVBI09tVbqPwwFAan22YS6ZkGEkBk9fuaGM%3D&reserved=0 * For (un)subscribe requests visit https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Cgpoon%40gsu.edu%7Cfd42b6ec3efa41d855b608d711714bdd%7C515ad73d8d5e4169895c9789dc742a70%7C0%7C0%7C636997050628368338&sdata=t0GoJ9Udb%2FRCmHCOVgyC242c%2FCHGJJ9WMi5KUfe9T8k%3D&reserved=0 or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.