Hi Dan, On Fri, Feb 9, 2018 at 4:56 PM, Daniel Kozuch <dan.koz...@gmail.com> wrote:
> Szilárd, > > If I may jump in on this conversation, Let's fork the thread so the topics stay clear and discoverable by others, please. I am having the reverse problem > (which I assume others may encounter also) where I am attempting a large > REMD run (84 replicas) and I have access to say 12 GPUs and 84 CPUs. > OK. That's a useful case to clarify. If you have exactly 84 CPUs for your 84 runs and as that number of divisible by the number of GPUs too, it should be as simple as running mpirun -np 84 gmx_mpi mdrun -multi 84 and the automatic mapping of ranks/threads to hardware should work just fine. Basically I have less GPUs than simulations. Is there a logical approach to > using gputasks and other new options in GROMACS 2018 for this setup? I read > through the available documentation,but as you mentioned it seems to be > targeted for a single-GPU runs. > TL;DR Correction: PME GPU offload is tuned for single-GPU (more precisely domain-decomposition) simulations. The -gpu_id/-gputasks options have been fairly well thought through to be useful for all current and most future use-cases ;) Therefore, the use of -gputasks is valid and useful in multi-rank and multi-sim/replica runs as well: it will still map GPUs to tasks/rank within a node. You'll find a bit more detailed writeup below which might refresh & clarify the details of mdrun internal workings. ---- First a brief recap of the basics (for those inexperienced or needing a reminder): GROMACS uses heterogeneous offload parallelization, i.e. CPU & GPU are both used. This has a number of benefits and drawbacks, what is relevant in this context is that for a highly tuned MD engine it is generally difficult to achieve perfect load balance and 100% utilization of both CPU and GPU (in fact the same applies to ranks of an MPI-parallel run). Consequently, part of the runtime will be spent waiting for the GPU on the CPU or vice-versa. GPUs are easy to share among multiple, dependent or independent jobs (unlike CPUs/cores) and this can help make use of otherwise idle GPU time. By setting up threads/processes to share GPUs (e.g. ranks in a multi-sim run or even independent program executions), these can fill the GPU utilization "gaps" generally resulting in better overall efficiency, e.g. higher aggregate ns/day. GPU sharing has a moderate importance in multi-GPU/node parallel runs (i.e. it is useful to run a few ranks per GPU), but even more important in throughput type use with a few to many independent runs -- as a reminder, we've talked about this in our paper and most information is still useful and valid, see Fig 5 and related discussion of https://goo.gl/FvkGC7 Back to the use of -gpu_id / -gputasks. The slight change is that the roles of -gpu_id have been separated: previously it specified *both* the IDs of the devices to use as well as the mapping of the devices to the tasks/ranks in a simulation. In the 2018 release -gpu_id *only specifies which devices to use*, while -gputasks provides the mapping (and it is in most cases optional). E.g. these three are equivalent, but the former is not valid in v2018 mdrun -ntmpi 4 -gpu_id 0011 -nb gpu # map two GPUs to four ranks in v2016 mdrun -ntmpi 4 [-gpu_id 01] -gputasks 0011 -nb gpu [-pme cpu] # use GPU 0 and 1 and map two-by-two to the four PP ranks -- in v2018 mdrun -ntmpi 4 # lazy default for both v2016 and v2018 -- assuming there are only two GPUs and given that PME runs on the CPU by default with DD Now to PME-GPU: before v2018 there was only a single task type offloaded, so -gpu_id mapped GPUs to PP tasks within a node (combined PP+PME or separate PP ranks); for multiple ranks using the same GPU, the ID was simply repeated. With the additional PME task to offload (which can "reside" either in the same rank as the PP task or in a separate rank) the mapping has to account for PME too. What may not be crystal clear from the docs is that the order of tasks is generally PP first, PME next (both within a rank and across ranks); also PME ranks are by default interleaved (unless changed with -ddorder) so in an 8-rank 6 PP / 2 PME setup the rank order is 3 PP / 1 PME / 3 PP / 1 PME (this is however not supported with GPUs!) A few examples leading up to the multi-replica runs: * a single GPU / single rank run gmx mdrun -ntmpi 1 -gputasks 00 -nb gpu -pme gpu # verbose command line gmx mdrun -ntmpi 1 -gpu_id 0 # will do the same as above gmx mdrun -ntmpi 1 # will also do the same as above ;) * single node 2 GPUs with separate PME -- note the limitations of this mode offload mode discussed previously (the mail I've just forked!) gmx mdrun [-pme gpu -nb gpu] -ntmpi 8 -ntomp 6 -npme 1 -gputasks 00000001 # assuming 24 cores / 48 threads / 2 GPUs gmx mdrun [-pme gpu -nb gpu] -ntmpi 12 -ntomp 4 -npme 1 -gputasks 000000000011 # could be more efficient than the above * single node 2 GPUs 4 replicas 1 rank each 2-way sharing mpirun -np 4 gmx mdrun -multi 4 -pme gpu -nb gpu -gputasks 0011 mpirun -np 4 gmx mdrun -multi 4 # equivalent with the above assuming 2 GPUs It can be worth trying at least 2-4 sims per GPU (especially if there are enough replicas and individual run performance is less important). What you *need* to make sure for performance reasons is that you have at least 1 core per GPU (cores not a hardware threads). Cheers, -- Szilárd > > Thanks so much, > Dan > > > > On Fri, Feb 9, 2018 at 10:27 AM, Szilárd Páll <pall.szil...@gmail.com> > wrote: > > > On Fri, Feb 9, 2018 at 4:25 PM, Szilárd Páll <pall.szil...@gmail.com> > > wrote: > > > > > Hi, > > > > > > First of all,have you read the docs (admittedly somewhat brief): > > > http://manual.gromacs.org/documentation/2018/user-guide/ > > > mdrun-performance.html#types-of-gpu-tasks > > > > > > The current PME GPU was optimized for single-GPU runs. Using multiple > > GPUs > > > with PME offloaded works, but this mode hasn't been an optimization > > target > > > and it will often not give very good performance. Using multiple GPUs > > > requires a separate PME rank (as you have realized), only one can be > used > > > (as we don't support PME decomposition on GPUs) and it comes some > > > inherent scaling drawbacks. For this reason, unless you _need_ your > > single > > > run to be as fast as possible, you'll be better off running multiple > > > simulations side-by side. > > > > > > > PS: You can of course also run on two GPUs and run two simulations > > side-by-side (on half of the cores for each) to improve the overall > > aggregate throughput you get out of the hardware. > > > > > > > > > > A few tips for tuning the performance of a multi-GPU run with PME > > offload: > > > * expect to get at best 1.5 scaling to 2 GPUs (rarely 3 if the tasks > > allow) > > > * generally it's best to use about the same decomposition that you'd > use > > > with nonbonded-only offload, e.g. in your case 6-8 ranks > > > * map the GPU task alone or at most together with 1 PP rank to a GPU, > > i.e. > > > use the new -gputasks option > > > e.g. for your case I'd expect the following to work ~best: > > > gmx mdrun -v -deffnm md -pme gpu -nb gpu -ntmpi 8 -ntomp 6 -npme 1 > > > -gputasks 00000001 > > > or > > > gmx mdrun -v -deffnm md -pme gpu -nb gpu -ntmpi 8 -ntomp 6 -npme 1 > > > -gputasks 00000011 > > > > > > > > > Let me know if that gave some improvement. > > > > > > Cheers, > > > > > > -- > > > Szilárd > > > > > > On Fri, Feb 9, 2018 at 8:51 AM, Gmx QA <gmxquesti...@gmail.com> wrote: > > > > > >> Hi list, > > >> > > >> I am trying out the new gromacs 2018 (really nice so far), but have a > > few > > >> questions about what command line options I should specify, > specifically > > >> with the new gnu pme implementation. > > >> > > >> My computer has two CPUs (with 12 cores each, 24 with hyper threading) > > and > > >> two GPUs, and I currently (with 2018) start simulations like this: > > >> > > >> $ gmx mdrun -v -deffnm md -pme gpu -nb gpu -ntmpi 2 -npme 1 -ntomp 24 > > >> -gpu_id 01 > > >> > > >> this works, but gromacs prints the message that 24 omp threads per mpi > > >> rank > > >> is likely inefficient. However, trying to reduce the number of omp > > threads > > >> I see a reduction in performance. Is this message no longer relevant > > with > > >> gpu pme or am I overlooking something? > > >> > > >> Thanks > > >> /PK > > >> -- > > >> Gromacs Users mailing list > > >> > > >> * Please search the archive at http://www.gromacs.org/Support > > >> /Mailing_Lists/GMX-Users_List before posting! > > >> > > >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > >> > > >> * For (un)subscribe requests visit > > >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > >> send a mail to gmx-users-requ...@gromacs.org. > > >> > > > > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at http://www.gromacs.org/ > > Support/Mailing_Lists/GMX-Users_List before posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at http://www.gromacs.org/Support > /Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.