Re: [gmx-users] Tests with Threadripper and dual gpu setup
BS”D Did you build with or without hwloc? I did use hwloc. — Gromacs 2018 rc1 (using gcc 4.8.5) — Using AVX_256 You should be using AVX2_128 or AVX2_256 or Zen! The former will be fastest in CPU-only runs, the latter can often be (a bit) faster in GPU accelerated runs. Once I saw that AVX2_128 was faster, I did not think there would be an advantage to AVX2_256 on GPU runs. So is there any suggestion to overcome the problem of gcc 5.5 not recognising what CPU hardware I have (not that 5.5 gave much of an advantage in Gromacs 2016). Now force Dynamic Load Balancing gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu -dlb yes I would recommend *against* doing that unless you have concrete cases where this is better than "-dlb auto" -- and if you have such cases, please share them as it is not expected behavior. (Note: DLB has acquired the capability to observe when turning it on it leads to performance drop and it switches off automatically in such cases!) I did see that it in some cases it was turning off DLB for a while, or for the rest of the run. In my case however, I did get better results by forcing it to be on. I can send the .tpr file to you, off-list, if you want… Thanks Harry Harry M. Greenblatt Associate Staff Scientist Dept of Structural Biology harry.greenbl...@weizmann.ac.il<../../owa/redir.aspx?C=QQgUExlE8Ueu2zs5OGxuL5gubHf97c8IyXxHOfOIqyzCgIQtXppXx1YBYaN5yrHbaDn2xAb8moU.=mailto%3aharry.greenblatt%40weizmann.ac.il> Weizmann Institute of SciencePhone: 972-8-934-6340 234 Herzl St.Facsimile: 972-8-934-3361 Rehovot, 7610001 Israel -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Tests with Threadripper and dual gpu setup
Hi, Thanks for the report! Did you build with or without hwloc? There is a known issue with the automatic pin stride when not using hwloc which will lead to a "compact" pinning (using half of the cores with 2 threads/core) when <=half of the threads are launched (instead of using all cores 1 thread/core which is the default on Intel). When it comes to running "wide" ranks (i.e. many OpenMP threads per rank) on Zen/Ryzen, things are not straightforward, so the default 16/32 threads on 16 cores + 1 GPU is not great. If already running domain-decomposition, 4-8 threads/rank is generally best, but unfortunately this will often not be better than just using no DD and taking the hit of threading inefficiency. A few more comments in-line. On Wed, Jan 24, 2018 at 10:14 AM, Harry Mark Greenblatt < harry.greenbl...@weizmann.ac.il> wrote: > BS”D > > In case anybody is interested we have tested Gromacs on a Threadripper > machine with two GPU’s. > > Hardware: > > Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair > H100i V2 Liquid cooling > Asus Prime X399-A M/B > 2 X Geforce GTX 1080 GPU’s > 32 GB of 3200MHz memory > Samsung 850 Pro 512GB SSD > > OS, software: > > Centos 7.4, with 4.14 Kernel from ElRepo > gcc 4.8.5 and gcc 5.5.0 > fftw 3.3.7 (AVX2 enabled) > Cuda 8 > Gromacs 2016.4 > Gromacs 2018-rc1 and final 2018. > Using thread-MPI > > > I managed to compile gcc 5.5.0, but when I went to use it to compile > Gromacs, the compiler could not recognise the hardware, although the native > gcc 4.8.5 had no problem. > In 2016.4, I was able to specify which SIMD set to use, so this was not an > issue. In any case there was very little difference between gcc 5.5.0 and > 4.8.5. So I used 4.8.5 for 2018. > Any ideas how to overcome this problem with 5.5.0? > > > Gromacs 2016.4 > > > System: Protein/DNA complex, with 438,397 atoms (including waters/ions), > 100 ps npt equilibration. > > Allowing Gromacs to choose how it wanted to allocate the hardware gave > > 8 tMPI ranks, 4 thread per rank, both GPU’s > > 12.4 ns/day > > When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s > > 12.2 ns/day > > > Running on “real” cores only > > 4 tMPI ranks, 4 threads per rank, 2 GPU’s > > 10.2 ns/day > > 1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on, > but pinstride and pinoffset automatic) > > 10.6 ns/day > > 1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning > options: > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1 > -gpu_id 0 -pinoffset 0 -pinstride 2 > > 12.3 ns/day > > Presumably, the gain here is because “pintstride 2” caused the job to run > on the “real” (1,2,3…15) cores, and not on virtual cores. The automatic > pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are > virtual and so gave only 10.6ns/day. > > ** So there very little gain from the second GPU, and very little gain > from multithreading. ** > > Using AVX_256 and not AVX2_256 with above command gave a small speed up > (although using AVX instead of AVX2 for FFTW made things worse). > > 12.5 ns/day > > > To compare with an Intel Xeon Silver system: > 2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no > Hyperthreading), 64GB memory > 2 x Geforce 1080’s (as used in the above tests) > > gcc 4.8.5 > Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and > not by me). > AVX2_256 should give some benefit, but not a lot. (BTW, on Silver do not use AVX_512, even on the Gold / 2FMA Skylake-X, when running with GPUs AVX2 tends to be is better.) > 2 MPI ranks, 12 threads each rank, 2 GPU’s > > 11.7 ns/day > > 4 MPI ranks, 6 threads each rank, 2 GPU’s > > 13.0 ns/day > > 6 MPI ranks, 4 threads each rank, 2 GPU’s > > 14.0 ns/day > Similar effect as noted wrt Ryzen. > > To compare with the AMD machine, same number of cores > > 1 MPI rank, 16 threads, 1 GPU > > 11.2 ns/day > (Side-note: a bit apples and oranges comparison, isn't it?) > > — > Gromacs 2018 rc1 (using gcc 4.8.5) > — > > Using AVX_256 > You should be using AVX2_128 or AVX2_256 or Zen! The former will be fastest in CPU-only runs, the latter can often be (a bit) faster in GPU accelerated runs. > > In ‘classic’ mode, not using gpu for PME > > 8 tMPI ranks, 4 threads per rank, 2 GPU’s > > 12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4) > > Now use a gpu for PME > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on > > used 1 tMPI rank, 32 OpenMP threads, 1 GPU > > 14.9 ns/day > > Forcing the program to use both GPU’s > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 > -npme 1 -gputasks 0011 -nb gpu > > 18.5 ns/day > > Now with AVX2_128 > > 19.0 ns/day > > Now force Dynamic Load Balancing > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 > -npme 1 -gputasks 0011 -nb gpu -dlb yes > I would recommend *against* doing
[gmx-users] Tests with Threadripper and dual gpu setup
BS”D In case anybody is interested we have tested Gromacs on a Threadripper machine with two GPU’s. Hardware: Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair H100i V2 Liquid cooling Asus Prime X399-A M/B 2 X Geforce GTX 1080 GPU’s 32 GB of 3200MHz memory Samsung 850 Pro 512GB SSD OS, software: Centos 7.4, with 4.14 Kernel from ElRepo gcc 4.8.5 and gcc 5.5.0 fftw 3.3.7 (AVX2 enabled) Cuda 8 Gromacs 2016.4 Gromacs 2018-rc1 and final 2018. Using thread-MPI I managed to compile gcc 5.5.0, but when I went to use it to compile Gromacs, the compiler could not recognise the hardware, although the native gcc 4.8.5 had no problem. In 2016.4, I was able to specify which SIMD set to use, so this was not an issue. In any case there was very little difference between gcc 5.5.0 and 4.8.5. So I used 4.8.5 for 2018. Any ideas how to overcome this problem with 5.5.0? Gromacs 2016.4 System: Protein/DNA complex, with 438,397 atoms (including waters/ions), 100 ps npt equilibration. Allowing Gromacs to choose how it wanted to allocate the hardware gave 8 tMPI ranks, 4 thread per rank, both GPU’s 12.4 ns/day When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s 12.2 ns/day Running on “real” cores only 4 tMPI ranks, 4 threads per rank, 2 GPU’s 10.2 ns/day 1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on, but pinstride and pinoffset automatic) 10.6 ns/day 1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning options: gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1 -gpu_id 0 -pinoffset 0 -pinstride 2 12.3 ns/day Presumably, the gain here is because “pintstride 2” caused the job to run on the “real” (1,2,3…15) cores, and not on virtual cores. The automatic pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are virtual and so gave only 10.6ns/day. ** So there very little gain from the second GPU, and very little gain from multithreading. ** Using AVX_256 and not AVX2_256 with above command gave a small speed up (although using AVX instead of AVX2 for FFTW made things worse). 12.5 ns/day To compare with an Intel Xeon Silver system: 2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no Hyperthreading), 64GB memory 2 x Geforce 1080’s (as used in the above tests) gcc 4.8.5 Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and not by me). 2 MPI ranks, 12 threads each rank, 2 GPU’s 11.7 ns/day 4 MPI ranks, 6 threads each rank, 2 GPU’s 13.0 ns/day 6 MPI ranks, 4 threads each rank, 2 GPU’s 14.0 ns/day To compare with the AMD machine, same number of cores 1 MPI rank, 16 threads, 1 GPU 11.2 ns/day — Gromacs 2018 rc1 (using gcc 4.8.5) — Using AVX_256 In ‘classic’ mode, not using gpu for PME 8 tMPI ranks, 4 threads per rank, 2 GPU’s 12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4) Now use a gpu for PME gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on used 1 tMPI rank, 32 OpenMP threads, 1 GPU 14.9 ns/day Forcing the program to use both GPU’s gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu 18.5 ns/day Now with AVX2_128 19.0 ns/day Now force Dynamic Load Balancing gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu -dlb yes 20.1 ns/day Now use more (8) tMPI ranks gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 8 -npme 1 -gputasks -nb gpu -dlb yes 20.7 ns/day And finally, using 2018 (AVX2_128) with the above command line 20.9 ns/day Here are the final lines from the log file Dynamic load balancing report: DLB was permanently on during the run per user request. Average load imbalance: 7.7%. The balanceable part of the MD step is 51%, load imbalance is computed from this. Part of the total run time spent waiting due to load imbalance: 3.9%. Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Average PME mesh/force load: 1.275 Part of the total run time spent waiting due to PP/PME imbalance: 9.4 % NOTE: 9.4 % performance was lost because the PME ranks had more work to do than the PP ranks. You might want to increase the number of PME ranks or increase the cut-off and the grid spacing. R E A L C Y C L E A N D T I M E A C C O U N T I N G On 7 MPI ranks doing PP, each using 4 OpenMP threads, and on 1 MPI rank doing PME, using 4 OpenMP threads Computing: Num Num CallWall time Giga-Cycles Ranks Threads Count (s) total sum% - Domain decomp. 74500 13.721 1306.196 2.9 DD comm. load 74500 0.366 34.875 0.1 DD comm. bounds74