Mark, I may have misread the ppt on optimization, but I did experiment with variations of mtomp mtmpi and so using less than si x threads was a 2 x 3 combination. Tonight I will put both
========================this is the last part of the log from a 2 gpu setup================ using gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 6 -gpu_id 1 -pin on. Run on the I7-970 cpu NOTE: DLB can now turn on, when beneficial <====== ############### ==> <==== A V E R A G E S ====> <== ############### ======> Statistics over 2401 steps using 25 frames Energies (kJ/mol) Angle G96Angle Proper Dih. Improper Dih. LJ-14 9.21440e+05 1.96052e+04 6.53857e+04 2.23128e+02 8.65164e+04 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. -2.84582e+07 -1.44895e+05 -2.04658e+03 1.34455e+07 5.03949e+04 Position Rest. Potential Kinetic En. Total Energy Temperature 3.44645e+01 -1.40160e+07 1.91196e+05 -1.38249e+07 3.04725e+02 Pres. DC (bar) Pressure (bar) Constr. rmsd -2.88685e+00 3.64550e+02 0.00000e+00 Total Virial (kJ/mol) -8.80572e+04 -5.06693e+03 6.90580e+02 -5.06777e+03 -6.31180e+04 -5.32400e+03 6.90136e+02 -5.32396e+03 -5.27950e+04 Pressure (bar) 4.14166e+02 1.39915e+01 -1.79346e+00 1.39938e+01 3.54006e+02 1.44453e+01 -1.79223e+00 1.44452e+01 3.25476e+02 T-PDMS T-VMOS 2.98272e+02 6.83205e+02 P P - P M E L O A D B A L A N C I N G NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling, you might not have reached a good load balance. PP/PME load balancing changed the cut-off and PME settings: particle-particle PME rcoulomb rlist grid spacing 1/beta initial 1.000 nm 1.000 nm 160 160 128 0.156 nm 0.320 nm final 1.628 nm 1.628 nm 96 96 80 0.260 nm 0.521 nm cost-ratio 4.31 0.23 (note that these numbers concern only part of the total PP and PME load) M E G A - F L O P S A C C O U N T I N G NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table W3=SPC/TIP3p W4=TIP4p (single or pairs) V&F=Potential and force V=Potential only F=Force only Computing: M-Number M-Flops % Flops ----------------------------------------------------------------------------- Pair Search distance check 225.527520 2029.748 0.0 NxN Ewald Elec. + LJ [F] 255071.893824 16834744.992 91.2 NxN Ewald Elec. + LJ [V&F] 2710.128064 289983.703 1.6 1,4 nonbonded interactions 432.540150 38928.613 0.2 Calc Weights 543.250260 19557.009 0.1 Spread Q Bspline 11589.338880 23178.678 0.1 Gather F Bspline 11589.338880 69536.033 0.4 3D-FFT 129115.579906 1032924.639 5.6 Solve PME 31.785216 2034.254 0.0 Reset In Box 1.885500 5.656 0.0 CG-CoM 1.960920 5.883 0.0 Angles 342.430620 57528.344 0.3 Propers 72.102030 16511.365 0.1 Impropers 0.432180 89.893 0.0 Pos. Restr. 3.457440 172.872 0.0 Virial 1.887750 33.979 0.0 Update 181.083420 5613.586 0.0 Stop-CM 1.960920 19.609 0.0 Calc-Ekin 3.771000 101.817 0.0 Lincs 375.988360 22559.302 0.1 Lincs-Mat 8530.590144 34122.361 0.2 Constraint-V 751.820250 6014.562 0.0 Constraint-Vir 1.956622 46.959 0.0 ----------------------------------------------------------------------------- Total 18455743.858 100.0 ----------------------------------------------------------------------------- D O M A I N D E C O M P O S I T I O N S T A T I S T I C S av. #atoms communicated per step for force: 2 x 6018.1 av. #atoms communicated per step for LINCS: 2 x 3015.7 Dynamic load balancing report: DLB was off during the run due to low measured imbalance. Average load imbalance: 0.9%. The balanceable part of the MD step is 47%, load imbalance is computed from this. Part of the total run time spent waiting due to load imbalance: 0.4%. R E A L C Y C L E A N D T I M E A C C O U N T I N G On 2 MPI ranks, each using 6 OpenMP threads Computing: Num Num Call Wall time Giga-Cycles Ranks Threads Count (s) total sum % ----------------------------------------------------------------------------- Domain decomp. 2 6 25 0.627 24.367 0.8 DD comm. load 2 6 2 0.000 0.004 0.0 Neighbor search 2 6 25 0.160 6.206 0.2 Launch GPU ops. 2 6 4802 0.516 20.048 0.7 Comm. coord. 2 6 2376 0.272 10.563 0.4 Force 2 6 2401 3.714 144.331 4.9 Wait + Comm. F 2 6 2401 0.210 8.173 0.3 PME mesh 2 6 2401 49.851 1937.315 66.2 Wait GPU NB nonloc. 2 6 2401 0.056 2.157 0.1 Wait GPU NB local 2 6 2401 0.033 1.285 0.0 NB X/F buffer ops. 2 6 9554 0.641 24.920 0.9 Write traj. 2 6 2 0.040 1.559 0.1 Update 2 6 4802 1.690 65.662 2.2 Constraints 2 6 4802 10.001 388.661 13.3 Comm. energies 2 6 25 0.003 0.107 0.0 Rest 7.511 291.885 10.0 ----------------------------------------------------------------------------- Total 75.323 2927.243 100.0 ----------------------------------------------------------------------------- Breakdown of PME mesh computation ----------------------------------------------------------------------------- PME redist. X/F 2 6 4802 2.694 104.683 3.6 PME spread 2 6 2401 10.619 412.680 14.1 PME gather 2 6 2401 9.157 355.857 12.2 PME 3D-FFT 2 6 4802 21.805 847.398 28.9 PME 3D-FFT Comm. 2 6 4802 4.471 173.761 5.9 PME solve Elec 2 6 2401 1.067 41.480 1.4 ----------------------------------------------------------------------------- Core t (s) Wall t (s) (%) Time: 903.878 75.323 1200.0 (ns/day) (hour/ns) Performance: 2.754 8.714 Finished mdrun on rank 0 Sun Dec 9 20:36:30 2018 =============================================================================== ============ end of log from a 1 gpu setup============================================ using: gmx mdrun -deffnm SR.sys.nvt -ntmpi 1 -ntomp 12 -gpu_id 01 -pin on Intel I7-970 step 1200: timed with pme grid 112 108 96, coulomb cutoff 1.395: 3000.4 M-cycles Step Time 1200 1.20000 Writing checkpoint, step 1200 at Sun Dec 9 20:27:47 2018 Energies (kJ/mol) Angle G96Angle Proper Dih. Improper Dih. LJ-14 9.21561e+05 1.42782e+04 6.60879e+04 2.04484e+02 8.39065e+04 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. -2.84398e+07 -1.44481e+05 -2.04658e+03 1.34476e+07 3.82740e+04 Position Rest. Potential Kinetic En. Total Energy Temperature 3.92568e+01 -1.40143e+07 1.86727e+05 -1.38276e+07 2.97602e+02 Pres. DC (bar) Pressure (bar) Constr. rmsd -2.88685e+00 1.92481e+01 0.00000e+00 <====== ############### ==> <==== A V E R A G E S ====> <== ############### ======> Statistics over 1201 steps using 13 frames Energies (kJ/mol) Angle G96Angle Proper Dih. Improper Dih. LJ-14 9.24025e+05 2.25759e+04 6.46951e+04 2.25055e+02 8.86630e+04 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. -2.84705e+07 -1.45696e+05 -2.04658e+03 1.34231e+07 7.81266e+04 Position Rest. Potential Kinetic En. Total Energy Temperature 2.47925e+01 -1.40168e+07 1.93813e+05 -1.38230e+07 3.08896e+02 Pres. DC (bar) Pressure (bar) Constr. rmsd -2.88685e+00 6.63095e+02 0.00000e+00 Total Virial (kJ/mol) -2.04748e+05 -1.20971e+04 1.35853e+02 -1.20969e+04 -1.60243e+05 -1.17082e+04 1.35807e+02 -1.17081e+04 -1.59982e+05 Pressure (bar) 7.39235e+02 3.34709e+01 3.22280e-02 3.34703e+01 6.25486e+02 3.18788e+01 3.23543e-02 3.18787e+01 6.24566e+02 T-PDMS T-VMOS 2.96678e+02 1.02554e+03 P P - P M E L O A D B A L A N C I N G PP/PME load balancing changed the cut-off and PME settings: particle-particle PME rcoulomb rlist grid spacing 1/beta initial 1.000 nm 1.000 nm 160 160 128 0.156 nm 0.320 nm final 1.389 nm 1.389 nm 120 108 96 0.222 nm 0.445 nm cost-ratio 2.68 0.38 (note that these numbers concern only part of the total PP and PME load) M E G A - F L O P S A C C O U N T I N G NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table W3=SPC/TIP3p W4=TIP4p (single or pairs) V&F=Potential and force V=Potential only F=Force only Computing: M-Number M-Flops % Flops ----------------------------------------------------------------------------- Pair Search distance check 113.300752 1019.707 0.0 NxN Ewald Elec. + LJ [F] 100343.174976 6622649.548 96.9 NxN Ewald Elec. + LJ [V&F] 1114.688448 119271.664 1.7 1,4 nonbonded interactions 216.360150 19472.413 0.3 Shift-X 0.980460 5.883 0.0 Angles 171.286620 28776.152 0.4 Propers 36.066030 8259.121 0.1 Impropers 0.216180 44.965 0.0 Pos. Restr. 1.729440 86.472 0.0 Virial 0.981045 17.659 0.0 Update 90.579420 2807.962 0.0 Stop-CM 1.055880 10.559 0.0 Calc-Ekin 1.960920 52.945 0.0 Lincs 181.093320 10865.599 0.2 Lincs-Mat 4114.301760 16457.207 0.2 Constraint-V 362.035980 2896.288 0.0 Constraint-Vir 0.979290 23.503 0.0 ----------------------------------------------------------------------------- Total 6832717.647 100.0 ----------------------------------------------------------------------------- R E A L C Y C L E A N D T I M E A C C O U N T I N G On 1 MPI rank, each using 12 OpenMP threads Computing: Num Num Call Wall time Giga-Cycles Ranks Threads Count (s) total sum % ----------------------------------------------------------------------------- Neighbor search 1 12 13 0.163 6.350 1.0 Launch GPU ops. 1 12 2402 0.326 12.683 1.9 Force 1 12 1201 1.813 70.465 10.6 Wait PME GPU gather 1 12 1201 0.936 36.381 5.5 Reduce GPU PME F 1 12 1201 0.300 11.659 1.8 Wait GPU NB local 1 12 1201 2.156 83.786 12.6 NB X/F buffer ops. 1 12 2389 0.462 17.965 2.7 Write traj. 1 12 2 0.076 2.952 0.4 Update 1 12 2402 0.822 31.959 4.8 Constraints 1 12 2402 3.896 151.425 22.8 Rest 6.140 238.626 35.9 ----------------------------------------------------------------------------- Total 17.092 664.251 100.0 ----------------------------------------------------------------------------- Core t (s) Wall t (s) (%) Time: 205.106 17.092 1200.0 (ns/day) (hour/ns) Performance: 6.071 3.953 Finished mdrun on rank 0 Sun Dec 9 20:27:48 2018 ======================================================================== I'll put the two cards in an i7- 7700 and report later tonight Paul On Dec 10 2018, at 3:53 pm, Mark Abraham <mark.j.abra...@gmail.com> wrote: > Hi, > > One of your reported runs only used six threads, by the way. > Something sensible can be said when the performance report at the end of > the log file can be seen. > > Mark > On Tue., 11 Dec. 2018, 01:25 p buscemi, <pbusc...@q.com> wrote: > > Thank you, Mark, for the prompt response. I realize the limitations of the > > system ( its over 8 yo ), but I did not expect the speed to decrease by 50% > > with 12 available threads ! No combination of ntomp, ntmpi could raise > > ns/day above 4 with two GPU, vs 6 with one GPU. > > > > This is actually a learning/practice run for a new build - an AMD 4.2 Ghz > > 32 core TR, 64G ram. In this case I am trying to decide upon either a RTX > > 2080 ti or two GTX 1080 TI. I'd prefer the two 1080's for the 7000 cores vs > > the 4500 cores of the 2080. The model systems will have ~ million particles > > and need the speed. But this is a major expense so I need to get it right. > > I'll do as you suggest and report the results for both systems and I > > really appreciate the assist. > > Paul > > UMN, BICB > > > > On Dec 9 2018, at 4:32 pm, paul buscemi <pbusc...@q.com> wrote: > > > > > > Dear Users, > > > I have good luck using a single GPU with the basic setup.. However in > > > > going from one gtx 1060 to a system with two - 50,000 atoms - the rate > > decrease from 10 ns/day to 5 or worse. The system models a ligand, solvent > > ( water ) and a lipid membrane > > > the cpu is a 6 core intel i7 970( 12 threads ) , 750W PS, 16G Ram. > > > with the basic command " mdrun I get: > > > ck Off! I just backed up sys.nvt.log to ./#.sys.nvt.log.10# > > > Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision) > > > Changing nstlist from 10 to 100, rlist from 1 to 1 > > > > > > Using 2 MPI threads > > > Using 6 OpenMP threads per tMPI thread > > > > > > On host I7 2 GPUs auto-selected for this run. > > > Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node: > > > PP:0,PP:1 > > > > > > Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.10# > > > Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.10# > > > NOTE: DLB will not turn on during the first phase of PME tuning > > > starting mdrun 'SR-TA' > > > 100000 steps, 100.0 ps. > > > and ending with ^C > > > > > > Received the INT signal, stopping within 200 steps > > > Dynamic load balancing report: > > > DLB was locked at the end of the run due to unfinished PP-PME balancing. > > > Average load imbalance: 0.7%. > > > The balanceable part of the MD step is 46%, load imbalance is computed > > > > from this. > > > Part of the total run time spent waiting due to load imbalance: 0.3%. > > > > > > > > > Core t (s) Wall t (s) (%) > > > Time: 543.475 45.290 1200.0 > > > (ns/day) (hour/ns) > > > Performance: 1.719 13.963 before DBL is turned on > > > > > > Very poor performance. I have been following - or trying to follow - > > "Performance Tuning and Optimization fo GROMACA ' M.Abraham andR Apsotolov > > - 2016 but have not yet broken the code. > > > ---------------- > > > gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 3 -gpu_id 01 -pin on. > > > > > > > > > Back Off! I just backed up SR.sys.nvt.log to ./#SR.sys.nvt.log.13# > > > Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision) > > > Changing nstlist from 10 to 100, rlist from 1 to 1 > > > > > > Using 2 MPI threads > > > Using 3 OpenMP threads per tMPI thread > > > > > > On host I7 2 GPUs auto-selected for this run. > > > Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node: > > > PP:0,PP:1 > > > > > > Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.13# > > > Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.13# > > > NOTE: DLB will not turn on during the first phase of PME tuning > > > starting mdrun 'SR-TA' > > > 100000 steps, 100.0 ps. > > > > > > NOTE: DLB can now turn on, when beneficial > > > ^C > > > > > > Received the INT signal, stopping within 200 steps > > > Dynamic load balancing report: > > > DLB was off during the run due to low measured imbalance. > > > Average load imbalance: 0.7%. > > > The balanceable part of the MD step is 46%, load imbalance is computed > > > > from this. > > > Part of the total run time spent waiting due to load imbalance: 0.3%. > > > > > > > > > Core t (s) Wall t (s) (%) > > > Time: 953.837 158.973 600.0 > > > (ns/day) (hour/ns) > > > Performance: 2.935 8.176 > > > > > > ==================== > > > the beginning of the log file is > > > GROMACS version: 2018.3 > > > Precision: single > > > Memory model: 64 bit > > > MPI library: thread_mpi > > > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) > > > GPU support: CUDA > > > SIMD instructions: SSE4.1 > > > FFT library: fftw-3.3.8-sse2 > > > RDTSCP usage: enabled > > > TNG support: enabled > > > Hwloc support: disabled > > > Tracing support: disabled > > > Built on: 2018-10-19 21:26:38 > > > Built by: pb@Q4 [CMAKE] > > > Build OS/arch: Linux 4.15.0-20-generic x86_64 > > > Build CPU vendor: Intel > > > Build CPU brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz > > > Build CPU family: 6 Model: 44 Stepping: 2 > > > Build CPU features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr > > > > nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 > > sse4.2 ssse3 > > > C compiler: /usr/bin/gcc-6 GNU 6.4.0 > > > C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops > > > > -fexcess-precision=fast > > > C++ compiler: /usr/bin/g++-6 GNU 6.4.0 > > > C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG -funroll-all-loops > > > > -fexcess-precision=fast > > > CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler > > > > driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on > > Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85 > > > CUDA compiler > > > > flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; > > ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; > > > CUDA driver: 9.10 > > > CUDA runtime: 9.10 > > > > > > > > > Running on 1 node with total 12 cores, 12 logical cores, 2 compatible > > GPUs > > > Hardware detected: > > > CPU info: > > > Vendor: Intel > > > Brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz > > > Family: 6 Model: 44 Stepping: 2 > > > Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr > > > > nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 > > sse4.2 ssse3 > > > Hardware topology: Only logical processor count > > > GPU info: > > > Number of GPUs detected: 2 > > > #0: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: > > > > compatible > > > #1: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: > > > > compatible > > > > > > > > > There were no errors encountered during the runs. Suggestions would be > > appreciated. > > > Regards > > > Paul > > > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a > mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.