Carsten,THanks for the response.

  my mistake - it was the GTX 980 from fig 3. … I was recalling from memory…..  
I assume that similar results would be achieved with the 1060’s

No I did not reset , my results were a compilation of 4-5 runs each under 
slightly different conditions on two computers. All with the same outcome - 
that is ugh!. Mark had asked for the log outputs indicating some useful 
conclusions could be drawn from them.


>> Dear users  ( one more try ) 
>> I am trying to use 2 GPU cards to improve modeling speed.  The computer 
>> described in the log files is used  to iron out models and am using to learn 
>> how to use two GPU cards before purchasing two new RTX 2080 ti's.  The CPU 
>> is a 8 core 16 thread AMD and the GPU's are two GTX 1060; there are 50000 
>> atoms in the model
>> Using ntpmi and ntomp  settings of 1: 16,  auto  ( 4:4) and  2: 8 ( and any 
>> other combination factoring to 16)  the rating for ns/day are approx.   
>> 12-16  and  for any other setting ~6-8  i.e adding a card cuts efficiency by 
>> half.  The average load imbalance is less than 3.4% for the multicard setup .
>> I am not at this point trying to maximize efficiency, but only to show some 
>> improvement going from one to two cards.   According to a 2015 paper form 
>> the Gromacs group  “ Best bang for your buck: GPU nodes for GROMACS 
>> biomolecular simulations “  I should expect maybe (at best )  50% 
>> improvement for 90k atoms ( with  2x  GTX 970 )
> We did not benchmark GTX 970 in that publication.
> But from Table 6 you can see that we also had quite a few cases with out 80k 
> benchmark
> where going from 1 to 2 GPUs, simulation speed did not increase much: E.g. 
> for the
> E5-2670v2 going from one to 2 GTX 980 GPUs led to an increase of 10 percent.
> Did you use counter resetting for the benchnarks?
> Carsten
>> What bothers me in my initial attempts is that my simulations became slower 
>> by adding the second GPU - it is frustrating to say the least. It's like 
>> swimming backwards.
>> I know am missing - as a minimum -  the correct setup for mdrun and 
>> suggestions would be welcome
>> The output from the last section of the log files is included below.
>> =========================== ntpmi  1  ntomp:16 ==============================
>>      <======  ###############  ==>
>>      <====  A V E R A G E S  ====>
>>      <==  ###############  ======>
>>      Statistics over 29301 steps using 294 frames
>>  Energies (kJ/mol)
>>         Angle       G96Angle    Proper Dih.  Improper Dih.          LJ-14
>>   9.17533e+05    2.27874e+04    6.64128e+04    2.31214e+02    8.34971e+04
>>    Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>>  -2.84567e+07   -1.43385e+05   -2.04658e+03    1.33320e+07    1.59914e+05
>> Position Rest.      Potential    Kinetic En.   Total Energy    Temperature
>>   7.79893e+01   -1.40196e+07    1.88467e+05   -1.38312e+07    3.00376e+02
>> Pres. DC (bar) Pressure (bar)   Constr. rmsd
>>  -2.88685e+00    3.75436e+01    0.00000e+00
>>  Total Virial (kJ/mol)
>>   5.27555e+04   -4.87626e+02    1.86144e+02
>>  -4.87648e+02    4.04479e+04   -1.91959e+02
>>   1.86177e+02   -1.91957e+02    5.45671e+04
>>  Pressure (bar)
>>   2.22202e+01    1.27887e+00   -4.71738e-01
>>   1.27893e+00    6.48135e+01    5.12638e-01
>>  -4.71830e-01    5.12632e-01    2.55971e+01
>>        T-PDMS         T-VMOS
>>   2.99822e+02    3.32834e+02
>>      M E G A - F L O P S   A C C O U N T I N G
>> NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>> RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>> W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>> V&F=Potential and force  V=Potential only  F=Force only
>> Computing:                               M-Number         M-Flops  % Flops
>> -----------------------------------------------------------------------------
>> Pair Search distance check            2349.753264       21147.779     0.0
>> NxN Ewald Elec. + LJ [F]           1771584.591744   116924583.055    96.6
>> NxN Ewald Elec. + LJ [V&F]           17953.091840     1920980.827     1.6
>> 1,4 nonbonded interactions            5278.575150      475071.763     0.4
>> Shift-X                                 22.173480         133.041     0.0
>> Angles                                4178.908620      702056.648     0.6
>> Propers                                879.909030      201499.168     0.2
>> Impropers                                5.274180        1097.029     0.0
>> Pos. Restr.                             42.193440        2109.672     0.0
>> Virial                                  22.186710         399.361     0.0
>> Update                                2209.881420       68506.324     0.1
>> Stop-CM                                 22.248900         222.489     0.0
>> Calc-Ekin                               44.346960        1197.368     0.0
>> Lincs                                 4414.639320      264878.359     0.2
>> Lincs-Mat                           100297.229760      401188.919     0.3
>> Constraint-V                          8829.127980       70633.024     0.1
>> Constraint-Vir                          22.147020         531.528     0.0
>> -----------------------------------------------------------------------------
>> Total                                               121056236.355   100.0
>> -----------------------------------------------------------------------------
>>    R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>> On 1 MPI rank, each using 16 OpenMP threads
>> Computing:          Num   Num      Call    Wall time         Giga-Cycles
>>                    Ranks Threads  Count      (s)         total sum    %
>> -----------------------------------------------------------------------------
>> Neighbor search        1   16        294       2.191        129.485   1.0
>> Launch GPU ops.        1   16      58602       4.257        251.544   2.0
>> Force                  1   16      29301      23.769       1404.510  11.3
>> Wait PME GPU gather    1   16      29301      33.740       1993.695  16.0
>> Reduce GPU PME F       1   16      29301       7.244        428.079   3.4
>> Wait GPU NB local      1   16      29301      60.054       3548.612  28.5
>> NB X/F buffer ops.     1   16      58308       9.823        580.459   4.7
>> Write traj.            1   16          7       0.119          7.048   0.1
>> Update                 1   16      58602      11.089        655.275   5.3
>> Constraints            1   16      58602      40.378       2385.992  19.2
>> Rest                                          17.743       1048.462   8.4
>> -----------------------------------------------------------------------------
>> Total                                        210.408      12433.160 100.0
>> -----------------------------------------------------------------------------
>>              Core t (s)   Wall t (s)        (%)
>>      Time:     3366.529      210.408     1600.0
>>                (ns/day)    (hour/ns)
>> Performance:       12.032        1.995
>> Finished mdrun on rank 0 Mon Dec 10 17:17:04 2018
>> =========================== ntpmi and ntomp   auto  ( 4:4 ) 
>> =======================================
>>      <======  ###############  ==>
>>      <====  A V E R A G E S  ====>
>>      <==  ###############  ======>
>>      Statistics over 3301 steps using 34 frames
>>  Energies (kJ/mol)
>>         Angle       G96Angle    Proper Dih.  Improper Dih.          LJ-14
>>   9.20586e+05    1.95534e+04    6.56058e+04    2.21093e+02    8.56673e+04
>>    Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>>  -2.84553e+07   -1.44595e+05   -2.04658e+03    1.34518e+07    4.26167e+04
>> Position Rest.      Potential    Kinetic En.   Total Energy    Temperature
>>   3.83653e+01   -1.40159e+07    1.90353e+05   -1.38255e+07    3.03381e+02
>> Pres. DC (bar) Pressure (bar)   Constr. rmsd
>>  -2.88685e+00    2.72913e+02    0.00000e+00
>>  Total Virial (kJ/mol)
>>  -5.05948e+04   -3.29107e+03    4.84786e+02
>>  -3.29135e+03   -3.42006e+04   -3.32392e+03
>>   4.84606e+02   -3.32403e+03   -2.06849e+04
>>  Pressure (bar)
>>   3.09713e+02    8.98192e+00   -1.19828e+00
>>   8.98270e+00    2.73248e+02    8.99543e+00
>>  -1.19778e+00    8.99573e+00    2.35776e+02
>>        T-PDMS         T-VMOS
>>   2.98623e+02    5.82467e+02
>>      P P   -   P M E   L O A D   B A L A N C I N G
>> NOTE: The PP/PME load balancing was limited by the maximum allowed grid 
>> scaling,
>>      you might not have reached a good load balance.
>> PP/PME load balancing changed the cut-off and PME settings:
>>          particle-particle                    PME
>>           rcoulomb  rlist            grid      spacing   1/beta
>>  initial  1.000 nm  1.000 nm     160 160 128   0.156 nm  0.320 nm
>>  final    1.628 nm  1.628 nm      96  96  80   0.260 nm  0.521 nm
>> cost-ratio           4.31             0.23
>> (note that these numbers concern only part of the total PP and PME load)
>>      M E G A - F L O P S   A C C O U N T I N G
>> NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>> RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>> W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>> V&F=Potential and force  V=Potential only  F=Force only
>> Computing:                               M-Number         M-Flops  % Flops
>> -----------------------------------------------------------------------------
>> Pair Search distance check             285.793872        2572.145     0.0
>> NxN Ewald Elec. + LJ [F]            367351.034688    24245168.289    92.1
>> NxN Ewald Elec. + LJ [V&F]            3841.181056      411006.373     1.6
>> 1,4 nonbonded interactions             594.675150       53520.763     0.2
>> Calc Weights                           746.884260       26887.833     0.1
>> Spread Q Bspline                     15933.530880       31867.062     0.1
>> Gather F Bspline                     15933.530880       95601.185     0.4
>> 3D-FFT                              154983.295306     1239866.362     4.7
>> Solve PME                               40.079616        2565.095     0.0
>> Reset In Box                             2.564280           7.693     0.0
>> CG-CoM                                   2.639700           7.919     0.0
>> Angles                                 470.788620       79092.488     0.3
>> Propers                                 99.129030       22700.548     0.1
>> Impropers                                0.594180         123.589     0.0
>> Pos. Restr.                              4.753440         237.672     0.0
>> Virial                                   2.570400          46.267     0.0
>> Update                                 248.961420        7717.804     0.0
>> Stop-CM                                  2.639700          26.397     0.0
>> Calc-Ekin                                5.128560         138.471     0.0
>> Lincs                                  557.713246       33462.795     0.1
>> Lincs-Mat                            12624.363456       50497.454     0.2
>> Constraint-V                          1115.257670        8922.061     0.0
>> Constraint-Vir                           2.871389          68.913     0.0
>> -----------------------------------------------------------------------------
>> Total                                                26312105.181   100.0
>> -----------------------------------------------------------------------------
>>   D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>> av. #atoms communicated per step for force:  2 x 16748.9
>> av. #atoms communicated per step for LINCS:  2 x 9361.6
>> Dynamic load balancing report:
>> DLB was off during the run due to low measured imbalance.
>> Average load imbalance: 3.4%.
>> The balanceable part of the MD step is 46%, load imbalance is computed from 
>> this.
>> Part of the total run time spent waiting due to load imbalance: 1.6%.
>>    R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>> On 4 MPI ranks, each using 4 OpenMP threads
>> Computing:          Num   Num      Call    Wall time         Giga-Cycles
>>                    Ranks Threads  Count      (s)         total sum    %
>> -----------------------------------------------------------------------------
>> Domain decomp.         4    4         34       0.457         26.976   1.0
>> DD comm. load          4    4          2       0.000          0.008   0.0
>> Neighbor search        4    4         34       0.138          8.160   0.3
>> Launch GPU ops.        4    4       6602       0.441         26.070   0.9
>> Comm. coord.           4    4       3267       0.577         34.081   1.2
>> Force                  4    4       3301       2.298        135.761   4.9
>> Wait + Comm. F         4    4       3301       0.276         16.330   0.6
>> PME mesh               4    4       3301      25.822       1525.817  54.8
>> Wait GPU NB nonloc.    4    4       3301       0.132          7.819   0.3
>> Wait GPU NB local      4    4       3301       0.012          0.724   0.0
>> NB X/F buffer ops.     4    4      13136       0.471         27.822   1.0
>> Write traj.            4    4          2       0.014          0.839   0.0
>> Update                 4    4       6602       1.006         59.442   2.1
>> Constraints            4    4       6602       6.926        409.290  14.7
>> Comm. energies         4    4         34       0.009          0.524   0.0
>> Rest                                           8.548        505.108  18.1
>> -----------------------------------------------------------------------------
>> Total                                         47.127       2784.772 100.0
>> -----------------------------------------------------------------------------
>> Breakdown of PME mesh computation
>> -----------------------------------------------------------------------------
>> PME redist. X/F        4    4       6602       2.538        149.998   5.4
>> PME spread             4    4       3301       6.055        357.770  12.8
>> PME gather             4    4       3301       3.432        202.814   7.3
>> PME 3D-FFT             4    4       6602      10.559        623.925  22.4
>> PME 3D-FFT Comm.       4    4       6602       2.691        158.993   5.7
>> PME solve Elec         4    4       3301       0.521         30.805   1.1
>> -----------------------------------------------------------------------------
>>              Core t (s)   Wall t (s)        (%)
>>      Time:      754.033       47.127     1600.0
>>                (ns/day)    (hour/ns)
>> Performance:        6.052        3.966
>> Finished mdrun on rank 0 Mon Dec 10 17:10:34 2018
>> =========================================== ntmpi  2: ntomp 8 
>> ==============================================
>>      <======  ###############  ==>
>>      <====  A V E R A G E S  ====>
>>      <==  ###############  ======>
>>      Statistics over 11201 steps using 113 frames
>>  Energies (kJ/mol)
>>         Angle       G96Angle    Proper Dih.  Improper Dih.          LJ-14
>>   9.16403e+05    2.12953e+04    6.61725e+04    2.26296e+02    8.35215e+04
>>    Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>>  -2.84508e+07   -1.43740e+05   -2.04658e+03    1.34647e+07    2.76232e+04
>> Position Rest.      Potential    Kinetic En.   Total Energy    Temperature
>>   5.93627e+01   -1.40166e+07    1.88847e+05   -1.38277e+07    3.00981e+02
>> Pres. DC (bar) Pressure (bar)   Constr. rmsd
>>  -2.88685e+00    8.53077e+01    0.00000e+00
>>  Total Virial (kJ/mol)
>>   3.15233e+04   -6.80636e+02    9.80007e+01
>>  -6.81075e+02    2.45640e+04   -1.40642e+03
>>   9.81033e+01   -1.40643e+03    4.02877e+04
>>  Pressure (bar)
>>   8.11163e+01    1.87348e+00   -2.03329e-01
>>   1.87469e+00    1.09211e+02    3.83468e+00
>>  -2.03613e-01    3.83470e+00    6.55961e+01
>>        T-PDMS         T-VMOS
>>   2.99551e+02    3.84895e+02
>>      P P   -   P M E   L O A D   B A L A N C I N G
>> NOTE: The PP/PME load balancing was limited by the maximum allowed grid 
>> scaling,
>>      you might not have reached a good load balance.
>> PP/PME load balancing changed the cut-off and PME settings:
>>          particle-particle                    PME
>>           rcoulomb  rlist            grid      spacing   1/beta
>>  initial  1.000 nm  1.000 nm     160 160 128   0.156 nm  0.320 nm
>>  final    1.628 nm  1.628 nm      96  96  80   0.260 nm  0.521 nm
>> cost-ratio           4.31             0.23
>> (note that these numbers concern only part of the total PP and PME load)
>>      M E G A - F L O P S   A C C O U N T I N G
>> NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>> RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>> W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>> V&F=Potential and force  V=Potential only  F=Force only
>> Computing:                               M-Number         M-Flops  % Flops
>> -----------------------------------------------------------------------------
>> Pair Search distance check            1057.319360        9515.874     0.0
>> NxN Ewald Elec. + LJ [F]           1410325.411968    93081477.190    93.9
>> NxN Ewald Elec. + LJ [V&F]           14378.367616     1538485.335     1.6
>> 1,4 nonbonded interactions            2017.860150      181607.413     0.2
>> Calc Weights                          2534.338260       91236.177     0.1
>> Spread Q Bspline                     54065.882880      108131.766     0.1
>> Gather F Bspline                     54065.882880      324395.297     0.3
>> 3D-FFT                              383450.341906     3067602.735     3.1
>> Solve PME                              113.199616        7244.775     0.0
>> Reset In Box                             8.522460          25.567     0.0
>> CG-CoM                                   8.597880          25.794     0.0
>> Angles                                1597.486620      268377.752     0.3
>> Propers                                336.366030       77027.821     0.1
>> Impropers                                2.016180         419.365     0.0
>> Pos. Restr.                             16.129440         806.472     0.0
>> Virial                                   8.532630         153.587     0.0
>> Update                                 844.779420       26188.162     0.0
>> Stop-CM                                  8.597880          85.979     0.0
>> Calc-Ekin                               17.044920         460.213     0.0
>> Lincs                                 1753.732822      105223.969     0.1
>> Lincs-Mat                            39788.083512      159152.334     0.2
>> Constraint-V                          3507.309174       28058.473     0.0
>> Constraint-Vir                           8.845375         212.289     0.0
>> -----------------------------------------------------------------------------
>> Total                                                99075914.342   100.0
>> -----------------------------------------------------------------------------
>>   D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>> av. #atoms communicated per step for force:  2 x 6810.8
>> av. #atoms communicated per step for LINCS:  2 x 3029.3
>> Dynamic load balancing report:
>> DLB was off during the run due to low measured imbalance.
>> Average load imbalance: 0.8%.
>> The balanceable part of the MD step is 46%, load imbalance is computed from 
>> this.
>> Part of the total run time spent waiting due to load imbalance: 0.4%.
>>    R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>> On 2 MPI ranks, each using 8 OpenMP threads
>> Computing:          Num   Num      Call    Wall time         Giga-Cycles
>>                    Ranks Threads  Count      (s)         total sum    %
>> -----------------------------------------------------------------------------
>> Domain decomp.         2    8        113       1.532         90.505   1.4
>> DD comm. load          2    8          4       0.000          0.027   0.0
>> Neighbor search        2    8        113       0.442         26.107   0.4
>> Launch GPU ops.        2    8      22402       1.230         72.668   1.1
>> Comm. coord.           2    8      11088       0.894         52.844   0.8
>> Force                  2    8      11201       8.166        482.534   7.5
>> Wait + Comm. F         2    8      11201       0.672         39.720   0.6
>> PME mesh               2    8      11201      61.637       3642.183  56.6
>> Wait GPU NB nonloc.    2    8      11201       0.342         20.205   0.3
>> Wait GPU NB local      2    8      11201       0.031          1.847   0.0
>> NB X/F buffer ops.     2    8      44578       1.793        105.947   1.6
>> Write traj.            2    8          4       0.040          2.386   0.0
>> Update                 2    8      22402       4.148        245.121   3.8
>> Constraints            2    8      22402      19.207       1134.940  17.6
>> Comm. energies         2    8        113       0.006          0.354   0.0
>> Rest                                           8.801        520.065   8.1
>> -----------------------------------------------------------------------------
>> Total                                        108.942       6437.452 100.0
>> -----------------------------------------------------------------------------
>> Breakdown of PME mesh computation
>> -----------------------------------------------------------------------------
>> PME redist. X/F        2    8      22402       4.992        294.991   4.6
>> PME spread             2    8      11201      16.979       1003.299  15.6
>> PME gather             2    8      11201      11.687        690.563  10.7
>> PME 3D-FFT             2    8      22402      21.648       1279.195  19.9
>> PME 3D-FFT Comm.       2    8      22402       4.985        294.567   4.6
>> PME solve Elec         2    8      11201       1.241         73.332   1.1
>> -----------------------------------------------------------------------------
>>              Core t (s)   Wall t (s)        (%)
>>      Time:     1743.073      108.942     1600.0
>>                (ns/day)    (hour/ns)
>> Performance:        8.883        2.702
>> Finished mdrun on rank 0 Mon Dec 10 17:01:45 2018
Reply via email to