On 17/07/2012 7:06 PM, DeChang Li wrote:
------------------------------

Message: 8
Date: Tue, 17 Jul 2012 18:40:05 +1000
From: Mark Abraham <[email protected]>
Subject: Re: [gmx-users] why Blue Gene/Q is so slow?
To: Discussion list for GROMACS users <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 17/07/2012 5:00 PM, DeChang Li wrote:
Dear all,

       I am running a 9000 atom system with GBSA (Gromacs 4.5.5) in a
Blue Gene/Q cluster. I got the speed 1.002 ns/day with 8 cores.
However, in my own workstation with 8 cores the same system can reach
nearly 10 ns/day (Intel(R) Xeon(R) CPU E5620  @ 2.40GHz). Can anyone
tell me what's wrong in my simulation? Any suggestion will be
appreciated.
Your workstation is running highly effective optimized SSE loops.
BlueGene/Q is not using its multiple FPU because that code hasn't been
written (for explicit or implicit solvation), and BlueGene's processors
are probably slower too.

Mark
That means the code itself causes only 10% speed in BlueGene/Q
compared with intel CPUs workstation?

You'd see a comparable decrease if you would turn off the SSE optimization on your workstation, but perhaps not as severe. There's art and skill in making code run fast, and it's very rare that you don't need to target a specific architecture to achieve it.

  Is there any method to improve
the speed in BG/Q?

Write the optimized code ;-) Also, use more of the machine - you can probably get down to 500 atoms/core or below. There will be a limit beyond which it's impossible to go (or be effective). You can try simulating without cut-offs (see parts of manual 7.3 and mailing list discussions) which uses different all-vs-all inner loops, but your system might be too large for that to be useful.

Mark



Dechang




Following is my md.mdp file:

constraints            = hbonds
constraint_algorithm   = LINCS
lincs_order            = 4
comm_mode              = Angular
comm_grps              = system
integrator             = sd
;annealing           = single single
;annealing_npoints   = 2 2
;annealing_time      = 0 500 0 500
;annealing_temp      = 200 300 200 300
dt                     = 0.002 ; ps !
nsteps                 = 5000000 ; total 5000 ps.
nstcomm                = 10
nstcalcenergy           = 10
nstxout                = 10000 ; collect data every 1 ps
nstenergy              = 10000
nstvout                = 10000
nstlog                 = 1000
;nstxtcout              = 50000
;xtc_grps               = system
nstfout                = 0
nstlist                = 10
ns_type                = grid
pbc                    = no
rlist                  = 1.2
coulombtype            = cut-off
rcoulomb               = 1.2
rvdw                   = 1.2
fourierspacing         = 0.12
fourier_nx             = 0
fourier_ny             = 0
fourier_nz             = 0
pme_order              = 4
ewald_rtol             = 1e-5
optimize_fft           = yes
;energygrps             = alpha1 alpha2 alpha3 beta1 beta2 beta3 gamma
;DispCorr               = EnerPres
; Berendsen temperature coupling is on in two groups
Tcoupl                 =
tau_t                  = 0.5
tc-grps                = system
ref_t                  = 300
; Pressure coupling is on
Pcoupl                 = no ;berendsen
tau_p                  = 1.0
compressibility        = 4.5e-5
ref_p                  = 1.0
; Generate velocites is on at 300 K.
gen_vel                = yes
gen_temp               = 300
gen_seed               = -1

implicit_solvent       = GBSA
gb_algorithm           = OBC
rgbradii               = 1.2
sa_surface_tension     = 2.25936



Here is the preformace info:

          M E G A - F L O P S   A C C O U N T I N G

     RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
     T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
     NF=No Forces

   Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
   Generalized Born Coulomb                61.482892        2951.179     0.4
   GB Coulomb + LJ                       2565.481100      156494.347    19.4
   Outer nonbonded loop                   152.268546        1522.685     0.2
   1,4 nonbonded interactions             116.143224       10452.890     1.3
   Born radii (HCT/OBC)                  2868.222234      524884.669    64.9
   Born force chain rule                 2868.222234       43023.334     5.3
   NS-Pairs                               516.814696       10853.109     1.3
   Reset In Box                             4.464788          13.394     0.0
   CG-CoM                                   4.482576          13.448     0.0
   Bonds                                   22.174434        1308.292     0.2
   Angles                                  80.586114       13538.467     1.7
   Propers                                160.742142       36809.951     4.6
   Virial                                   4.636254          83.453     0.0
   Update                                  44.478894        1378.846     0.2
   Stop-CM                                  4.455894          44.559     0.0
   Calc-Ekin                               44.487788        1201.170     0.1
   Lincs                                   44.951630        2697.098     0.3
   Lincs-Mat                              261.822552        1047.290     0.1
   Constraint-V                            44.951630         359.613     0.0
   Constraint-Vir                           2.251163          54.028     0.0
-----------------------------------------------------------------------------
   Total                                                  808731.820   100.0
-----------------------------------------------------------------------------


      D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

   av. #atoms communicated per step for force:  2 x 660.5
   av. #atoms communicated per step for LINCS:  2 x 34.3

   Average load imbalance: 1.7 %
   Part of the total run time spent waiting due to load imbalance: 1.4 %


       R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

   Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
   Domain decomp.         8        502       59.421       37.1     0.5
   DD comm. load          8          8        0.004        0.0     0.0
   Comm. coord.           8       5001       16.575       10.4     0.2
   Neighbor search        8        502      136.093       85.1     1.2
   Force                  8       5001     9744.582     6090.7    88.3
   Wait + Comm. F         8       5001       90.905       56.8     0.8
   Write traj.            8          2        0.954        0.6     0.0
   Update                 8       5001       72.936       45.6     0.7
   Constraints            8      10002      171.445      107.2     1.6
   Comm. energies         8        502       10.427        6.5     0.1
   Rest                   8                 732.742      458.0     6.6
-----------------------------------------------------------------------
   Total                  8               11036.086     6897.9   100.0
-----------------------------------------------------------------------

          Parallel run - timing based on wallclock.

                 NODE (s)   Real (s)      (%)
         Time:    862.243    862.243    100.0
                         14:22
                 (Mnbf/s)   (MFlops)   (ns/day)  (hour/ns)
Performance:      3.047    937.940      1.002     23.946
Finished mdrun on node 0 Tue Jul 17 16:06:48 2012



--
gmx-users mailing list    [email protected]
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Only plain text messages are allowed!
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the www interface or send it to [email protected].
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to