On 17/07/2012 7:06 PM, DeChang Li wrote:
------------------------------
Message: 8
Date: Tue, 17 Jul 2012 18:40:05 +1000
From: Mark Abraham <[email protected]>
Subject: Re: [gmx-users] why Blue Gene/Q is so slow?
To: Discussion list for GROMACS users <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 17/07/2012 5:00 PM, DeChang Li wrote:
Dear all,
I am running a 9000 atom system with GBSA (Gromacs 4.5.5) in a
Blue Gene/Q cluster. I got the speed 1.002 ns/day with 8 cores.
However, in my own workstation with 8 cores the same system can reach
nearly 10 ns/day (Intel(R) Xeon(R) CPU E5620 @ 2.40GHz). Can anyone
tell me what's wrong in my simulation? Any suggestion will be
appreciated.
Your workstation is running highly effective optimized SSE loops.
BlueGene/Q is not using its multiple FPU because that code hasn't been
written (for explicit or implicit solvation), and BlueGene's processors
are probably slower too.
Mark
That means the code itself causes only 10% speed in BlueGene/Q
compared with intel CPUs workstation?
You'd see a comparable decrease if you would turn off the SSE
optimization on your workstation, but perhaps not as severe. There's art
and skill in making code run fast, and it's very rare that you don't
need to target a specific architecture to achieve it.
Is there any method to improve
the speed in BG/Q?
Write the optimized code ;-) Also, use more of the machine - you can
probably get down to 500 atoms/core or below. There will be a limit
beyond which it's impossible to go (or be effective). You can try
simulating without cut-offs (see parts of manual 7.3 and mailing list
discussions) which uses different all-vs-all inner loops, but your
system might be too large for that to be useful.
Mark
Dechang
Following is my md.mdp file:
constraints = hbonds
constraint_algorithm = LINCS
lincs_order = 4
comm_mode = Angular
comm_grps = system
integrator = sd
;annealing = single single
;annealing_npoints = 2 2
;annealing_time = 0 500 0 500
;annealing_temp = 200 300 200 300
dt = 0.002 ; ps !
nsteps = 5000000 ; total 5000 ps.
nstcomm = 10
nstcalcenergy = 10
nstxout = 10000 ; collect data every 1 ps
nstenergy = 10000
nstvout = 10000
nstlog = 1000
;nstxtcout = 50000
;xtc_grps = system
nstfout = 0
nstlist = 10
ns_type = grid
pbc = no
rlist = 1.2
coulombtype = cut-off
rcoulomb = 1.2
rvdw = 1.2
fourierspacing = 0.12
fourier_nx = 0
fourier_ny = 0
fourier_nz = 0
pme_order = 4
ewald_rtol = 1e-5
optimize_fft = yes
;energygrps = alpha1 alpha2 alpha3 beta1 beta2 beta3 gamma
;DispCorr = EnerPres
; Berendsen temperature coupling is on in two groups
Tcoupl =
tau_t = 0.5
tc-grps = system
ref_t = 300
; Pressure coupling is on
Pcoupl = no ;berendsen
tau_p = 1.0
compressibility = 4.5e-5
ref_p = 1.0
; Generate velocites is on at 300 K.
gen_vel = yes
gen_temp = 300
gen_seed = -1
implicit_solvent = GBSA
gb_algorithm = OBC
rgbradii = 1.2
sa_surface_tension = 2.25936
Here is the preformace info:
M E G A - F L O P S A C C O U N T I N G
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Generalized Born Coulomb 61.482892 2951.179 0.4
GB Coulomb + LJ 2565.481100 156494.347 19.4
Outer nonbonded loop 152.268546 1522.685 0.2
1,4 nonbonded interactions 116.143224 10452.890 1.3
Born radii (HCT/OBC) 2868.222234 524884.669 64.9
Born force chain rule 2868.222234 43023.334 5.3
NS-Pairs 516.814696 10853.109 1.3
Reset In Box 4.464788 13.394 0.0
CG-CoM 4.482576 13.448 0.0
Bonds 22.174434 1308.292 0.2
Angles 80.586114 13538.467 1.7
Propers 160.742142 36809.951 4.6
Virial 4.636254 83.453 0.0
Update 44.478894 1378.846 0.2
Stop-CM 4.455894 44.559 0.0
Calc-Ekin 44.487788 1201.170 0.1
Lincs 44.951630 2697.098 0.3
Lincs-Mat 261.822552 1047.290 0.1
Constraint-V 44.951630 359.613 0.0
Constraint-Vir 2.251163 54.028 0.0
-----------------------------------------------------------------------------
Total 808731.820 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 660.5
av. #atoms communicated per step for LINCS: 2 x 34.3
Average load imbalance: 1.7 %
Part of the total run time spent waiting due to load imbalance: 1.4 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 8 502 59.421 37.1 0.5
DD comm. load 8 8 0.004 0.0 0.0
Comm. coord. 8 5001 16.575 10.4 0.2
Neighbor search 8 502 136.093 85.1 1.2
Force 8 5001 9744.582 6090.7 88.3
Wait + Comm. F 8 5001 90.905 56.8 0.8
Write traj. 8 2 0.954 0.6 0.0
Update 8 5001 72.936 45.6 0.7
Constraints 8 10002 171.445 107.2 1.6
Comm. energies 8 502 10.427 6.5 0.1
Rest 8 732.742 458.0 6.6
-----------------------------------------------------------------------
Total 8 11036.086 6897.9 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 862.243 862.243 100.0
14:22
(Mnbf/s) (MFlops) (ns/day) (hour/ns)
Performance: 3.047 937.940 1.002 23.946
Finished mdrun on node 0 Tue Jul 17 16:06:48 2012
--
gmx-users mailing list [email protected]
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Only plain text messages are allowed!
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [email protected].
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists