Re: [gmx-users] Best performace with 0 core for PME calcuation

Mark Abraham Fri, 09 Jan 2009 17:47:30 -0800

Nicolas wrote:

Hello,
I'm trying to do a benchmark with Gromacs 4 on our cluster, but I don'tcompletely understand the results I obtain. The system I used is a 128DOPC bilayer hydrated by ~18800 SPC for a total of ~70200 atoms. Thesize of the system is 9.6x9.6x10.1 nm^3. I'm using the followingparameters:
       * nstlist = 10
       * rlist = 1
       * Coulombtype = PME
       * rcoulomb = 1
       * fourier spacing = 0.12
       * vdwtype = Cutoff
       * rvdw = 1

The cluster itself has got 2 procs/node connected by Ethernet 100 MB/s.

Ethernet and Gigabit ethernet are not fast enough to get reasonablescaling. There've been quite a few posts on this topic in the last sixmonths.

Hmm I see you've corrected your post to refer to Infiniband with fourcores/node. That should be reasonable, I understand (but search thearchive).

You should also check that your benchmark calculation is long enoughthat you are measuring a simulation time that isn't being dominated bysetup costs. Some years ago a non-MD sysadmin complained of poor scalingwhen he was testing over 10 or so MD steps!

I'm using mpiexec to run Gromacs. When I use -npme 2 -ddorderinterleave, I get:
ncore    Perf (ns/day)    PME (%)

   1    0,00    0
   2    0,00    0
   3    0,00    0
   4    1,35    28
   5    1,84    31
   6    2,08    27
   8    2,09    21
   10    2,25    17
   12    2,02    15
   14    2,20    13
   16    2,04    11
   18    2,18    10
   20    2,29    9
So, above 6-8 cores, the PP nodes are spending too much time waiting forthe PME nodes and the perf forms a plateau.

That's not surprising - the heuristic is that about a third to a quarterof the cores want to be PME-only nodes. Of course, that depends on therelative size of the real- and reciprocal-space parts of the calculation.

When I use -npme 0, I get:

    ncore    Perf (ns/day)    PME (%)
   1    0,43    33
   2    0,92    34
   3    1,34    35
   4    1,69    36
   5    2,17    33
   6    2,56    32
   8    3,24    33
   10    3,84    34
   12    4,34    35
   14    5,05    32
   16    5,47    34
   18    5,54    37
   20    6,13    36
I obtain much better performances when there is no PME nodes, while Iwas expecting the opposite. Does someone have an explanation for that?Does that means domain decomposition is useless below a certain realspace cutoff? I'm quite confused.

The relevant observations are for 4,5,6 and 8, for which shared-duty isout-performing -npme 2. I think your observations support the conclusionthat your network hardware is more limiting for PME-only nodes thanshared-duty nodes. They don't support the conclusion that DD is useless,since you haven't compared with PD.

You can play with the PME parameters to shift more load into thereal-space part - IIRC Carsten suggested a heuristic a few months back.


Mark
_______________________________________________
gmx-users mailing list    [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!

Please don't post (un)subscribe requests to the list. Use thewww interface or send it to [email protected].

Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Re: [gmx-users] Best performace with 0 core for PME calcuation

Reply via email to