Nicolas wrote:
Hello,

I'm trying to do a benchmark with Gromacs 4 on our cluster, but I don't completely understand the results I obtain. The system I used is a 128 DOPC bilayer hydrated by ~18800 SPC for a total of ~70200 atoms. The size of the system is 9.6x9.6x10.1 nm^3. I'm using the following parameters:

       * nstlist = 10
       * rlist = 1
       * Coulombtype = PME
       * rcoulomb = 1
       * fourier spacing = 0.12
       * vdwtype = Cutoff
       * rvdw = 1

The cluster itself has got 2 procs/node connected by Ethernet 100 MB/s.

Ethernet and Gigabit ethernet are not fast enough to get reasonable scaling. There've been quite a few posts on this topic in the last six months.

Hmm I see you've corrected your post to refer to Infiniband with four cores/node. That should be reasonable, I understand (but search the archive).

You should also check that your benchmark calculation is long enough that you are measuring a simulation time that isn't being dominated by setup costs. Some years ago a non-MD sysadmin complained of poor scaling when he was testing over 10 or so MD steps!

I'm using mpiexec to run Gromacs. When I use -npme 2 -ddorder interleave, I get:
ncore    Perf (ns/day)    PME (%)

   1    0,00    0
   2    0,00    0
   3    0,00    0
   4    1,35    28
   5    1,84    31
   6    2,08    27
   8    2,09    21
   10    2,25    17
   12    2,02    15
   14    2,20    13
   16    2,04    11
   18    2,18    10
   20    2,29    9

So, above 6-8 cores, the PP nodes are spending too much time waiting for the PME nodes and the perf forms a plateau.

That's not surprising - the heuristic is that about a third to a quarter of the cores want to be PME-only nodes. Of course, that depends on the relative size of the real- and reciprocal-space parts of the calculation.

When I use -npme 0, I get:

    ncore    Perf (ns/day)    PME (%)
   1    0,43    33
   2    0,92    34
   3    1,34    35
   4    1,69    36
   5    2,17    33
   6    2,56    32
   8    3,24    33
   10    3,84    34
   12    4,34    35
   14    5,05    32
   16    5,47    34
   18    5,54    37
   20    6,13    36

I obtain much better performances when there is no PME nodes, while I was expecting the opposite. Does someone have an explanation for that? Does that means domain decomposition is useless below a certain real space cutoff? I'm quite confused.

The relevant observations are for 4,5,6 and 8, for which shared-duty is out-performing -npme 2. I think your observations support the conclusion that your network hardware is more limiting for PME-only nodes than shared-duty nodes. They don't support the conclusion that DD is useless, since you haven't compared with PD.

You can play with the PME parameters to shift more load into the real-space part - IIRC Carsten suggested a heuristic a few months back.

Mark
_______________________________________________
gmx-users mailing list    [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the www interface or send it to [email protected].
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Reply via email to