I wonder whether what I see that -np 108 and -ntomp 2 is best comes from using -multi 6 with 8-CPU nodes. That level of parallelism may then be necessary to trigger automatic segregation of PP and PME ranks. I'm not sure if I tried -np 54 and -ntomp 4, which would probably also do it. I compared mostly on 196 CPUs then found going up to 216 was better than 196 with -ntomp 2 and pure MPI (-ntomp 1) was considerably worse for both. Would people recommend to go back to 196 which allows 4 whole nodes per replica and playing with -npme and -ntomp_pme?
> Hi Thanh Le, > > Assuming all the nodes are the same (9 nodes with 12 CPUs) then you could > try the following > > mpirun -np 9 --map-by node mdrun -ntomp 12 ... > mpirun -np 18 mdrun -ntomp 6 ... > mpirun -np 54 mdrun -ntomp 2 ... > > Which of these works best will depend on your setup. > > Using the whole cluster for one job may not be the most efficient way. I > found on our cluster that once I reach 216 CPUs (equivalent settings from > the queuing system to -np 108 and -ntomp 2), I can't do better by adding > more nodes (where presumably communication becomes an issue). In addition > to running -multi or -multidir jobs, which takes the load off > communication a bit, it may also be worth having separate jobs and using > -pin on and -pinoffset. > > Best wishes > James > >> Hi everyone, >> I have a question concerning running gromacs in parallel. I have read >> over >> the >> http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html >> <http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html> >> but I still dont quite understand how to run it efficiently. >> My gromacs version is 4.5.4 >> The cluster I am using has CPUs total: 108 and 4 hosts up. >> The node iam using: >> Architecture: x86_64 >> CPU op-mode(s): 32-bit, 64-bit >> Byte Order: Little Endian >> CPU(s): 12 >> On-line CPU(s) list: 0-11 >> Thread(s) per core: 2 >> Core(s) per socket: 6 >> Socket(s): 1 >> NUMA node(s): 1 >> Vendor ID: AuthenticAMD >> CPU family: 21 >> Model: 2 >> Stepping: 0 >> CPU MHz: 1400.000 >> BogoMIPS: 5200.57 >> Virtualization: AMD-V >> L1d cache: 16K >> L1i cache: 64K >> L2 cache: 2048K >> L3 cache: 6144K >> NUMA node0 CPU(s): 0-11 >> MPI is already installed. I also have permission to use the cluster as >> much as I can. >> My question is: how should I write my mdrun command run to utilize all >> the >> possible cores and nodes? >> Thanks, >> Thanh Le >> -- >> Gromacs Users mailing list >> >> * Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> send >> a mail to gmx-users-requ...@gromacs.org. >> > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.