On Wed, Sep 21, 2016 at 9:55 PM, <jkrie...@mrc-lmb.cam.ac.uk> wrote: > Thanks Sz. > > Do you think going up to from version 5.0.4 to 5.1.4 would really make > such a big difference?
Note that I was recommending using a modern compiler + the latest release (which is called 2016 not 5.1.4!). It's hard to guess the improvements, but from 5.0->2016 you should see double-digit percentage improvements and going from gcc 4.4 to 5.x or 6.0 wil also have a significant improvement. > Here is a log file from a single md run (that has finished unlike the > metadynamics) with the number of OpenMP threads matching how many threads > there are on each node. This has been restarted a number of times with > different launch configurations being mostly the number of nodes and the > node type (either 8 CPUs or 24 CPUs). > https://www.dropbox.com/s/uxzsj3pm31n66nz/md.log?dl=0 You seem to be using a single MPI rank per node ion these runs. That will almost never be optimal, especially not when DD is not limited. > From timesteps when checkpoints were written I can see that these > configurations make quite a difference and per CPU, having 8 OpenMP > threads per MPI process becomes a much worse idea stepping from 4 nodes to > 6 nodes, i.e. having more CPUs makes mixed paralellism less favourable as > suggested in figure 8. Yes, the best may not lie at 1 OpenMP thread per > MPI rank and may vary depending on the number of CPUs as well. Sure, but 8 threads panning over two sockets will definitely be suboptimal. Start with trying fewer and consider using separate PME ranks especially if you have ethernet. > Also, I can > see that for the same number of CPUs, the 24-thread nodes are better than > the 8-thread nodes but I can't get so many of them as they are also more > popular for RELION users. FYI those are 2x6-core CPUs with Hyperthreading, so 2x12 hardware threads. Also the two generations newer, so it's not surprising that they are much faster. Still, 24 threads/node is too much. Use less. > What can I infer from the information at the > end? Before starting to interpret that, it's worth fixing the above issues ;) Otherwise, what's clear is that PME is taking a considerable amount of time, especially given the long cut-off. Cheers, -- Szilárd > > Best wishes > James > >> Hi, >> >> On Wed, Sep 21, 2016 at 5:44 PM, <jkrie...@mrc-lmb.cam.ac.uk> wrote: >>> Hi Szilárd, >>> >>> Yes I had looked at it but not with our cluster in mind. I now have a >>> couple of GPU systems (both have an 8-core i7-4790K CPU with one Titan X >>> GPU on one system and two Titan X GPUs on the other), and have been >>> thinking about about getting the most out of them. I listened to >>> Carsten's >>> BioExcel webinar this morning and it got me thinking about the cluster >>> as >>> well. I've just had a quick look now and it suggests Nrank = Nc and Nth >>> = >>> 1 for high core count, which I think worked slightly less well for me >>> but >>> I can't find the details so I may be remembering wrong. >> >> That's not unexpected, the reported values are specific to the >> hardware and benchmark systems and only give a rough idea where the >> ranks/threads balance should be. >>> >>> I don't have log files from a systematic benchmark of our cluster as it >>> isn't really available enough for doing that. >> >> That's not really necessary, even logs from a single production run >> can hint possible improvements. >> >>> I haven't tried gmx tune_pme >>> on there either. I do have node-specific installations of gromacs-5.0.4 >>> but I think they were done with gcc-4.4.7 so there's room for >>> improvement >>> there. >> >> If that's the case, I'd simply recommend using a modern compiler and >> if you can a recent GROMACS version, you'll gain more performance than >> from most launch config tuning. >> >>> The cluster nodes I have been using have the following cpu specs >>> and 10Gb networking. It could be that using 2 OpenMP threads per MPI >>> rank >>> works nicely because it matches the CPU configuration and makes better >>> use >>> of hyperthreading. >> >> Or because of the network. Or for some other reason. Again, comparing >> the runs' log files could tell more :) >> >>> Architecture: x86_64 >>> CPU op-mode(s): 32-bit, 64-bit >>> Byte Order: Little Endian >>> CPU(s): 8 >>> On-line CPU(s) list: 0-7 >>> Thread(s) per core: 2 >>> Core(s) per socket: 2 >>> Socket(s): 2 >>> NUMA node(s): 2 >>> Vendor ID: GenuineIntel >>> CPU family: 6 >>> Model: 26 >>> Model name: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz >>> Stepping: 5 >>> CPU MHz: 2393.791 >>> BogoMIPS: 4787.24 >>> Virtualization: VT-x >>> L1d cache: 32K >>> L1i cache: 32K >>> L2 cache: 256K >>> L3 cache: 8192K >>> NUMA node0 CPU(s): 0,2,4,6 >>> NUMA node1 CPU(s): 1,3,5,7 >>> >>> I appreciate that a lot is system-dependent and that I can't really help >>> you help me very much. It also should be noted that my multi runs are >>> multiple walker metadynamics run and are slowing down because there are >>> large bias potentials in memory that need to be communicated around too. >>> As I said I haven't had a chance to make separate benchmark runs but >>> have >>> just made observations based upon existing runs. >> >> Understandable, I was just giving tips and hints. >> >> Cheers, >> -- >> Sz. >> >> >>> Best wishes >>> James >>> >>>> Performance tuning is highly dependent on the simulation system and >>>> the hardware you're running on. Questions like the ones you pose are >>>> impossible to answer meaningfully without *full* log files (and >>>> hardware specs including network). >>>> >>>> Have you checked the performance checklist I linked above? >>>> -- >>>> SzilÃĄrd >>>> >>>> >>>> On Wed, Sep 21, 2016 at 11:36 AM, <jkrie...@mrc-lmb.cam.ac.uk> wrote: >>>>> I wonder whether what I see that -np 108 and -ntomp 2 is best comes >>>>> from >>>>> using -multi 6 with 8-CPU nodes. That level of parallelism may then be >>>>> necessary to trigger automatic segregation of PP and PME ranks. I'm >>>>> not >>>>> sure if I tried -np 54 and -ntomp 4, which would probably also do it. >>>>> I >>>>> compared mostly on 196 CPUs then found going up to 216 was better than >>>>> 196 >>>>> with -ntomp 2 and pure MPI (-ntomp 1) was considerably worse for both. >>>>> Would people recommend to go back to 196 which allows 4 whole nodes >>>>> per >>>>> replica and playing with -npme and -ntomp_pme? >>>>> >>>>>> Hi Thanh Le, >>>>>> >>>>>> Assuming all the nodes are the same (9 nodes with 12 CPUs) then you >>>>>> could >>>>>> try the following >>>>>> >>>>>> mpirun -np 9 --map-by node mdrun -ntomp 12 ... >>>>>> mpirun -np 18 mdrun -ntomp 6 ... >>>>>> mpirun -np 54 mdrun -ntomp 2 ... >>>>>> >>>>>> Which of these works best will depend on your setup. >>>>>> >>>>>> Using the whole cluster for one job may not be the most efficient >>>>>> way. >>>>>> I >>>>>> found on our cluster that once I reach 216 CPUs (equivalent settings >>>>>> from >>>>>> the queuing system to -np 108 and -ntomp 2), I can't do better by >>>>>> adding >>>>>> more nodes (where presumably communication becomes an issue). In >>>>>> addition >>>>>> to running -multi or -multidir jobs, which takes the load off >>>>>> communication a bit, it may also be worth having separate jobs and >>>>>> using >>>>>> -pin on and -pinoffset. >>>>>> >>>>>> Best wishes >>>>>> James >>>>>> >>>>>>> Hi everyone, >>>>>>> I have a question concerning running gromacs in parallel. I have >>>>>>> read >>>>>>> over >>>>>>> the >>>>>>> http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html >>>>>>> <http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html> >>>>>>> but I still dont quite understand how to run it efficiently. >>>>>>> My gromacs version is 4.5.4 >>>>>>> The cluster I am using has CPUs total: 108 and 4 hosts up. >>>>>>> The node iam using: >>>>>>> Architecture: x86_64 >>>>>>> CPU op-mode(s): 32-bit, 64-bit >>>>>>> Byte Order: Little Endian >>>>>>> CPU(s): 12 >>>>>>> On-line CPU(s) list: 0-11 >>>>>>> Thread(s) per core: 2 >>>>>>> Core(s) per socket: 6 >>>>>>> Socket(s): 1 >>>>>>> NUMA node(s): 1 >>>>>>> Vendor ID: AuthenticAMD >>>>>>> CPU family: 21 >>>>>>> Model: 2 >>>>>>> Stepping: 0 >>>>>>> CPU MHz: 1400.000 >>>>>>> BogoMIPS: 5200.57 >>>>>>> Virtualization: AMD-V >>>>>>> L1d cache: 16K >>>>>>> L1i cache: 64K >>>>>>> L2 cache: 2048K >>>>>>> L3 cache: 6144K >>>>>>> NUMA node0 CPU(s): 0-11 >>>>>>> MPI is already installed. I also have permission to use the cluster >>>>>>> as >>>>>>> much as I can. >>>>>>> My question is: how should I write my mdrun command run to utilize >>>>>>> all >>>>>>> the >>>>>>> possible cores and nodes? >>>>>>> Thanks, >>>>>>> Thanh Le >>>>>>> -- >>>>>>> Gromacs Users mailing list >>>>>>> >>>>>>> * Please search the archive at >>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>>>>> posting! >>>>>>> >>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>>>>> >>>>>>> * For (un)subscribe requests visit >>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users >>>>>>> or >>>>>>> send >>>>>>> a mail to gmx-users-requ...@gromacs.org. >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Gromacs Users mailing list >>>>> >>>>> * Please search the archive at >>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>>> posting! >>>>> >>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>>> >>>>> * For (un)subscribe requests visit >>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>>>> send a mail to gmx-users-requ...@gromacs.org. >>>> -- >>>> Gromacs Users mailing list >>>> >>>> * Please search the archive at >>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>> posting! >>>> >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>>> * For (un)subscribe requests visit >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>>> send >>>> a mail to gmx-users-requ...@gromacs.org. >>> >>> >>> >>> -- >>> Gromacs Users mailing list >>> >>> * Please search the archive at >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>> posting! >>> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>> >>> * For (un)subscribe requests visit >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>> send a mail to gmx-users-requ...@gromacs.org. >> -- >> Gromacs Users mailing list >> >> * Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send >> a mail to gmx-users-requ...@gromacs.org. > > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a > mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.