FWIW: we see varying reports about the scalability of Slurm, especially at large cluster sizes. Last I saw/tested, there is a quadratic term that begins to dominate above 2k nodes. Others swear it is better <shrug>. Guess I'd be cautious and definitely test things before investing in a move - I'm not convinced.
On May 6, 2014, at 8:37 PM, Moody, Adam T. <mood...@llnl.gov> wrote: > Hi Chris, > I'm interested in SLURM / OpenMPI startup numbers, but I haven't done this > testing myself. We're stuck with an older version of SLURM for various > internal reasons, and I'm wondering whether it's worth the effort to back > port the PMI2 support. Can you share some of the differences in times at > different scales? > Thanks, > -Adam > ________________________________________ > From: devel [devel-boun...@open-mpi.org] on behalf of Christopher Samuel > [sam...@unimelb.edu.au] > Sent: Tuesday, May 06, 2014 8:32 PM > To: de...@open-mpi.org > Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is > specifically requested > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 07/05/14 12:53, Ralph Castain wrote: > >> We have been seeing a lot of problems with the Slurm PMI-2 support >> (not in OMPI - it's the code in Slurm that is having problems). At >> this time, I'm unaware of any advantage in using PMI-2 over PMI-1 >> in Slurm - the scaling is equally poor, and PMI-2 does not supports >> any additional functionality. >> >> I know that Cray PMI-2 has a definite advantage, so I'm proposing >> that we turn PMI-2 "off" when under Slurm unless the user >> specifically requests we use it. > > Our local testing has shown that PMI-2 in 1.7.x gives a massive > improvement in scaling when starting jobs with srun over using srun > with OMPI 1.6.x and now that OMPI 1.8.x is out we're planning on > moving to using PMI2 with OMPI and srun. > > Using mpirun gives good performance with OMPI 1.6.x but Slurm then > gets all its memory stats wrong and if you run with CR_Core_Memory in > Slurm you have a very high risk your job will get killed incorrectly. > > All the best, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlNpqUwACgkQO2KABBYQAh/igwCfQSB/v3tI37Rq4z5z/0xT/BYU > 6ToAn3Qt6tOt46LQD25eHhlx+3z/sjnQ > =LEHf > -----END PGP SIGNATURE----- > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14691.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14692.php