Not to 1.6 series, but it is in the about-to-be-released 1.7.3, and will be there from that point onwards. Still waiting to see if it resolves the difference.
On Jul 23, 2013, at 4:28 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 23/07/13 19:34, Joshua Ladd wrote: > >> Hi, Chris > > Hi Joshua, > > I've quoted you in full as I don't think your message made it through > to the slurm-dev list (at least I've not received it from there yet). > >> Funny you should mention this now. We identified and diagnosed the >> issue some time ago as a combination of SLURM's PMI1 >> implementation and some of, what I'll call, OMPI's topology >> requirements (probably not the right word.) Here's what is >> happening, in a nutshell, when you launch with srun: >> >> 1. Each process pushes his endpoint data up to the PMI "cloud" via >> PMI put (I think it's about five or six puts, bottom line, O(1).) >> 2. Then executes a PMI commit and PMI barrier to ensure all other >> processes have finished committing their data to the "cloud". 3. >> Subsequent to this, each process executes O(N) (N is the number of >> procs in the job) PMI gets in order to get all of the endpoint >> data for every process regardless of whether or not the process >> communicates with that endpoint. >> >> "We" (MLNX et al.) undertook an in-depth scaling study of this and >> identified several poorly scaling pieces with the worst offenders >> being: >> >> 1. PMI Barrier scales worse than linear. 2. At scale, the PMI get >> phase starts to look quadratic. >> >> The proposed solution that "we" (OMPI + SLURM) have come up with is >> to modify OMPI to support PMI2 and to use SLURM 2.6 which has >> support for PMI2 and is (allegedly) much more scalable than PMI1. >> Several folks in the combined communities are working hard, as we >> speak, trying to get this functional to see if it indeed makes a >> difference. Stay tuned, Chris. Hopefully we will have some data by >> the end of the week. > > Wonderful, great to know that what we're seeing is actually real and > not just pilot error on our part! We're happy enough to tell users > to keep on using mpirun as they will be used to from our other Intel > systems and to only use srun if the code requires it (one or two > commercial apps that use Intel MPI). > > Can I ask, if the PMI2 ideas work out is that likely to get backported > to OMPI 1.6.x ? > > All the best, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlHvEZIACgkQO2KABBYQAh9QogCeMuR/E4oPivdsX3r671+z7EWd > Hv8An1N8csHMby7bouT/gC07i/J2PW+i > =gZsB > -----END PGP SIGNATURE----- > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel