Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/05/14 00:16, Joshua Ladd wrote: > The necessary packages will be supported and available in community > OFED. We're constrained to what is in RHEL6 I'm afraid. This is because we have to run GPFS over IB to BG/Q from the same NSDs that talk GPFS to all our Intel clusters. We did try MOFED 2.x (in connected mode) on a new Intel cluster during its bring up last year which worked for MPI but stopped it talking to the NSDs. Reverting to vanilla RHEL6 fixed it. Not your problem though. :-) As Ralph has said there is work on an alternative solution that we will be able to use. Thanks! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNsG88ACgkQO2KABBYQAh8+SwCfZWpViBFwuhlxqERXpbXbr8Eq awwAnjj7NJ2/zUGBeZNT0UPwkmaGOaLR =nPxl -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/05/14 23:45, Ralph Castain wrote: > Artem and I are working on a new PMIx plugin that will resolve it > for non-Mellanox cases. Ah yes of course, sorry my bad! - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNsGcsACgkQO2KABBYQAh/ATgCfeQHS1KsZbLS8Hdux6p98K3w3 DqsAn3vZJMtYGs1xWK4ubK26ceuACtf1 =zPyS -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Chris, The necessary packages will be supported and available in community OFED. Josh On Thu, May 8, 2014 at 9:23 AM, Chris Samuelwrote: > On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote: > > > We (MLNX) are working on a new SLURM PMI2 plugin that we plan to > eventually > > push upstream. However, to use it, it will require linking in a > proprietary > > Mellanox library that accelerates the collective operations (available in > > MOFED versions 2.1 and higher.) > > What about those of us who cannot run Mellanox OFED? > > All the best, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14755.php >
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
On May 8, 2014, at 6:23 AM, Chris Samuelwrote: > On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote: > >> We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually >> push upstream. However, to use it, it will require linking in a proprietary >> Mellanox library that accelerates the collective operations (available in >> MOFED versions 2.1 and higher.) > > What about those of us who cannot run Mellanox OFED? Artem and I are working on a new PMIx plugin that will resolve it for non-Mellanox cases. > > All the best, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14755.php
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote: > We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually > push upstream. However, to use it, it will require linking in a proprietary > Mellanox library that accelerates the collective operations (available in > MOFED versions 2.1 and higher.) What about those of us who cannot run Mellanox OFED? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Hi, Adam We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually push upstream. However, to use it, it will require linking in a proprietary Mellanox library that accelerates the collective operations (available in MOFED versions 2.1 and higher.) Similar in spirit to the MXM MTL or FCA COLL components in OMPI. Best, Josh On Wed, May 7, 2014 at 11:45 AM, Moody, Adam T. <mood...@llnl.gov> wrote: > Hi Josh, > Are your changes to OMPI or SLURM's PMI2 implementation? Do you plan to > push those changes back upstream? > -Adam > > > -- > *From:* devel [devel-boun...@open-mpi.org] on behalf of Joshua Ladd [ > jladd.m...@gmail.com] > *Sent:* Wednesday, May 07, 2014 7:56 AM > *To:* Open MPI Developers > > *Subject:* Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is > specifically requested > > Ah, I see. Sorry for the reactionary comment - but this feature falls > squarely within my "jurisdiction", and we've invested a lot in improving > OMPI jobstart under srun. > > That being said (now that I've taken some deep breaths and carefully read > your original email :)), what you're proposing isn't a bad idea. I think it > would be good to maybe add a "--with-pmi2" flag to configure since > "--with-pmi" automagically uses PMI2 if it finds the header and lib. This > way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or > hack the installation. > > Josh > > > On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Okay, then we'll just have to develop a workaround for all those Slurm >> releases where PMI-2 is borked :-( >> >> FWIW: I think people misunderstood my statement. I specifically did >> *not* propose to *lose* PMI-2 support. I suggested that we change it to >> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >> stabilized, then we could reverse that policy. >> >> However, given that both you and Chris appear to prefer to keep it >> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >> broken and then fall back to PMI-1. >> >> >> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >> >> Just saw this thread, and I second Chris' observations: at scale we >> are seeing huge gains in jobstart performance with PMI2 over PMI1. We >> *CANNOT* loose this functionality. For competitive reasons, I cannot >> provide exact numbers, but let's say the difference is in the ballpark of a >> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely >> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, >> but there is no contest between PMI1 and PMI2. We (MLNX) are actively >> working to resolve some of the scalability issues in PMI2. >> >> Josh >> >> Joshua S. Ladd >> Staff Engineer, HPC Software >> Mellanox Technologies >> >> Email: josh...@mellanox.com >> >> >> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Interesting - how many nodes were involved? As I said, the bad scaling >>> becomes more evident at a fairly high node count. >>> >>> On May 7, 2014, at 12:07 AM, Christopher Samuel <sam...@unimelb.edu.au> >>> wrote: >>> >>> > -BEGIN PGP SIGNED MESSAGE- >>> > Hash: SHA1 >>> > >>> > Hiya Ralph, >>> > >>> > On 07/05/14 14:49, Ralph Castain wrote: >>> > >>> >> I should have looked closer to see the numbers you posted, Chris - >>> >> those include time for MPI wireup. So what you are seeing is that >>> >> mpirun is much more efficient at exchanging the MPI endpoint info >>> >> than PMI. I suspect that PMI2 is not much better as the primary >>> >> reason for the difference is that mpriun sends blobs, while PMI >>> >> requires that everything be encoded into strings and sent in little >>> >> pieces. >>> >> >>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >>> >> operation) much faster, and MPI_Init completes faster. Rest of the >>> >> computation should be the same, so long compute apps will see the >>> >> difference narrow considerably. >>> > >>> > Unfortunately it looks like I had an enthusiastic cleanup at some point >>&g
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/05/14 12:54, Ralph Castain wrote: > I think there was one 2.6.x that was borked, and definitely > problems in the 14.03.x line. Can't pinpoint it for you, though. No worries, thanks. > Sounds good. I'm going to have to dig deeper into those numbers, > though, as they don't entirely add up to me. Once the job gets > launched, the launch method itself should have no bearing on > computational speed - IF all things are equal. In other words, if > the process layout is the same, and the binding pattern is the > same, then computational speed should be roughly equivalent > regardless of how the procs were started. Not sure if it's significant but when mpirun was launching processes it was using srun to start orted which then started MPI ranks whereas with PMI/PMI2 it appeared to directly start the ranks. > My guess is that your data might indicate a difference in the > layout and/or binding pattern as opposed to PMI2 vs mpirun. At the > scale you mention later in the thread (only 70 nodes x 16 ppn), the > difference in launch timing would be zilch. So I'm betting you > would find (upon further exploration) that (a) you might not have > been binding processes when launching by mpirun, since we didn't > bind by default until the 1.8 series, but were binding under direct > srun launch, and (b) your process mapping would quite likely be > different as we default to byslot mapping, and I believe srun > defaults to bynode? FWIW all our environment modules that do OMPI have: setenv OMPI_MCA_orte_process_binding core > Might be worth another comparison run when someone has time. Yeah, I'll try and queue up some more tests - unfortunately the cluster we tested on then is flat out at the moment but I'll try and sneak a 64-core job using identical configs and compare mpirun, srun on its own and srun with PMI2. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNq/K8ACgkQO2KABBYQAh/q0wCcDvYjl4tYVXrHNciCkKgbnwF7 VHoAn3Q+gZXQNKzs++3uajmiGTkq/EeD =ucJg -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
2014-05-08 9:54 GMT+07:00 Ralph Castain: > > On May 7, 2014, at 6:15 PM, Christopher Samuel > wrote: > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > Hi all, > > > > Apologies for having dropped out of the thread, night intervened here. > ;-) > > > > On 08/05/14 00:45, Ralph Castain wrote: > > > >> Okay, then we'll just have to develop a workaround for all those > >> Slurm releases where PMI-2 is borked :-( > > > > Do you know what these releases are? Are we talking 2.6.x or 14.03? > > The 14.03 series has had a fair few rapid point releases and doesn't > > appear to be anywhere as near as stable as 2.6 was when it came out. :-( > > Yeah :-( > > I think there was one 2.6.x that was borked, and definitely problems in > the 14.03.x line. Can't pinpoint it for you, though. > The bug I experienced with abnormal OMPI termination persist starting from 2.6.3 till latest slurm release. It may appear earlier - I didn't check. However SLURM gyus didn't confirm that it's a bug acually. Things will get clear after 2 weeks when the person who maintains the code will review the patch. But I am pretty sure thats a bug. Refer to this thread http://thread.gmane.org/gmane.comp.distributed.slurm.devel/5213. > > > > >> FWIW: I think people misunderstood my statement. I specifically > >> did *not* propose to *lose* PMI-2 support. I suggested that we > >> change it to "on-by-request" instead of the current "on-by-default" > >> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once > >> the Slurm implementation stabilized, then we could reverse that > >> policy. > >> > >> However, given that both you and Chris appear to prefer to keep it > >> "on-by-default", we'll see if we can find a way to detect that > >> PMI-2 is broken and then fall back to PMI-1. > > > > My intention was to provide the data that led us to want PMI2, but if > > configure had an option to enable PMI2 by default so that only those > > who requested it got it then I'd be more than happy - we'd just add it > > to our script to build it. > > Sounds good. I'm going to have to dig deeper into those numbers, though, > as they don't entirely add up to me. Once the job gets launched, the launch > method itself should have no bearing on computational speed - IF all things > are equal. In other words, if the process layout is the same, and the > binding pattern is the same, then computational speed should be roughly > equivalent regardless of how the procs were started. > > My guess is that your data might indicate a difference in the layout > and/or binding pattern as opposed to PMI2 vs mpirun. At the scale you > mention later in the thread (only 70 nodes x 16 ppn), the difference in > launch timing would be zilch. So I'm betting you would find (upon further > exploration) that (a) you might not have been binding processes when > launching by mpirun, since we didn't bind by default until the 1.8 series, > but were binding under direct srun launch, and (b) your process mapping > would quite likely be different as we default to byslot mapping, and I > believe srun defaults to bynode? > > Might be worth another comparison run when someone has time. > > > > > > All the best! > > Chris > > - -- > > Christopher SamuelSenior Systems Administrator > > VLSCI - Victorian Life Sciences Computation Initiative > > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > > http://www.vlsci.org.au/ http://twitter.com/vlsci > > > > -BEGIN PGP SIGNATURE- > > Version: GnuPG v1.4.14 (GNU/Linux) > > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > > > iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP > > 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt > > =OvH4 > > -END PGP SIGNATURE- > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14733.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14738.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
On May 7, 2014, at 6:15 PM, Christopher Samuelwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hi all, > > Apologies for having dropped out of the thread, night intervened here. ;-) > > On 08/05/14 00:45, Ralph Castain wrote: > >> Okay, then we'll just have to develop a workaround for all those >> Slurm releases where PMI-2 is borked :-( > > Do you know what these releases are? Are we talking 2.6.x or 14.03? > The 14.03 series has had a fair few rapid point releases and doesn't > appear to be anywhere as near as stable as 2.6 was when it came out. :-( Yeah :-( I think there was one 2.6.x that was borked, and definitely problems in the 14.03.x line. Can't pinpoint it for you, though. > >> FWIW: I think people misunderstood my statement. I specifically >> did *not* propose to *lose* PMI-2 support. I suggested that we >> change it to "on-by-request" instead of the current "on-by-default" >> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once >> the Slurm implementation stabilized, then we could reverse that >> policy. >> >> However, given that both you and Chris appear to prefer to keep it >> "on-by-default", we'll see if we can find a way to detect that >> PMI-2 is broken and then fall back to PMI-1. > > My intention was to provide the data that led us to want PMI2, but if > configure had an option to enable PMI2 by default so that only those > who requested it got it then I'd be more than happy - we'd just add it > to our script to build it. Sounds good. I'm going to have to dig deeper into those numbers, though, as they don't entirely add up to me. Once the job gets launched, the launch method itself should have no bearing on computational speed - IF all things are equal. In other words, if the process layout is the same, and the binding pattern is the same, then computational speed should be roughly equivalent regardless of how the procs were started. My guess is that your data might indicate a difference in the layout and/or binding pattern as opposed to PMI2 vs mpirun. At the scale you mention later in the thread (only 70 nodes x 16 ppn), the difference in launch timing would be zilch. So I'm betting you would find (upon further exploration) that (a) you might not have been binding processes when launching by mpirun, since we didn't bind by default until the 1.8 series, but were binding under direct srun launch, and (b) your process mapping would quite likely be different as we default to byslot mapping, and I believe srun defaults to bynode? Might be worth another comparison run when someone has time. > > All the best! > Chris > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP > 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt > =OvH4 > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14733.php
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
On May 7, 2014, at 6:51 PM, Christopher Samuelwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/05/14 18:00, Ralph Castain wrote: > >> Interesting - how many nodes were involved? As I said, the bad >> scaling becomes more evident at a fairly high node count. > > Our x86-64 systems are low node counts (we've got BG/Q for capacity), > the cluster that those tests were run on has 70 nodes, each with 16 > cores, so I suspect we're a long long way away from that pain point. At least 25x, my friend :-) > > All the best! > Chris > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC > NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb > =K0XF > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14734.php
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
That is interesting. I think I will reconstruct your experiments on my system when I will be testing PMI selection logic. According to your resource count numbers I can do that. I will publish my results in the list. 2014-05-08 8:51 GMT+07:00 Christopher Samuel: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/05/14 18:00, Ralph Castain wrote: > > > Interesting - how many nodes were involved? As I said, the bad > > scaling becomes more evident at a fairly high node count. > > Our x86-64 systems are low node counts (we've got BG/Q for capacity), > the cluster that those tests were run on has 70 nodes, each with 16 > cores, so I suspect we're a long long way away from that pain point. > > All the best! > Chris > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC > NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb > =K0XF > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14734.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Hi Chris. Current disign is to provide the runtime parameter for PMI version selection. It would be even more flexible that configuration-time selection and (with my current understanding) not very hard to acheive. 2014-05-08 8:15 GMT+07:00 Christopher Samuel: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hi all, > > Apologies for having dropped out of the thread, night intervened here. ;-) > > On 08/05/14 00:45, Ralph Castain wrote: > > > Okay, then we'll just have to develop a workaround for all those > > Slurm releases where PMI-2 is borked :-( > > Do you know what these releases are? Are we talking 2.6.x or 14.03? > The 14.03 series has had a fair few rapid point releases and doesn't > appear to be anywhere as near as stable as 2.6 was when it came out. :-( > > > FWIW: I think people misunderstood my statement. I specifically > > did *not* propose to *lose* PMI-2 support. I suggested that we > > change it to "on-by-request" instead of the current "on-by-default" > > so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once > > the Slurm implementation stabilized, then we could reverse that > > policy. > > > > However, given that both you and Chris appear to prefer to keep it > > "on-by-default", we'll see if we can find a way to detect that > > PMI-2 is broken and then fall back to PMI-1. > > My intention was to provide the data that led us to want PMI2, but if > configure had an option to enable PMI2 by default so that only those > who requested it got it then I'd be more than happy - we'd just add it > to our script to build it. > > All the best! > Chris > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP > 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt > =OvH4 > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14733.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/05/14 18:00, Ralph Castain wrote: > Interesting - how many nodes were involved? As I said, the bad > scaling becomes more evident at a fairly high node count. Our x86-64 systems are low node counts (we've got BG/Q for capacity), the cluster that those tests were run on has 70 nodes, each with 16 cores, so I suspect we're a long long way away from that pain point. All the best! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb =K0XF -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, Apologies for having dropped out of the thread, night intervened here. ;-) On 08/05/14 00:45, Ralph Castain wrote: > Okay, then we'll just have to develop a workaround for all those > Slurm releases where PMI-2 is borked :-( Do you know what these releases are? Are we talking 2.6.x or 14.03? The 14.03 series has had a fair few rapid point releases and doesn't appear to be anywhere as near as stable as 2.6 was when it came out. :-( > FWIW: I think people misunderstood my statement. I specifically > did *not* propose to *lose* PMI-2 support. I suggested that we > change it to "on-by-request" instead of the current "on-by-default" > so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once > the Slurm implementation stabilized, then we could reverse that > policy. > > However, given that both you and Chris appear to prefer to keep it > "on-by-default", we'll see if we can find a way to detect that > PMI-2 is broken and then fall back to PMI-1. My intention was to provide the data that led us to want PMI2, but if configure had an option to enable PMI2 by default so that only those who requested it got it then I'd be more than happy - we'd just add it to our script to build it. All the best! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt =OvH4 -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
2014-05-08 7:15 GMT+07:00 Ralph Castain: > Take a look in opal/mca/common/pmi - we already do a bunch of #if PMI2 > stuff in there. All we are talking about doing here is: > > * making those selections be runtime based on an MCA param, compiling if > PMI2 is available but selected at runtime > > * moving some additional functions into that code area and out of the > individual components > Ok, that is pretty clear now. And will do exactly #2. Thank you. > > > On May 7, 2014, at 5:08 PM, Artem Polyakov wrote: > > I like #2 too. > But my question was slightly different. Can we incapsulate PMI logic that > OMPI use in common/pmi as #2 suggests but have 2 different > implementations of this component say common/pmi and common/pmi2? I am > asking because I have concerns that this kind of component is not supposed > to be duplicated. > In this case we could have one common MCA parameter and 2 components as it > was suggested by Jeff. > > > 2014-05-08 7:01 GMT+07:00 Ralph Castain : > >> The desired solution is to have the ability to select pmi-1 vs pmi-2 at >> runtime. This can be done in two ways: >> >> 1. you could have separate pmi1 and pmi2 components in each framework. >> You'd want to define only one common MCA param to direct the selection, >> however. >> >> 2. you could have a single pmi component in each framework, calling code >> in the appropriate common/pmi location. You would then need a runtime MCA >> param to select whether pmi-1 or pmi-2 was going to be used, and have the >> common code check before making the desired calls. >> >> The choice of method is left up to you. They each have their negatives. >> If it were me, I'd probably try #2 first, assuming the codes are mostly >> common in the individual frameworks. >> >> >> On May 7, 2014, at 4:51 PM, Artem Polyakov wrote: >> >> Just reread your suggestions in our out-of-list discussion and found >> that I misunderstand it. So no parallel PMI! Take all possible code into >> opal/mca/common/pmi. >> To additionally clarify what is the preferred way: >> 1. to create one joined PMI module having a switches to decide what >> functiononality to implement. >> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and >> does this fit opal/mca/common/ ideology at all? >> >> >> 2014-05-08 6:44 GMT+07:00 Artem Polyakov : >> >>> >>> 2014-05-08 5:54 GMT+07:00 Ralph Castain : >>> >>> Ummmno, I don't think that's right. I believe we decided to instead create the separate components, default to PMI-2 if available, print nice error message if not, otherwise use PMI-1. I don't want to initialize both PMIs in parallel as most installations won't support it. >>> >>> Ok, I agree. Beside the lack of support there can be a performance hit >>> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 >>> since it is quite simple and local. But I didn't consider other >>> implementations. >>> >>> On May 7, 2014, at 3:49 PM, Artem Polyakov wrote: We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 correctness first as it was initially intended. Here is my idea. The universal way to decide if PMI2 is correct is to compare PMI_Init(.., , , ...) and PMI2_Init(.., , , ...). Size and rank should be equal. In this case we proceed with PMI2 finalizing PMI1. Otherwise we finalize PMI2 and proceed with PMI1. I need to clarify with SLURM guys if parallel initialization of both PMIs are legal. If not - we'll do that sequentially. In other places we'll just use the flag saying what PMI version to use. Does that sounds reasonable? 2014-05-07 23:10 GMT+07:00 Artem Polyakov : > That's a good point. There is actually a bunch of modules in ompi, > opal and orte that has to be duplicated. > > среда, 7 мая 2014 г. пользователь Joshua Ladd написал: > >> +1 Sounds like a good idea - but decoupling the two and adding all >> the right selection mojo might be a bit of a pain. There are several >> places >> in OMPI where the distinction between PMI1and PMI2 is made, not only in >> grpcomm. DB and ESS frameworks off the top of my head. >> >> Josh >> >> >> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov >> wrote: >> >>> Good idea :)! >>> >>> среда, 7 мая 2014 г. пользователь Ralph Castain написал: >>> >>> Jeff actually had a useful suggestion (gasp!).He proposed that we >>> separate the PMI-1 and PMI-2 codes into separate components so you could >>> select them at runtime. Thus, we would build both (assuming both PMI-1 >>> and >>> 2 libs are found), default to PMI-1, but users could select to try >>> PMI-2. >>> If the PMI-2 component failed, we would
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Take a look in opal/mca/common/pmi - we already do a bunch of #if PMI2 stuff in there. All we are talking about doing here is: * making those selections be runtime based on an MCA param, compiling if PMI2 is available but selected at runtime * moving some additional functions into that code area and out of the individual components On May 7, 2014, at 5:08 PM, Artem Polyakovwrote: > I like #2 too. > But my question was slightly different. Can we incapsulate PMI logic that > OMPI use in common/pmi as #2 suggests but have 2 different implementations of > this component say common/pmi and common/pmi2? I am asking because I have > concerns that this kind of component is not supposed to be duplicated. > In this case we could have one common MCA parameter and 2 components as it > was suggested by Jeff. > > > 2014-05-08 7:01 GMT+07:00 Ralph Castain : > The desired solution is to have the ability to select pmi-1 vs pmi-2 at > runtime. This can be done in two ways: > > 1. you could have separate pmi1 and pmi2 components in each framework. You'd > want to define only one common MCA param to direct the selection, however. > > 2. you could have a single pmi component in each framework, calling code in > the appropriate common/pmi location. You would then need a runtime MCA param > to select whether pmi-1 or pmi-2 was going to be used, and have the common > code check before making the desired calls. > > The choice of method is left up to you. They each have their negatives. If it > were me, I'd probably try #2 first, assuming the codes are mostly common in > the individual frameworks. > > > On May 7, 2014, at 4:51 PM, Artem Polyakov wrote: > >> Just reread your suggestions in our out-of-list discussion and found that I >> misunderstand it. So no parallel PMI! Take all possible code into >> opal/mca/common/pmi. >> To additionally clarify what is the preferred way: >> 1. to create one joined PMI module having a switches to decide what >> functiononality to implement. >> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and does >> this fit opal/mca/common/ ideology at all? >> >> >> 2014-05-08 6:44 GMT+07:00 Artem Polyakov : >> >> 2014-05-08 5:54 GMT+07:00 Ralph Castain : >> >> Ummmno, I don't think that's right. I believe we decided to instead >> create the separate components, default to PMI-2 if available, print nice >> error message if not, otherwise use PMI-1. >> >> I don't want to initialize both PMIs in parallel as most installations won't >> support it. >> >> Ok, I agree. Beside the lack of support there can be a performance hit >> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 >> since it is quite simple and local. But I didn't consider other >> implementations. >> >> On May 7, 2014, at 3:49 PM, Artem Polyakov wrote: >> >>> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 >>> correctness first as it was initially intended. Here is my idea. The >>> universal way to decide if PMI2 is correct is to compare PMI_Init(.., >>> , , ...) and PMI2_Init(.., , , ...). Size and rank >>> should be equal. In this case we proceed with PMI2 finalizing PMI1. >>> Otherwise we finalize PMI2 and proceed with PMI1. >>> I need to clarify with SLURM guys if parallel initialization of both PMIs >>> are legal. If not - we'll do that sequentially. >>> In other places we'll just use the flag saying what PMI version to use. >>> Does that sounds reasonable? >>> >>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov : >>> That's a good point. There is actually a bunch of modules in ompi, opal and >>> orte that has to be duplicated. >>> >>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал: >>> +1 Sounds like a good idea - but decoupling the two and adding all the >>> right selection mojo might be a bit of a pain. There are several places in >>> OMPI where the distinction between PMI1and PMI2 is made, not only in >>> grpcomm. DB and ESS frameworks off the top of my head. >>> >>> Josh >>> >>> >>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote: >>> Good idea :)! >>> >>> среда, 7 мая 2014 г. пользователь Ralph Castain написал: >>> >>> Jeff actually had a useful suggestion (gasp!).He proposed that we separate >>> the PMI-1 and PMI-2 codes into separate components so you could select them >>> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are >>> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 >>> component failed, we would emit a show_help indicating that they probably >>> have a broken PMI-2 version and should try PMI-1. >>> >>> Make sense? >>> Ralph >>> >>> On May 7, 2014, at 8:00 AM, Ralph Castain wrote: >>> On May 7, 2014, at 7:56 AM, Joshua Ladd
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
I like #2 too. But my question was slightly different. Can we incapsulate PMI logic that OMPI use in common/pmi as #2 suggests but have 2 different implementations of this component say common/pmi and common/pmi2? I am asking because I have concerns that this kind of component is not supposed to be duplicated. In this case we could have one common MCA parameter and 2 components as it was suggested by Jeff. 2014-05-08 7:01 GMT+07:00 Ralph Castain: > The desired solution is to have the ability to select pmi-1 vs pmi-2 at > runtime. This can be done in two ways: > > 1. you could have separate pmi1 and pmi2 components in each framework. > You'd want to define only one common MCA param to direct the selection, > however. > > 2. you could have a single pmi component in each framework, calling code > in the appropriate common/pmi location. You would then need a runtime MCA > param to select whether pmi-1 or pmi-2 was going to be used, and have the > common code check before making the desired calls. > > The choice of method is left up to you. They each have their negatives. If > it were me, I'd probably try #2 first, assuming the codes are mostly common > in the individual frameworks. > > > On May 7, 2014, at 4:51 PM, Artem Polyakov wrote: > > Just reread your suggestions in our out-of-list discussion and found that > I misunderstand it. So no parallel PMI! Take all possible code into > opal/mca/common/pmi. > To additionally clarify what is the preferred way: > 1. to create one joined PMI module having a switches to decide what > functiononality to implement. > 2. or to have 2 separate common modules for PMI1 and one for PMI2, and > does this fit opal/mca/common/ ideology at all? > > > 2014-05-08 6:44 GMT+07:00 Artem Polyakov : > >> >> 2014-05-08 5:54 GMT+07:00 Ralph Castain : >> >> Ummmno, I don't think that's right. I believe we decided to instead >>> create the separate components, default to PMI-2 if available, print nice >>> error message if not, otherwise use PMI-1. >>> >>> I don't want to initialize both PMIs in parallel as most installations >>> won't support it. >>> >> >> Ok, I agree. Beside the lack of support there can be a performance hit >> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 >> since it is quite simple and local. But I didn't consider other >> implementations. >> >> On May 7, 2014, at 3:49 PM, Artem Polyakov wrote: >>> >>> We discussed with Ralph Joshuas concerns and decided to try automatic >>> PMI2 correctness first as it was initially intended. Here is my idea. The >>> universal way to decide if PMI2 is correct is to compare PMI_Init(.., >>> , , ...) and PMI2_Init(.., , , ...). Size and rank >>> should be equal. In this case we proceed with PMI2 finalizing PMI1. >>> Otherwise we finalize PMI2 and proceed with PMI1. >>> I need to clarify with SLURM guys if parallel initialization of both >>> PMIs are legal. If not - we'll do that sequentially. >>> In other places we'll just use the flag saying what PMI version to use. >>> Does that sounds reasonable? >>> >>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov : >>> That's a good point. There is actually a bunch of modules in ompi, opal and orte that has to be duplicated. среда, 7 мая 2014 г. пользователь Joshua Ladd написал: > +1 Sounds like a good idea - but decoupling the two and adding all the > right selection mojo might be a bit of a pain. There are several places in > OMPI where the distinction between PMI1and PMI2 is made, not only in > grpcomm. DB and ESS frameworks off the top of my head. > > Josh > > > On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov > wrote: > >> Good idea :)! >> >> среда, 7 мая 2014 г. пользователь Ralph Castain написал: >> >> Jeff actually had a useful suggestion (gasp!).He proposed that we >> separate the PMI-1 and PMI-2 codes into separate components so you could >> select them at runtime. Thus, we would build both (assuming both PMI-1 >> and >> 2 libs are found), default to PMI-1, but users could select to try PMI-2. >> If the PMI-2 component failed, we would emit a show_help indicating that >> they probably have a broken PMI-2 version and should try PMI-1. >> >> Make sense? >> Ralph >> >> On May 7, 2014, at 8:00 AM, Ralph Castain wrote: >> >> >> On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: >> >> Ah, I see. Sorry for the reactionary comment - but this feature falls >> squarely within my "jurisdiction", and we've invested a lot in improving >> OMPI jobstart under srun. >> >> That being said (now that I've taken some deep breaths and carefully >> read your original email :)), what you're proposing isn't a bad idea. I >> think
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
The desired solution is to have the ability to select pmi-1 vs pmi-2 at runtime. This can be done in two ways: 1. you could have separate pmi1 and pmi2 components in each framework. You'd want to define only one common MCA param to direct the selection, however. 2. you could have a single pmi component in each framework, calling code in the appropriate common/pmi location. You would then need a runtime MCA param to select whether pmi-1 or pmi-2 was going to be used, and have the common code check before making the desired calls. The choice of method is left up to you. They each have their negatives. If it were me, I'd probably try #2 first, assuming the codes are mostly common in the individual frameworks. On May 7, 2014, at 4:51 PM, Artem Polyakovwrote: > Just reread your suggestions in our out-of-list discussion and found that I > misunderstand it. So no parallel PMI! Take all possible code into > opal/mca/common/pmi. > To additionally clarify what is the preferred way: > 1. to create one joined PMI module having a switches to decide what > functiononality to implement. > 2. or to have 2 separate common modules for PMI1 and one for PMI2, and does > this fit opal/mca/common/ ideology at all? > > > 2014-05-08 6:44 GMT+07:00 Artem Polyakov : > > 2014-05-08 5:54 GMT+07:00 Ralph Castain : > > Ummmno, I don't think that's right. I believe we decided to instead > create the separate components, default to PMI-2 if available, print nice > error message if not, otherwise use PMI-1. > > I don't want to initialize both PMIs in parallel as most installations won't > support it. > > Ok, I agree. Beside the lack of support there can be a performance hit caused > by PMI1 initialization at scale. This is not a case of SLURM PMI1 since it is > quite simple and local. But I didn't consider other implementations. > > On May 7, 2014, at 3:49 PM, Artem Polyakov wrote: > >> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 >> correctness first as it was initially intended. Here is my idea. The >> universal way to decide if PMI2 is correct is to compare PMI_Init(.., , >> , ...) and PMI2_Init(.., , , ...). Size and rank should be >> equal. In this case we proceed with PMI2 finalizing PMI1. Otherwise we >> finalize PMI2 and proceed with PMI1. >> I need to clarify with SLURM guys if parallel initialization of both PMIs >> are legal. If not - we'll do that sequentially. >> In other places we'll just use the flag saying what PMI version to use. >> Does that sounds reasonable? >> >> 2014-05-07 23:10 GMT+07:00 Artem Polyakov : >> That's a good point. There is actually a bunch of modules in ompi, opal and >> orte that has to be duplicated. >> >> среда, 7 мая 2014 г. пользователь Joshua Ladd написал: >> +1 Sounds like a good idea - but decoupling the two and adding all the right >> selection mojo might be a bit of a pain. There are several places in OMPI >> where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB >> and ESS frameworks off the top of my head. >> >> Josh >> >> >> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote: >> Good idea :)! >> >> среда, 7 мая 2014 г. пользователь Ralph Castain написал: >> >> Jeff actually had a useful suggestion (gasp!).He proposed that we separate >> the PMI-1 and PMI-2 codes into separate components so you could select them >> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are >> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 >> component failed, we would emit a show_help indicating that they probably >> have a broken PMI-2 version and should try PMI-1. >> >> Make sense? >> Ralph >> >> On May 7, 2014, at 8:00 AM, Ralph Castain wrote: >> >>> >>> On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: >>> Ah, I see. Sorry for the reactionary comment - but this feature falls squarely within my "jurisdiction", and we've invested a lot in improving OMPI jobstart under srun. That being said (now that I've taken some deep breaths and carefully read your original email :)), what you're proposing isn't a bad idea. I think it would be good to maybe add a "--with-pmi2" flag to configure since "--with-pmi" automagically uses PMI2 if it finds the header and lib. This way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or hack the installation. >>> >>> That would be a much simpler solution than what Artem proposed (off-list) >>> where we would try PMI2 and then if it didn't work try to figure out how to >>> fall back to PMI1. I'll add this for now, and if Artem wants to try his >>> more automagic solution and can make it work, then we can reconsider that >>> option. >>> >>> Thanks >>> Ralph >>> Josh
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Just reread your suggestions in our out-of-list discussion and found that I misunderstand it. So no parallel PMI! Take all possible code into opal/mca/common/pmi. To additionally clarify what is the preferred way: 1. to create one joined PMI module having a switches to decide what functiononality to implement. 2. or to have 2 separate common modules for PMI1 and one for PMI2, and does this fit opal/mca/common/ ideology at all? 2014-05-08 6:44 GMT+07:00 Artem Polyakov: > > 2014-05-08 5:54 GMT+07:00 Ralph Castain : > > Ummmno, I don't think that's right. I believe we decided to instead >> create the separate components, default to PMI-2 if available, print nice >> error message if not, otherwise use PMI-1. >> >> I don't want to initialize both PMIs in parallel as most installations >> won't support it. >> > > Ok, I agree. Beside the lack of support there can be a performance hit > caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 > since it is quite simple and local. But I didn't consider other > implementations. > > On May 7, 2014, at 3:49 PM, Artem Polyakov wrote: >> >> We discussed with Ralph Joshuas concerns and decided to try automatic >> PMI2 correctness first as it was initially intended. Here is my idea. The >> universal way to decide if PMI2 is correct is to compare PMI_Init(.., >> , , ...) and PMI2_Init(.., , , ...). Size and rank >> should be equal. In this case we proceed with PMI2 finalizing PMI1. >> Otherwise we finalize PMI2 and proceed with PMI1. >> I need to clarify with SLURM guys if parallel initialization of both PMIs >> are legal. If not - we'll do that sequentially. >> In other places we'll just use the flag saying what PMI version to use. >> Does that sounds reasonable? >> >> 2014-05-07 23:10 GMT+07:00 Artem Polyakov : >> >>> That's a good point. There is actually a bunch of modules in ompi, opal >>> and orte that has to be duplicated. >>> >>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал: >>> +1 Sounds like a good idea - but decoupling the two and adding all the right selection mojo might be a bit of a pain. There are several places in OMPI where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB and ESS frameworks off the top of my head. Josh On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote: > Good idea :)! > > среда, 7 мая 2014 г. пользователь Ralph Castain написал: > > Jeff actually had a useful suggestion (gasp!).He proposed that we > separate the PMI-1 and PMI-2 codes into separate components so you could > select them at runtime. Thus, we would build both (assuming both PMI-1 and > 2 libs are found), default to PMI-1, but users could select to try PMI-2. > If the PMI-2 component failed, we would emit a show_help indicating that > they probably have a broken PMI-2 version and should try PMI-1. > > Make sense? > Ralph > > On May 7, 2014, at 8:00 AM, Ralph Castain wrote: > > > On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: > > Ah, I see. Sorry for the reactionary comment - but this feature falls > squarely within my "jurisdiction", and we've invested a lot in improving > OMPI jobstart under srun. > > That being said (now that I've taken some deep breaths and carefully > read your original email :)), what you're proposing isn't a bad idea. I > think it would be good to maybe add a "--with-pmi2" flag to configure > since > "--with-pmi" automagically uses PMI2 if it finds the header and lib. This > way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or > hack the installation. > > > That would be a much simpler solution than what Artem proposed > (off-list) where we would try PMI2 and then if it didn't work try to > figure > out how to fall back to PMI1. I'll add this for now, and if Artem wants to > try his more automagic solution and can make it work, then we can > reconsider that option. > > Thanks > Ralph > > > Josh > > > On Wed, May 7, 2014 at 10:45 AM, Ralph Castain > wrote: > > Okay, then we'll just have to develop a workaround for all those Slurm > releases where PMI-2 is borked :-( > > FWIW: I think people misunderstood my statement. I specifically did > *not* propose to *lose* PMI-2 support. I suggested that we change it to > "on-by-request" instead of the current "on-by-default" so we wouldn't keep > getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation > stabilized, then we could reverse that policy. > > However, given that both you and Chris appear to prefer to keep it > "on-by-default", we'll see if we can find a way
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
2014-05-08 5:54 GMT+07:00 Ralph Castain: > Ummmno, I don't think that's right. I believe we decided to instead > create the separate components, default to PMI-2 if available, print nice > error message if not, otherwise use PMI-1. > > I don't want to initialize both PMIs in parallel as most installations > won't support it. > Ok, I agree. Beside the lack of support there can be a performance hit caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 since it is quite simple and local. But I didn't consider other implementations. On May 7, 2014, at 3:49 PM, Artem Polyakov wrote: > > We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 > correctness first as it was initially intended. Here is my idea. The > universal way to decide if PMI2 is correct is to compare PMI_Init(.., > , , ...) and PMI2_Init(.., , , ...). Size and rank > should be equal. In this case we proceed with PMI2 finalizing PMI1. > Otherwise we finalize PMI2 and proceed with PMI1. > I need to clarify with SLURM guys if parallel initialization of both PMIs > are legal. If not - we'll do that sequentially. > In other places we'll just use the flag saying what PMI version to use. > Does that sounds reasonable? > > 2014-05-07 23:10 GMT+07:00 Artem Polyakov : > >> That's a good point. There is actually a bunch of modules in ompi, opal >> and orte that has to be duplicated. >> >> среда, 7 мая 2014 г. пользователь Joshua Ladd написал: >> >>> +1 Sounds like a good idea - but decoupling the two and adding all the >>> right selection mojo might be a bit of a pain. There are several places in >>> OMPI where the distinction between PMI1and PMI2 is made, not only in >>> grpcomm. DB and ESS frameworks off the top of my head. >>> >>> Josh >>> >>> >>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov >>> wrote: >>> Good idea :)! среда, 7 мая 2014 г. пользователь Ralph Castain написал: Jeff actually had a useful suggestion (gasp!).He proposed that we separate the PMI-1 and PMI-2 codes into separate components so you could select them at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 component failed, we would emit a show_help indicating that they probably have a broken PMI-2 version and should try PMI-1. Make sense? Ralph On May 7, 2014, at 8:00 AM, Ralph Castain wrote: On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: Ah, I see. Sorry for the reactionary comment - but this feature falls squarely within my "jurisdiction", and we've invested a lot in improving OMPI jobstart under srun. That being said (now that I've taken some deep breaths and carefully read your original email :)), what you're proposing isn't a bad idea. I think it would be good to maybe add a "--with-pmi2" flag to configure since "--with-pmi" automagically uses PMI2 if it finds the header and lib. This way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or hack the installation. That would be a much simpler solution than what Artem proposed (off-list) where we would try PMI2 and then if it didn't work try to figure out how to fall back to PMI1. I'll add this for now, and if Artem wants to try his more automagic solution and can make it work, then we can reconsider that option. Thanks Ralph Josh On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: Okay, then we'll just have to develop a workaround for all those Slurm releases where PMI-2 is borked :-( FWIW: I think people misunderstood my statement. I specifically did *not* propose to *lose* PMI-2 support. I suggested that we change it to "on-by-request" instead of the current "on-by-default" so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation stabilized, then we could reverse that policy. However, given that both you and Chris appear to prefer to keep it "on-by-default", we'll see if we can find a way to detect that PMI-2 is broken and then fall back to PMI-1. On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: Just saw this thread, and I second Chris' observations: at scale we are seeing huge gains in jobstart performance with PMI2 over PMI1. We *CANNOT* loose this functionality. For competitive reasons, I cannot provide exact numbers, but let's say the difference is in the ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but there
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Ummmno, I don't think that's right. I believe we decided to instead create the separate components, default to PMI-2 if available, print nice error message if not, otherwise use PMI-1. I don't want to initialize both PMIs in parallel as most installations won't support it. On May 7, 2014, at 3:49 PM, Artem Polyakovwrote: > We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 > correctness first as it was initially intended. Here is my idea. The > universal way to decide if PMI2 is correct is to compare PMI_Init(.., , > , ...) and PMI2_Init(.., , , ...). Size and rank should be > equal. In this case we proceed with PMI2 finalizing PMI1. Otherwise we > finalize PMI2 and proceed with PMI1. > I need to clarify with SLURM guys if parallel initialization of both PMIs are > legal. If not - we'll do that sequentially. > In other places we'll just use the flag saying what PMI version to use. > Does that sounds reasonable? > > 2014-05-07 23:10 GMT+07:00 Artem Polyakov : > That's a good point. There is actually a bunch of modules in ompi, opal and > orte that has to be duplicated. > > среда, 7 мая 2014 г. пользователь Joshua Ladd написал: > +1 Sounds like a good idea - but decoupling the two and adding all the right > selection mojo might be a bit of a pain. There are several places in OMPI > where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB > and ESS frameworks off the top of my head. > > Josh > > > On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote: > Good idea :)! > > среда, 7 мая 2014 г. пользователь Ralph Castain написал: > > Jeff actually had a useful suggestion (gasp!).He proposed that we separate > the PMI-1 and PMI-2 codes into separate components so you could select them > at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are > found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 > component failed, we would emit a show_help indicating that they probably > have a broken PMI-2 version and should try PMI-1. > > Make sense? > Ralph > > On May 7, 2014, at 8:00 AM, Ralph Castain wrote: > >> >> On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: >> >>> Ah, I see. Sorry for the reactionary comment - but this feature falls >>> squarely within my "jurisdiction", and we've invested a lot in improving >>> OMPI jobstart under srun. >>> >>> That being said (now that I've taken some deep breaths and carefully read >>> your original email :)), what you're proposing isn't a bad idea. I think it >>> would be good to maybe add a "--with-pmi2" flag to configure since >>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This >>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or >>> hack the installation. >> >> That would be a much simpler solution than what Artem proposed (off-list) >> where we would try PMI2 and then if it didn't work try to figure out how to >> fall back to PMI1. I'll add this for now, and if Artem wants to try his more >> automagic solution and can make it work, then we can reconsider that option. >> >> Thanks >> Ralph >> >>> >>> Josh >>> >>> >>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: >>> Okay, then we'll just have to develop a workaround for all those Slurm >>> releases where PMI-2 is borked :-( >>> >>> FWIW: I think people misunderstood my statement. I specifically did *not* >>> propose to *lose* PMI-2 support. I suggested that we change it to >>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >>> stabilized, then we could reverse that policy. >>> >>> However, given that both you and Chris appear to prefer to keep it >>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >>> broken and then fall back to PMI-1. >>> >>> >>> On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: >>> Just saw this thread, and I second Chris' observations: at scale we are seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT loose this functionality. For competitive reasons, I cannot provide exact numbers, but let's say the difference is in the ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but there is no contest between PMI1 and PMI2. We (MLNX) are actively working to resolve some of the scalability issues in PMI2. Josh Joshua S. Ladd Staff Engineer, HPC Software Mellanox Technologies Email: josh...@mellanox.com On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: Interesting - how many
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 correctness first as it was initially intended. Here is my idea. The universal way to decide if PMI2 is correct is to compare PMI_Init(.., , , ...) and PMI2_Init(.., , , ...). Size and rank should be equal. In this case we proceed with PMI2 finalizing PMI1. Otherwise we finalize PMI2 and proceed with PMI1. I need to clarify with SLURM guys if parallel initialization of both PMIs are legal. If not - we'll do that sequentially. In other places we'll just use the flag saying what PMI version to use. Does that sounds reasonable? 2014-05-07 23:10 GMT+07:00 Artem Polyakov: > That's a good point. There is actually a bunch of modules in ompi, opal > and orte that has to be duplicated. > > среда, 7 мая 2014 г. пользователь Joshua Ladd написал: > >> +1 Sounds like a good idea - but decoupling the two and adding all the >> right selection mojo might be a bit of a pain. There are several places in >> OMPI where the distinction between PMI1and PMI2 is made, not only in >> grpcomm. DB and ESS frameworks off the top of my head. >> >> Josh >> >> >> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote: >> >>> Good idea :)! >>> >>> среда, 7 мая 2014 г. пользователь Ralph Castain написал: >>> >>> Jeff actually had a useful suggestion (gasp!).He proposed that we >>> separate the PMI-1 and PMI-2 codes into separate components so you could >>> select them at runtime. Thus, we would build both (assuming both PMI-1 and >>> 2 libs are found), default to PMI-1, but users could select to try PMI-2. >>> If the PMI-2 component failed, we would emit a show_help indicating that >>> they probably have a broken PMI-2 version and should try PMI-1. >>> >>> Make sense? >>> Ralph >>> >>> On May 7, 2014, at 8:00 AM, Ralph Castain wrote: >>> >>> >>> On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: >>> >>> Ah, I see. Sorry for the reactionary comment - but this feature falls >>> squarely within my "jurisdiction", and we've invested a lot in improving >>> OMPI jobstart under srun. >>> >>> That being said (now that I've taken some deep breaths and carefully >>> read your original email :)), what you're proposing isn't a bad idea. I >>> think it would be good to maybe add a "--with-pmi2" flag to configure since >>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This >>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or >>> hack the installation. >>> >>> >>> That would be a much simpler solution than what Artem proposed >>> (off-list) where we would try PMI2 and then if it didn't work try to figure >>> out how to fall back to PMI1. I'll add this for now, and if Artem wants to >>> try his more automagic solution and can make it work, then we can >>> reconsider that option. >>> >>> Thanks >>> Ralph >>> >>> >>> Josh >>> >>> >>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: >>> >>> Okay, then we'll just have to develop a workaround for all those Slurm >>> releases where PMI-2 is borked :-( >>> >>> FWIW: I think people misunderstood my statement. I specifically did >>> *not* propose to *lose* PMI-2 support. I suggested that we change it to >>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >>> stabilized, then we could reverse that policy. >>> >>> However, given that both you and Chris appear to prefer to keep it >>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >>> broken and then fall back to PMI-1. >>> >>> >>> On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: >>> >>> Just saw this thread, and I second Chris' observations: at scale we are >>> seeing huge gains in jobstart performance with PMI2 over PMI1. We >>> *CANNOT* loose this functionality. For competitive reasons, I cannot >>> provide exact numbers, but let's say the difference is in the ballpark of a >>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely >>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, >>> but there is no contest between PMI1 and PMI2. We (MLNX) are actively >>> working to resolve some of the scalability issues in PMI2. >>> >>> Josh >>> >>> Joshua S. Ladd >>> Staff Engineer, HPC Software >>> Mellanox Technologies >>> >>> Email: josh...@mellanox.com >>> >>> >>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: >>> >>> Interesting - how many nodes were involved? As I said, the bad scaling >>> becomes more evident at a fairly high node count. >>> >>> On May 7, 2014, at 12:07 AM, Christopher Samuel >>> wrote: >>> >>> > -BEGIN PGP SIGNED MESSAGE- >>> > Hash: SHA1 >>> > >>> > Hiya Ralph, >>> > >>> > On 07/05/14 14:49, Ralph Castain wrote: >>> > >>> >> I should have looked closer to see the numbers you
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Yeah, we'll want to move some of it into common - but a lot of that was already done, so I think it won't be that hard. Will explore On May 7, 2014, at 9:00 AM, Joshua Laddwrote: > +1 Sounds like a good idea - but decoupling the two and adding all the right > selection mojo might be a bit of a pain. There are several places in OMPI > where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB > and ESS frameworks off the top of my head. > > Josh > > > On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote: > Good idea :)! > > среда, 7 мая 2014 г. пользователь Ralph Castain написал: > > Jeff actually had a useful suggestion (gasp!).He proposed that we separate > the PMI-1 and PMI-2 codes into separate components so you could select them > at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are > found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 > component failed, we would emit a show_help indicating that they probably > have a broken PMI-2 version and should try PMI-1. > > Make sense? > Ralph > > On May 7, 2014, at 8:00 AM, Ralph Castain wrote: > >> >> On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: >> >>> Ah, I see. Sorry for the reactionary comment - but this feature falls >>> squarely within my "jurisdiction", and we've invested a lot in improving >>> OMPI jobstart under srun. >>> >>> That being said (now that I've taken some deep breaths and carefully read >>> your original email :)), what you're proposing isn't a bad idea. I think it >>> would be good to maybe add a "--with-pmi2" flag to configure since >>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This >>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or >>> hack the installation. >> >> That would be a much simpler solution than what Artem proposed (off-list) >> where we would try PMI2 and then if it didn't work try to figure out how to >> fall back to PMI1. I'll add this for now, and if Artem wants to try his more >> automagic solution and can make it work, then we can reconsider that option. >> >> Thanks >> Ralph >> >>> >>> Josh >>> >>> >>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: >>> Okay, then we'll just have to develop a workaround for all those Slurm >>> releases where PMI-2 is borked :-( >>> >>> FWIW: I think people misunderstood my statement. I specifically did *not* >>> propose to *lose* PMI-2 support. I suggested that we change it to >>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >>> stabilized, then we could reverse that policy. >>> >>> However, given that both you and Chris appear to prefer to keep it >>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >>> broken and then fall back to PMI-1. >>> >>> >>> On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: >>> Just saw this thread, and I second Chris' observations: at scale we are seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT loose this functionality. For competitive reasons, I cannot provide exact numbers, but let's say the difference is in the ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but there is no contest between PMI1 and PMI2. We (MLNX) are actively working to resolve some of the scalability issues in PMI2. Josh Joshua S. Ladd Staff Engineer, HPC Software Mellanox Technologies Email: josh...@mellanox.com On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: Interesting - how many nodes were involved? As I said, the bad scaling becomes more evident at a fairly high node count. On May 7, 2014, at 12:07 AM, Christopher Samuel wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hiya Ralph, > > On 07/05/14 14:49, Ralph Castain wrote: > >> I should have looked closer to see the numbers you posted, Chris - >> those include time for MPI wireup. So what you are seeing is that >> mpirun is much more efficient at exchanging the MPI endpoint info >> than PMI. I suspect that PMI2 is not much better as the primary >> reason for the difference is that mpriun sends blobs, while PMI >> requires that everything b > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: >
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
+1 Sounds like a good idea - but decoupling the two and adding all the right selection mojo might be a bit of a pain. There are several places in OMPI where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB and ESS frameworks off the top of my head. Josh On Wed, May 7, 2014 at 11:48 AM, Artem Polyakovwrote: > Good idea :)! > > среда, 7 мая 2014 г. пользователь Ralph Castain написал: > > Jeff actually had a useful suggestion (gasp!).He proposed that we separate >> the PMI-1 and PMI-2 codes into separate components so you could select them >> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are >> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 >> component failed, we would emit a show_help indicating that they probably >> have a broken PMI-2 version and should try PMI-1. >> >> Make sense? >> Ralph >> >> On May 7, 2014, at 8:00 AM, Ralph Castain wrote: >> >> >> On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: >> >> Ah, I see. Sorry for the reactionary comment - but this feature falls >> squarely within my "jurisdiction", and we've invested a lot in improving >> OMPI jobstart under srun. >> >> That being said (now that I've taken some deep breaths and carefully read >> your original email :)), what you're proposing isn't a bad idea. I think it >> would be good to maybe add a "--with-pmi2" flag to configure since >> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This >> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or >> hack the installation. >> >> >> That would be a much simpler solution than what Artem proposed (off-list) >> where we would try PMI2 and then if it didn't work try to figure out how to >> fall back to PMI1. I'll add this for now, and if Artem wants to try his >> more automagic solution and can make it work, then we can reconsider that >> option. >> >> Thanks >> Ralph >> >> >> Josh >> >> >> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: >> >> Okay, then we'll just have to develop a workaround for all those Slurm >> releases where PMI-2 is borked :-( >> >> FWIW: I think people misunderstood my statement. I specifically did *not* >> propose to *lose* PMI-2 support. I suggested that we change it to >> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >> stabilized, then we could reverse that policy. >> >> However, given that both you and Chris appear to prefer to keep it >> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >> broken and then fall back to PMI-1. >> >> >> On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: >> >> Just saw this thread, and I second Chris' observations: at scale we are >> seeing huge gains in jobstart performance with PMI2 over PMI1. We >> *CANNOT* loose this functionality. For competitive reasons, I cannot >> provide exact numbers, but let's say the difference is in the ballpark of a >> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely >> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, >> but there is no contest between PMI1 and PMI2. We (MLNX) are actively >> working to resolve some of the scalability issues in PMI2. >> >> Josh >> >> Joshua S. Ladd >> Staff Engineer, HPC Software >> Mellanox Technologies >> >> Email: josh...@mellanox.com >> >> >> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: >> >> Interesting - how many nodes were involved? As I said, the bad scaling >> becomes more evident at a fairly high node count. >> >> On May 7, 2014, at 12:07 AM, Christopher Samuel >> wrote: >> >> > -BEGIN PGP SIGNED MESSAGE- >> > Hash: SHA1 >> > >> > Hiya Ralph, >> > >> > On 07/05/14 14:49, Ralph Castain wrote: >> > >> >> I should have looked closer to see the numbers you posted, Chris - >> >> those include time for MPI wireup. So what you are seeing is that >> >> mpirun is much more efficient at exchanging the MPI endpoint info >> >> than PMI. I suspect that PMI2 is not much better as the primary >> >> reason for the difference is that mpriun sends blobs, while PMI >> >> requires that everything b >> >> > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14716.php >
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Good idea :)! среда, 7 мая 2014 г. пользователь Ralph Castain написал: > Jeff actually had a useful suggestion (gasp!).He proposed that we separate > the PMI-1 and PMI-2 codes into separate components so you could select them > at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are > found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 > component failed, we would emit a show_help indicating that they probably > have a broken PMI-2 version and should try PMI-1. > > Make sense? > Ralph > > On May 7, 2014, at 8:00 AM, Ralph Castainwrote: > > > On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: > > Ah, I see. Sorry for the reactionary comment - but this feature falls > squarely within my "jurisdiction", and we've invested a lot in improving > OMPI jobstart under srun. > > That being said (now that I've taken some deep breaths and carefully read > your original email :)), what you're proposing isn't a bad idea. I think it > would be good to maybe add a "--with-pmi2" flag to configure since > "--with-pmi" automagically uses PMI2 if it finds the header and lib. This > way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or > hack the installation. > > > That would be a much simpler solution than what Artem proposed (off-list) > where we would try PMI2 and then if it didn't work try to figure out how to > fall back to PMI1. I'll add this for now, and if Artem wants to try his > more automagic solution and can make it work, then we can reconsider that > option. > > Thanks > Ralph > > > Josh > > > On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: > > Okay, then we'll just have to develop a workaround for all those Slurm > releases where PMI-2 is borked :-( > > FWIW: I think people misunderstood my statement. I specifically did *not* > propose to *lose* PMI-2 support. I suggested that we change it to > "on-by-request" instead of the current "on-by-default" so we wouldn't keep > getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation > stabilized, then we could reverse that policy. > > However, given that both you and Chris appear to prefer to keep it > "on-by-default", we'll see if we can find a way to detect that PMI-2 is > broken and then fall back to PMI-1. > > > On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: > > Just saw this thread, and I second Chris' observations: at scale we are > seeing huge gains in jobstart performance with PMI2 over PMI1. We > *CANNOT*loose this functionality. For competitive reasons, I cannot provide > exact > numbers, but let's say the difference is in the ballpark of a full > order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely > unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, > but there is no contest between PMI1 and PMI2. We (MLNX) are actively > working to resolve some of the scalability issues in PMI2. > > Josh > > Joshua S. Ladd > Staff Engineer, HPC Software > Mellanox Technologies > > Email: josh...@mellanox.com > > > On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: > > Interesting - how many nodes were involved? As I said, the bad scaling > becomes more evident at a fairly high node count. > > On May 7, 2014, at 12:07 AM, Christopher Samuel > wrote: > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > Hiya Ralph, > > > > On 07/05/14 14:49, Ralph Castain wrote: > > > >> I should have looked closer to see the numbers you posted, Chris - > >> those include time for MPI wireup. So what you are seeing is that > >> mpirun is much more efficient at exchanging the MPI endpoint info > >> than PMI. I suspect that PMI2 is not much better as the primary > >> reason for the difference is that mpriun sends blobs, while PMI > >> requires that everything b > > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Jeff actually had a useful suggestion (gasp!).He proposed that we separate the PMI-1 and PMI-2 codes into separate components so you could select them at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 component failed, we would emit a show_help indicating that they probably have a broken PMI-2 version and should try PMI-1. Make sense? Ralph On May 7, 2014, at 8:00 AM, Ralph Castainwrote: > > On May 7, 2014, at 7:56 AM, Joshua Ladd wrote: > >> Ah, I see. Sorry for the reactionary comment - but this feature falls >> squarely within my "jurisdiction", and we've invested a lot in improving >> OMPI jobstart under srun. >> >> That being said (now that I've taken some deep breaths and carefully read >> your original email :)), what you're proposing isn't a bad idea. I think it >> would be good to maybe add a "--with-pmi2" flag to configure since >> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This >> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or >> hack the installation. > > That would be a much simpler solution than what Artem proposed (off-list) > where we would try PMI2 and then if it didn't work try to figure out how to > fall back to PMI1. I'll add this for now, and if Artem wants to try his more > automagic solution and can make it work, then we can reconsider that option. > > Thanks > Ralph > >> >> Josh >> >> >> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: >> Okay, then we'll just have to develop a workaround for all those Slurm >> releases where PMI-2 is borked :-( >> >> FWIW: I think people misunderstood my statement. I specifically did *not* >> propose to *lose* PMI-2 support. I suggested that we change it to >> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >> stabilized, then we could reverse that policy. >> >> However, given that both you and Chris appear to prefer to keep it >> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >> broken and then fall back to PMI-1. >> >> >> On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: >> >>> Just saw this thread, and I second Chris' observations: at scale we are >>> seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT >>> loose this functionality. For competitive reasons, I cannot provide exact >>> numbers, but let's say the difference is in the ballpark of a full >>> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely >>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, >>> but there is no contest between PMI1 and PMI2. We (MLNX) are actively >>> working to resolve some of the scalability issues in PMI2. >>> >>> Josh >>> >>> Joshua S. Ladd >>> Staff Engineer, HPC Software >>> Mellanox Technologies >>> >>> Email: josh...@mellanox.com >>> >>> >>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: >>> Interesting - how many nodes were involved? As I said, the bad scaling >>> becomes more evident at a fairly high node count. >>> >>> On May 7, 2014, at 12:07 AM, Christopher Samuel >>> wrote: >>> >>> > -BEGIN PGP SIGNED MESSAGE- >>> > Hash: SHA1 >>> > >>> > Hiya Ralph, >>> > >>> > On 07/05/14 14:49, Ralph Castain wrote: >>> > >>> >> I should have looked closer to see the numbers you posted, Chris - >>> >> those include time for MPI wireup. So what you are seeing is that >>> >> mpirun is much more efficient at exchanging the MPI endpoint info >>> >> than PMI. I suspect that PMI2 is not much better as the primary >>> >> reason for the difference is that mpriun sends blobs, while PMI >>> >> requires that everything be encoded into strings and sent in little >>> >> pieces. >>> >> >>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >>> >> operation) much faster, and MPI_Init completes faster. Rest of the >>> >> computation should be the same, so long compute apps will see the >>> >> difference narrow considerably. >>> > >>> > Unfortunately it looks like I had an enthusiastic cleanup at some point >>> > and so I cannot find the out files from those runs at the moment, but >>> > I did find some comparisons from around that time. >>> > >>> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 >>> > run with mpirun and srun successively from inside the same Slurm job. >>> > >>> > mpirun namd2 macpf.conf >>> > srun --mpi=pmi2 namd2 macpf.conf >>> > >>> > Firstly the mpirun output (grep'ing the interesting bits): >>> > >>> > Charm++> Running on MPI version: 2.1 >>> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 >>> > MB memory >>> > Info: Benchmark time: 512 CPUs 0.0929002 s/step
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Hi Josh, Are your changes to OMPI or SLURM's PMI2 implementation? Do you plan to push those changes back upstream? -Adam From: devel [devel-boun...@open-mpi.org] on behalf of Joshua Ladd [jladd.m...@gmail.com] Sent: Wednesday, May 07, 2014 7:56 AM To: Open MPI Developers Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested Ah, I see. Sorry for the reactionary comment - but this feature falls squarely within my "jurisdiction", and we've invested a lot in improving OMPI jobstart under srun. That being said (now that I've taken some deep breaths and carefully read your original email :)), what you're proposing isn't a bad idea. I think it would be good to maybe add a "--with-pmi2" flag to configure since "--with-pmi" automagically uses PMI2 if it finds the header and lib. This way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or hack the installation. Josh On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote: Okay, then we'll just have to develop a workaround for all those Slurm releases where PMI-2 is borked :-( FWIW: I think people misunderstood my statement. I specifically did *not* propose to *lose* PMI-2 support. I suggested that we change it to "on-by-request" instead of the current "on-by-default" so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation stabilized, then we could reverse that policy. However, given that both you and Chris appear to prefer to keep it "on-by-default", we'll see if we can find a way to detect that PMI-2 is broken and then fall back to PMI-1. On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote: Just saw this thread, and I second Chris' observations: at scale we are seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT loose this functionality. For competitive reasons, I cannot provide exact numbers, but let's say the difference is in the ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but there is no contest between PMI1 and PMI2. We (MLNX) are actively working to resolve some of the scalability issues in PMI2. Josh Joshua S. Ladd Staff Engineer, HPC Software Mellanox Technologies Email: josh...@mellanox.com<mailto:josh...@mellanox.com> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote: Interesting - how many nodes were involved? As I said, the bad scaling becomes more evident at a fairly high node count. On May 7, 2014, at 12:07 AM, Christopher Samuel <sam...@unimelb.edu.au<mailto:sam...@unimelb.edu.au>> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hiya Ralph, > > On 07/05/14 14:49, Ralph Castain wrote: > >> I should have looked closer to see the numbers you posted, Chris - >> those include time for MPI wireup. So what you are seeing is that >> mpirun is much more efficient at exchanging the MPI endpoint info >> than PMI. I suspect that PMI2 is not much better as the primary >> reason for the difference is that mpriun sends blobs, while PMI >> requires that everything be encoded into strings and sent in little >> pieces. >> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >> operation) much faster, and MPI_Init completes faster. Rest of the >> computation should be the same, so long compute apps will see the >> difference narrow considerably. > > Unfortunately it looks like I had an enthusiastic cleanup at some point > and so I cannot find the out files from those runs at the moment, but > I did find some comparisons from around that time. > > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 > run with mpirun and srun successively from inside the same Slurm job. > > mpirun namd2 macpf.conf > srun --mpi=pmi2 namd2 macpf.conf > > Firstly the mpirun output (grep'ing the interesting bits): > > Charm++> Running on MPI version: 2.1 > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB > memory > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB > > Now the srun output: > > Charm++> Running o
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Thanks, Chris. -Adam From: devel [devel-boun...@open-mpi.org] on behalf of Christopher Samuel [sam...@unimelb.edu.au] Sent: Wednesday, May 07, 2014 12:07 AM To: de...@open-mpi.org Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hiya Ralph, On 07/05/14 14:49, Ralph Castain wrote: > I should have looked closer to see the numbers you posted, Chris - > those include time for MPI wireup. So what you are seeing is that > mpirun is much more efficient at exchanging the MPI endpoint info > than PMI. I suspect that PMI2 is not much better as the primary > reason for the difference is that mpriun sends blobs, while PMI > requires that everything be encoded into strings and sent in little > pieces. > > Hence, mpirun can exchange the endpoint info (the dreaded "modex" > operation) much faster, and MPI_Init completes faster. Rest of the > computation should be the same, so long compute apps will see the > difference narrow considerably. Unfortunately it looks like I had an enthusiastic cleanup at some point and so I cannot find the out files from those runs at the moment, but I did find some comparisons from around that time. This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 run with mpirun and srun successively from inside the same Slurm job. mpirun namd2 macpf.conf srun --mpi=pmi2 namd2 macpf.conf Firstly the mpirun output (grep'ing the interesting bits): Charm++> Running on MPI version: 2.1 Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB memory WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB Now the srun output: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB memory WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB The next two pairs are first launched using mpirun from 1.6.x and then with srun from 1.7.3a1r29103. Again each pair inside the same Slurm job with the same inputs. First pair mpirun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB First pair srun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory WallClock: 7476.643555 CPUTime: 7476.643555 Memory: 968.867188 MB Second pair mpirun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory WallClock: 7842.831543 CPUTime: 7842.831543 Memory: 1004.050781 MB Second pair srun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory WallClock: 7522.677246 CPUTime: 7522.677246 Memory: 969.433594 MB So to me it looks like (for NAMD on our system at least) that
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
On May 7, 2014, at 7:56 AM, Joshua Laddwrote: > Ah, I see. Sorry for the reactionary comment - but this feature falls > squarely within my "jurisdiction", and we've invested a lot in improving OMPI > jobstart under srun. > > That being said (now that I've taken some deep breaths and carefully read > your original email :)), what you're proposing isn't a bad idea. I think it > would be good to maybe add a "--with-pmi2" flag to configure since > "--with-pmi" automagically uses PMI2 if it finds the header and lib. This > way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or > hack the installation. That would be a much simpler solution than what Artem proposed (off-list) where we would try PMI2 and then if it didn't work try to figure out how to fall back to PMI1. I'll add this for now, and if Artem wants to try his more automagic solution and can make it work, then we can reconsider that option. Thanks Ralph > > Josh > > > On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: > Okay, then we'll just have to develop a workaround for all those Slurm > releases where PMI-2 is borked :-( > > FWIW: I think people misunderstood my statement. I specifically did *not* > propose to *lose* PMI-2 support. I suggested that we change it to > "on-by-request" instead of the current "on-by-default" so we wouldn't keep > getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation > stabilized, then we could reverse that policy. > > However, given that both you and Chris appear to prefer to keep it > "on-by-default", we'll see if we can find a way to detect that PMI-2 is > broken and then fall back to PMI-1. > > > On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: > >> Just saw this thread, and I second Chris' observations: at scale we are >> seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT >> loose this functionality. For competitive reasons, I cannot provide exact >> numbers, but let's say the difference is in the ballpark of a full >> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely >> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but >> there is no contest between PMI1 and PMI2. We (MLNX) are actively working >> to resolve some of the scalability issues in PMI2. >> >> Josh >> >> Joshua S. Ladd >> Staff Engineer, HPC Software >> Mellanox Technologies >> >> Email: josh...@mellanox.com >> >> >> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: >> Interesting - how many nodes were involved? As I said, the bad scaling >> becomes more evident at a fairly high node count. >> >> On May 7, 2014, at 12:07 AM, Christopher Samuel >> wrote: >> >> > -BEGIN PGP SIGNED MESSAGE- >> > Hash: SHA1 >> > >> > Hiya Ralph, >> > >> > On 07/05/14 14:49, Ralph Castain wrote: >> > >> >> I should have looked closer to see the numbers you posted, Chris - >> >> those include time for MPI wireup. So what you are seeing is that >> >> mpirun is much more efficient at exchanging the MPI endpoint info >> >> than PMI. I suspect that PMI2 is not much better as the primary >> >> reason for the difference is that mpriun sends blobs, while PMI >> >> requires that everything be encoded into strings and sent in little >> >> pieces. >> >> >> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >> >> operation) much faster, and MPI_Init completes faster. Rest of the >> >> computation should be the same, so long compute apps will see the >> >> difference narrow considerably. >> > >> > Unfortunately it looks like I had an enthusiastic cleanup at some point >> > and so I cannot find the out files from those runs at the moment, but >> > I did find some comparisons from around that time. >> > >> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 >> > run with mpirun and srun successively from inside the same Slurm job. >> > >> > mpirun namd2 macpf.conf >> > srun --mpi=pmi2 namd2 macpf.conf >> > >> > Firstly the mpirun output (grep'ing the interesting bits): >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 >> > MB memory >> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 >> > MB memory >> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 >> > MB memory >> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 >> > MB memory >> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 >> > MB memory >> > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB >> > >> > Now the srun output: >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 >> > MB memory >> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Ah, I see. Sorry for the reactionary comment - but this feature falls squarely within my "jurisdiction", and we've invested a lot in improving OMPI jobstart under srun. That being said (now that I've taken some deep breaths and carefully read your original email :)), what you're proposing isn't a bad idea. I think it would be good to maybe add a "--with-pmi2" flag to configure since "--with-pmi" automagically uses PMI2 if it finds the header and lib. This way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or hack the installation. Josh On Wed, May 7, 2014 at 10:45 AM, Ralph Castainwrote: > Okay, then we'll just have to develop a workaround for all those Slurm > releases where PMI-2 is borked :-( > > FWIW: I think people misunderstood my statement. I specifically did *not* > propose to *lose* PMI-2 support. I suggested that we change it to > "on-by-request" instead of the current "on-by-default" so we wouldn't keep > getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation > stabilized, then we could reverse that policy. > > However, given that both you and Chris appear to prefer to keep it > "on-by-default", we'll see if we can find a way to detect that PMI-2 is > broken and then fall back to PMI-1. > > > On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: > > Just saw this thread, and I second Chris' observations: at scale we are > seeing huge gains in jobstart performance with PMI2 over PMI1. We > *CANNOT*loose this functionality. For competitive reasons, I cannot provide > exact > numbers, but let's say the difference is in the ballpark of a full > order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely > unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, > but there is no contest between PMI1 and PMI2. We (MLNX) are actively > working to resolve some of the scalability issues in PMI2. > > Josh > > Joshua S. Ladd > Staff Engineer, HPC Software > Mellanox Technologies > > Email: josh...@mellanox.com > > > On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: > >> Interesting - how many nodes were involved? As I said, the bad scaling >> becomes more evident at a fairly high node count. >> >> On May 7, 2014, at 12:07 AM, Christopher Samuel >> wrote: >> >> > -BEGIN PGP SIGNED MESSAGE- >> > Hash: SHA1 >> > >> > Hiya Ralph, >> > >> > On 07/05/14 14:49, Ralph Castain wrote: >> > >> >> I should have looked closer to see the numbers you posted, Chris - >> >> those include time for MPI wireup. So what you are seeing is that >> >> mpirun is much more efficient at exchanging the MPI endpoint info >> >> than PMI. I suspect that PMI2 is not much better as the primary >> >> reason for the difference is that mpriun sends blobs, while PMI >> >> requires that everything be encoded into strings and sent in little >> >> pieces. >> >> >> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >> >> operation) much faster, and MPI_Init completes faster. Rest of the >> >> computation should be the same, so long compute apps will see the >> >> difference narrow considerably. >> > >> > Unfortunately it looks like I had an enthusiastic cleanup at some point >> > and so I cannot find the out files from those runs at the moment, but >> > I did find some comparisons from around that time. >> > >> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 >> > run with mpirun and srun successively from inside the same Slurm job. >> > >> > mpirun namd2 macpf.conf >> > srun --mpi=pmi2 namd2 macpf.conf >> > >> > Firstly the mpirun output (grep'ing the interesting bits): >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns >> 1055.19 MB memory >> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns >> 1055.19 MB memory >> > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB >> > >> > Now the srun output: >> > >> > Charm++> Running on MPI version: 2.1 >> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns >> 1036.75 MB memory >> > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns >> 1036.75 MB memory >> > WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB >> > >> > >> > The next two pairs are first launched using mpirun from 1.6.x and then >> with srun >> > from 1.7.3a1r29103.
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Okay, then we'll just have to develop a workaround for all those Slurm releases where PMI-2 is borked :-( FWIW: I think people misunderstood my statement. I specifically did *not* propose to *lose* PMI-2 support. I suggested that we change it to "on-by-request" instead of the current "on-by-default" so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation stabilized, then we could reverse that policy. However, given that both you and Chris appear to prefer to keep it "on-by-default", we'll see if we can find a way to detect that PMI-2 is broken and then fall back to PMI-1. On May 7, 2014, at 7:39 AM, Joshua Laddwrote: > Just saw this thread, and I second Chris' observations: at scale we are > seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT > loose this functionality. For competitive reasons, I cannot provide exact > numbers, but let's say the difference is in the ballpark of a full > order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely > unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but > there is no contest between PMI1 and PMI2. We (MLNX) are actively working to > resolve some of the scalability issues in PMI2. > > Josh > > Joshua S. Ladd > Staff Engineer, HPC Software > Mellanox Technologies > > Email: josh...@mellanox.com > > > On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: > Interesting - how many nodes were involved? As I said, the bad scaling > becomes more evident at a fairly high node count. > > On May 7, 2014, at 12:07 AM, Christopher Samuel wrote: > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > Hiya Ralph, > > > > On 07/05/14 14:49, Ralph Castain wrote: > > > >> I should have looked closer to see the numbers you posted, Chris - > >> those include time for MPI wireup. So what you are seeing is that > >> mpirun is much more efficient at exchanging the MPI endpoint info > >> than PMI. I suspect that PMI2 is not much better as the primary > >> reason for the difference is that mpriun sends blobs, while PMI > >> requires that everything be encoded into strings and sent in little > >> pieces. > >> > >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" > >> operation) much faster, and MPI_Init completes faster. Rest of the > >> computation should be the same, so long compute apps will see the > >> difference narrow considerably. > > > > Unfortunately it looks like I had an enthusiastic cleanup at some point > > and so I cannot find the out files from those runs at the moment, but > > I did find some comparisons from around that time. > > > > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 > > run with mpirun and srun successively from inside the same Slurm job. > > > > mpirun namd2 macpf.conf > > srun --mpi=pmi2 namd2 macpf.conf > > > > Firstly the mpirun output (grep'ing the interesting bits): > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB > > memory > > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB > > > > Now the srun output: > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB > > memory > > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB > > memory > > WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB > > > > > > The next two pairs are first launched using mpirun from 1.6.x and then with > > srun > > from 1.7.3a1r29103. Again each pair inside the same Slurm job with the > > same inputs. > > > > First pair mpirun: > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB > > memory > > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB > > memory > > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB > > memory > > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB > > memory > > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB > > memory > > WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB > > > > First pair srun: > > > >
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Just saw this thread, and I second Chris' observations: at scale we are seeing huge gains in jobstart performance with PMI2 over PMI1. We *CANNOT*loose this functionality. For competitive reasons, I cannot provide exact numbers, but let's say the difference is in the ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but there is no contest between PMI1 and PMI2. We (MLNX) are actively working to resolve some of the scalability issues in PMI2. Josh Joshua S. Ladd Staff Engineer, HPC Software Mellanox Technologies Email: josh...@mellanox.com On Wed, May 7, 2014 at 4:00 AM, Ralph Castainwrote: > Interesting - how many nodes were involved? As I said, the bad scaling > becomes more evident at a fairly high node count. > > On May 7, 2014, at 12:07 AM, Christopher Samuel > wrote: > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > Hiya Ralph, > > > > On 07/05/14 14:49, Ralph Castain wrote: > > > >> I should have looked closer to see the numbers you posted, Chris - > >> those include time for MPI wireup. So what you are seeing is that > >> mpirun is much more efficient at exchanging the MPI endpoint info > >> than PMI. I suspect that PMI2 is not much better as the primary > >> reason for the difference is that mpriun sends blobs, while PMI > >> requires that everything be encoded into strings and sent in little > >> pieces. > >> > >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" > >> operation) much faster, and MPI_Init completes faster. Rest of the > >> computation should be the same, so long compute apps will see the > >> difference narrow considerably. > > > > Unfortunately it looks like I had an enthusiastic cleanup at some point > > and so I cannot find the out files from those runs at the moment, but > > I did find some comparisons from around that time. > > > > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 > > run with mpirun and srun successively from inside the same Slurm job. > > > > mpirun namd2 macpf.conf > > srun --mpi=pmi2 namd2 macpf.conf > > > > Firstly the mpirun output (grep'ing the interesting bits): > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 > MB memory > > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 > MB memory > > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 > MB memory > > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 > MB memory > > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 > MB memory > > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB > > > > Now the srun output: > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 > MB memory > > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 > MB memory > > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 > MB memory > > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 > MB memory > > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 > MB memory > > WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB > > > > > > The next two pairs are first launched using mpirun from 1.6.x and then > with srun > > from 1.7.3a1r29103. Again each pair inside the same Slurm job with the > same inputs. > > > > First pair mpirun: > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB > memory > > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB > memory > > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB > memory > > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB > memory > > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB > memory > > WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB > > > > First pair srun: > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB > memory > > Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB > memory > > Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB > memory > > Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB > memory > > Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB > memory > > WallClock: 7476.643555 CPUTime: 7476.643555 Memory: 968.867188 MB > > > > > > Second pair mpirun: > > > > Charm++> Running on MPI version: 2.1 > > Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB > memory > > Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB > memory > > Info:
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Interesting - how many nodes were involved? As I said, the bad scaling becomes more evident at a fairly high node count. On May 7, 2014, at 12:07 AM, Christopher Samuelwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hiya Ralph, > > On 07/05/14 14:49, Ralph Castain wrote: > >> I should have looked closer to see the numbers you posted, Chris - >> those include time for MPI wireup. So what you are seeing is that >> mpirun is much more efficient at exchanging the MPI endpoint info >> than PMI. I suspect that PMI2 is not much better as the primary >> reason for the difference is that mpriun sends blobs, while PMI >> requires that everything be encoded into strings and sent in little >> pieces. >> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >> operation) much faster, and MPI_Init completes faster. Rest of the >> computation should be the same, so long compute apps will see the >> difference narrow considerably. > > Unfortunately it looks like I had an enthusiastic cleanup at some point > and so I cannot find the out files from those runs at the moment, but > I did find some comparisons from around that time. > > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 > run with mpirun and srun successively from inside the same Slurm job. > > mpirun namd2 macpf.conf > srun --mpi=pmi2 namd2 macpf.conf > > Firstly the mpirun output (grep'ing the interesting bits): > > Charm++> Running on MPI version: 2.1 > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB > memory > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB > memory > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB > > Now the srun output: > > Charm++> Running on MPI version: 2.1 > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB > memory > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB > memory > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB > memory > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB > memory > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB > memory > WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB > > > The next two pairs are first launched using mpirun from 1.6.x and then with > srun > from 1.7.3a1r29103. Again each pair inside the same Slurm job with the same > inputs. > > First pair mpirun: > > Charm++> Running on MPI version: 2.1 > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory > WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB > > First pair srun: > > Charm++> Running on MPI version: 2.1 > Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB > memory > Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB > memory > Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB > memory > Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory > Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB > memory > WallClock: 7476.643555 CPUTime: 7476.643555 Memory: 968.867188 MB > > > Second pair mpirun: > > Charm++> Running on MPI version: 2.1 > Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB > memory > Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory > Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB > memory > Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB > memory > Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB > memory > WallClock: 7842.831543 CPUTime: 7842.831543 Memory: 1004.050781 MB > > Second pair srun: > > Charm++> Running on MPI version: 2.1 > Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory > Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory > Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory > Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory > Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory > WallClock: 7522.677246 CPUTime: 7522.677246 Memory: 969.433594 MB > > > So to me it
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hiya Ralph, On 07/05/14 14:49, Ralph Castain wrote: > I should have looked closer to see the numbers you posted, Chris - > those include time for MPI wireup. So what you are seeing is that > mpirun is much more efficient at exchanging the MPI endpoint info > than PMI. I suspect that PMI2 is not much better as the primary > reason for the difference is that mpriun sends blobs, while PMI > requires that everything be encoded into strings and sent in little > pieces. > > Hence, mpirun can exchange the endpoint info (the dreaded "modex" > operation) much faster, and MPI_Init completes faster. Rest of the > computation should be the same, so long compute apps will see the > difference narrow considerably. Unfortunately it looks like I had an enthusiastic cleanup at some point and so I cannot find the out files from those runs at the moment, but I did find some comparisons from around that time. This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 run with mpirun and srun successively from inside the same Slurm job. mpirun namd2 macpf.conf srun --mpi=pmi2 namd2 macpf.conf Firstly the mpirun output (grep'ing the interesting bits): Charm++> Running on MPI version: 2.1 Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB memory Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB memory WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB Now the srun output: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB memory Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB memory WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB The next two pairs are first launched using mpirun from 1.6.x and then with srun from 1.7.3a1r29103. Again each pair inside the same Slurm job with the same inputs. First pair mpirun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB First pair srun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory WallClock: 7476.643555 CPUTime: 7476.643555 Memory: 968.867188 MB Second pair mpirun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory WallClock: 7842.831543 CPUTime: 7842.831543 Memory: 1004.050781 MB Second pair srun: Charm++> Running on MPI version: 2.1 Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory WallClock: 7522.677246 CPUTime: 7522.677246 Memory: 969.433594 MB So to me it looks like (for NAMD on our system at least) that PMI2 does seem to give better scalability. All the best! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
I should have looked closer to see the numbers you posted, Chris - those include time for MPI wireup. So what you are seeing is that mpirun is much more efficient at exchanging the MPI endpoint info than PMI. I suspect that PMI2 is not much better as the primary reason for the difference is that mpriun sends blobs, while PMI requires that everything be encoded into strings and sent in little pieces. Hence, mpirun can exchange the endpoint info (the dreaded "modex" operation) much faster, and MPI_Init completes faster. Rest of the computation should be the same, so long compute apps will see the difference narrow considerably. HTH Ralph On May 6, 2014, at 9:45 PM, Ralph Castainwrote: > Ah, interesting - my comments were in respect to startup time (specifically, > MPI wireup) > > On May 6, 2014, at 8:49 PM, Christopher Samuel wrote: > >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA1 >> >> On 07/05/14 13:37, Moody, Adam T. wrote: >> >>> Hi Chris, >> >> Hi Adam, >> >>> I'm interested in SLURM / OpenMPI startup numbers, but I haven't >>> done this testing myself. We're stuck with an older version of >>> SLURM for various internal reasons, and I'm wondering whether it's >>> worth the effort to back port the PMI2 support. Can you share some >>> of the differences in times at different scales? >> >> We've not looked at startup times I'm afraid, this was time to >> solution. We noticed it with Slurm when we first started using on >> x86-64 for our NAMD tests (this from a posting to the list last year >> when I raised the issue and were told PMI2 would be the solution): >> >>> Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB. >>> >>> Here are some timings as reported as the WallClock time by NAMD >>> itself (so not including startup/tear down overhead from Slurm). >>> >>> srun: >>> >>> run1/slurm-93744.out:WallClock: 695.079773 CPUTime: 695.079773 >>> run4/slurm-94011.out:WallClock: 723.907959 CPUTime: 723.907959 >>> run5/slurm-94013.out:WallClock: 726.156799 CPUTime: 726.156799 >>> run6/slurm-94017.out:WallClock: 724.828918 CPUTime: 724.828918 >>> >>> Average of 692 seconds >>> >>> mpirun: >>> >>> run2/slurm-93746.out:WallClock: 559.311035 CPUTime: 559.311035 >>> run3/slurm-93910.out:WallClock: 544.116333 CPUTime: 544.116333 >>> run7/slurm-94019.out:WallClock: 586.072693 CPUTime: 586.072693 >>> >>> Average of 563 seconds. >>> >>> So that's about 23% slower. >>> >>> Everything is identical (they're all symlinks to the same golden >>> master) *except* for the srun / mpirun which is modified by >>> copying the batch script and substituting mpirun for srun. >> >> >> >> - -- >> Christopher SamuelSenior Systems Administrator >> VLSCI - Victorian Life Sciences Computation Initiative >> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 >> http://www.vlsci.org.au/ http://twitter.com/vlsci >> >> -BEGIN PGP SIGNATURE- >> Version: GnuPG v1.4.14 (GNU/Linux) >> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ >> >> iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw >> 8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK >> =pqH9 >> -END PGP SIGNATURE- >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14694.php >
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Ah, interesting - my comments were in respect to startup time (specifically, MPI wireup) On May 6, 2014, at 8:49 PM, Christopher Samuelwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/05/14 13:37, Moody, Adam T. wrote: > >> Hi Chris, > > Hi Adam, > >> I'm interested in SLURM / OpenMPI startup numbers, but I haven't >> done this testing myself. We're stuck with an older version of >> SLURM for various internal reasons, and I'm wondering whether it's >> worth the effort to back port the PMI2 support. Can you share some >> of the differences in times at different scales? > > We've not looked at startup times I'm afraid, this was time to > solution. We noticed it with Slurm when we first started using on > x86-64 for our NAMD tests (this from a posting to the list last year > when I raised the issue and were told PMI2 would be the solution): > >> Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB. >> >> Here are some timings as reported as the WallClock time by NAMD >> itself (so not including startup/tear down overhead from Slurm). >> >> srun: >> >> run1/slurm-93744.out:WallClock: 695.079773 CPUTime: 695.079773 >> run4/slurm-94011.out:WallClock: 723.907959 CPUTime: 723.907959 >> run5/slurm-94013.out:WallClock: 726.156799 CPUTime: 726.156799 >> run6/slurm-94017.out:WallClock: 724.828918 CPUTime: 724.828918 >> >> Average of 692 seconds >> >> mpirun: >> >> run2/slurm-93746.out:WallClock: 559.311035 CPUTime: 559.311035 >> run3/slurm-93910.out:WallClock: 544.116333 CPUTime: 544.116333 >> run7/slurm-94019.out:WallClock: 586.072693 CPUTime: 586.072693 >> >> Average of 563 seconds. >> >> So that's about 23% slower. >> >> Everything is identical (they're all symlinks to the same golden >> master) *except* for the srun / mpirun which is modified by >> copying the batch script and substituting mpirun for srun. > > > > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw > 8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK > =pqH9 > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14694.php
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/05/14 13:37, Moody, Adam T. wrote: > Hi Chris, Hi Adam, > I'm interested in SLURM / OpenMPI startup numbers, but I haven't > done this testing myself. We're stuck with an older version of > SLURM for various internal reasons, and I'm wondering whether it's > worth the effort to back port the PMI2 support. Can you share some > of the differences in times at different scales? We've not looked at startup times I'm afraid, this was time to solution. We noticed it with Slurm when we first started using on x86-64 for our NAMD tests (this from a posting to the list last year when I raised the issue and were told PMI2 would be the solution): > Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB. > > Here are some timings as reported as the WallClock time by NAMD > itself (so not including startup/tear down overhead from Slurm). > > srun: > > run1/slurm-93744.out:WallClock: 695.079773 CPUTime: 695.079773 > run4/slurm-94011.out:WallClock: 723.907959 CPUTime: 723.907959 > run5/slurm-94013.out:WallClock: 726.156799 CPUTime: 726.156799 > run6/slurm-94017.out:WallClock: 724.828918 CPUTime: 724.828918 > > Average of 692 seconds > > mpirun: > > run2/slurm-93746.out:WallClock: 559.311035 CPUTime: 559.311035 > run3/slurm-93910.out:WallClock: 544.116333 CPUTime: 544.116333 > run7/slurm-94019.out:WallClock: 586.072693 CPUTime: 586.072693 > > Average of 563 seconds. > > So that's about 23% slower. > > Everything is identical (they're all symlinks to the same golden > master) *except* for the srun / mpirun which is modified by > copying the batch script and substituting mpirun for srun. - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw 8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK =pqH9 -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
FWIW: we see varying reports about the scalability of Slurm, especially at large cluster sizes. Last I saw/tested, there is a quadratic term that begins to dominate above 2k nodes. Others swear it is better . Guess I'd be cautious and definitely test things before investing in a move - I'm not convinced. On May 6, 2014, at 8:37 PM, Moody, Adam T. <mood...@llnl.gov> wrote: > Hi Chris, > I'm interested in SLURM / OpenMPI startup numbers, but I haven't done this > testing myself. We're stuck with an older version of SLURM for various > internal reasons, and I'm wondering whether it's worth the effort to back > port the PMI2 support. Can you share some of the differences in times at > different scales? > Thanks, > -Adam > > From: devel [devel-boun...@open-mpi.org] on behalf of Christopher Samuel > [sam...@unimelb.edu.au] > Sent: Tuesday, May 06, 2014 8:32 PM > To: de...@open-mpi.org > Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is > specifically requested > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/05/14 12:53, Ralph Castain wrote: > >> We have been seeing a lot of problems with the Slurm PMI-2 support >> (not in OMPI - it's the code in Slurm that is having problems). At >> this time, I'm unaware of any advantage in using PMI-2 over PMI-1 >> in Slurm - the scaling is equally poor, and PMI-2 does not supports >> any additional functionality. >> >> I know that Cray PMI-2 has a definite advantage, so I'm proposing >> that we turn PMI-2 "off" when under Slurm unless the user >> specifically requests we use it. > > Our local testing has shown that PMI-2 in 1.7.x gives a massive > improvement in scaling when starting jobs with srun over using srun > with OMPI 1.6.x and now that OMPI 1.8.x is out we're planning on > moving to using PMI2 with OMPI and srun. > > Using mpirun gives good performance with OMPI 1.6.x but Slurm then > gets all its memory stats wrong and if you run with CR_Core_Memory in > Slurm you have a very high risk your job will get killed incorrectly. > > All the best, > Chris > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlNpqUwACgkQO2KABBYQAh/igwCfQSB/v3tI37Rq4z5z/0xT/BYU > 6ToAn3Qt6tOt46LQD25eHhlx+3z/sjnQ > =LEHf > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14691.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14692.php