2014-05-08 7:15 GMT+07:00 Ralph Castain <r...@open-mpi.org>: > Take a look in opal/mca/common/pmi - we already do a bunch of #if PMI2 > stuff in there. All we are talking about doing here is: > > * making those selections be runtime based on an MCA param, compiling if > PMI2 is available but selected at runtime > > * moving some additional functions into that code area and out of the > individual components >
Ok, that is pretty clear now. And will do exactly #2. Thank you. > > > On May 7, 2014, at 5:08 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > I like #2 too. > But my question was slightly different. Can we incapsulate PMI logic that > OMPI use in common/pmi as #2 suggests but have 2 different > implementations of this component say common/pmi and common/pmi2? I am > asking because I have concerns that this kind of component is not supposed > to be duplicated. > In this case we could have one common MCA parameter and 2 components as it > was suggested by Jeff. > > > 2014-05-08 7:01 GMT+07:00 Ralph Castain <r...@open-mpi.org>: > >> The desired solution is to have the ability to select pmi-1 vs pmi-2 at >> runtime. This can be done in two ways: >> >> 1. you could have separate pmi1 and pmi2 components in each framework. >> You'd want to define only one common MCA param to direct the selection, >> however. >> >> 2. you could have a single pmi component in each framework, calling code >> in the appropriate common/pmi location. You would then need a runtime MCA >> param to select whether pmi-1 or pmi-2 was going to be used, and have the >> common code check before making the desired calls. >> >> The choice of method is left up to you. They each have their negatives. >> If it were me, I'd probably try #2 first, assuming the codes are mostly >> common in the individual frameworks. >> >> >> On May 7, 2014, at 4:51 PM, Artem Polyakov <artpo...@gmail.com> wrote: >> >> Just reread your suggestions in our out-of-list discussion and found >> that I misunderstand it. So no parallel PMI! Take all possible code into >> opal/mca/common/pmi. >> To additionally clarify what is the preferred way: >> 1. to create one joined PMI module having a switches to decide what >> functiononality to implement. >> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and >> does this fit opal/mca/common/ ideology at all? >> >> >> 2014-05-08 6:44 GMT+07:00 Artem Polyakov <artpo...@gmail.com>: >> >>> >>> 2014-05-08 5:54 GMT+07:00 Ralph Castain <r...@open-mpi.org>: >>> >>> Ummm....no, I don't think that's right. I believe we decided to instead >>>> create the separate components, default to PMI-2 if available, print nice >>>> error message if not, otherwise use PMI-1. >>>> >>>> I don't want to initialize both PMIs in parallel as most installations >>>> won't support it. >>>> >>> >>> Ok, I agree. Beside the lack of support there can be a performance hit >>> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 >>> since it is quite simple and local. But I didn't consider other >>> implementations. >>> >>> On May 7, 2014, at 3:49 PM, Artem Polyakov <artpo...@gmail.com> wrote: >>>> >>>> We discussed with Ralph Joshuas concerns and decided to try automatic >>>> PMI2 correctness first as it was initially intended. Here is my idea. The >>>> universal way to decide if PMI2 is correct is to compare PMI_Init(.., >>>> &rank, &size, ...) and PMI2_Init(.., &rank, &size, ...). Size and rank >>>> should be equal. In this case we proceed with PMI2 finalizing PMI1. >>>> Otherwise we finalize PMI2 and proceed with PMI1. >>>> I need to clarify with SLURM guys if parallel initialization of both >>>> PMIs are legal. If not - we'll do that sequentially. >>>> In other places we'll just use the flag saying what PMI version to use. >>>> Does that sounds reasonable? >>>> >>>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov <artpo...@gmail.com>: >>>> >>>>> That's a good point. There is actually a bunch of modules in ompi, >>>>> opal and orte that has to be duplicated. >>>>> >>>>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал: >>>>> >>>>>> +1 Sounds like a good idea - but decoupling the two and adding all >>>>>> the right selection mojo might be a bit of a pain. There are several >>>>>> places >>>>>> in OMPI where the distinction between PMI1and PMI2 is made, not only in >>>>>> grpcomm. DB and ESS frameworks off the top of my head. >>>>>> >>>>>> Josh >>>>>> >>>>>> >>>>>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov <artpo...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Good idea :)! >>>>>>> >>>>>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал: >>>>>>> >>>>>>> Jeff actually had a useful suggestion (gasp!).He proposed that we >>>>>>> separate the PMI-1 and PMI-2 codes into separate components so you could >>>>>>> select them at runtime. Thus, we would build both (assuming both PMI-1 >>>>>>> and >>>>>>> 2 libs are found), default to PMI-1, but users could select to try >>>>>>> PMI-2. >>>>>>> If the PMI-2 component failed, we would emit a show_help indicating that >>>>>>> they probably have a broken PMI-2 version and should try PMI-1. >>>>>>> >>>>>>> Make sense? >>>>>>> Ralph >>>>>>> >>>>>>> On May 7, 2014, at 8:00 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>> >>>>>>> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.m...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Ah, I see. Sorry for the reactionary comment - but this feature >>>>>>> falls squarely within my "jurisdiction", and we've invested a lot in >>>>>>> improving OMPI jobstart under srun. >>>>>>> >>>>>>> That being said (now that I've taken some deep breaths and carefully >>>>>>> read your original email :)), what you're proposing isn't a bad idea. I >>>>>>> think it would be good to maybe add a "--with-pmi2" flag to configure >>>>>>> since >>>>>>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. >>>>>>> This >>>>>>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM >>>>>>> or >>>>>>> hack the installation. >>>>>>> >>>>>>> >>>>>>> That would be a much simpler solution than what Artem proposed >>>>>>> (off-list) where we would try PMI2 and then if it didn't work try to >>>>>>> figure >>>>>>> out how to fall back to PMI1. I'll add this for now, and if Artem wants >>>>>>> to >>>>>>> try his more automagic solution and can make it work, then we can >>>>>>> reconsider that option. >>>>>>> >>>>>>> Thanks >>>>>>> Ralph >>>>>>> >>>>>>> >>>>>>> Josh >>>>>>> >>>>>>> >>>>>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>> >>>>>>> Okay, then we'll just have to develop a workaround for all those >>>>>>> Slurm releases where PMI-2 is borked :-( >>>>>>> >>>>>>> FWIW: I think people misunderstood my statement. I specifically did >>>>>>> *not* propose to *lose* PMI-2 support. I suggested that we change it to >>>>>>> "on-by-request" instead of the current "on-by-default" so we wouldn't >>>>>>> keep >>>>>>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >>>>>>> stabilized, then we could reverse that policy. >>>>>>> >>>>>>> However, given that both you and Chris appear to prefer to keep it >>>>>>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >>>>>>> broken and then fall back to PMI-1. >>>>>>> >>>>>>> >>>>>>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.m...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Just saw this thread, and I second Chris' observations: at scale we >>>>>>> are seeing huge gains in jobstart performance with PMI2 over PMI1. We >>>>>>> *CANNOT* loose this functionality. For competitive reasons, I >>>>>>> cannot provide exact numbers, but let's say the difference is in the >>>>>>> ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is >>>>>>> completely unacceptable/unusable at scale. Certainly PMI2 still has >>>>>>> scaling >>>>>>> issues, but there is no contest between PMI1 and PMI2. We (MLNX) are >>>>>>> actively working to resolve some of the scalability issues in PMI2. >>>>>>> >>>>>>> Josh >>>>>>> >>>>>>> Joshua S. Ladd >>>>>>> Staff Engineer, HPC Software >>>>>>> Mellanox Technologies >>>>>>> >>>>>>> Email: josh...@mellanox.com >>>>>>> >>>>>>> >>>>>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>> >>>>>>> Interesting - how many nodes were involved? As I said, the bad >>>>>>> scaling becomes more evident at a fairly high node count. >>>>>>> >>>>>>> On May 7, 2014, at 12:07 AM, Christopher Samuel < >>>>>>> sam...@unimelb.edu.au> wrote: >>>>>>> >>>>>>> > -----BEGIN PGP SIGNED MESSAGE----- >>>>>>> > Hash: SHA1 >>>>>>> > >>>>>>> > Hiya Ralph, >>>>>>> > >>>>>>> > On 07/05/14 14:49, Ralph Castain wrote: >>>>>>> > >>>>>>> >> I should have looked closer to see the numbers you posted, Chris - >>>>>>> >> those include time for MPI wireup. So what you are seeing is that >>>>>>> >> mpirun is much more efficient at exchanging the MPI endpoint info >>>>>>> >> than PMI. I suspect that PMI2 is not much better as the primary >>>>>>> >> reason for the difference is that mpriun sends blobs, while PMI >>>>>>> >> requires that everything b >>>>>>> >>>>>>> _______________________________________________ >>>>>>> >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14716.php >>>>>>> >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> С Уважением, Поляков Артем Юрьевич >>>> Best regards, Artem Y. Polyakov >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/05/14725.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/05/14726.php >>>> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14728.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14729.php >> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14730.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14731.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov