It does look similar - question is: why didn’t this fix the problem? Will have to investigate.
Thanks > On Dec 2, 2014, at 3:17 AM, Artem Polyakov <artpo...@gmail.com> wrote: > > > > 2014-12-02 17:13 GMT+06:00 Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>>: > Hmmm…if that is true, then it didn’t fix this problem as it is being reported > in the master. > > I had this problem on my laptop installation. You can check my report it was > detailed enough and see if you hitting the same issue. My fix was also > included into 1.8 branch. I am not sure that this is the same issue but they > looks similar. > > > >> On Dec 1, 2014, at 9:40 PM, Artem Polyakov <artpo...@gmail.com >> <mailto:artpo...@gmail.com>> wrote: >> >> I think this might be related to the configuration problem I was fixing with >> Jeff few months ago. Refer here: >> https://github.com/open-mpi/ompi/pull/240 >> <https://github.com/open-mpi/ompi/pull/240> >> >> 2014-12-02 10:15 GMT+06:00 Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>>: >> If it isn’t too much trouble, it would be good to confirm that it remains >> broken. I strongly suspect it is based on Moe’s comments. >> >> Obviously, other people are making this work. For Intel MPI, all you do is >> point it at libpmi and they can run. However, they do explicitly dlopen it >> in their code, and I don’t know what flags they might pass when they do so. >> >> If necessary, I suppose we could follow that pattern. In other words, rather >> than specifically linking the “s1” component to libpmi, instead require that >> the user point us to a pmi library via an MCA param, then explicitly dlopen >> that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but >> resolves the pmi linkage problem. >> >> >>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org <mailto:gilles.gouaillar...@iferc.org>> >>> wrote: >>> >>> $ srun --version >>> slurm 2.6.6-VENDOR_PROVIDED >>> >>> $ srun --mpi=pmi2 -n 1 ~/hw >>> I am 0 / 1 >>> >>> $ srun -n 1 ~/hw >>> /csc/home1/gouaillardet/hw: symbol lookup error: >>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose >>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received >>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or >>> received >>> srun: error: soleil: task 0: Exited with exit code 127 >>> >>> $ ldd /usr/lib64/slurm/auth_munge.so >>> linux-vdso.so.1 => (0x00007fff54478000) >>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000) >>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000) >>> libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000) >>> /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000) >>> >>> >>> now, if i reling auth_munge.so so it depends on libslurm : >>> >>> $ srun -n 1 ~/hw >>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined >>> symbol: slurm_auth_get_arg_desc >>> >>> >>> i can give a try to the latest slurm if needed >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 2014/12/02 12:56, Ralph Castain wrote: >>>> Out of curiosity - how are you testing these? I have more current versions >>>> of Slurm and would like to test the observations there. >>>> >>>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet >>>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> >>>>> wrote: >>>>> >>>>> I d like to make a step back ... >>>>> >>>>> i previously tested with slurm 2.6.0, and it complained about the >>>>> slurm_verbose symbol that is defined in libslurm.so >>>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok >>>>> >>>>> now i tested with slurm 2.6.6 and it complains about the >>>>> slurm_auth_get_arg_desc symbol, and this symbol is not >>>>> defined in any dynamic library. it is internally defined in the static >>>>> libcommon.a library, which is used to build the slurm binaries. >>>>> >>>>> as far as i understand, auth_munge.so can only be invoked from a slurm >>>>> binary, which means it cannot be invoked from an mpi application >>>>> even if it is linked with libslurm, libpmi, ... >>>>> >>>>> that looks like a slurm design issue that the slurm folks will take care >>>>> of. >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On 2014/12/02 12:33, Ralph Castain wrote: >>>>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 >>>>>> component as this is the only place that requires it, and it won’t hurt >>>>>> anything to do so. >>>>>> >>>>>> >>>>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet >>>>>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> >>>>>>> <mailto:gilles.gouaillar...@iferc.org> >>>>>>> <mailto:gilles.gouaillar...@iferc.org> wrote: >>>>>>> >>>>>>> Jeff, >>>>>>> >>>>>>> FWIW, you can read my analysis of what is going wrong at >>>>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php >>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>> >>>>>>> bottom line, i agree this is a slurm issue (slurm plugin should depend >>>>>>> on libslurm, but they do not, yet) >>>>>>> >>>>>>> a possible workaround would be to make the pmi component a "proxy" that >>>>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is done. >>>>>>> that being said, the impact is quite limited (no direct launch in slurm >>>>>>> with pmi1, but pmi2 works fine) so it makes sense not to work around >>>>>>> someone else problem. >>>>>>> and that being said, configure could detect this broken pmi1 and not >>>>>>> build pmi1 support or print a user friendly error message if pmi1 is >>>>>>> used. >>>>>>> >>>>>>> any thoughts ? >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote: >>>>>>>> Ok, if the problem is moot, great. >>>>>>>> >>>>>>>> (sidenote: this is moot, so ignore this if you want: with this >>>>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue) >>>>>>>> >>>>>>>> >>>>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>> <mailto:r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 component. >>>>>>>>> This library is missing the linkage to libslurm that contains the >>>>>>>>> linkage to libauth where munge resides. So when we call a PMI >>>>>>>>> function, libpmi references a call to munge for authentication and >>>>>>>>> hits an “unresolved symbol” error. >>>>>>>>> >>>>>>>>> Moe acknowledges the error is in Slurm and is fixing the linkages so >>>>>>>>> this problem goes away >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) >>>>>>>>>> <jsquy...@cisco.com> <mailto:jsquy...@cisco.com> >>>>>>>>>> <mailto:jsquy...@cisco.com> <mailto:jsquy...@cisco.com> wrote: >>>>>>>>>> >>>>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> <mailto:r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>>>>>> >>>>>>>>>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly >>>>>>>>>>> against its dependencies (the pmi-2 one is correct). Moe is aware >>>>>>>>>>> of the problem and fixing it on their side. This won’t help >>>>>>>>>>> existing installations until they upgrade, but I tend to agree with >>>>>>>>>>> Jeff about not fixing other people’s problems. >>>>>>>>>> Can you explain what is happening? >>>>>>>>>> >>>>>>>>>> I ask because I'm not sure I understand the problem such that using >>>>>>>>>> RTLD_GLOBAL would fix it. I.e., even if libpmi1.so isn't linked >>>>>>>>>> against its dependencies properly, that shouldn't cause a problem if >>>>>>>>>> OMPI components A and B are both linked against libpmi1.so, and then >>>>>>>>>> A is loaded, and then B is loaded. >>>>>>>>>> >>>>>>>>>> ...or perhaps we can just discuss this on the call tomorrow? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jeff Squyres >>>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>>>>>>> <mailto:jsquy...@cisco.com> <mailto:jsquy...@cisco.com> >>>>>>>>>> For corporate legal information go to: >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php >>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16388.php> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16389.php> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php >>> <http://www.open-mpi.org/community/lists/devel/2014/12/16390.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16391.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16391.php> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16393.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16393.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16395.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16395.php> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16396.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16396.php>