Oy -- I thought we fixed that. :-( Are you saying that configure output says that ltdladvise is not found?
On Dec 2, 2014, at 9:59 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: > didn't want to interfere with this thread, although I have a similar issue, > since I have the solution nearly fully cooked up. But anyway, this last email > gave the hint on why we have suddenly the problem in ompio: > > it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set > anymore, so the entire section is being skipped. I double checked that with > the 1.8 branch, it goes through the section, but not with master. > > Thanks > Edgar > > > > On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote: >> Looks like I was totally lying in >> http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I >> said we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL: >> >> https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124 >> >> This ltdl advice object is passed to lt_dlopen() for all components. My >> mistake; sorry. >> >> So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect. >> >> I believe someone said earlier in the thread that adding the right -llibs to >> the configure line will solve the issue, and that sounds correct to me. If >> there's a missing symbol because the SLURM libraries are not automatically >> pulling in the right dependent libraries, then *if* we put a workaround in >> OMPI to fix this issue, then the right workaround is to add the relevant >> -llibs when that component is linked. >> >> *If* you add that workaround (which is a whole separate discussion), I would >> suggest adding a configure.m4 test to see if adding the additional -llibs >> are necessary. Perhaps AC_LINK_IFELSE looking for a symbol, and then if >> that fails, AC_LINK_IFELSE again with the additional -llibs to see if that >> works. >> >> Or something like that. >> >> >> >> On Dec 2, 2014, at 6:38 AM, Artem Polyakov <artpo...@gmail.com> wrote: >> >>> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is >>> set. If it is zero - very probably this is the same bug as mine. >>> >>> 2014-12-02 17:33 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>> It does look similar - question is: why didn’t this fix the problem? Will >>> have to investigate. >>> >>> Thanks >>> >>> >>>> On Dec 2, 2014, at 3:17 AM, Artem Polyakov <artpo...@gmail.com> wrote: >>>> >>>> >>>> >>>> 2014-12-02 17:13 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>>> Hmmm…if that is true, then it didn’t fix this problem as it is being >>>> reported in the master. >>>> >>>> I had this problem on my laptop installation. You can check my report it >>>> was detailed enough and see if you hitting the same issue. My fix was also >>>> included into 1.8 branch. I am not sure that this is the same issue but >>>> they looks similar. >>>> >>>> >>>> >>>>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov <artpo...@gmail.com> wrote: >>>>> >>>>> I think this might be related to the configuration problem I was fixing >>>>> with Jeff few months ago. Refer here: >>>>> https://github.com/open-mpi/ompi/pull/240 >>>>> >>>>> 2014-12-02 10:15 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>>>> If it isn’t too much trouble, it would be good to confirm that it remains >>>>> broken. I strongly suspect it is based on Moe’s comments. >>>>> >>>>> Obviously, other people are making this work. For Intel MPI, all you do >>>>> is point it at libpmi and they can run. However, they do explicitly >>>>> dlopen it in their code, and I don’t know what flags they might pass when >>>>> they do so. >>>>> >>>>> If necessary, I suppose we could follow that pattern. In other words, >>>>> rather than specifically linking the “s1” component to libpmi, instead >>>>> require that the user point us to a pmi library via an MCA param, then >>>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues >>>>> cited by Jeff, but resolves the pmi linkage problem. >>>>> >>>>> >>>>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet >>>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>>> >>>>>> $ srun --version >>>>>> slurm 2.6.6-VENDOR_PROVIDED >>>>>> >>>>>> $ srun --mpi=pmi2 -n 1 ~/hw >>>>>> I am 0 / 1 >>>>>> >>>>>> $ srun -n 1 ~/hw >>>>>> /csc/home1/gouaillardet/hw: symbol lookup error: >>>>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose >>>>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received >>>>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted >>>>>> or received >>>>>> srun: error: soleil: task 0: Exited with exit code 127 >>>>>> >>>>>> $ ldd /usr/lib64/slurm/auth_munge.so >>>>>> linux-vdso.so.1 => (0x00007fff54478000) >>>>>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000) >>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000) >>>>>> libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000) >>>>>> /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000) >>>>>> >>>>>> >>>>>> now, if i reling auth_munge.so so it depends on libslurm : >>>>>> >>>>>> $ srun -n 1 ~/hw >>>>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined >>>>>> symbol: slurm_auth_get_arg_desc >>>>>> >>>>>> >>>>>> i can give a try to the latest slurm if needed >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> >>>>>> On 2014/12/02 12:56, Ralph Castain wrote: >>>>>>> Out of curiosity - how are you testing these? I have more current >>>>>>> versions of Slurm and would like to test the observations there. >>>>>>> >>>>>>> >>>>>>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet >>>>>>>> <gilles.gouaillar...@iferc.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I d like to make a step back ... >>>>>>>> >>>>>>>> i previously tested with slurm 2.6.0, and it complained about the >>>>>>>> slurm_verbose symbol that is defined in libslurm.so >>>>>>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok >>>>>>>> >>>>>>>> now i tested with slurm 2.6.6 and it complains about the >>>>>>>> slurm_auth_get_arg_desc symbol, and this symbol is not >>>>>>>> defined in any dynamic library. it is internally defined in the static >>>>>>>> libcommon.a library, which is used to build the slurm binaries. >>>>>>>> >>>>>>>> as far as i understand, auth_munge.so can only be invoked from a slurm >>>>>>>> binary, which means it cannot be invoked from an mpi application >>>>>>>> even if it is linked with libslurm, libpmi, ... >>>>>>>> >>>>>>>> that looks like a slurm design issue that the slurm folks will take >>>>>>>> care of. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Gilles >>>>>>>> >>>>>>>> On 2014/12/02 12:33, Ralph Castain wrote: >>>>>>>> >>>>>>>>> Another option is to simply add the -lslurm -lauth flags to the >>>>>>>>> pmix/s1 component as this is the only place that requires it, and it >>>>>>>>> won’t hurt anything to do so. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet >>>>>>>>>> <gilles.gouaillar...@iferc.org> >>>>>>>>>> <mailto:gilles.gouaillar...@iferc.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Jeff, >>>>>>>>>> >>>>>>>>>> FWIW, you can read my analysis of what is going wrong at >>>>>>>>>> >>>>>>>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>>>> >>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>>>> >>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> bottom line, i agree this is a slurm issue (slurm plugin should >>>>>>>>>> depend >>>>>>>>>> on libslurm, but they do not, yet) >>>>>>>>>> >>>>>>>>>> a possible workaround would be to make the pmi component a "proxy" >>>>>>>>>> that >>>>>>>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is >>>>>>>>>> done. >>>>>>>>>> that being said, the impact is quite limited (no direct launch in >>>>>>>>>> slurm >>>>>>>>>> with pmi1, but pmi2 works fine) so it makes sense not to work around >>>>>>>>>> someone else problem. >>>>>>>>>> and that being said, configure could detect this broken pmi1 and not >>>>>>>>>> build pmi1 support or print a user friendly error message if pmi1 is >>>>>>>>>> used. >>>>>>>>>> >>>>>>>>>> any thoughts ? >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> Gilles >>>>>>>>>> >>>>>>>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote: >>>>>>>>>> >>>>>>>>>>> Ok, if the problem is moot, great. >>>>>>>>>>> >>>>>>>>>>> (sidenote: this is moot, so ignore this if you want: with this >>>>>>>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain >>>>>>>>>>> <r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 component. >>>>>>>>>>>> This library is missing the linkage to libslurm that contains the >>>>>>>>>>>> linkage to libauth where munge resides. So when we call a PMI >>>>>>>>>>>> function, libpmi references a call to munge for authentication and >>>>>>>>>>>> hits an “unresolved symbol” error. >>>>>>>>>>>> >>>>>>>>>>>> Moe acknowledges the error is in Slurm and is fixing the linkages >>>>>>>>>>>> so this problem goes away >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) >>>>>>>>>>>>> <jsquy...@cisco.com> <mailto:jsquy...@cisco.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain >>>>>>>>>>>>> <r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly >>>>>>>>>>>>>> against its dependencies (the pmi-2 one is correct). Moe is >>>>>>>>>>>>>> aware of the problem and fixing it on their side. This won’t >>>>>>>>>>>>>> help existing installations until they upgrade, but I tend to >>>>>>>>>>>>>> agree with Jeff about not fixing other people’s problems. >>>>>>>>>>>>>> >>>>>>>>>>>>> Can you explain what is happening? >>>>>>>>>>>>> >>>>>>>>>>>>> I ask because I'm not sure I understand the problem such that >>>>>>>>>>>>> using RTLD_GLOBAL would fix it. I.e., even if libpmi1.so isn't >>>>>>>>>>>>> linked against its dependencies properly, that shouldn't cause a >>>>>>>>>>>>> problem if OMPI components A and B are both linked against >>>>>>>>>>>>> libpmi1.so, and then A is loaded, and then B is loaded. >>>>>>>>>>>>> >>>>>>>>>>>>> ...or perhaps we can just discuss this on the call tomorrow? >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>> >>>>>>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>>>>>>>>>> >>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> >>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>>> >>>>>>>>>>>>> Subscription: >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>>> >>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php >>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> >>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>> >>>>>>>>>>>> Subscription: >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>> >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>>>>> >>>>>>>>>> Subscription: >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> >>>>>>>>> Subscription: >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> >>>>>>>> de...@open-mpi.org >>>>>>>> >>>>>>>> Subscription: >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> >>>>>>> de...@open-mpi.org >>>>>>> >>>>>>> Subscription: >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16391.php >>>>> >>>>> >>>>> >>>>> -- >>>>> С Уважением, Поляков Артем Юрьевич >>>>> Best regards, Artem Y. Polyakov >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16393.php >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16395.php >>>> >>>> >>>> >>>> -- >>>> С Уважением, Поляков Артем Юрьевич >>>> Best regards, Artem Y. Polyakov >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16396.php >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16397.php >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16398.php >> >> > > -- > Edgar Gabriel > Associate Professor > Parallel Software Technologies Lab http://pstl.cs.uh.edu > Department of Computer Science University of Houston > Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA > Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16400.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/