I think this might be related to the configuration problem I was fixing with Jeff few months ago. Refer here: https://github.com/open-mpi/ompi/pull/240
2014-12-02 10:15 GMT+06:00 Ralph Castain <r...@open-mpi.org>: > If it isn’t too much trouble, it would be good to confirm that it remains > broken. I strongly suspect it is based on Moe’s comments. > > Obviously, other people are making this work. For Intel MPI, all you do is > point it at libpmi and they can run. However, they do explicitly dlopen it > in their code, and I don’t know what flags they might pass when they do so. > > If necessary, I suppose we could follow that pattern. In other words, > rather than specifically linking the “s1” component to libpmi, instead > require that the user point us to a pmi library via an MCA param, then > explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues > cited by Jeff, but resolves the pmi linkage problem. > > > On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > $ srun --version > slurm 2.6.6-VENDOR_PROVIDED > > $ srun --mpi=pmi2 -n 1 ~/hw > I am 0 / 1 > > $ srun -n 1 ~/hw > /csc/home1/gouaillardet/hw: symbol lookup error: > /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose > srun: error: slurm_receive_msg: Zero Bytes were transmitted or received > srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or > received > srun: error: soleil: task 0: Exited with exit code 127 > > $ ldd /usr/lib64/slurm/auth_munge.so > linux-vdso.so.1 => (0x00007fff54478000) > libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000) > libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000) > /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000) > > > now, if i reling auth_munge.so so it depends on libslurm : > > $ srun -n 1 ~/hw > srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined > symbol: slurm_auth_get_arg_desc > > > i can give a try to the latest slurm if needed > > Cheers, > > Gilles > > > On 2014/12/02 12:56, Ralph Castain wrote: > > Out of curiosity - how are you testing these? I have more current versions of > Slurm and would like to test the observations there. > > > On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote: > > I d like to make a step back ... > > i previously tested with slurm 2.6.0, and it complained about the > slurm_verbose symbol that is defined in libslurm.so > so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok > > now i tested with slurm 2.6.6 and it complains about the > slurm_auth_get_arg_desc symbol, and this symbol is not > defined in any dynamic library. it is internally defined in the static > libcommon.a library, which is used to build the slurm binaries. > > as far as i understand, auth_munge.so can only be invoked from a slurm > binary, which means it cannot be invoked from an mpi application > even if it is linked with libslurm, libpmi, ... > > that looks like a slurm design issue that the slurm folks will take care of. > > Cheers, > > Gilles > > On 2014/12/02 12:33, Ralph Castain wrote: > > Another option is to simply add the -lslurm -lauth flags to the pmix/s1 > component as this is the only place that requires it, and it won’t hurt > anything to do so. > > > > On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> > <mailto:gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote: > > Jeff, > > FWIW, you can read my analysis of what is going wrong > athttp://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php > <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> > <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> > <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> > <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> > <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> > <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> > > bottom line, i agree this is a slurm issue (slurm plugin should depend > on libslurm, but they do not, yet) > > a possible workaround would be to make the pmi component a "proxy" that > dlopen with RTLD_GLOBAL the "real" component in which the job is done. > that being said, the impact is quite limited (no direct launch in slurm > with pmi1, but pmi2 works fine) so it makes sense not to work around > someone else problem. > and that being said, configure could detect this broken pmi1 and not > build pmi1 support or print a user friendly error message if pmi1 is used. > > any thoughts ? > > Cheers, > > Gilles > > On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote: > > Ok, if the problem is moot, great. > > (sidenote: this is moot, so ignore this if you want: with this explanation, > I'm still not sure how RTLD_GLOBAL fixes the issue) > > > On Dec 1, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> > <r...@open-mpi.org> <mailto:r...@open-mpi.org> <r...@open-mpi.org> wrote: > > > Easy enough to explain. We link libpmi into the pmix/s1 component. This > library is missing the linkage to libslurm that contains the linkage to > libauth where munge resides. So when we call a PMI function, libpmi > references a call to munge for authentication and hits an “unresolved symbol” > error. > > Moe acknowledges the error is in Slurm and is fixing the linkages so this > problem goes away > > > > On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > <jsquy...@cisco.com> <mailto:jsquy...@cisco.com> <jsquy...@cisco.com> wrote: > > On Dec 1, 2014, at 5:07 PM, Ralph Castain <r...@open-mpi.org> > <r...@open-mpi.org> <mailto:r...@open-mpi.org> <r...@open-mpi.org> wrote: > > > FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its > dependencies (the pmi-2 one is correct). Moe is aware of the problem and > fixing it on their side. This won’t help existing installations until they > upgrade, but I tend to agree with Jeff about not fixing other people’s > problems. > > Can you explain what is happening? > > I ask because I'm not sure I understand the problem such that using > RTLD_GLOBAL would fix it. I.e., even if libpmi1.so isn't linked against its > dependencies properly, that shouldn't cause a problem if OMPI components A > and B are both linked against libpmi1.so, and then A is loaded, and then B is > loaded. > > ...or perhaps we can just discuss this on the call tomorrow? > > -- > Jeff squyresjsquy...@cisco.com <mailto:jsquy...@cisco.com> > <jsquy...@cisco.com> > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > <http://www.cisco.com/web/about/doing_business/legal/cri/> > <http://www.cisco.com/web/about/doing_business/legal/cri/> > > _______________________________________________ > devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> > <de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16383.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> > > _______________________________________________ > devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> > <de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16384.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> > > _______________________________________________ > devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> > <de...@open-mpi.org> <mailto:de...@open-mpi.org> <de...@open-mpi.org> > <mailto:de...@open-mpi.org> <de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16386.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> > > _______________________________________________ > devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> > <de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16387.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> > <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16388.php > > > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16389.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16390.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16391.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov