I think this might be related to the configuration problem I was fixing
with Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain <r...@open-mpi.org>:

> If it isn’t too much trouble, it would be good to confirm that it remains
> broken. I strongly suspect it is based on Moe’s comments.
>
> Obviously, other people are making this work. For Intel MPI, all you do is
> point it at libpmi and they can run. However, they do explicitly dlopen it
> in their code, and I don’t know what flags they might pass when they do so.
>
> If necessary, I suppose we could follow that pattern. In other words,
> rather than specifically linking the “s1” component to libpmi, instead
> require that the user point us to a pmi library via an MCA param, then
> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
> cited by Jeff, but resolves the pmi linkage problem.
>
>
> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>  $ srun --version
> slurm 2.6.6-VENDOR_PROVIDED
>
> $ srun --mpi=pmi2 -n 1 ~/hw
> I am 0 / 1
>
> $ srun -n 1 ~/hw
> /csc/home1/gouaillardet/hw: symbol lookup error:
> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or
> received
> srun: error: soleil: task 0: Exited with exit code 127
>
> $ ldd /usr/lib64/slurm/auth_munge.so
>     linux-vdso.so.1 =>  (0x00007fff54478000)
>     libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000)
>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000)
>     libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000)
>     /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000)
>
>
> now, if i reling auth_munge.so so it depends on libslurm :
>
> $ srun -n 1 ~/hw
> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
> symbol: slurm_auth_get_arg_desc
>
>
> i can give a try to the latest slurm if needed
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/02 12:56, Ralph Castain wrote:
>
> Out of curiosity - how are you testing these? I have more current versions of 
> Slurm and would like to test the observations there.
>
>
>  On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote:
>
> I d like to make a step back ...
>
> i previously tested with slurm 2.6.0, and it complained about the 
> slurm_verbose symbol that is defined in libslurm.so
> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>
> now i tested with slurm 2.6.6 and it complains about the 
> slurm_auth_get_arg_desc symbol, and this symbol is not
> defined in any dynamic library. it is internally defined in the static 
> libcommon.a library, which is used to build the slurm binaries.
>
> as far as i understand, auth_munge.so can only be invoked from a slurm 
> binary, which means it cannot be invoked from an mpi application
> even if it is linked with libslurm, libpmi, ...
>
> that looks like a slurm design issue that the slurm folks will take care of.
>
> Cheers,
>
> Gilles
>
> On 2014/12/02 12:33, Ralph Castain wrote:
>
>  Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
> component as this is the only place that requires it, and it won’t hurt 
> anything to do so.
>
>
>
>  On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> 
> <mailto:gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote:
>
> Jeff,
>
> FWIW, you can read my analysis of what is going wrong 
> athttp://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>
>
> bottom line, i agree this is a slurm issue (slurm plugin should depend
> on libslurm, but they do not, yet)
>
> a possible workaround would be to make the pmi component a "proxy" that
> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
> that being said, the impact is quite limited (no direct launch in slurm
> with pmi1, but pmi2 works fine) so it makes sense not to work around
> someone else problem.
> and that being said, configure could detect this broken pmi1 and not
> build pmi1 support or print a user friendly error message if pmi1 is used.
>
> any thoughts ?
>
> Cheers,
>
> Gilles
>
> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>
>  Ok, if the problem is moot, great.
>
> (sidenote: this is moot, so ignore this if you want: with this explanation, 
> I'm still not sure how RTLD_GLOBAL fixes the issue)
>
>
> On Dec 1, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> 
> <r...@open-mpi.org> <mailto:r...@open-mpi.org> <r...@open-mpi.org> wrote:
>
>
>  Easy enough to explain. We link libpmi into the pmix/s1 component. This 
> library is missing the linkage to libslurm that contains the linkage to 
> libauth where munge resides. So when we call a PMI function, libpmi 
> references a call to munge for authentication and hits an “unresolved symbol” 
> error.
>
> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
> problem goes away
>
>
>
>  On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> <jsquy...@cisco.com> <mailto:jsquy...@cisco.com> <jsquy...@cisco.com> wrote:
>
> On Dec 1, 2014, at 5:07 PM, Ralph Castain <r...@open-mpi.org> 
> <r...@open-mpi.org> <mailto:r...@open-mpi.org> <r...@open-mpi.org> wrote:
>
>
>  FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
> fixing it on their side. This won’t help existing installations until they 
> upgrade, but I tend to agree with Jeff about not fixing other people’s 
> problems.
>
>  Can you explain what is happening?
>
> I ask because I'm not sure I understand the problem such that using 
> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against its 
> dependencies properly, that shouldn't cause a problem if OMPI components A 
> and B are both linked against libpmi1.so, and then A is loaded, and then B is 
> loaded.
>
> ...or perhaps we can just discuss this on the call tomorrow?
>
> --
> Jeff squyresjsquy...@cisco.com <mailto:jsquy...@cisco.com> 
> <jsquy...@cisco.com>
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/ 
> <http://www.cisco.com/web/about/doing_business/legal/cri/> 
> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
> <de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>
>
>  _______________________________________________
> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
> <de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>
>
>  _______________________________________________
> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
> <de...@open-mpi.org> <mailto:de...@open-mpi.org> <de...@open-mpi.org> 
> <mailto:de...@open-mpi.org> <de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>
>
>  _______________________________________________
> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
> <de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>
>
>  _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php
>
>
>  _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16391.php
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Reply via email to