Oy -- I thought we fixed that.  :-(

Are you saying that configure output says that ltdladvise is not found?


On Dec 2, 2014, at 9:59 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote:

> didn't want to interfere with this thread, although I have a similar issue, 
> since I have the solution nearly fully cooked up. But anyway, this last email 
> gave the hint on why we have suddenly the problem in ompio:
> 
> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set 
> anymore, so the entire section is being skipped. I double checked that with 
> the 1.8 branch, it goes through the section, but not with master.
> 
> Thanks
> Edgar
> 
> 
> 
> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>> Looks like I was totally lying in 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I 
>> said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>> 
>> https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124
>> 
>> This ltdl advice object is passed to lt_dlopen() for all components.  My 
>> mistake; sorry.
>> 
>> So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.
>> 
>> I believe someone said earlier in the thread that adding the right -llibs to 
>> the configure line will solve the issue, and that sounds correct to me.  If 
>> there's a missing symbol because the SLURM libraries are not automatically 
>> pulling in the right dependent libraries, then *if* we put a workaround in 
>> OMPI to fix this issue, then the right workaround is to add the relevant 
>> -llibs when that component is linked.
>> 
>> *If* you add that workaround (which is a whole separate discussion), I would 
>> suggest adding a configure.m4 test to see if adding the additional -llibs 
>> are necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if 
>> that fails, AC_LINK_IFELSE again with the additional -llibs to see if that 
>> works.
>> 
>> Or something like that.
>> 
>> 
>> 
>> On Dec 2, 2014, at 6:38 AM, Artem Polyakov <artpo...@gmail.com> wrote:
>> 
>>> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is 
>>> set. If it is zero - very probably this is the same bug as mine.
>>> 
>>> 2014-12-02 17:33 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
>>> It does look similar - question is: why didn’t this fix the problem? Will 
>>> have to investigate.
>>> 
>>> Thanks
>>> 
>>> 
>>>> On Dec 2, 2014, at 3:17 AM, Artem Polyakov <artpo...@gmail.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> 2014-12-02 17:13 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
>>>> Hmmm…if that is true, then it didn’t fix this problem as it is being 
>>>> reported in the master.
>>>> 
>>>> I had this problem on my laptop installation. You can check my report it 
>>>> was detailed enough and see if you hitting the same issue. My fix was also 
>>>> included into 1.8 branch. I am not sure that this is the same issue but 
>>>> they looks similar.
>>>> 
>>>> 
>>>> 
>>>>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov <artpo...@gmail.com> wrote:
>>>>> 
>>>>> I think this might be related to the configuration problem I was fixing 
>>>>> with Jeff few months ago. Refer here:
>>>>> https://github.com/open-mpi/ompi/pull/240
>>>>> 
>>>>> 2014-12-02 10:15 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
>>>>> If it isn’t too much trouble, it would be good to confirm that it remains 
>>>>> broken. I strongly suspect it is based on Moe’s comments.
>>>>> 
>>>>> Obviously, other people are making this work. For Intel MPI, all you do 
>>>>> is point it at libpmi and they can run. However, they do explicitly 
>>>>> dlopen it in their code, and I don’t know what flags they might pass when 
>>>>> they do so.
>>>>> 
>>>>> If necessary, I suppose we could follow that pattern. In other words, 
>>>>> rather than specifically linking the “s1” component to libpmi, instead 
>>>>> require that the user point us to a pmi library via an MCA param, then 
>>>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues 
>>>>> cited by Jeff, but resolves the pmi linkage problem.
>>>>> 
>>>>> 
>>>>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>> 
>>>>>> $ srun --version
>>>>>> slurm 2.6.6-VENDOR_PROVIDED
>>>>>> 
>>>>>> $ srun --mpi=pmi2 -n 1 ~/hw
>>>>>> I am 0 / 1
>>>>>> 
>>>>>> $ srun -n 1 ~/hw
>>>>>> /csc/home1/gouaillardet/hw: symbol lookup error: 
>>>>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>>>>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>>>>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted 
>>>>>> or received
>>>>>> srun: error: soleil: task 0: Exited with exit code 127
>>>>>> 
>>>>>> $ ldd /usr/lib64/slurm/auth_munge.so
>>>>>>     linux-vdso.so.1 =>  (0x00007fff54478000)
>>>>>>     libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000)
>>>>>>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000)
>>>>>>     libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000)
>>>>>>     /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000)
>>>>>> 
>>>>>> 
>>>>>> now, if i reling auth_munge.so so it depends on libslurm :
>>>>>> 
>>>>>> $ srun -n 1 ~/hw
>>>>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined 
>>>>>> symbol: slurm_auth_get_arg_desc
>>>>>> 
>>>>>> 
>>>>>> i can give a try to the latest slurm if needed
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> 
>>>>>> On 2014/12/02 12:56, Ralph Castain wrote:
>>>>>>> Out of curiosity - how are you testing these? I have more current 
>>>>>>> versions of Slurm and would like to test the observations there.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>>>>>>> <gilles.gouaillar...@iferc.org>
>>>>>>>>  wrote:
>>>>>>>> 
>>>>>>>> I d like to make a step back ...
>>>>>>>> 
>>>>>>>> i previously tested with slurm 2.6.0, and it complained about the 
>>>>>>>> slurm_verbose symbol that is defined in libslurm.so
>>>>>>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>>>>>>> 
>>>>>>>> now i tested with slurm 2.6.6 and it complains about the 
>>>>>>>> slurm_auth_get_arg_desc symbol, and this symbol is not
>>>>>>>> defined in any dynamic library. it is internally defined in the static 
>>>>>>>> libcommon.a library, which is used to build the slurm binaries.
>>>>>>>> 
>>>>>>>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>>>>>>>> binary, which means it cannot be invoked from an mpi application
>>>>>>>> even if it is linked with libslurm, libpmi, ...
>>>>>>>> 
>>>>>>>> that looks like a slurm design issue that the slurm folks will take 
>>>>>>>> care of.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Gilles
>>>>>>>> 
>>>>>>>> On 2014/12/02 12:33, Ralph Castain wrote:
>>>>>>>> 
>>>>>>>>> Another option is to simply add the -lslurm -lauth flags to the 
>>>>>>>>> pmix/s1 component as this is the only place that requires it, and it 
>>>>>>>>> won’t hurt anything to do so.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>>>>>>>> <gilles.gouaillar...@iferc.org> 
>>>>>>>>>> <mailto:gilles.gouaillar...@iferc.org>
>>>>>>>>>>  wrote:
>>>>>>>>>> 
>>>>>>>>>> Jeff,
>>>>>>>>>> 
>>>>>>>>>> FWIW, you can read my analysis of what is going wrong at
>>>>>>>>>> 
>>>>>>>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>
>>>>>>>>>>  
>>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>
>>>>>>>>>>  
>>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> bottom line, i agree this is a slurm issue (slurm plugin should 
>>>>>>>>>> depend
>>>>>>>>>> on libslurm, but they do not, yet)
>>>>>>>>>> 
>>>>>>>>>> a possible workaround would be to make the pmi component a "proxy" 
>>>>>>>>>> that
>>>>>>>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is 
>>>>>>>>>> done.
>>>>>>>>>> that being said, the impact is quite limited (no direct launch in 
>>>>>>>>>> slurm
>>>>>>>>>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>>>>>>>>>> someone else problem.
>>>>>>>>>> and that being said, configure could detect this broken pmi1 and not
>>>>>>>>>> build pmi1 support or print a user friendly error message if pmi1 is 
>>>>>>>>>> used.
>>>>>>>>>> 
>>>>>>>>>> any thoughts ?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Gilles
>>>>>>>>>> 
>>>>>>>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>>>>>>>>>> 
>>>>>>>>>>> Ok, if the problem is moot, great.
>>>>>>>>>>> 
>>>>>>>>>>> (sidenote: this is moot, so ignore this if you want: with this 
>>>>>>>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain
>>>>>>>>>>> <r...@open-mpi.org> <mailto:r...@open-mpi.org>
>>>>>>>>>>>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 component. 
>>>>>>>>>>>> This library is missing the linkage to libslurm that contains the 
>>>>>>>>>>>> linkage to libauth where munge resides. So when we call a PMI 
>>>>>>>>>>>> function, libpmi references a call to munge for authentication and 
>>>>>>>>>>>> hits an “unresolved symbol” error.
>>>>>>>>>>>> 
>>>>>>>>>>>> Moe acknowledges the error is in Slurm and is fixing the linkages 
>>>>>>>>>>>> so this problem goes away
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) 
>>>>>>>>>>>>> <jsquy...@cisco.com> <mailto:jsquy...@cisco.com>
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain
>>>>>>>>>>>>> <r...@open-mpi.org> <mailto:r...@open-mpi.org>
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly 
>>>>>>>>>>>>>> against its dependencies (the pmi-2 one is correct).  Moe is 
>>>>>>>>>>>>>> aware of the problem and fixing it on their side. This won’t 
>>>>>>>>>>>>>> help existing installations until they upgrade, but I tend to 
>>>>>>>>>>>>>> agree with Jeff about not fixing other people’s problems.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you explain what is happening?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I ask because I'm not sure I understand the problem such that 
>>>>>>>>>>>>> using RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't 
>>>>>>>>>>>>> linked against its dependencies properly, that shouldn't cause a 
>>>>>>>>>>>>> problem if OMPI components A and B are both linked against 
>>>>>>>>>>>>> libpmi1.so, and then A is loaded, and then B is loaded.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ...or perhaps we can just discuss this on the call tomorrow?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>> 
>>>>>>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>>>>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> 
>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Subscription:
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Link to this post:
>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php 
>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> 
>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>>>> 
>>>>>>>>>>>> Subscription:
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>>> 
>>>>>>>>>>>> Link to this post:
>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php 
>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> 
>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> 
>>>>>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org>
>>>>>>>>>> 
>>>>>>>>>> Subscription:
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>> 
>>>>>>>>>> Link to this post:
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> 
>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>> 
>>>>>>>>> Subscription:
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>> 
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> 
>>>>>>>> de...@open-mpi.org
>>>>>>>> 
>>>>>>>> Subscription:
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> 
>>>>>>> de...@open-mpi.org
>>>>>>> 
>>>>>>> Subscription:
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16391.php
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> С Уважением, Поляков Артем Юрьевич
>>>>> Best regards, Artem Y. Polyakov
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16393.php
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16395.php
>>>> 
>>>> 
>>>> 
>>>> --
>>>> С Уважением, Поляков Артем Юрьевич
>>>> Best regards, Artem Y. Polyakov
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16396.php
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16397.php
>>> 
>>> 
>>> 
>>> --
>>> С Уважением, Поляков Артем Юрьевич
>>> Best regards, Artem Y. Polyakov
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16398.php
>> 
>> 
> 
> -- 
> Edgar Gabriel
> Associate Professor
> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
> Department of Computer Science          University of Houston
> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16400.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to