I fixed the specific problem of setting the LD_LIBRARY_PATH (and PATH,
though that wasn't mentioned) for the case of procs spawned locally by
mpirun - see r15516. Please confirm that the problem is gone and/or let me
know if it persists for you.

The issue of name resolution is a more general problem that will take some
discussion - to occur separately from this chain. So some of the behavior
you cited continues for the moment.

Thanks
Ralph



On 7/19/07 9:39 AM, "Ralph H Castain" <r...@lanl.gov> wrote:

> Talked with Brian and we have identified the problem and a fix - will come
> in later today.
> 
> Thanks
> Ralph
> 
> 
> 
> On 7/19/07 9:24 AM, "Ralph H Castain" <r...@lanl.gov> wrote:
> 
>> You are correct - I misread the note. My bad.
>> 
>> I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly -
>> shouldn't be a big deal.
>> 
>> 
>> On 7/19/07 9:12 AM, "George Bosilca" <bosi...@cs.utk.edu> wrote:
>> 
>>> The second execution (the one that you make reference to) is the one
>>> that works fine. The failing one is the first one, where
>>> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost
>>> make the problem vanish.
>>> 
>>>    george.
>>> 
>>> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote:
>>> 
>>>> But it *does* provide an LD_LIBRARY_PATH that is pointing to your
>>>> openmpi
>>>> installation - it says it did it right here in your debug output:
>>>> 
>>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
>>>>>>> openmpi/lib
>>>> 
>>>> I suspect that the problem isn't in the launcher, but rather in the
>>>> iof
>>>> again. Why don't we wait until those fixes come into the trunk before
>>>> chasing our tails any further?
>>>> 
>>>> 
>>>> On 7/19/07 8:18 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
>>>> 
>>>>> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote:
>>>>>> Interesting. Apparently, it is getting a NULL back when it tries
>>>>>> to access
>>>>>> the LD_LIBRARY_PATH in your environment. Here is the code involved:
>>>>>> 
>>>>>>      newenv = opal_os_path( false, prefix_dir, lib_base, NULL );
>>>>>>      oldenv = getenv("LD_LIBRARY_PATH");
>>>>>>      if (NULL != oldenv) {
>>>>>>           char* temp;
>>>>>>           asprintf(&temp, "%s:%s", newenv, oldenv);
>>>>>>           free(newenv);
>>>>>>           newenv = temp;
>>>>>>      }
>>>>>>      opal_setenv("LD_LIBRARY_PATH", newenv, true, &env);
>>>>>>      if (mca_pls_rsh_component.debug) {
>>>>>>           opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s",
>>>>>> newenv);
>>>>>>      }
>>>>>>      free(newenv);
>>>>>> 
>>>>>> So you can see that the only way we can get your debugging output
>>>>>> is for the
>>>>>> LD_LIBRARY_PATH in your starting environment to be NULL. Note
>>>>>> that this
>>>>>> comes after we fork, so we are talking about the child process -
>>>>>> not sure
>>>>>> that matters, but may as well point it out.
>>>>>> 
>>>>>> So the question is: why do you not have LD_LIBRARY_PATH set in your
>>>>>> environment when you provide a different hostname?
>>>>> Right I don't have LD_LIBRARY_PATH set in my environment, but I
>>>>> expect
>>>>> that mpirun will provide working environment for all ranks not just
>>>>> remote ones. This is how it worked before. Perhaps that was a bug,
>>>>> but
>>>>> this was useful bug :)
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 7/19/07 7:45 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
>>>>>> 
>>>>>>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote:
>>>>>>>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote:
>>>>>>>>> But this will lockup:
>>>>>>>>> 
>>>>>>>>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961
>>>>>>>>> printenv | grep
>>>>>>>>> LD
>>>>>>>>> 
>>>>>>>>> The reason is that the hostname in this last command doesn't
>>>>>>>>> match the
>>>>>>>>> hostname I get when I query my interfaces, so mpirun thinks it
>>>>>>>>> must be a
>>>>>>>>> remote host - and so we stick in ssh until that times out.
>>>>>>>>> Which could be
>>>>>>>>> quick on your machine, but takes awhile for me.
>>>>>>>>> 
>>>>>>>> This is not my case. mpirun resolves hostname and runs env but
>>>>>>>> LD_LIBRARY_PATH is not there. If I use full name like this
>>>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com
>>>>>>>> env | grep
>>>>>>>> LD_LIBRARY_PATH
>>>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>>>>>>> 
>>>>>>>> everything is OK.
>>>>>>>> 
>>>>>>> More info. If I provide hostname to mpirun as returned by command
>>>>>>> "hostname" the LD_LIBRARY_PATH is not set:
>>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname`  env | grep LD
>>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>>>>>> 
>>>>>>> if I provide any other name that resolves to the same IP then
>>>>>>> LD_LIBRARY_PATH is set.
>>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost  env | grep LD
>>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests
>>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
>>>>>>> 
>>>>>>> Here is debug output of "bad" run:
>>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca
>>>>>>> pls_rsh_debug 1 echo
>>>>>>> [elfit1:14730] pls:rsh: launching job 1
>>>>>>> [elfit1:14730] pls:rsh: no new daemons to launch
>>>>>>> 
>>>>>>> Here is good one:
>>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca
>>>>>>> pls_rsh_debug 1 echo
>>>>>>> [elfit1:14752] pls:rsh: launching job 1
>>>>>>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1
>>>>>>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell
>>>>>>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1
>>>>>>> [elfit1:14752] pls:rsh: final template argv:
>>>>>>> [elfit1:14752] pls:rsh:     /usr/bin/ssh <template> orted --name
>>>>>>> <template>
>>>>>>> --num_procs 1 --vpid_start 0 --nodename <template> --universe
>>>>>>> root@elfit1:default-universe-14752 --nsreplica
>>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
>>>>>>> gprreplica
>>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>>>>>> mca_base_param_file_path
>>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
>>>>>>> glebn/openmpi
>>>>>>> wd
>>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
>>>>>>> [elfit1:14752] pls:rsh: launching on node localhost
>>>>>>> [elfit1:14752] pls:rsh: localhost is a LOCAL node
>>>>>>> [elfit1:14752] pls:rsh: reset PATH:
>>>>>>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/
>>>>>>> vltmpi/OPENIB/mpi
>>>>>>> /b
>>>>>>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/
>>>>>>> local/bin:/sbin
>>>>>>> :/
>>>>>>> bin:/usr/sbin:/usr/bin:/root/bin
>>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/
>>>>>>> openmpi/lib
>>>>>>> [elfit1:14752] pls:rsh: changing to directory /root
>>>>>>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/
>>>>>>> orted) [orted
>>>>>>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --
>>>>>>> universe
>>>>>>> root@elfit1:default-universe-14752 --nsreplica
>>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --
>>>>>>> gprreplica
>>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca
>>>>>>> mca_base_param_file_path
>>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/
>>>>>>> glebn/openmpi
>>>>>>> wd
>>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd
>>>>>>> --set-sid]
>>>>>>> 
>>>>>>> --
>>>>>>> Gleb.
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> --
>>>>> Gleb.
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to