Talked with Brian and we have identified the problem and a fix - will come in later today.
Thanks Ralph On 7/19/07 9:24 AM, "Ralph H Castain" <r...@lanl.gov> wrote: > You are correct - I misread the note. My bad. > > I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly - > shouldn't be a big deal. > > > On 7/19/07 9:12 AM, "George Bosilca" <bosi...@cs.utk.edu> wrote: > >> The second execution (the one that you make reference to) is the one >> that works fine. The failing one is the first one, where >> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost >> make the problem vanish. >> >> george. >> >> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote: >> >>> But it *does* provide an LD_LIBRARY_PATH that is pointing to your >>> openmpi >>> installation - it says it did it right here in your debug output: >>> >>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ >>>>>> openmpi/lib >>> >>> I suspect that the problem isn't in the launcher, but rather in the >>> iof >>> again. Why don't we wait until those fixes come into the trunk before >>> chasing our tails any further? >>> >>> >>> On 7/19/07 8:18 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: >>> >>>> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: >>>>> Interesting. Apparently, it is getting a NULL back when it tries >>>>> to access >>>>> the LD_LIBRARY_PATH in your environment. Here is the code involved: >>>>> >>>>> newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); >>>>> oldenv = getenv("LD_LIBRARY_PATH"); >>>>> if (NULL != oldenv) { >>>>> char* temp; >>>>> asprintf(&temp, "%s:%s", newenv, oldenv); >>>>> free(newenv); >>>>> newenv = temp; >>>>> } >>>>> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); >>>>> if (mca_pls_rsh_component.debug) { >>>>> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", >>>>> newenv); >>>>> } >>>>> free(newenv); >>>>> >>>>> So you can see that the only way we can get your debugging output >>>>> is for the >>>>> LD_LIBRARY_PATH in your starting environment to be NULL. Note >>>>> that this >>>>> comes after we fork, so we are talking about the child process - >>>>> not sure >>>>> that matters, but may as well point it out. >>>>> >>>>> So the question is: why do you not have LD_LIBRARY_PATH set in your >>>>> environment when you provide a different hostname? >>>> Right I don't have LD_LIBRARY_PATH set in my environment, but I >>>> expect >>>> that mpirun will provide working environment for all ranks not just >>>> remote ones. This is how it worked before. Perhaps that was a bug, >>>> but >>>> this was useful bug :) >>>> >>>>> >>>>> >>>>> On 7/19/07 7:45 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: >>>>> >>>>>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: >>>>>>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: >>>>>>>> But this will lockup: >>>>>>>> >>>>>>>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 >>>>>>>> printenv | grep >>>>>>>> LD >>>>>>>> >>>>>>>> The reason is that the hostname in this last command doesn't >>>>>>>> match the >>>>>>>> hostname I get when I query my interfaces, so mpirun thinks it >>>>>>>> must be a >>>>>>>> remote host - and so we stick in ssh until that times out. >>>>>>>> Which could be >>>>>>>> quick on your machine, but takes awhile for me. >>>>>>>> >>>>>>> This is not my case. mpirun resolves hostname and runs env but >>>>>>> LD_LIBRARY_PATH is not there. If I use full name like this >>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com >>>>>>> env | grep >>>>>>> LD_LIBRARY_PATH >>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >>>>>>> >>>>>>> everything is OK. >>>>>>> >>>>>> More info. If I provide hostname to mpirun as returned by command >>>>>> "hostname" the LD_LIBRARY_PATH is not set: >>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD >>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >>>>>> >>>>>> if I provide any other name that resolves to the same IP then >>>>>> LD_LIBRARY_PATH is set. >>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD >>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >>>>>> >>>>>> Here is debug output of "bad" run: >>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca >>>>>> pls_rsh_debug 1 echo >>>>>> [elfit1:14730] pls:rsh: launching job 1 >>>>>> [elfit1:14730] pls:rsh: no new daemons to launch >>>>>> >>>>>> Here is good one: >>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca >>>>>> pls_rsh_debug 1 echo >>>>>> [elfit1:14752] pls:rsh: launching job 1 >>>>>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 >>>>>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell >>>>>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 >>>>>> [elfit1:14752] pls:rsh: final template argv: >>>>>> [elfit1:14752] pls:rsh: /usr/bin/ssh <template> orted --name >>>>>> <template> >>>>>> --num_procs 1 --vpid_start 0 --nodename <template> --universe >>>>>> root@elfit1:default-universe-14752 --nsreplica >>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- >>>>>> gprreplica >>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca >>>>>> mca_base_param_file_path >>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/ >>>>>> glebn/openmpi >>>>>> wd >>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd >>>>>> [elfit1:14752] pls:rsh: launching on node localhost >>>>>> [elfit1:14752] pls:rsh: localhost is a LOCAL node >>>>>> [elfit1:14752] pls:rsh: reset PATH: >>>>>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/ >>>>>> vltmpi/OPENIB/mpi >>>>>> /b >>>>>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/ >>>>>> local/bin:/sbin >>>>>> :/ >>>>>> bin:/usr/sbin:/usr/bin:/root/bin >>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ >>>>>> openmpi/lib >>>>>> [elfit1:14752] pls:rsh: changing to directory /root >>>>>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/ >>>>>> orted) [orted >>>>>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost -- >>>>>> universe >>>>>> root@elfit1:default-universe-14752 --nsreplica >>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- >>>>>> gprreplica >>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca >>>>>> mca_base_param_file_path >>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/ >>>>>> glebn/openmpi >>>>>> wd >>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd >>>>>> --set-sid] >>>>>> >>>>>> -- >>>>>> Gleb. >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> -- >>>> Gleb. >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel