Could you please give the nightly 1.7.5 tarball a try using the same cmd line 
options and send me the output? I see the problem, but am trying to understand 
how it happens. I've added a bunch of diagnostic statements that should help me 
track it down.

Thanks
Ralph

On Feb 12, 2014, at 1:26 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote:

> As said, the change in behaviour is new in 1.7.4 - all previous versions has 
> been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a workaround 
> for older versions of Open MPI for some 60-seconds timeout when starting the 
> same command (which is still sucessfull); or for infinite waiting in same 
> cases.
> 
> 
> 
> Attached are logs of the commands:
> $ export | grep OMPI | tee export_OMPI-linuxbmc0008.txt
> 
> $ $MPI_BINDIR/mpiexec  -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100  
> -H linuxscc004 -np 1 hostname 2>&1 | tee oob_base_verbose-linuxbmc0008-173.txt
> 
> (and -174 for appropriate versions 1.7.3 and 1.7.4)
> 
> 
> $ ifconfig 2>&1 | tee ifconfig-linuxbmc0008.txt
> 
> (and -linuxscc004 for the two nodes; linuxscc004 is in (h) fabric and 
> 'mpiexec' was called from node linuxbmc0008 which is in the (b) fabric where 
> the 'ib0' is configured to be the main interface)
> 
> and the OMPI environment on linuxbmc0008. Maybe you can see something from 
> this.
> 
> Best
> Paul
> 
> 
> On 02/11/14 20:29, Ralph Castain wrote:
>> I've added better error messages in the trunk, scheduled to move over to 
>> 1.7.5. I don't see anything in the code that would explain why we don't 
>> pickup and use ib0 if it is present and specified in if_include - we should 
>> be doing it.
>> 
>> For now, can you run this with "-mca oob_base_verbose 100" on your cmd line 
>> and send me the output? Might help debug the behavior.
>> 
>> Thanks
>> Ralph
>> 
>> On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote:
>> 
>>> Dear Open MPI developer,
>>> 
>>> I.
>>> we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a 
>>> change to previous versions:
>>> - when calling "mpiexec", it returns "1" and exits silently.
>>> 
>>> The behaviour is reproducible; well not that easy reproducible.
>>> 
>>> We have multiple InfiniBand islands in our cluster. All nodes are 
>>> passwordless reachable from each other in somehow way; some via IPoIB, for 
>>> some routing you also have to use ethernet cards and IB/TCP gateways.
>>> 
>>> One island (b) is configured to use the IB card as the main TCP interface. 
>>> In this island, the variable OMPI_MCA_oob_tcp_if_include is set to "ib0" (*)
>>> 
>>> Another island (h) is configured in convenient way: IB cards also are here 
>>> and may be used for IPoIB in the island, but the "main interface" used for 
>>> DNS and Hostname binds is eth0.
>>> 
>>> When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI 
>>> version is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", mpiexec 
>>> just exits with return value "1" and no error/warning.
>>> 
>>> When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine.
>>> 
>>> All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this 
>>> behaviour; so this is aligned to v1.7.4 only. See log below.
>>> 
>>> You ask why to hell starting MPI processes on other IB island? Because our 
>>> front-end nodes are in the island (b) but we sometimes need to start 
>>> something also on island (h), which has been worced perfectly until 1.7.4.
>>> 
>>> 
>>> (*) This is another Spaghetti Western long story. In short, we set 
>>> OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card is 
>>> configured to be the main network interface, in order to stop Open MPI 
>>> trying to connect via (possibly unconfigured) ethernet cards - which lead 
>>> to endless waiting, sometimes.
>>> Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php
>>> 
>>> ------------------------------------------------------------------------------
>>> pk224850@cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3
>>> Unloading openmpi 1.7.3                         [ OK ]
>>> Loading openmpi 1.7.3 for intel compiler                         [ OK ]
>>> pk224850@cluster:~[524]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 hostname 
>>> ; echo $?
>>> linuxscc004.rz.RWTH-Aachen.DE
>>> 0
>>> pk224850@cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4
>>> Unloading openmpi 1.7.3                         [ OK ]
>>> Loading openmpi 1.7.4 for intel compiler                         [ OK ]
>>> pk224850@cluster:~[526]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 hostname 
>>> ; echo $?
>>> 1
>>> pk224850@cluster:~[527]$
>>> ------------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> II.
>>> During some experiments with envvars and v1.7.4, got the below messages.
>>> 
>>> --------------------------------------------------------------------------
>>> Sorry!  You were supposed to get help about:
>>>    no-included-found
>>> But I couldn't open the help file:
>>>    /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No 
>>> such file or directory.  Sorry!
>>> --------------------------------------------------------------------------
>>> [linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not 
>>> available in file ess_hnp_module.c at line 314
>>> --------------------------------------------------------------------------
>>> 
>>> Reproducing:
>>> $MPI_BINDIR/mpiexec  -mca oob_tcp_if_include ib0   -H linuxscc004 -np 1 
>>> hostname
>>> 
>>> *frome one node with no 'ib0' card*, also without infiniband. Yessir this 
>>> is a bad idea, and the 1.7.3 has said more understanding "you do wrong 
>>> thing":
>>> --------------------------------------------------------------------------
>>> None of the networks specified to be included for out-of-band communications
>>> could be found:
>>> 
>>>  Value given: ib0
>>> 
>>> Please revise the specification and try again.
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> No idea, why the file share/openmpi/help-oob-tcp.txt has not been installed 
>>> in 1.7.4, as we compile this version in pretty the same way as previous 
>>> versions..
>>> 
>>> 
>>> 
>>> 
>>> Best,
>>> Paul Kapinos
>>> 
>>> --
>>> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
>>> RWTH Aachen University, IT Center
>>> Seffenter Weg 23,  D 52074  Aachen (Germany)
>>> Tel: +49 241/80-24915
>>> 
>> 
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, IT Center
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> <oob_base_verbose-linuxbmc0008-165.txt><oob_base_verbose-linuxbmc0008-173.txt><oob_base_verbose-linuxbmc0008-174.txt><export_OMPI-linuxbmc0008.txt><ifconfig-linuxbmc0008.txt><ifconfig-linuxscc004.txt>

Reply via email to