Okay, this exposed the problem. The issue is that "ib0" on the two machines is 
defined on two completely different IP subnets:

linuxbmc0008:  134.61.202.7
linuxscc004:      192.168.222.4

The OOB doesn't think those two are directly reachable by each other as the 
IP/subnet-mask don't match - we obviously require a better testing method, or 
maybe just default to trying the connection and fail if we can't make it. Let 
me ponder that one a bit.

Thanks!

On Feb 13, 2014, at 3:05 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote:

> Attached the output from openmpi/1.7.5a1r30708
> 
> $ $MPI_BINDIR/mpiexec  -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100  
> -H linuxscc004 -np 1 hostname 2>&1 | tee 
> oob_base_verbose-linuxbmc0008-175a1r29587.txt
> 
> Well, some 5 lines added.
> (The ib0 on linuxscc004 is not reachable from linuxbmc0008 - this lead to TCP 
> shutdown? cf. line 36-37)
> 
> 
> On 02/13/14 01:28, Ralph Castain wrote:
>> Could you please give the nightly 1.7.5 tarball a try using the same cmd 
>> line options and send me the output? I see the problem, but am trying to 
>> understand how it happens. I've added a bunch of diagnostic statements that 
>> should help me track it down.
>> 
>> Thanks
>> Ralph
>> 
>> On Feb 12, 2014, at 1:26 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote:
>> 
>>> As said, the change in behaviour is new in 1.7.4 - all previous versions 
>>> has been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a 
>>> workaround for older versions of Open MPI for some 60-seconds timeout when 
>>> starting the same command (which is still sucessfull); or for infinite 
>>> waiting in same cases.
>>> 
>>> 
>>> 
>>> Attached are logs of the commands:
>>> $ export | grep OMPI | tee export_OMPI-linuxbmc0008.txt
>>> 
>>> $ $MPI_BINDIR/mpiexec  -mca oob_tcp_if_include ib0 -mca oob_base_verbose 
>>> 100  -H linuxscc004 -np 1 hostname 2>&1 | tee 
>>> oob_base_verbose-linuxbmc0008-173.txt
>>> 
>>> (and -174 for appropriate versions 1.7.3 and 1.7.4)
>>> 
>>> 
>>> $ ifconfig 2>&1 | tee ifconfig-linuxbmc0008.txt
>>> 
>>> (and -linuxscc004 for the two nodes; linuxscc004 is in (h) fabric and 
>>> 'mpiexec' was called from node linuxbmc0008 which is in the (b) fabric 
>>> where the 'ib0' is configured to be the main interface)
>>> 
>>> and the OMPI environment on linuxbmc0008. Maybe you can see something from 
>>> this.
>>> 
>>> Best
>>> Paul
>>> 
>>> 
>>> On 02/11/14 20:29, Ralph Castain wrote:
>>>> I've added better error messages in the trunk, scheduled to move over to 
>>>> 1.7.5. I don't see anything in the code that would explain why we don't 
>>>> pickup and use ib0 if it is present and specified in if_include - we 
>>>> should be doing it.
>>>> 
>>>> For now, can you run this with "-mca oob_base_verbose 100" on your cmd 
>>>> line and send me the output? Might help debug the behavior.
>>>> 
>>>> Thanks
>>>> Ralph
>>>> 
>>>> On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> 
>>>> wrote:
>>>> 
>>>>> Dear Open MPI developer,
>>>>> 
>>>>> I.
>>>>> we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a 
>>>>> change to previous versions:
>>>>> - when calling "mpiexec", it returns "1" and exits silently.
>>>>> 
>>>>> The behaviour is reproducible; well not that easy reproducible.
>>>>> 
>>>>> We have multiple InfiniBand islands in our cluster. All nodes are 
>>>>> passwordless reachable from each other in somehow way; some via IPoIB, 
>>>>> for some routing you also have to use ethernet cards and IB/TCP gateways.
>>>>> 
>>>>> One island (b) is configured to use the IB card as the main TCP 
>>>>> interface. In this island, the variable OMPI_MCA_oob_tcp_if_include is 
>>>>> set to "ib0" (*)
>>>>> 
>>>>> Another island (h) is configured in convenient way: IB cards also are 
>>>>> here and may be used for IPoIB in the island, but the "main interface" 
>>>>> used for DNS and Hostname binds is eth0.
>>>>> 
>>>>> When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI 
>>>>> version is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", 
>>>>> mpiexec just exits with return value "1" and no error/warning.
>>>>> 
>>>>> When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine.
>>>>> 
>>>>> All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this 
>>>>> behaviour; so this is aligned to v1.7.4 only. See log below.
>>>>> 
>>>>> You ask why to hell starting MPI processes on other IB island? Because 
>>>>> our front-end nodes are in the island (b) but we sometimes need to start 
>>>>> something also on island (h), which has been worced perfectly until 1.7.4.
>>>>> 
>>>>> 
>>>>> (*) This is another Spaghetti Western long story. In short, we set 
>>>>> OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card 
>>>>> is configured to be the main network interface, in order to stop Open MPI 
>>>>> trying to connect via (possibly unconfigured) ethernet cards - which lead 
>>>>> to endless waiting, sometimes.
>>>>> Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> pk224850@cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3
>>>>> Unloading openmpi 1.7.3                         [ OK ]
>>>>> Loading openmpi 1.7.3 for intel compiler                         [ OK ]
>>>>> pk224850@cluster:~[524]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 
>>>>> hostname ; echo $?
>>>>> linuxscc004.rz.RWTH-Aachen.DE
>>>>> 0
>>>>> pk224850@cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4
>>>>> Unloading openmpi 1.7.3                         [ OK ]
>>>>> Loading openmpi 1.7.4 for intel compiler                         [ OK ]
>>>>> pk224850@cluster:~[526]$ $MPI_BINDIR/mpiexec  -H linuxscc004 -np 1 
>>>>> hostname ; echo $?
>>>>> 1
>>>>> pk224850@cluster:~[527]$
>>>>> ------------------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> II.
>>>>> During some experiments with envvars and v1.7.4, got the below messages.
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> Sorry!  You were supposed to get help about:
>>>>>    no-included-found
>>>>> But I couldn't open the help file:
>>>>>    /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No 
>>>>> such file or directory.  Sorry!
>>>>> --------------------------------------------------------------------------
>>>>> [linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not 
>>>>> available in file ess_hnp_module.c at line 314
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> Reproducing:
>>>>> $MPI_BINDIR/mpiexec  -mca oob_tcp_if_include ib0   -H linuxscc004 -np 1 
>>>>> hostname
>>>>> 
>>>>> *frome one node with no 'ib0' card*, also without infiniband. Yessir this 
>>>>> is a bad idea, and the 1.7.3 has said more understanding "you do wrong 
>>>>> thing":
>>>>> --------------------------------------------------------------------------
>>>>> None of the networks specified to be included for out-of-band 
>>>>> communications
>>>>> could be found:
>>>>> 
>>>>>  Value given: ib0
>>>>> 
>>>>> Please revise the specification and try again.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> No idea, why the file share/openmpi/help-oob-tcp.txt has not been 
>>>>> installed in 1.7.4, as we compile this version in pretty the same way as 
>>>>> previous versions..
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Best,
>>>>> Paul Kapinos
>>>>> 
>>>>> --
>>>>> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
>>>>> RWTH Aachen University, IT Center
>>>>> Seffenter Weg 23,  D 52074  Aachen (Germany)
>>>>> Tel: +49 241/80-24915
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
>>> RWTH Aachen University, IT Center
>>> Seffenter Weg 23,  D 52074  Aachen (Germany)
>>> Tel: +49 241/80-24915
>>> <oob_base_verbose-linuxbmc0008-165.txt><oob_base_verbose-linuxbmc0008-173.txt><oob_base_verbose-linuxbmc0008-174.txt><export_OMPI-linuxbmc0008.txt><ifconfig-linuxbmc0008.txt><ifconfig-linuxscc004.txt>
>> 
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, IT Center
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> <oob_base_verbose-linuxbmc0008-175a1r29587.txt>

Reply via email to