Okay, this exposed the problem. The issue is that "ib0" on the two machines is defined on two completely different IP subnets:
linuxbmc0008: 134.61.202.7 linuxscc004: 192.168.222.4 The OOB doesn't think those two are directly reachable by each other as the IP/subnet-mask don't match - we obviously require a better testing method, or maybe just default to trying the connection and fail if we can't make it. Let me ponder that one a bit. Thanks! On Feb 13, 2014, at 3:05 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote: > Attached the output from openmpi/1.7.5a1r30708 > > $ $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100 > -H linuxscc004 -np 1 hostname 2>&1 | tee > oob_base_verbose-linuxbmc0008-175a1r29587.txt > > Well, some 5 lines added. > (The ib0 on linuxscc004 is not reachable from linuxbmc0008 - this lead to TCP > shutdown? cf. line 36-37) > > > On 02/13/14 01:28, Ralph Castain wrote: >> Could you please give the nightly 1.7.5 tarball a try using the same cmd >> line options and send me the output? I see the problem, but am trying to >> understand how it happens. I've added a bunch of diagnostic statements that >> should help me track it down. >> >> Thanks >> Ralph >> >> On Feb 12, 2014, at 1:26 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote: >> >>> As said, the change in behaviour is new in 1.7.4 - all previous versions >>> has been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a >>> workaround for older versions of Open MPI for some 60-seconds timeout when >>> starting the same command (which is still sucessfull); or for infinite >>> waiting in same cases. >>> >>> >>> >>> Attached are logs of the commands: >>> $ export | grep OMPI | tee export_OMPI-linuxbmc0008.txt >>> >>> $ $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -mca oob_base_verbose >>> 100 -H linuxscc004 -np 1 hostname 2>&1 | tee >>> oob_base_verbose-linuxbmc0008-173.txt >>> >>> (and -174 for appropriate versions 1.7.3 and 1.7.4) >>> >>> >>> $ ifconfig 2>&1 | tee ifconfig-linuxbmc0008.txt >>> >>> (and -linuxscc004 for the two nodes; linuxscc004 is in (h) fabric and >>> 'mpiexec' was called from node linuxbmc0008 which is in the (b) fabric >>> where the 'ib0' is configured to be the main interface) >>> >>> and the OMPI environment on linuxbmc0008. Maybe you can see something from >>> this. >>> >>> Best >>> Paul >>> >>> >>> On 02/11/14 20:29, Ralph Castain wrote: >>>> I've added better error messages in the trunk, scheduled to move over to >>>> 1.7.5. I don't see anything in the code that would explain why we don't >>>> pickup and use ib0 if it is present and specified in if_include - we >>>> should be doing it. >>>> >>>> For now, can you run this with "-mca oob_base_verbose 100" on your cmd >>>> line and send me the output? Might help debug the behavior. >>>> >>>> Thanks >>>> Ralph >>>> >>>> On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> >>>> wrote: >>>> >>>>> Dear Open MPI developer, >>>>> >>>>> I. >>>>> we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a >>>>> change to previous versions: >>>>> - when calling "mpiexec", it returns "1" and exits silently. >>>>> >>>>> The behaviour is reproducible; well not that easy reproducible. >>>>> >>>>> We have multiple InfiniBand islands in our cluster. All nodes are >>>>> passwordless reachable from each other in somehow way; some via IPoIB, >>>>> for some routing you also have to use ethernet cards and IB/TCP gateways. >>>>> >>>>> One island (b) is configured to use the IB card as the main TCP >>>>> interface. In this island, the variable OMPI_MCA_oob_tcp_if_include is >>>>> set to "ib0" (*) >>>>> >>>>> Another island (h) is configured in convenient way: IB cards also are >>>>> here and may be used for IPoIB in the island, but the "main interface" >>>>> used for DNS and Hostname binds is eth0. >>>>> >>>>> When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI >>>>> version is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", >>>>> mpiexec just exits with return value "1" and no error/warning. >>>>> >>>>> When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine. >>>>> >>>>> All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this >>>>> behaviour; so this is aligned to v1.7.4 only. See log below. >>>>> >>>>> You ask why to hell starting MPI processes on other IB island? Because >>>>> our front-end nodes are in the island (b) but we sometimes need to start >>>>> something also on island (h), which has been worced perfectly until 1.7.4. >>>>> >>>>> >>>>> (*) This is another Spaghetti Western long story. In short, we set >>>>> OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card >>>>> is configured to be the main network interface, in order to stop Open MPI >>>>> trying to connect via (possibly unconfigured) ethernet cards - which lead >>>>> to endless waiting, sometimes. >>>>> Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> pk224850@cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3 >>>>> Unloading openmpi 1.7.3 [ OK ] >>>>> Loading openmpi 1.7.3 for intel compiler [ OK ] >>>>> pk224850@cluster:~[524]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 >>>>> hostname ; echo $? >>>>> linuxscc004.rz.RWTH-Aachen.DE >>>>> 0 >>>>> pk224850@cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4 >>>>> Unloading openmpi 1.7.3 [ OK ] >>>>> Loading openmpi 1.7.4 for intel compiler [ OK ] >>>>> pk224850@cluster:~[526]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 >>>>> hostname ; echo $? >>>>> 1 >>>>> pk224850@cluster:~[527]$ >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> II. >>>>> During some experiments with envvars and v1.7.4, got the below messages. >>>>> >>>>> -------------------------------------------------------------------------- >>>>> Sorry! You were supposed to get help about: >>>>> no-included-found >>>>> But I couldn't open the help file: >>>>> /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No >>>>> such file or directory. Sorry! >>>>> -------------------------------------------------------------------------- >>>>> [linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not >>>>> available in file ess_hnp_module.c at line 314 >>>>> -------------------------------------------------------------------------- >>>>> >>>>> Reproducing: >>>>> $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -H linuxscc004 -np 1 >>>>> hostname >>>>> >>>>> *frome one node with no 'ib0' card*, also without infiniband. Yessir this >>>>> is a bad idea, and the 1.7.3 has said more understanding "you do wrong >>>>> thing": >>>>> -------------------------------------------------------------------------- >>>>> None of the networks specified to be included for out-of-band >>>>> communications >>>>> could be found: >>>>> >>>>> Value given: ib0 >>>>> >>>>> Please revise the specification and try again. >>>>> -------------------------------------------------------------------------- >>>>> >>>>> >>>>> No idea, why the file share/openmpi/help-oob-tcp.txt has not been >>>>> installed in 1.7.4, as we compile this version in pretty the same way as >>>>> previous versions.. >>>>> >>>>> >>>>> >>>>> >>>>> Best, >>>>> Paul Kapinos >>>>> >>>>> -- >>>>> Dipl.-Inform. Paul Kapinos - High Performance Computing, >>>>> RWTH Aachen University, IT Center >>>>> Seffenter Weg 23, D 52074 Aachen (Germany) >>>>> Tel: +49 241/80-24915 >>>>> >>>> >>> >>> >>> -- >>> Dipl.-Inform. Paul Kapinos - High Performance Computing, >>> RWTH Aachen University, IT Center >>> Seffenter Weg 23, D 52074 Aachen (Germany) >>> Tel: +49 241/80-24915 >>> <oob_base_verbose-linuxbmc0008-165.txt><oob_base_verbose-linuxbmc0008-173.txt><oob_base_verbose-linuxbmc0008-174.txt><export_OMPI-linuxbmc0008.txt><ifconfig-linuxbmc0008.txt><ifconfig-linuxscc004.txt> >> > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, IT Center > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > <oob_base_verbose-linuxbmc0008-175a1r29587.txt>