Could you please give the nightly 1.7.5 tarball a try using the same cmd line options and send me the output? I see the problem, but am trying to understand how it happens. I've added a bunch of diagnostic statements that should help me track it down.
Thanks Ralph On Feb 12, 2014, at 1:26 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote: > As said, the change in behaviour is new in 1.7.4 - all previous versions has > been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a workaround > for older versions of Open MPI for some 60-seconds timeout when starting the > same command (which is still sucessfull); or for infinite waiting in same > cases. > > > > Attached are logs of the commands: > $ export | grep OMPI | tee export_OMPI-linuxbmc0008.txt > > $ $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100 > -H linuxscc004 -np 1 hostname 2>&1 | tee oob_base_verbose-linuxbmc0008-173.txt > > (and -174 for appropriate versions 1.7.3 and 1.7.4) > > > $ ifconfig 2>&1 | tee ifconfig-linuxbmc0008.txt > > (and -linuxscc004 for the two nodes; linuxscc004 is in (h) fabric and > 'mpiexec' was called from node linuxbmc0008 which is in the (b) fabric where > the 'ib0' is configured to be the main interface) > > and the OMPI environment on linuxbmc0008. Maybe you can see something from > this. > > Best > Paul > > > On 02/11/14 20:29, Ralph Castain wrote: >> I've added better error messages in the trunk, scheduled to move over to >> 1.7.5. I don't see anything in the code that would explain why we don't >> pickup and use ib0 if it is present and specified in if_include - we should >> be doing it. >> >> For now, can you run this with "-mca oob_base_verbose 100" on your cmd line >> and send me the output? Might help debug the behavior. >> >> Thanks >> Ralph >> >> On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote: >> >>> Dear Open MPI developer, >>> >>> I. >>> we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a >>> change to previous versions: >>> - when calling "mpiexec", it returns "1" and exits silently. >>> >>> The behaviour is reproducible; well not that easy reproducible. >>> >>> We have multiple InfiniBand islands in our cluster. All nodes are >>> passwordless reachable from each other in somehow way; some via IPoIB, for >>> some routing you also have to use ethernet cards and IB/TCP gateways. >>> >>> One island (b) is configured to use the IB card as the main TCP interface. >>> In this island, the variable OMPI_MCA_oob_tcp_if_include is set to "ib0" (*) >>> >>> Another island (h) is configured in convenient way: IB cards also are here >>> and may be used for IPoIB in the island, but the "main interface" used for >>> DNS and Hostname binds is eth0. >>> >>> When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI >>> version is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", mpiexec >>> just exits with return value "1" and no error/warning. >>> >>> When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine. >>> >>> All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this >>> behaviour; so this is aligned to v1.7.4 only. See log below. >>> >>> You ask why to hell starting MPI processes on other IB island? Because our >>> front-end nodes are in the island (b) but we sometimes need to start >>> something also on island (h), which has been worced perfectly until 1.7.4. >>> >>> >>> (*) This is another Spaghetti Western long story. In short, we set >>> OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card is >>> configured to be the main network interface, in order to stop Open MPI >>> trying to connect via (possibly unconfigured) ethernet cards - which lead >>> to endless waiting, sometimes. >>> Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php >>> >>> ------------------------------------------------------------------------------ >>> pk224850@cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3 >>> Unloading openmpi 1.7.3 [ OK ] >>> Loading openmpi 1.7.3 for intel compiler [ OK ] >>> pk224850@cluster:~[524]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname >>> ; echo $? >>> linuxscc004.rz.RWTH-Aachen.DE >>> 0 >>> pk224850@cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4 >>> Unloading openmpi 1.7.3 [ OK ] >>> Loading openmpi 1.7.4 for intel compiler [ OK ] >>> pk224850@cluster:~[526]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname >>> ; echo $? >>> 1 >>> pk224850@cluster:~[527]$ >>> ------------------------------------------------------------------------------ >>> >>> >>> >>> >>> >>> >>> >>> >>> II. >>> During some experiments with envvars and v1.7.4, got the below messages. >>> >>> -------------------------------------------------------------------------- >>> Sorry! You were supposed to get help about: >>> no-included-found >>> But I couldn't open the help file: >>> /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No >>> such file or directory. Sorry! >>> -------------------------------------------------------------------------- >>> [linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not >>> available in file ess_hnp_module.c at line 314 >>> -------------------------------------------------------------------------- >>> >>> Reproducing: >>> $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -H linuxscc004 -np 1 >>> hostname >>> >>> *frome one node with no 'ib0' card*, also without infiniband. Yessir this >>> is a bad idea, and the 1.7.3 has said more understanding "you do wrong >>> thing": >>> -------------------------------------------------------------------------- >>> None of the networks specified to be included for out-of-band communications >>> could be found: >>> >>> Value given: ib0 >>> >>> Please revise the specification and try again. >>> -------------------------------------------------------------------------- >>> >>> >>> No idea, why the file share/openmpi/help-oob-tcp.txt has not been installed >>> in 1.7.4, as we compile this version in pretty the same way as previous >>> versions.. >>> >>> >>> >>> >>> Best, >>> Paul Kapinos >>> >>> -- >>> Dipl.-Inform. Paul Kapinos - High Performance Computing, >>> RWTH Aachen University, IT Center >>> Seffenter Weg 23, D 52074 Aachen (Germany) >>> Tel: +49 241/80-24915 >>> >> > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, IT Center > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > <oob_base_verbose-linuxbmc0008-165.txt><oob_base_verbose-linuxbmc0008-173.txt><oob_base_verbose-linuxbmc0008-174.txt><export_OMPI-linuxbmc0008.txt><ifconfig-linuxbmc0008.txt><ifconfig-linuxscc004.txt>