Hi Micha! I'm sorry for replying late! I was on holidays.
Your description sounds reasonable but I have no possibility to do tests of my own at the moment. I CC'ed Jeff (upstream), maybe he can comment on the issue. BTW, did you also try the 1.3 series of Open MPI? Best regards Manuel Am Samstag, den 18.04.2009, 01:49 +0300 schrieb Micha Feigin: > Package: openmpi-bin > Version: 1.2.8-3 > Severity: important > > > As far as I understand the error, mpiexec resolves name -> addresses on the > server > it is run on instead of an each host seperately. This works in an environment > where > each hostname resolves to the same address on each host (cluster connected > via a > switch) but fails where it resolves to different addresses (ring/star setups > for > example where each computer is connected directly to all/some of the others) > > I'm not 100% sure that this is the problem as I'm seeing success on a single > case where this should probably fail but it is my best bet from the error > message. > > version 1.2.8 worked fine for the same simple program (a simple hellow world > that > just comunicated the computer name for each process) > > An example output: > > mpiexec is run on the master node hubert and is set to run the processes on > two nodes > fry and leela. As is understood from the error messages leela tries to > connect to > fry on address 192.168.1.2 which is it's address on hubert but not leela > (where it > is 192.168.4.1) > > This is a four node claster all interconnected > > 192.168.1.1 192.168.1.2 > hubert ------------------------ fry > | \ / | 192.168.4.1 > | \ / | > | \ / | > | \ / | > | / \ | > | / \ | > | / \ | > | / \ | 192.168.4.2 > hermes ----------------------- leelas > > ================================================================= > mpiexec -np 8 -H fry,leela test_mpi > Hello MPI from the server process of 8 on fry! > [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > [leela:4436] *** An error occurred in MPI_Send > [leela:4436] *** on communicator MPI_COMM_WORLD > [leela:4436] *** MPI_ERR_INTERN: internal error > [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] > from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: > Network is unreachable > > -------------------------------------------------------------------------- > mpiexec has exited due to process rank 1 with PID 4433 on > node leela exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpiexec (as reported here). > -------------------------------------------------------------------------- > [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / > mpi_errors_are_fatal > [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > ================================================================= > > This seems to be a directional issue as running the program -H fry,leela > failes > where -H leela,fry works, same behaviour for all senarious except those that > include > the master node (hubert) where it resolves the external ip (from an external > dns) instead > of the internal ip (from the hosts file). thus one direction fails (no > external connection > at the moment for all but the master) and the other causes a lockup > > I hope that the explenation is not too convoluted > > -- System Information: > Debian Release: squeeze/sid > APT prefers unstable > APT policy: (500, 'unstable'), (1, 'experimental') > Architecture: amd64 (x86_64) > > Kernel: Linux 2.6.28.8 (SMP w/4 CPU cores) > Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) > Shell: /bin/sh linked to /bin/bash > > Versions of packages openmpi-bin depends on: > ii libc6 2.9-7 GNU C Library: Shared libraries > ii libgcc1 1:4.3.3-7 GCC support library > ii libopenmpi1 1.2.8-3 high performance message passing > l > ii libstdc++6 4.3.3-7 The GNU Standard C++ Library v3 > ii openmpi-common 1.2.8-3 high performance message passing > l > > openmpi-bin recommends no packages. > > Versions of packages openmpi-bin suggests: > ii gfortran 4:4.3.3-2 The GNU Fortran 95 compiler > > -- no debconf information
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil

