Hi, I'm currently testing the new IPv6 code in a lot of different setups.
It's doing fine with Linux and Solaris, both on x86. There are also no problems between multiple amd64s, but I wasn't able to communicate between x86 and amd64. The oob connection is up, but the BTL hangs. gdb (remote) shows: #0 0xb7d3bac9 in sigprocmask () from /lib/tls/libc.so.6 #1 0xb7eb956c in opal_evsignal_recalc () from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0 #2 0xb7eba033 in poll_dispatch () from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0 #3 0xb7eb8d5d in opal_event_loop () from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0 #4 0xb7eb2f58 in opal_progress () from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0 #5 0xb7c72505 in mca_pml_ob1_recv () from /home/racl/adi/ompi/trunk/Linux-i686//lib/openmpi/mca_pml_ob1.so #6 0xb7fa8c10 in PMPI_Recv () from /home/racl/adi/ompi/trunk/Linux-i686/lib/libmpi.so.0 #7 0x080488cd in main () and the local gdb: #0 0x00002aaaab4b4d99 in __libc_sigaction () from /lib/libpthread.so.0 #1 0x00002aaaaaee4c26 in opal_evsignal_recalc () from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0 #2 0x00002aaaaaee44b1 in opal_event_loop () from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0 #3 0x00002aaaaaedfc10 in opal_progress () from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0 #4 0x00002aaaad6a0c8c in mca_pml_ob1_recv () from /home/adi/trunk/Linux-x86_64//lib/openmpi/mca_pml_ob1.so #5 0x00002aaaaac429f9 in PMPI_Recv () from /home/adi//trunk/Linux-x86_64/lib/libmpi.so.0 #6 0x0000000000400b39 in main () The ompi-1.1.2-release also shows this problem, so I'm not sure if it's my fault. I've added some debug output to my ringtest (see below) and got the following result: 1: waiting for message 0: sending message (0) to 1 0: sent message Here's the code: #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { int rank; int size; int message = 0; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if (!rank) { printf("%i: sending message (%i) to %i\n", rank, message, 1); MPI_Send(&message, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); printf("%i: sent message\n", rank); MPI_Recv(&message, 1, MPI_INT, size-1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%i: got message (%i) from %i\n", rank, message, size-1); } else { printf("%i: waiting for message\n"); MPI_Recv(&message, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); message += 1; MPI_Send(&message, 1, MPI_INT, (rank+1)%size, 0, MPI_COMM_WORLD); printf("%i: got message (%i) from %i, sending to %i\n", rank, message, rank-1, (rank+1)%size); } MPI_Finalize(); return 0; } Not very particular, but as seen in the gdb output and also from the debug lines, both processes are waiting in PMPI_Recv(), expecting a message to arrive. Is this a known problem? What's wrong? Usercode? ompi? As far as I can see (tcpdump and strace), all tcp connections are up, so the message might got stuck between rank0 and rank1. -- mail: a...@thur.de http://adi.thur.de PGP: v2-key via keyserver Windows not found - Abort/Retry/Smile