Hi,
I'm currently testing the new IPv6 code in a lot of
different setups.
It's doing fine with Linux and Solaris, both on x86.
There are also no problems between multiple amd64s,
but I wasn't able to communicate between x86 and amd64.
The oob connection is up, but the BTL hangs. gdb (remote) shows:
#0 0xb7d3bac9 in sigprocmask () from /lib/tls/libc.so.6
#1 0xb7eb956c in opal_evsignal_recalc ()
from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#2 0xb7eba033 in poll_dispatch ()
from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#3 0xb7eb8d5d in opal_event_loop ()
from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#4 0xb7eb2f58 in opal_progress ()
from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#5 0xb7c72505 in mca_pml_ob1_recv ()
from /home/racl/adi/ompi/trunk/Linux-i686//lib/openmpi/mca_pml_ob1.so
#6 0xb7fa8c10 in PMPI_Recv ()
from /home/racl/adi/ompi/trunk/Linux-i686/lib/libmpi.so.0
#7 0x080488cd in main ()
and the local gdb:
#0 0x00002aaaab4b4d99 in __libc_sigaction () from /lib/libpthread.so.0
#1 0x00002aaaaaee4c26 in opal_evsignal_recalc ()
from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#2 0x00002aaaaaee44b1 in opal_event_loop ()
from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#3 0x00002aaaaaedfc10 in opal_progress ()
from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#4 0x00002aaaad6a0c8c in mca_pml_ob1_recv ()
from /home/adi/trunk/Linux-x86_64//lib/openmpi/mca_pml_ob1.so
#5 0x00002aaaaac429f9 in PMPI_Recv ()
from /home/adi//trunk/Linux-x86_64/lib/libmpi.so.0
#6 0x0000000000400b39 in main ()
The ompi-1.1.2-release also shows this problem, so I'm not
sure if it's my fault.
I've added some debug output to my ringtest (see below) and
got the following result:
1: waiting for message
0: sending message (0) to 1
0: sent message
Here's the code:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv)
{
int rank;
int size;
int message = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (!rank) {
printf("%i: sending message (%i) to %i\n", rank, message, 1);
MPI_Send(&message, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
printf("%i: sent message\n", rank);
MPI_Recv(&message, 1, MPI_INT, size-1, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("%i: got message (%i) from %i\n", rank, message, size-1);
} else {
printf("%i: waiting for message\n");
MPI_Recv(&message, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
message += 1;
MPI_Send(&message, 1, MPI_INT, (rank+1)%size, 0, MPI_COMM_WORLD);
printf("%i: got message (%i) from %i, sending to %i\n", rank, message,
rank-1, (rank+1)%size);
}
MPI_Finalize();
return 0;
}
Not very particular, but as seen in the gdb output and also
from the debug lines, both processes are waiting in PMPI_Recv(),
expecting a message to arrive.
Is this a known problem? What's wrong? Usercode? ompi?
As far as I can see (tcpdump and strace), all tcp connections
are up, so the message might got stuck between rank0 and rank1.
--
mail: [email protected] http://adi.thur.de PGP: v2-key via keyserver
Windows not found - Abort/Retry/Smile