Hi,

I'm currently testing the new IPv6 code in a lot of
different setups.

It's doing fine with Linux and Solaris, both on x86.
There are also no problems between multiple amd64s,
but I wasn't able to communicate between x86 and amd64.

The oob connection is up, but the BTL hangs. gdb (remote) shows:

#0  0xb7d3bac9 in sigprocmask () from /lib/tls/libc.so.6
#1  0xb7eb956c in opal_evsignal_recalc ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#2  0xb7eba033 in poll_dispatch ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#3  0xb7eb8d5d in opal_event_loop ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#4  0xb7eb2f58 in opal_progress ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#5  0xb7c72505 in mca_pml_ob1_recv ()
   from /home/racl/adi/ompi/trunk/Linux-i686//lib/openmpi/mca_pml_ob1.so
#6  0xb7fa8c10 in PMPI_Recv ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libmpi.so.0
#7  0x080488cd in main ()


and the local gdb:

#0  0x00002aaaab4b4d99 in __libc_sigaction () from /lib/libpthread.so.0
#1  0x00002aaaaaee4c26 in opal_evsignal_recalc ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#2  0x00002aaaaaee44b1 in opal_event_loop ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#3  0x00002aaaaaedfc10 in opal_progress ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#4  0x00002aaaad6a0c8c in mca_pml_ob1_recv ()
   from /home/adi/trunk/Linux-x86_64//lib/openmpi/mca_pml_ob1.so
#5  0x00002aaaaac429f9 in PMPI_Recv ()
   from /home/adi//trunk/Linux-x86_64/lib/libmpi.so.0
#6  0x0000000000400b39 in main ()


The ompi-1.1.2-release also shows this problem, so I'm not
sure if it's my fault.

I've added some debug output to my ringtest (see below) and
got the following result:

1: waiting for message
0: sending message (0) to 1
0: sent message

Here's the code:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv)
{
    int rank;
    int size;
    int message = 0;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (!rank) {
        printf("%i: sending message (%i) to %i\n", rank, message, 1);
        MPI_Send(&message, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
        printf("%i: sent message\n", rank);
        MPI_Recv(&message, 1, MPI_INT, size-1, 0, MPI_COMM_WORLD, 
                MPI_STATUS_IGNORE);
        printf("%i: got message (%i) from %i\n", rank, message, size-1);
    } else {
        printf("%i: waiting for message\n");
        MPI_Recv(&message, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
                MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        message += 1;
        MPI_Send(&message, 1, MPI_INT, (rank+1)%size, 0, MPI_COMM_WORLD);
        printf("%i: got message (%i) from %i, sending to %i\n", rank, message, 
               rank-1, (rank+1)%size);
    }

    MPI_Finalize();
    return 0;
}

Not very particular, but as seen in the gdb output and also
from the debug lines, both processes are waiting in PMPI_Recv(),
expecting a message to arrive.

Is this a known problem? What's wrong? Usercode? ompi?
As far as I can see (tcpdump and strace), all tcp connections
are up, so the message might got stuck between rank0 and rank1.


-- 
mail: a...@thur.de      http://adi.thur.de      PGP: v2-key via keyserver

Windows not found - Abort/Retry/Smile

Reply via email to