I think this question was already raised few weeks ago. The problem come from the BTL headers where the fields do not have the same alignment inside. The original question was asked by Nysal Jan on an email with the subject "SEGV in EM64T <--> PPC64 communication" on Oct. 11 2006. Unfortunately, we still have the same problem.

Can you check that your problem is similar ?

  Thanks,
    george.

On Nov 1, 2006, at 6:20 PM, Adrian Knoth wrote:

Hi,

I'm currently testing the new IPv6 code in a lot of
different setups.

It's doing fine with Linux and Solaris, both on x86.
There are also no problems between multiple amd64s,
but I wasn't able to communicate between x86 and amd64.

The oob connection is up, but the BTL hangs. gdb (remote) shows:

#0  0xb7d3bac9 in sigprocmask () from /lib/tls/libc.so.6
#1  0xb7eb956c in opal_evsignal_recalc ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#2  0xb7eba033 in poll_dispatch ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#3  0xb7eb8d5d in opal_event_loop ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#4  0xb7eb2f58 in opal_progress ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#5  0xb7c72505 in mca_pml_ob1_recv ()
from /home/racl/adi/ompi/trunk/Linux-i686//lib/openmpi/ mca_pml_ob1.so
#6  0xb7fa8c10 in PMPI_Recv ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libmpi.so.0
#7  0x080488cd in main ()


and the local gdb:

#0 0x00002aaaab4b4d99 in __libc_sigaction () from /lib/ libpthread.so.0
#1  0x00002aaaaaee4c26 in opal_evsignal_recalc ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#2  0x00002aaaaaee44b1 in opal_event_loop ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#3  0x00002aaaaaedfc10 in opal_progress ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#4  0x00002aaaad6a0c8c in mca_pml_ob1_recv ()
   from /home/adi/trunk/Linux-x86_64//lib/openmpi/mca_pml_ob1.so
#5  0x00002aaaaac429f9 in PMPI_Recv ()
   from /home/adi//trunk/Linux-x86_64/lib/libmpi.so.0
#6  0x0000000000400b39 in main ()


The ompi-1.1.2-release also shows this problem, so I'm not
sure if it's my fault.

I've added some debug output to my ringtest (see below) and
got the following result:

1: waiting for message
0: sending message (0) to 1
0: sent message

Here's the code:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv)
{
    int rank;
    int size;
    int message = 0;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (!rank) {
        printf("%i: sending message (%i) to %i\n", rank, message, 1);
        MPI_Send(&message, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
        printf("%i: sent message\n", rank);
        MPI_Recv(&message, 1, MPI_INT, size-1, 0, MPI_COMM_WORLD,
                MPI_STATUS_IGNORE);
printf("%i: got message (%i) from %i\n", rank, message, size-1);
    } else {
        printf("%i: waiting for message\n");
        MPI_Recv(&message, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
                MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        message += 1;
MPI_Send(&message, 1, MPI_INT, (rank+1)%size, 0, MPI_COMM_WORLD); printf("%i: got message (%i) from %i, sending to %i\n", rank, message,
               rank-1, (rank+1)%size);
    }

    MPI_Finalize();
    return 0;
}

Not very particular, but as seen in the gdb output and also
from the debug lines, both processes are waiting in PMPI_Recv(),
expecting a message to arrive.

Is this a known problem? What's wrong? Usercode? ompi?
As far as I can see (tcpdump and strace), all tcp connections
are up, so the message might got stuck between rank0 and rank1.


--
mail: a...@thur.de      http://adi.thur.de      PGP: v2-key via keyserver

Windows not found - Abort/Retry/Smile
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to