Hello, list. I have a strange segmentation fault on x86_64 machine running together with x86. I am running attached program that sends some bytes from process 0 to process 1. My configuration is: Machine #1: (process 0) arch: x86 hostname: magomedov-desktop linux distro: Ubuntu 9.10 Open MPI: v1.4 configured with --enable-heterogeneous --enable-debug Machine #2: (process 1) arch: x86_64 hostname: linuxtche linux distro: Fedora 12 Open MPI: v1.4 configured with --enable-heterogeneous --prefix=/home/magomedov/openmpi/ --enable-debug
They are connected by ethernet. My user environment on second (x86_64) machine is set up to use Open MPI from /home/magomedov/openmpi/. Then I compile attached program on both machines (at the same path) and run it. Process 0 from x86 machine should send data to process 1 on x86_64 machine. First, let's send 65530 bytes: mpirun -host timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test 65530 magomedov@linuxtche's password: *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 21875 *** *** processor linuxtche, comm size is 2, my rank is 1, pid 11357 *** Received 65530 bytes It's OK. Then let's send 65537 bytes: magomedov@magomedov-desktop:~/workspace/mpi-test$ mpirun -host timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test 65537 magomedov@linuxtche's password: *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 9205 *** *** processor linuxtche, comm size is 2, my rank is 1, pid 28858 *** [linuxtche:28858] *** Process received signal *** [linuxtche:28858] Signal: Segmentation fault (11) [linuxtche:28858] Signal code: Address not mapped (1) [linuxtche:28858] Failing at address: 0x201143bf8 [linuxtche:28858] [ 0] /lib64/libpthread.so.0() [0x3600c0f0f0] [linuxtche:28858] [ 1] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0xfc27) [0x7f5e94076c27] [linuxtche:28858] [ 2] /home/magomedov/openmpi/lib/openmpi/mca_btl_tcp.so(+0xadac) [0x7f5e935c3dac] [linuxtche:28858] [ 3] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27611) [0x7f5e96575611] [linuxtche:28858] [ 4] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27c57) [0x7f5e96575c57] [linuxtche:28858] [ 5] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f5e96575848] [linuxtche:28858] [ 6] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_progress+0x89) [0x7f5e965648dd] [linuxtche:28858] [ 7] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x762f) [0x7f5e9406e62f] [linuxtche:28858] [ 8] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x777d) [0x7f5e9406e77d] [linuxtche:28858] [ 9] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8246) [0x7f5e9406f246] [linuxtche:28858] [10] /home/magomedov/openmpi/lib/libmpi.so.0(MPI_Recv +0x2d2) [0x7f5e96af832c] [linuxtche:28858] [11] /home/magomedov/workspace/mpi-test/mpi-send-test(main+0x1e4) [0x400ee8] [linuxtche:28858] [12] /lib64/libc.so.6(__libc_start_main+0xfd) [0x360001eb1d] [linuxtche:28858] [13] /home/magomedov/workspace/mpi-test/mpi-send-test() [0x400c49] [linuxtche:28858] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 28858 on node linuxtche exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- If I am trying to send >= 65537 bytes from x86 I always get segfault on x86_64. I made some investigations and found that "bad" pointer always has a valid pointer actually in it's lower 32-bit word and "2" or "1" in it's upper word. Program segfaults in pml_ob1_recvfrag.c, in function mca_pml_ob1_recv_frag_callback_fin(), rdma pointer is broken. I inserted rdma = (mca_btl_base_descriptor_t*)((unsigned long)rdma & 0xFFFFFFFF); line which I believe truncates 64-bit pointer to 32 bits and segfaults disappeared. However, this is not the solution. After some investigations with gdb it seems to me like this pointer was sent to x86 machine and was received from it broken but I don't realize what is going on enough to fix it... Can anyone reproduce it? I got the same results on openmpi-1.4.2rc1 too. It looks like the same problem was described here http://www.open-mpi.org/community/lists/users/2010/02/12182.php in ompi-users list. -- Kind regards, Timur Magomedov Senior C++ Developer DevelopOnBox LLC / Zodiac Interactive http://www.zodiac.tv/
#include <mpi.h> #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <stdint.h> int main(int argc, char *argv[]) { int ret; int size; int rank; int name_len; char name[MPI_MAX_PROCESSOR_NAME]; int len; int sender = 0; int receiver = 1; uint8_t *val; MPI_Status stat; MPI_Init(&argc, &argv); MPI_Get_processor_name(name, &name_len); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("*** processor %s, comm size is %d, my rank is %d, pid %u ***\n", name, size, rank, getpid()); if (argc != 2) { printf("Usage: %s message_length\n", argv[0]); exit(EXIT_FAILURE); } if (rank == 0) { len = atoi(argv[1]); } MPI_Bcast(&len, 1, MPI_INT, 0, MPI_COMM_WORLD); val = malloc(len); if (NULL == val) { puts("Memory allocation failed"); exit(EXIT_FAILURE); } if (size == 2) { if (rank == sender) { MPI_Send(val, len, MPI_BYTE, receiver, 0, MPI_COMM_WORLD); } else { int count = 0; MPI_Recv(val, len, MPI_BYTE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); MPI_Get_count(&stat, MPI_BYTE, &count); MPI_Get_elements(&stat, MPI_BYTE, &count); printf("Received %d bytes\n", count); } } free(val); ret = MPI_Finalize(); return EXIT_SUCCESS; }