Timur, Thanks for the very detailed analysis of the problem. Based on your observations, I was able to track down the issue pretty quickly. In few words, the 64 bits machine sent a pointer to the 32 bits one, and expected that it will get it back untouched. Unfortunately, on the 32 bits machine this pointer was translated into a void* and the upper 32 bits were lost.
I don't have a heterogeneous environment available right away to test my patch. I would really appreciate it if you can test it and let us know if it solve this problem. Thanks, george. PS: In order to apply it, please go in the ompi/mca/pml/ob1 directory and do the "patch -p0" from there.
heterogeneous.patch
Description: Binary data
On Apr 22, 2010, at 09:08 , Timur Magomedov wrote: > Hello, list. > > I have a strange segmentation fault on x86_64 machine running together > with x86. > I am running attached program that sends some bytes from process 0 to > process 1. My configuration is: > Machine #1: (process 0) > arch: x86 > hostname: magomedov-desktop > linux distro: Ubuntu 9.10 > Open MPI: v1.4 configured with --enable-heterogeneous --enable-debug > Machine #2: (process 1) > arch: x86_64 > hostname: linuxtche > linux distro: Fedora 12 > Open MPI: v1.4 configured with --enable-heterogeneous > --prefix=/home/magomedov/openmpi/ --enable-debug > > They are connected by ethernet. > My user environment on second (x86_64) machine is set up to use Open MPI > from /home/magomedov/openmpi/. > > Then I compile attached program on both machines (at the same path) and > run it. Process 0 from x86 machine should send data to process 1 on > x86_64 machine. > > First, let's send 65530 bytes: > > mpirun -host timur,linuxtche -np > 2 /home/magomedov/workspace/mpi-test/mpi-send-test 65530 > magomedov@linuxtche's password: > *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 21875 > *** > *** processor linuxtche, comm size is 2, my rank is 1, pid 11357 *** > Received 65530 bytes > > It's OK. Then let's send 65537 bytes: > > magomedov@magomedov-desktop:~/workspace/mpi-test$ mpirun -host > timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test > 65537 > magomedov@linuxtche's password: > *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 9205 > *** > *** processor linuxtche, comm size is 2, my rank is 1, pid 28858 *** > [linuxtche:28858] *** Process received signal *** > [linuxtche:28858] Signal: Segmentation fault (11) > [linuxtche:28858] Signal code: Address not mapped (1) > [linuxtche:28858] Failing at address: 0x201143bf8 > [linuxtche:28858] [ 0] /lib64/libpthread.so.0() [0x3600c0f0f0] > [linuxtche:28858] > [ 1] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0xfc27) > [0x7f5e94076c27] > [linuxtche:28858] > [ 2] /home/magomedov/openmpi/lib/openmpi/mca_btl_tcp.so(+0xadac) > [0x7f5e935c3dac] > [linuxtche:28858] > [ 3] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27611) > [0x7f5e96575611] > [linuxtche:28858] > [ 4] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27c57) > [0x7f5e96575c57] > [linuxtche:28858] > [ 5] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) > [0x7f5e96575848] > [linuxtche:28858] > [ 6] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_progress+0x89) > [0x7f5e965648dd] > [linuxtche:28858] > [ 7] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x762f) > [0x7f5e9406e62f] > [linuxtche:28858] > [ 8] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x777d) > [0x7f5e9406e77d] > [linuxtche:28858] > [ 9] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8246) > [0x7f5e9406f246] > [linuxtche:28858] [10] /home/magomedov/openmpi/lib/libmpi.so.0(MPI_Recv > +0x2d2) [0x7f5e96af832c] > [linuxtche:28858] > [11] /home/magomedov/workspace/mpi-test/mpi-send-test(main+0x1e4) > [0x400ee8] > [linuxtche:28858] [12] /lib64/libc.so.6(__libc_start_main+0xfd) > [0x360001eb1d] > [linuxtche:28858] > [13] /home/magomedov/workspace/mpi-test/mpi-send-test() [0x400c49] > [linuxtche:28858] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 28858 on node linuxtche > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > If I am trying to send >= 65537 bytes from x86 I always get segfault on > x86_64. > > I made some investigations and found that "bad" pointer always has a > valid pointer actually in it's lower 32-bit word and "2" or "1" in it's > upper word. Program segfaults in pml_ob1_recvfrag.c, in function > mca_pml_ob1_recv_frag_callback_fin(), rdma pointer is broken. I inserted > rdma = (mca_btl_base_descriptor_t*)((unsigned long)rdma & 0xFFFFFFFF); > line which I believe truncates 64-bit pointer to 32 bits and segfaults > disappeared. However, this is not the solution. > > After some investigations with gdb it seems to me like this pointer was > sent to x86 machine and was received from it broken but I don't realize > what is going on enough to fix it... > > Can anyone reproduce it? > I got the same results on openmpi-1.4.2rc1 too. > > It looks like the same problem was described here > http://www.open-mpi.org/community/lists/users/2010/02/12182.php in > ompi-users list. > > -- > Kind regards, > Timur Magomedov > Senior C++ Developer > DevelopOnBox LLC / Zodiac Interactive > http://www.zodiac.tv/ > <mpi-send-test.c>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel