Thanks, I will apply the patch to all branches later today. Thanks for your help!
George "All the books in the world contain no more information than is broadcast as video in a single large American city in a single year. Not all bits have equal value.". -- Carl Sagan On Apr 23, 2010, at 3:43, Timur Magomedov <timur.magome...@developonbox.ru> wrote: > Thank you, George! > I checked out trunk version 1.7a1r23028 and got the same errors as on > 1.4.*. Then I applied your patch, fixed one more file > > Index: pml_ob1_recvreq.c > =================================================================== > --- pml_ob1_recvreq.c (revision 23028) > +++ pml_ob1_recvreq.c (working copy) > @@ -331,7 +331,7 @@ > > mca_pml_ob1_send_fin(recvreq->req_recv.req_base.req_proc, > bml_btl, > - frag->rdma_hdr.hdr_rget.hdr_des.pval, > + frag->rdma_hdr.hdr_rget.hdr_des, > des->order, 0); > > /* is receive request complete */ > > and the problem disappeared. > > В Птн, 23/04/2010 в 01:38 -0400, George Bosilca пишет: >> Timur, >> >> Thanks for the very detailed analysis of the problem. Based on your >> observations, I was able to track down the issue pretty quickly. In few >> words, the 64 bits machine sent a pointer to the 32 bits one, and expected >> that it will get it back untouched. Unfortunately, on the 32 bits machine >> this pointer was translated into a void* and the upper 32 bits were lost. >> >> I don't have a heterogeneous environment available right away to test my >> patch. I would really appreciate it if you can test it and let us know if it >> solve this problem. >> >> Thanks, >> george. >> >> PS: In order to apply it, please go in the ompi/mca/pml/ob1 directory and do >> the "patch -p0" from there. >> >> On Apr 22, 2010, at 09:08 , Timur Magomedov wrote: >> >>> Hello, list. >>> >>> I have a strange segmentation fault on x86_64 machine running together >>> with x86. >>> I am running attached program that sends some bytes from process 0 to >>> process 1. My configuration is: >>> Machine #1: (process 0) >>> arch: x86 >>> hostname: magomedov-desktop >>> linux distro: Ubuntu 9.10 >>> Open MPI: v1.4 configured with --enable-heterogeneous --enable-debug >>> Machine #2: (process 1) >>> arch: x86_64 >>> hostname: linuxtche >>> linux distro: Fedora 12 >>> Open MPI: v1.4 configured with --enable-heterogeneous >>> --prefix=/home/magomedov/openmpi/ --enable-debug >>> >>> They are connected by ethernet. >>> My user environment on second (x86_64) machine is set up to use Open MPI >>> from /home/magomedov/openmpi/. >>> >>> Then I compile attached program on both machines (at the same path) and >>> run it. Process 0 from x86 machine should send data to process 1 on >>> x86_64 machine. >>> >>> First, let's send 65530 bytes: >>> >>> mpirun -host timur,linuxtche -np >>> 2 /home/magomedov/workspace/mpi-test/mpi-send-test 65530 >>> magomedov@linuxtche's password: >>> *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 21875 >>> *** >>> *** processor linuxtche, comm size is 2, my rank is 1, pid 11357 *** >>> Received 65530 bytes >>> >>> It's OK. Then let's send 65537 bytes: >>> >>> magomedov@magomedov-desktop:~/workspace/mpi-test$ mpirun -host >>> timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test >>> 65537 >>> magomedov@linuxtche's password: >>> *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 9205 >>> *** >>> *** processor linuxtche, comm size is 2, my rank is 1, pid 28858 *** >>> [linuxtche:28858] *** Process received signal *** >>> [linuxtche:28858] Signal: Segmentation fault (11) >>> [linuxtche:28858] Signal code: Address not mapped (1) >>> [linuxtche:28858] Failing at address: 0x201143bf8 >>> [linuxtche:28858] [ 0] /lib64/libpthread.so.0() [0x3600c0f0f0] >>> [linuxtche:28858] >>> [ 1] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0xfc27) >>> [0x7f5e94076c27] >>> [linuxtche:28858] >>> [ 2] /home/magomedov/openmpi/lib/openmpi/mca_btl_tcp.so(+0xadac) >>> [0x7f5e935c3dac] >>> [linuxtche:28858] >>> [ 3] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27611) >>> [0x7f5e96575611] >>> [linuxtche:28858] >>> [ 4] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27c57) >>> [0x7f5e96575c57] >>> [linuxtche:28858] >>> [ 5] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) >>> [0x7f5e96575848] >>> [linuxtche:28858] >>> [ 6] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_progress+0x89) >>> [0x7f5e965648dd] >>> [linuxtche:28858] >>> [ 7] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x762f) >>> [0x7f5e9406e62f] >>> [linuxtche:28858] >>> [ 8] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x777d) >>> [0x7f5e9406e77d] >>> [linuxtche:28858] >>> [ 9] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8246) >>> [0x7f5e9406f246] >>> [linuxtche:28858] [10] /home/magomedov/openmpi/lib/libmpi.so.0(MPI_Recv >>> +0x2d2) [0x7f5e96af832c] >>> [linuxtche:28858] >>> [11] /home/magomedov/workspace/mpi-test/mpi-send-test(main+0x1e4) >>> [0x400ee8] >>> [linuxtche:28858] [12] /lib64/libc.so.6(__libc_start_main+0xfd) >>> [0x360001eb1d] >>> [linuxtche:28858] >>> [13] /home/magomedov/workspace/mpi-test/mpi-send-test() [0x400c49] >>> [linuxtche:28858] *** End of error message *** >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 1 with PID 28858 on node linuxtche >>> exited on signal 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> >>> If I am trying to send >= 65537 bytes from x86 I always get segfault on >>> x86_64. >>> >>> I made some investigations and found that "bad" pointer always has a >>> valid pointer actually in it's lower 32-bit word and "2" or "1" in it's >>> upper word. Program segfaults in pml_ob1_recvfrag.c, in function >>> mca_pml_ob1_recv_frag_callback_fin(), rdma pointer is broken. I inserted >>> rdma = (mca_btl_base_descriptor_t*)((unsigned long)rdma & 0xFFFFFFFF); >>> line which I believe truncates 64-bit pointer to 32 bits and segfaults >>> disappeared. However, this is not the solution. >>> >>> After some investigations with gdb it seems to me like this pointer was >>> sent to x86 machine and was received from it broken but I don't realize >>> what is going on enough to fix it... >>> >>> Can anyone reproduce it? >>> I got the same results on openmpi-1.4.2rc1 too. >>> >>> It looks like the same problem was described here >>> http://www.open-mpi.org/community/lists/users/2010/02/12182.php in >>> ompi-users list. >>> >>> -- >>> Kind regards, >>> Timur Magomedov >>> Senior C++ Developer >>> DevelopOnBox LLC / Zodiac Interactive >>> http://www.zodiac.tv/ >>> <mpi-send-test.c>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Kind regards, > Timur Magomedov > Senior C++ Developer > DevelopOnBox LLC / Zodiac Interactive > http://www.zodiac.tv/ >