Timur,

Thanks for the very detailed analysis of the problem. Based on your 
observations, I was able to track down the issue pretty quickly. In few words, 
the 64 bits machine sent a pointer to the 32 bits one, and expected that it 
will get it back untouched. Unfortunately, on the 32 bits machine this pointer 
was translated into a void* and the upper 32 bits were lost.

I don't have a heterogeneous environment available right away to test my patch. 
I would really appreciate it if you can test it and let us know if it solve 
this problem.

  Thanks,
    george.

PS: In order to apply it, please go in the ompi/mca/pml/ob1 directory and do 
the "patch -p0" from there.

Attachment: heterogeneous.patch
Description: Binary data


On Apr 22, 2010, at 09:08 , Timur Magomedov wrote:

> Hello, list.
> 
> I have a strange segmentation fault on x86_64 machine running together
> with x86.
> I am running attached program that sends some bytes from process 0 to
> process 1. My configuration is:
> Machine #1: (process 0)
>  arch: x86
>  hostname: magomedov-desktop
>  linux distro: Ubuntu 9.10
>  Open MPI: v1.4 configured with --enable-heterogeneous --enable-debug
> Machine #2: (process 1)
>  arch: x86_64
>  hostname: linuxtche
>  linux distro: Fedora 12
>  Open MPI: v1.4 configured with --enable-heterogeneous
> --prefix=/home/magomedov/openmpi/ --enable-debug
> 
> They are connected by ethernet.
> My user environment on second (x86_64) machine is set up to use Open MPI
> from /home/magomedov/openmpi/.
> 
> Then I compile attached program on both machines (at the same path) and
> run it. Process 0 from x86 machine should send data to process 1 on
> x86_64 machine.
> 
> First, let's send 65530 bytes:
> 
> mpirun -host timur,linuxtche -np
> 2 /home/magomedov/workspace/mpi-test/mpi-send-test 65530
> magomedov@linuxtche's password: 
> *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 21875
> ***
> *** processor linuxtche, comm size is 2, my rank is 1, pid 11357 ***
> Received 65530 bytes
> 
> It's OK. Then let's send 65537 bytes:
> 
> magomedov@magomedov-desktop:~/workspace/mpi-test$ mpirun -host
> timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test
> 65537
> magomedov@linuxtche's password: 
> *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 9205
> ***
> *** processor linuxtche, comm size is 2, my rank is 1, pid 28858 ***
> [linuxtche:28858] *** Process received signal ***
> [linuxtche:28858] Signal: Segmentation fault (11)
> [linuxtche:28858] Signal code: Address not mapped (1)
> [linuxtche:28858] Failing at address: 0x201143bf8
> [linuxtche:28858] [ 0] /lib64/libpthread.so.0() [0x3600c0f0f0]
> [linuxtche:28858]
> [ 1] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0xfc27)
> [0x7f5e94076c27]
> [linuxtche:28858]
> [ 2] /home/magomedov/openmpi/lib/openmpi/mca_btl_tcp.so(+0xadac)
> [0x7f5e935c3dac]
> [linuxtche:28858]
> [ 3] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27611)
> [0x7f5e96575611]
> [linuxtche:28858]
> [ 4] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27c57)
> [0x7f5e96575c57]
> [linuxtche:28858]
> [ 5] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f)
> [0x7f5e96575848]
> [linuxtche:28858]
> [ 6] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_progress+0x89)
> [0x7f5e965648dd]
> [linuxtche:28858]
> [ 7] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x762f)
> [0x7f5e9406e62f]
> [linuxtche:28858]
> [ 8] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x777d)
> [0x7f5e9406e77d]
> [linuxtche:28858]
> [ 9] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8246)
> [0x7f5e9406f246]
> [linuxtche:28858] [10] /home/magomedov/openmpi/lib/libmpi.so.0(MPI_Recv
> +0x2d2) [0x7f5e96af832c]
> [linuxtche:28858]
> [11] /home/magomedov/workspace/mpi-test/mpi-send-test(main+0x1e4)
> [0x400ee8]
> [linuxtche:28858] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x360001eb1d]
> [linuxtche:28858]
> [13] /home/magomedov/workspace/mpi-test/mpi-send-test() [0x400c49]
> [linuxtche:28858] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 28858 on node linuxtche
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> If I am trying to send >= 65537 bytes from x86 I always get segfault on
> x86_64.
> 
> I made some investigations and found that "bad" pointer always has a
> valid pointer actually in it's lower 32-bit word and "2" or "1" in it's
> upper word. Program segfaults in pml_ob1_recvfrag.c, in function
> mca_pml_ob1_recv_frag_callback_fin(), rdma pointer is broken. I inserted
> rdma = (mca_btl_base_descriptor_t*)((unsigned long)rdma & 0xFFFFFFFF);
> line which I believe truncates 64-bit pointer to 32 bits and segfaults
> disappeared. However, this is not the solution.
> 
> After some investigations with gdb it seems to me like this pointer was
> sent to x86 machine and was received from it broken but I don't realize
> what is going on enough to fix it...
> 
> Can anyone reproduce it?
> I got the same results on openmpi-1.4.2rc1 too.
> 
> It looks like the same problem was described here
> http://www.open-mpi.org/community/lists/users/2010/02/12182.php in
> ompi-users list.
> 
> -- 
> Kind regards,
> Timur Magomedov
> Senior C++ Developer
> DevelopOnBox LLC / Zodiac Interactive
> http://www.zodiac.tv/
> <mpi-send-test.c>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to