Hello,

I cannot run the a program with MPI when I compile it myself.
On some nodes I have the following error:
================================================================================
[mimi012:17730] *** Process received signal ***
[mimi012:17730] Signal: Segmentation fault (11)
[mimi012:17730] Signal code: Address not mapped (1)
[mimi012:17730] Failing at address: 0xf8
[mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500]
[mimi012:17730] [ 1]
/home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
[mimi012:17730] [ 2]
/home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
[mimi012:17730] [ 3]
/home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
[mimi012:17730] [ 4]
/home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
[mimi012:17730] [ 5]
/home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
[mimi012:17730] [ 6]
/home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c]
[mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851]
[mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
[mimi012:17730] *** End of error message ***
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

  hostname:  mimi012

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
================================================================================

The error does not appear with the official MPI installed in the
platform. I asked the admins about their compilation options but there
is nothing particular.

Moreover it appears only for some node lists. Still, the nodes seem to
be fine since it works with the official version of MPI of the platform.

To be sure it is not a network problem I tried to use "-mca btl
tcp,sm,self" or "-mca btl openib,sm,self" with no change.

Do you have any idea where this error may come from?

Thank you.


Cyril Bordage.
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to