What version of Open MPI are you running?

The error is indicating that Open MPI is trying to start a user-level helper 
daemon on the remote node, and the daemon is seg faulting (which is unusual).

One thing to be aware of:

     https://www.open-mpi.org/faq/?category=building#install-overwrite



> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> wrote:
> 
> Hello,
> 
> I cannot run the a program with MPI when I compile it myself.
> On some nodes I have the following error:
> ================================================================================
> [mimi012:17730] *** Process received signal ***
> [mimi012:17730] Signal: Segmentation fault (11)
> [mimi012:17730] Signal code: Address not mapped (1)
> [mimi012:17730] Failing at address: 0xf8
> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500]
> [mimi012:17730] [ 1]
> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
> [mimi012:17730] [ 2]
> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
> [mimi012:17730] [ 3]
> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
> [mimi012:17730] [ 4]
> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
> [mimi012:17730] [ 5]
> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
> [mimi012:17730] [ 6]
> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c]
> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851]
> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
> [mimi012:17730] *** End of error message ***
> --------------------------------------------------------------------------
> ORTE has lost communication with its daemon located on node:
> 
>  hostname:  mimi012
> 
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --------------------------------------------------------------------------
> ================================================================================
> 
> The error does not appear with the official MPI installed in the
> platform. I asked the admins about their compilation options but there
> is nothing particular.
> 
> Moreover it appears only for some node lists. Still, the nodes seem to
> be fine since it works with the official version of MPI of the platform.
> 
> To be sure it is not a network problem I tried to use "-mca btl
> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
> 
> Do you have any idea where this error may come from?
> 
> Thank you.
> 
> 
> Cyril Bordage.
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to