To complete this thread, the problem is now solved. Some .so were lingering 
around from a previous installation causing startup pb.

  George.


> On Feb 10, 2017, at 05:38 , Cyril Bordage <cyril.bord...@inria.fr> wrote:
> 
> Thank you for your answer.
> I am running the git master version (last tested was cad4c03).
> 
> FYI, Clément Foyer is talking with George Bosilca about this problem.
> 
> 
> Cyril.
> 
> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
>> What version of Open MPI are you running?
>> 
>> The error is indicating that Open MPI is trying to start a user-level helper 
>> daemon on the remote node, and the daemon is seg faulting (which is unusual).
>> 
>> One thing to be aware of:
>> 
>>     https://www.open-mpi.org/faq/?category=building#install-overwrite
>> 
>> 
>> 
>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>> 
>>> Hello,
>>> 
>>> I cannot run the a program with MPI when I compile it myself.
>>> On some nodes I have the following error:
>>> ================================================================================
>>> [mimi012:17730] *** Process received signal ***
>>> [mimi012:17730] Signal: Segmentation fault (11)
>>> [mimi012:17730] Signal code: Address not mapped (1)
>>> [mimi012:17730] Failing at address: 0xf8
>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500]
>>> [mimi012:17730] [ 1]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
>>> [mimi012:17730] [ 2]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
>>> [mimi012:17730] [ 3]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
>>> [mimi012:17730] [ 4]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
>>> [mimi012:17730] [ 5]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
>>> [mimi012:17730] [ 6]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c]
>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851]
>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
>>> [mimi012:17730] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> ORTE has lost communication with its daemon located on node:
>>> 
>>> hostname:  mimi012
>>> 
>>> This is usually due to either a failure of the TCP network
>>> connection to the node, or possibly an internal failure of
>>> the daemon itself. We cannot recover from this failure, and
>>> therefore will terminate the job.
>>> --------------------------------------------------------------------------
>>> ================================================================================
>>> 
>>> The error does not appear with the official MPI installed in the
>>> platform. I asked the admins about their compilation options but there
>>> is nothing particular.
>>> 
>>> Moreover it appears only for some node lists. Still, the nodes seem to
>>> be fine since it works with the official version of MPI of the platform.
>>> 
>>> To be sure it is not a network problem I tried to use "-mca btl
>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
>>> 
>>> Do you have any idea where this error may come from?
>>> 
>>> Thank you.
>>> 
>>> 
>>> Cyril Bordage.
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to