Cyril,

Are you running your jobs via a batch manager 
If yes, was support for it correctly built ?

If you were able to get a core dump, can you post the gdb stacktrace ?

I guess your nodes have several IP interfaces, you might want to try
mpirun --mca oob_tcp_if_include eth0 ...
(replace eth0 with something appropriate if needed)

Cheers,

Gilles

Cyril Bordage <cyril.bord...@inria.fr> wrote:
>Unfortunately this does not complete this thread. The problem is not
>solved! It is not an installation problem. I have no previous
>installation since I use separate directories.
>I have nothing specific to MPI path in my env, I just use the complete
>path to mpicc and mpirun.
>
>The error depends on which node I run on. For example I can run on node1
>and node2, or node1 and node3, or node2 and node3, but not on node1,
>node2 and node3. With the official version of the platform (1.8.1) it
>works like a charm.
>
>George, maybe, you could see it by yourself by connecting to our
>platform (plafrim), since you have an account. It should be easier to
>understand and see our problem.
>
>
>Cyril.
>
>Le 10/02/2017 à 18:15, George Bosilca a écrit :
>> To complete this thread, the problem is now solved. Some .so were lingering 
>> around from a previous installation causing startup pb.
>> 
>>   George.
>> 
>> 
>>> On Feb 10, 2017, at 05:38 , Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>>
>>> Thank you for your answer.
>>> I am running the git master version (last tested was cad4c03).
>>>
>>> FYI, Clément Foyer is talking with George Bosilca about this problem.
>>>
>>>
>>> Cyril.
>>>
>>> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
>>>> What version of Open MPI are you running?
>>>>
>>>> The error is indicating that Open MPI is trying to start a user-level 
>>>> helper daemon on the remote node, and the daemon is seg faulting (which is 
>>>> unusual).
>>>>
>>>> One thing to be aware of:
>>>>
>>>>     https://www.open-mpi.org/faq/?category=building#install-overwrite
>>>>
>>>>
>>>>
>>>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I cannot run the a program with MPI when I compile it myself.
>>>>> On some nodes I have the following error:
>>>>> ================================================================================
>>>>> [mimi012:17730] *** Process received signal ***
>>>>> [mimi012:17730] Signal: Segmentation fault (11)
>>>>> [mimi012:17730] Signal code: Address not mapped (1)
>>>>> [mimi012:17730] Failing at address: 0xf8
>>>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500]
>>>>> [mimi012:17730] [ 1]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
>>>>> [mimi012:17730] [ 2]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
>>>>> [mimi012:17730] [ 3]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
>>>>> [mimi012:17730] [ 4]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
>>>>> [mimi012:17730] [ 5]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
>>>>> [mimi012:17730] [ 6]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c]
>>>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851]
>>>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
>>>>> [mimi012:17730] *** End of error message ***
>>>>> --------------------------------------------------------------------------
>>>>> ORTE has lost communication with its daemon located on node:
>>>>>
>>>>> hostname:  mimi012
>>>>>
>>>>> This is usually due to either a failure of the TCP network
>>>>> connection to the node, or possibly an internal failure of
>>>>> the daemon itself. We cannot recover from this failure, and
>>>>> therefore will terminate the job.
>>>>> --------------------------------------------------------------------------
>>>>> ================================================================================
>>>>>
>>>>> The error does not appear with the official MPI installed in the
>>>>> platform. I asked the admins about their compilation options but there
>>>>> is nothing particular.
>>>>>
>>>>> Moreover it appears only for some node lists. Still, the nodes seem to
>>>>> be fine since it works with the official version of MPI of the platform.
>>>>>
>>>>> To be sure it is not a network problem I tried to use "-mca btl
>>>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
>>>>>
>>>>> Do you have any idea where this error may come from?
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> Cyril Bordage.
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>_______________________________________________
>devel mailing list
>devel@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to