I have no MPI installation in my environment.
If it was the case, would I have an error since I use the complete path
for mpirun?

I finally managed to get a backtrace:
#0  0x00007ffff7533f18 in _exit () from /lib64/libc.so.6
#1  0x00007ffff5169d68 in rte_abort (status=-51, report=true) at
../../../../../src/orte/mca/ess/pmi/ess_pmi_module.c:494
#2  0x00007ffff7b4fb9d in ompi_rte_abort (error_code=-51, fmt=0x0) at
../../../../../src/ompi/mca/rte/orte/rte_orte_module.c:85
#3  0x00007ffff7a927a3 in ompi_mpi_abort (comm=0x601280
<ompi_mpi_comm_world>, errcode=-51) at
../../src/ompi/runtime/ompi_mpi_abort.c:206
#4  0x00007ffff7a77c6b in ompi_errhandler_callback (status=-51,
source=0x7fffe8003494, info=0x7fffe8003570, results=0x7fffe80034c8,
cbfunc=0x7ffff4058ee8 <return_local_event_hdlr>, cbdata=0x7fffe80033d0)
    at ../../src/ompi/errhandler/errhandler.c:250
#5  0x00007ffff40594f7 in _event_hdlr (sd=-1, args=4,
cbdata=0x7fffe80033d0) at
../../../../../src/opal/mca/pmix/pmix2x/pmix2x.c:216
#6  0x00007ffff6ed2bdc in event_process_active_single_queue
(activeq=0x667cb0, base=0x668410) at
../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1370
#7  event_process_active (base=<optimized out>) at
../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1440
#8  opal_libevent2022_event_base_loop (base=0x668410, flags=1) at
../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1644
#9  0x00007ffff6e78263 in progress_engine (obj=0x667c68) at
../../src/opal/runtime/opal_progress_threads.c:105
#10 0x00007ffff7821851 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff756f94d in clone () from /lib64/libc.so.6


Cyril.

Le 14/02/2017 à 13:25, Jeff Squyres (jsquyres) a écrit :
> You should also check your paths for non interactive remote logins and ensure 
> that you are not accidentally mixing versions of open MPI (e.g., the new 
> version in your local machine, and some other version on the remote 
> machines). 
> 
> Sent from my phone. No type good. 
> 
>> On Feb 13, 2017, at 8:14 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>>
>> Cyril,
>>
>> Are you running your jobs via a batch manager 
>> If yes, was support for it correctly built ?
>>
>> If you were able to get a core dump, can you post the gdb stacktrace ?
>>
>> I guess your nodes have several IP interfaces, you might want to try
>> mpirun --mca oob_tcp_if_include eth0 ...
>> (replace eth0 with something appropriate if needed)
>>
>> Cheers,
>>
>> Gilles
>>
>> Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>> Unfortunately this does not complete this thread. The problem is not
>>> solved! It is not an installation problem. I have no previous
>>> installation since I use separate directories.
>>> I have nothing specific to MPI path in my env, I just use the complete
>>> path to mpicc and mpirun.
>>>
>>> The error depends on which node I run on. For example I can run on node1
>>> and node2, or node1 and node3, or node2 and node3, but not on node1,
>>> node2 and node3. With the official version of the platform (1.8.1) it
>>> works like a charm.
>>>
>>> George, maybe, you could see it by yourself by connecting to our
>>> platform (plafrim), since you have an account. It should be easier to
>>> understand and see our problem.
>>>
>>>
>>> Cyril.
>>>
>>>> Le 10/02/2017 à 18:15, George Bosilca a écrit :
>>>> To complete this thread, the problem is now solved. Some .so were 
>>>> lingering around from a previous installation causing startup pb.
>>>>
>>>>  George.
>>>>
>>>>
>>>>> On Feb 10, 2017, at 05:38 , Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>>>>
>>>>> Thank you for your answer.
>>>>> I am running the git master version (last tested was cad4c03).
>>>>>
>>>>> FYI, Clément Foyer is talking with George Bosilca about this problem.
>>>>>
>>>>>
>>>>> Cyril.
>>>>>
>>>>>> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
>>>>>> What version of Open MPI are you running?
>>>>>>
>>>>>> The error is indicating that Open MPI is trying to start a user-level 
>>>>>> helper daemon on the remote node, and the daemon is seg faulting (which 
>>>>>> is unusual).
>>>>>>
>>>>>> One thing to be aware of:
>>>>>>
>>>>>>    https://www.open-mpi.org/faq/?category=building#install-overwrite
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I cannot run the a program with MPI when I compile it myself.
>>>>>>> On some nodes I have the following error:
>>>>>>> ================================================================================
>>>>>>> [mimi012:17730] *** Process received signal ***
>>>>>>> [mimi012:17730] Signal: Segmentation fault (11)
>>>>>>> [mimi012:17730] Signal code: Address not mapped (1)
>>>>>>> [mimi012:17730] Failing at address: 0xf8
>>>>>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500]
>>>>>>> [mimi012:17730] [ 1]
>>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
>>>>>>> [mimi012:17730] [ 2]
>>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
>>>>>>> [mimi012:17730] [ 3]
>>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
>>>>>>> [mimi012:17730] [ 4]
>>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
>>>>>>> [mimi012:17730] [ 5]
>>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
>>>>>>> [mimi012:17730] [ 6]
>>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c]
>>>>>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851]
>>>>>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
>>>>>>> [mimi012:17730] *** End of error message ***
>>>>>>> --------------------------------------------------------------------------
>>>>>>> ORTE has lost communication with its daemon located on node:
>>>>>>>
>>>>>>> hostname:  mimi012
>>>>>>>
>>>>>>> This is usually due to either a failure of the TCP network
>>>>>>> connection to the node, or possibly an internal failure of
>>>>>>> the daemon itself. We cannot recover from this failure, and
>>>>>>> therefore will terminate the job.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> ================================================================================
>>>>>>>
>>>>>>> The error does not appear with the official MPI installed in the
>>>>>>> platform. I asked the admins about their compilation options but there
>>>>>>> is nothing particular.
>>>>>>>
>>>>>>> Moreover it appears only for some node lists. Still, the nodes seem to
>>>>>>> be fine since it works with the official version of MPI of the platform.
>>>>>>>
>>>>>>> To be sure it is not a network problem I tried to use "-mca btl
>>>>>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
>>>>>>>
>>>>>>> Do you have any idea where this error may come from?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>> Cyril Bordage.
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel@lists.open-mpi.org
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to