Cyril,

your first post mentions a crash in orted, but
the stack trace is the one of a MPI task.

i would expect orted generate a core, and then you can use gdb post mortem
to get the stack trace.
there should be several threads, so you can
info threads
bt
you might have to switch to an other thread

Cheers,

Gilles

On Tuesday, February 14, 2017, Cyril Bordage <cyril.bord...@inria.fr> wrote:

> I have no MPI installation in my environment.
> If it was the case, would I have an error since I use the complete path
> for mpirun?
>
> I finally managed to get a backtrace:
> #0  0x00007ffff7533f18 in _exit () from /lib64/libc.so.6
> #1  0x00007ffff5169d68 in rte_abort (status=-51, report=true) at
> ../../../../../src/orte/mca/ess/pmi/ess_pmi_module.c:494
> #2  0x00007ffff7b4fb9d in ompi_rte_abort (error_code=-51, fmt=0x0) at
> ../../../../../src/ompi/mca/rte/orte/rte_orte_module.c:85
> #3  0x00007ffff7a927a3 in ompi_mpi_abort (comm=0x601280
> <ompi_mpi_comm_world>, errcode=-51) at
> ../../src/ompi/runtime/ompi_mpi_abort.c:206
> #4  0x00007ffff7a77c6b in ompi_errhandler_callback (status=-51,
> source=0x7fffe8003494, info=0x7fffe8003570, results=0x7fffe80034c8,
> cbfunc=0x7ffff4058ee8 <return_local_event_hdlr>, cbdata=0x7fffe80033d0)
>     at ../../src/ompi/errhandler/errhandler.c:250
> #5  0x00007ffff40594f7 in _event_hdlr (sd=-1, args=4,
> cbdata=0x7fffe80033d0) at
> ../../../../../src/opal/mca/pmix/pmix2x/pmix2x.c:216
> #6  0x00007ffff6ed2bdc in event_process_active_single_queue
> (activeq=0x667cb0, base=0x668410) at
> ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1370
> #7  event_process_active (base=<optimized out>) at
> ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1440
> #8  opal_libevent2022_event_base_loop (base=0x668410, flags=1) at
> ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1644
> #9  0x00007ffff6e78263 in progress_engine (obj=0x667c68) at
> ../../src/opal/runtime/opal_progress_threads.c:105
> #10 0x00007ffff7821851 in start_thread () from /lib64/libpthread.so.0
> #11 0x00007ffff756f94d in clone () from /lib64/libc.so.6
>
>
> Cyril.
>
> Le 14/02/2017 à 13:25, Jeff Squyres (jsquyres) a écrit :
> > You should also check your paths for non interactive remote logins and
> ensure that you are not accidentally mixing versions of open MPI (e.g., the
> new version in your local machine, and some other version on the remote
> machines).
> >
> > Sent from my phone. No type good.
> >
> >> On Feb 13, 2017, at 8:14 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com <javascript:;>> wrote:
> >>
> >> Cyril,
> >>
> >> Are you running your jobs via a batch manager
> >> If yes, was support for it correctly built ?
> >>
> >> If you were able to get a core dump, can you post the gdb stacktrace ?
> >>
> >> I guess your nodes have several IP interfaces, you might want to try
> >> mpirun --mca oob_tcp_if_include eth0 ...
> >> (replace eth0 with something appropriate if needed)
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> Cyril Bordage <cyril.bord...@inria.fr <javascript:;>> wrote:
> >>> Unfortunately this does not complete this thread. The problem is not
> >>> solved! It is not an installation problem. I have no previous
> >>> installation since I use separate directories.
> >>> I have nothing specific to MPI path in my env, I just use the complete
> >>> path to mpicc and mpirun.
> >>>
> >>> The error depends on which node I run on. For example I can run on
> node1
> >>> and node2, or node1 and node3, or node2 and node3, but not on node1,
> >>> node2 and node3. With the official version of the platform (1.8.1) it
> >>> works like a charm.
> >>>
> >>> George, maybe, you could see it by yourself by connecting to our
> >>> platform (plafrim), since you have an account. It should be easier to
> >>> understand and see our problem.
> >>>
> >>>
> >>> Cyril.
> >>>
> >>>> Le 10/02/2017 à 18:15, George Bosilca a écrit :
> >>>> To complete this thread, the problem is now solved. Some .so were
> lingering around from a previous installation causing startup pb.
> >>>>
> >>>>  George.
> >>>>
> >>>>
> >>>>> On Feb 10, 2017, at 05:38 , Cyril Bordage <cyril.bord...@inria.fr
> <javascript:;>> wrote:
> >>>>>
> >>>>> Thank you for your answer.
> >>>>> I am running the git master version (last tested was cad4c03).
> >>>>>
> >>>>> FYI, Clément Foyer is talking with George Bosilca about this problem.
> >>>>>
> >>>>>
> >>>>> Cyril.
> >>>>>
> >>>>>> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
> >>>>>> What version of Open MPI are you running?
> >>>>>>
> >>>>>> The error is indicating that Open MPI is trying to start a
> user-level helper daemon on the remote node, and the daemon is seg faulting
> (which is unusual).
> >>>>>>
> >>>>>> One thing to be aware of:
> >>>>>>
> >>>>>>    https://www.open-mpi.org/faq/?category=building#install-
> overwrite
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr
> <javascript:;>> wrote:
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> I cannot run the a program with MPI when I compile it myself.
> >>>>>>> On some nodes I have the following error:
> >>>>>>> ============================================================
> ====================
> >>>>>>> [mimi012:17730] *** Process received signal ***
> >>>>>>> [mimi012:17730] Signal: Segmentation fault (11)
> >>>>>>> [mimi012:17730] Signal code: Address not mapped (1)
> >>>>>>> [mimi012:17730] Failing at address: 0xf8
> >>>>>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+
> 0xf500)[0x7ffff66c0500]
> >>>>>>> [mimi012:17730] [ 1]
> >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.
> so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
> >>>>>>> [mimi012:17730] [ 2]
> >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_
> oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
> >>>>>>> [mimi012:17730] [ 3]
> >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_
> oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
> >>>>>>> [mimi012:17730] [ 4]
> >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_
> oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
> >>>>>>> [mimi012:17730] [ 5]
> >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.
> so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
> >>>>>>> [mimi012:17730] [ 6]
> >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.
> so.0(+0x3d34c)[0x7ffff77c534c]
> >>>>>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+
> 0x7851)[0x7ffff66b8851]
> >>>>>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
> >>>>>>> [mimi012:17730] *** End of error message ***
> >>>>>>> ------------------------------------------------------------
> --------------
> >>>>>>> ORTE has lost communication with its daemon located on node:
> >>>>>>>
> >>>>>>> hostname:  mimi012
> >>>>>>>
> >>>>>>> This is usually due to either a failure of the TCP network
> >>>>>>> connection to the node, or possibly an internal failure of
> >>>>>>> the daemon itself. We cannot recover from this failure, and
> >>>>>>> therefore will terminate the job.
> >>>>>>> ------------------------------------------------------------
> --------------
> >>>>>>> ============================================================
> ====================
> >>>>>>>
> >>>>>>> The error does not appear with the official MPI installed in the
> >>>>>>> platform. I asked the admins about their compilation options but
> there
> >>>>>>> is nothing particular.
> >>>>>>>
> >>>>>>> Moreover it appears only for some node lists. Still, the nodes
> seem to
> >>>>>>> be fine since it works with the official version of MPI of the
> platform.
> >>>>>>>
> >>>>>>> To be sure it is not a network problem I tried to use "-mca btl
> >>>>>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
> >>>>>>>
> >>>>>>> Do you have any idea where this error may come from?
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>>
> >>>>>>> Cyril Bordage.
> >>>>>>> _______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> devel@lists.open-mpi.org <javascript:;>
> >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> devel@lists.open-mpi.org <javascript:;>
> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel@lists.open-mpi.org <javascript:;>
> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >>>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel@lists.open-mpi.org <javascript:;>
> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >> _______________________________________________
> >> devel mailing list
> >> devel@lists.open-mpi.org <javascript:;>
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org <javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <javascript:;>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to