I have no MPI installation in my environment. If it was the case, would I have an error since I use the complete path for mpirun?
I finally managed to get a backtrace: #0 0x00007ffff7533f18 in _exit () from /lib64/libc.so.6 #1 0x00007ffff5169d68 in rte_abort (status=-51, report=true) at ../../../../../src/orte/mca/ess/pmi/ess_pmi_module.c:494 #2 0x00007ffff7b4fb9d in ompi_rte_abort (error_code=-51, fmt=0x0) at ../../../../../src/ompi/mca/rte/orte/rte_orte_module.c:85 #3 0x00007ffff7a927a3 in ompi_mpi_abort (comm=0x601280 <ompi_mpi_comm_world>, errcode=-51) at ../../src/ompi/runtime/ompi_mpi_abort.c:206 #4 0x00007ffff7a77c6b in ompi_errhandler_callback (status=-51, source=0x7fffe8003494, info=0x7fffe8003570, results=0x7fffe80034c8, cbfunc=0x7ffff4058ee8 <return_local_event_hdlr>, cbdata=0x7fffe80033d0) at ../../src/ompi/errhandler/errhandler.c:250 #5 0x00007ffff40594f7 in _event_hdlr (sd=-1, args=4, cbdata=0x7fffe80033d0) at ../../../../../src/opal/mca/pmix/pmix2x/pmix2x.c:216 #6 0x00007ffff6ed2bdc in event_process_active_single_queue (activeq=0x667cb0, base=0x668410) at ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1370 #7 event_process_active (base=<optimized out>) at ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1440 #8 opal_libevent2022_event_base_loop (base=0x668410, flags=1) at ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1644 #9 0x00007ffff6e78263 in progress_engine (obj=0x667c68) at ../../src/opal/runtime/opal_progress_threads.c:105 #10 0x00007ffff7821851 in start_thread () from /lib64/libpthread.so.0 #11 0x00007ffff756f94d in clone () from /lib64/libc.so.6 Cyril. Le 14/02/2017 à 13:25, Jeff Squyres (jsquyres) a écrit : > You should also check your paths for non interactive remote logins and ensure > that you are not accidentally mixing versions of open MPI (e.g., the new > version in your local machine, and some other version on the remote > machines). > > Sent from my phone. No type good. > >> On Feb 13, 2017, at 8:14 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >> Cyril, >> >> Are you running your jobs via a batch manager >> If yes, was support for it correctly built ? >> >> If you were able to get a core dump, can you post the gdb stacktrace ? >> >> I guess your nodes have several IP interfaces, you might want to try >> mpirun --mca oob_tcp_if_include eth0 ... >> (replace eth0 with something appropriate if needed) >> >> Cheers, >> >> Gilles >> >> Cyril Bordage <cyril.bord...@inria.fr> wrote: >>> Unfortunately this does not complete this thread. The problem is not >>> solved! It is not an installation problem. I have no previous >>> installation since I use separate directories. >>> I have nothing specific to MPI path in my env, I just use the complete >>> path to mpicc and mpirun. >>> >>> The error depends on which node I run on. For example I can run on node1 >>> and node2, or node1 and node3, or node2 and node3, but not on node1, >>> node2 and node3. With the official version of the platform (1.8.1) it >>> works like a charm. >>> >>> George, maybe, you could see it by yourself by connecting to our >>> platform (plafrim), since you have an account. It should be easier to >>> understand and see our problem. >>> >>> >>> Cyril. >>> >>>> Le 10/02/2017 à 18:15, George Bosilca a écrit : >>>> To complete this thread, the problem is now solved. Some .so were >>>> lingering around from a previous installation causing startup pb. >>>> >>>> George. >>>> >>>> >>>>> On Feb 10, 2017, at 05:38 , Cyril Bordage <cyril.bord...@inria.fr> wrote: >>>>> >>>>> Thank you for your answer. >>>>> I am running the git master version (last tested was cad4c03). >>>>> >>>>> FYI, Clément Foyer is talking with George Bosilca about this problem. >>>>> >>>>> >>>>> Cyril. >>>>> >>>>>> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit : >>>>>> What version of Open MPI are you running? >>>>>> >>>>>> The error is indicating that Open MPI is trying to start a user-level >>>>>> helper daemon on the remote node, and the daemon is seg faulting (which >>>>>> is unusual). >>>>>> >>>>>> One thing to be aware of: >>>>>> >>>>>> https://www.open-mpi.org/faq/?category=building#install-overwrite >>>>>> >>>>>> >>>>>> >>>>>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> >>>>>>> wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I cannot run the a program with MPI when I compile it myself. >>>>>>> On some nodes I have the following error: >>>>>>> ================================================================================ >>>>>>> [mimi012:17730] *** Process received signal *** >>>>>>> [mimi012:17730] Signal: Segmentation fault (11) >>>>>>> [mimi012:17730] Signal code: Address not mapped (1) >>>>>>> [mimi012:17730] Failing at address: 0xf8 >>>>>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500] >>>>>>> [mimi012:17730] [ 1] >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9] >>>>>>> [mimi012:17730] [ 2] >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd] >>>>>>> [mimi012:17730] [ 3] >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34] >>>>>>> [mimi012:17730] [ 4] >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d] >>>>>>> [mimi012:17730] [ 5] >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c] >>>>>>> [mimi012:17730] [ 6] >>>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c] >>>>>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851] >>>>>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d] >>>>>>> [mimi012:17730] *** End of error message *** >>>>>>> -------------------------------------------------------------------------- >>>>>>> ORTE has lost communication with its daemon located on node: >>>>>>> >>>>>>> hostname: mimi012 >>>>>>> >>>>>>> This is usually due to either a failure of the TCP network >>>>>>> connection to the node, or possibly an internal failure of >>>>>>> the daemon itself. We cannot recover from this failure, and >>>>>>> therefore will terminate the job. >>>>>>> -------------------------------------------------------------------------- >>>>>>> ================================================================================ >>>>>>> >>>>>>> The error does not appear with the official MPI installed in the >>>>>>> platform. I asked the admins about their compilation options but there >>>>>>> is nothing particular. >>>>>>> >>>>>>> Moreover it appears only for some node lists. Still, the nodes seem to >>>>>>> be fine since it works with the official version of MPI of the platform. >>>>>>> >>>>>>> To be sure it is not a network problem I tried to use "-mca btl >>>>>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change. >>>>>>> >>>>>>> Do you have any idea where this error may come from? >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> >>>>>>> Cyril Bordage. >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> devel@lists.open-mpi.org >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel