Cyril, Are you running your jobs via a batch manager If yes, was support for it correctly built ?
If you were able to get a core dump, can you post the gdb stacktrace ? I guess your nodes have several IP interfaces, you might want to try mpirun --mca oob_tcp_if_include eth0 ... (replace eth0 with something appropriate if needed) Cheers, Gilles Cyril Bordage <cyril.bord...@inria.fr> wrote: >Unfortunately this does not complete this thread. The problem is not >solved! It is not an installation problem. I have no previous >installation since I use separate directories. >I have nothing specific to MPI path in my env, I just use the complete >path to mpicc and mpirun. > >The error depends on which node I run on. For example I can run on node1 >and node2, or node1 and node3, or node2 and node3, but not on node1, >node2 and node3. With the official version of the platform (1.8.1) it >works like a charm. > >George, maybe, you could see it by yourself by connecting to our >platform (plafrim), since you have an account. It should be easier to >understand and see our problem. > > >Cyril. > >Le 10/02/2017 à 18:15, George Bosilca a écrit : >> To complete this thread, the problem is now solved. Some .so were lingering >> around from a previous installation causing startup pb. >> >> George. >> >> >>> On Feb 10, 2017, at 05:38 , Cyril Bordage <cyril.bord...@inria.fr> wrote: >>> >>> Thank you for your answer. >>> I am running the git master version (last tested was cad4c03). >>> >>> FYI, Clément Foyer is talking with George Bosilca about this problem. >>> >>> >>> Cyril. >>> >>> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit : >>>> What version of Open MPI are you running? >>>> >>>> The error is indicating that Open MPI is trying to start a user-level >>>> helper daemon on the remote node, and the daemon is seg faulting (which is >>>> unusual). >>>> >>>> One thing to be aware of: >>>> >>>> https://www.open-mpi.org/faq/?category=building#install-overwrite >>>> >>>> >>>> >>>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I cannot run the a program with MPI when I compile it myself. >>>>> On some nodes I have the following error: >>>>> ================================================================================ >>>>> [mimi012:17730] *** Process received signal *** >>>>> [mimi012:17730] Signal: Segmentation fault (11) >>>>> [mimi012:17730] Signal code: Address not mapped (1) >>>>> [mimi012:17730] Failing at address: 0xf8 >>>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500] >>>>> [mimi012:17730] [ 1] >>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9] >>>>> [mimi012:17730] [ 2] >>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd] >>>>> [mimi012:17730] [ 3] >>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34] >>>>> [mimi012:17730] [ 4] >>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d] >>>>> [mimi012:17730] [ 5] >>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c] >>>>> [mimi012:17730] [ 6] >>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c] >>>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851] >>>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d] >>>>> [mimi012:17730] *** End of error message *** >>>>> -------------------------------------------------------------------------- >>>>> ORTE has lost communication with its daemon located on node: >>>>> >>>>> hostname: mimi012 >>>>> >>>>> This is usually due to either a failure of the TCP network >>>>> connection to the node, or possibly an internal failure of >>>>> the daemon itself. We cannot recover from this failure, and >>>>> therefore will terminate the job. >>>>> -------------------------------------------------------------------------- >>>>> ================================================================================ >>>>> >>>>> The error does not appear with the official MPI installed in the >>>>> platform. I asked the admins about their compilation options but there >>>>> is nothing particular. >>>>> >>>>> Moreover it appears only for some node lists. Still, the nodes seem to >>>>> be fine since it works with the official version of MPI of the platform. >>>>> >>>>> To be sure it is not a network problem I tried to use "-mca btl >>>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change. >>>>> >>>>> Do you have any idea where this error may come from? >>>>> >>>>> Thank you. >>>>> >>>>> >>>>> Cyril Bordage. >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >_______________________________________________ >devel mailing list >devel@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel