Nah, something isn’t right here. The singleton doesn’t go thru that code line, or it isn’t supposed to do so. I think the problem lies in the way the singleton in 2.x is starting up. Let me take a look at how singletons are working over there.
> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ralph, > > > i think i just found the root cause :-) > > > from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c > > /* store our jobid and rank */ > if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) { > /* if we were launched by the OMPI RTE, then > * the jobid is in a special format - so get it */ > mca_pmix_pmix112_component.native_launch = true; > opal_convert_string_to_jobid(&pname.jobid, my_proc.nspace); > } else { > /* we were launched by someone else, so make the > * jobid just be the hash of the nspace */ > OPAL_HASH_STR(my_proc.nspace, pname.jobid); > /* keep it from being negative */ > pname.jobid &= ~(0x8000); > } > > > in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment, > so we end up using a jobid that is a hash of the namespace > > the initial error message was > [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received > unexpected process identifier [[9325,0],0] from [[5590,0],0] > > and with little surprise > ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated by > OPAL_HASH_STR > > > i am now thinking the right fix is to simply > putenv(OPAL_MCA_PREFIX"orte_launch=1"); > in ess/singleton > > makes sense ? > > > Cheers, > > Gilles > > On 9/15/2016 2:58 AM, Eric Chamberland wrote: >> Ok, >> >> one test segfaulted *but* I can't tell if it is the *same* bug because there >> has been a segfault: >> >> stderr: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt >> >> >> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path >> NULL >> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash >> 1366255883 >> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 >> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL >> [lorien:190552] [[53310,0],0] plm:base:receive start comm >> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 *** >> ... >> ... >> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon >> on the local node in file ess_singleton_module.c at line 573 >> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon >> on the local node in file ess_singleton_module.c at line 163 >> *** An error occurred in MPI_Init_thread >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [lorien:190306] Local abort before MPI_INIT completed completed >> successfully, but am not able to aggregate error messages, and not able to >> guarantee that all other processes were killed! >> >> stdout: >> >> -------------------------------------------------------------------------- >> It looks like orte_init failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during orte_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> orte_ess_init failed >> --> Returned value Unable to start a daemon on the local node (-127) >> instead of ORTE_SUCCESS >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> ompi_mpi_init: ompi_rte_init failed >> --> Returned "Unable to start a daemon on the local node" (-127) instead of >> "Success" (0) >> -------------------------------------------------------------------------- >> >> openmpi content of $TMP: >> >> /tmp/tmp.GoQXICeyJl> ls -la >> total 1500 >> drwx------ 3 cmpbib bib 250 Sep 14 13:34 . >> drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. >> ... >> drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 >> openmpi-sessions-40031@lorien_0 >> srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 >> >> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . >> -type f >> ./53310/contact.txt >> >> cat 53310/contact.txt >> 3493724160.0;usock;tcp://132.203.7.36:54605 >> 190552 >> >> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 >> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] >> plm:base:set_hnp_name: final jobfam 53310 >> >> (this is the faulty test) >> full egrep: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt >> >> >> config.log: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log >> >> >> ompi_info: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt >> >> >> Maybe it aborted (instead of giving the other message) while doing the error >> because of export OMPI_MCA_plm_base_verbose=5 ? >> >> Thanks, >> >> Eric >> >> >> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: >>> Eric, >>> >>> do you mean you have a unique $TMP per a.out ? >>> or a unique $TMP per "batch" of run ? >>> >>> in the first case, my understanding is that conflicts cannot happen ... >>> >>> once you hit the bug, can you please please post the output of the >>> failed a.out, >>> and run >>> egrep 'jobfam|stop' >>> on all your logs, so we might spot a conflict >>> >>> Cheers, >>> >>> Gilles >>> >>> On Wednesday, September 14, 2016, Eric Chamberland >>> <eric.chamberl...@giref.ulaval.ca >>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote: >>> >>> Lucky! >>> >>> Since each runs have a specific TMP, I still have it on disc. >>> >>> for the faulty run, the TMP variable was: >>> >>> TMP=/tmp/tmp.wOv5dkNaSI >>> >>> and into $TMP I have: >>> >>> openmpi-sessions-40031@lorien_0 >>> >>> and into this subdirectory I have a bunch of empty dirs: >>> >>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>> ls -la |wc -l >>> 1841 >>> >>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>> ls -la |more >>> total 68 >>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>> ... >>> >>> If I do: >>> >>> lsof |grep "openmpi-sessions-40031" >>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>> /run/user/1000/gvfs >>> Output information may be incomplete. >>> lsof: WARNING: can't stat() tracefs file system >>> /sys/kernel/debug/tracing >>> Output information may be incomplete. >>> >>> nothing... >>> >>> What else may I check? >>> >>> Eric >>> >>> >>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>> >>> Hi, Eric >>> >>> I **think** this might be related to the following: >>> >>> https://github.com/pmix/master/pull/145 >>> <https://github.com/pmix/master/pull/145> >>> >>> I'm wondering if you can look into the /tmp directory and see if you >>> have a bunch of stale usock files. >>> >>> Best, >>> >>> Josh >>> >>> >>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >>> <gil...@rist.or.jp >>> <mailto:gil...@rist.or.jp>> wrote: >>> >>> Eric, >>> >>> >>> can you please provide more information on how your tests >>> are launched ? >>> >>> do you >>> >>> mpirun -np 1 ./a.out >>> >>> or do you simply >>> >>> ./a.out >>> >>> >>> do you use a batch manager ? if yes, which one ? >>> >>> do you run one test per job ? or multiple tests per job ? >>> >>> how are these tests launched ? >>> >>> >>> do the test that crashes use MPI_Comm_spawn ? >>> >>> i am surprised by the process name [[9325,5754],0], which >>> suggests >>> there MPI_Comm_spawn was called 5753 times (!) >>> >>> >>> can you also run >>> >>> hostname >>> >>> on the 'lorien' host ? >>> >>> if you configure'd Open MPI with --enable-debug, can you >>> >>> export OMPI_MCA_plm_base_verbose 5 >>> >>> then run one test and post the logs ? >>> >>> >>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 >>> should >>> produce job family 5576 (but you get 9325) >>> >>> the discrepancy could be explained by the use of a batch manager >>> and/or a full hostname i am unaware of. >>> >>> >>> orte_plm_base_set_hnp_name() generate a 16 bits job family >>> from the >>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. >>> >>> so strictly speaking, it is possible two jobs launched on >>> the same >>> node are assigned the same 16 bits job family. >>> >>> >>> the easiest way to detect this could be to >>> >>> - edit orte/mca/plm/base/plm_base_jobid.c >>> >>> and replace >>> >>> OPAL_OUTPUT_VERBOSE((5, >>> orte_plm_base_framework.framework_output, >>> "plm:base:set_hnp_name: final >>> jobfam %lu", >>> (unsigned long)jobfam)); >>> >>> with >>> >>> OPAL_OUTPUT_VERBOSE((4, >>> orte_plm_base_framework.framework_output, >>> "plm:base:set_hnp_name: final >>> jobfam %lu", >>> (unsigned long)jobfam)); >>> >>> configure Open MPI with --enable-debug and rebuild >>> >>> and then >>> >>> export OMPI_MCA_plm_base_verbose=4 >>> >>> and run your tests. >>> >>> >>> when the problem occurs, you will be able to check which pids >>> produced the faulty jobfam, and that could hint to a conflict. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> >>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>> >>> Hi, >>> >>> It is the third time this happened into the last 10 days. >>> >>> While running nighlty tests (~2200), we have one or two >>> tests >>> that fails at the very beginning with this strange error: >>> >>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: >>> received unexpected process identifier [[9325,0],0] from >>> [[5590,0],0] >>> >>> But I can't reproduce the problem right now... ie: If I >>> launch >>> this test alone "by hand", it is successful... the same >>> test was >>> successful yesterday... >>> >>> Is there some kind of "race condition" that can happen >>> on the >>> creation of "tmp" files if many tests runs together on >>> the same >>> node? (we are oversubcribing even sequential runs...) >>> >>> Here are the build logs: >>> >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> >>> >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>> >>> >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> >>> >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>> >>> >>> >>> Thanks, >>> >>> Eric >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel