I don’t understand this fascination with PMIx. PMIx didn’t calculate this jobid - OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing- to do with PMIx.
So why do you want to continue to blame PMIx for this problem?? > On Sep 15, 2016, at 4:29 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > Great catch, Gilles! Not much of a surprise though. > > Indeed, this issue has EVERYTHING to do with how PMIx is calculating the > jobid, which, in this case, results in hash collisions. ;-P > > Josh > > On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp > <mailto:gil...@rist.or.jp>> wrote: > Eric, > > > a bug has been identified, and a patch is available at > https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch > > <https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch> > > the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), > so if applying a patch does not fit your test workflow, > > it might be easier for you to update it and mpirun -np 1 ./a.out instead of > ./a.out > > > basically, increasing verbosity runs some extra code, which include sprintf. > so yes, it is possible to crash an app by increasing verbosity by running > into a bug that is hidden under normal operation. > my intuition suggests this is quite unlikely ... if you can get a core file > and a backtrace, we will soon find out > > > Cheers, > > Gilles > > > > On 9/15/2016 2:58 AM, Eric Chamberland wrote: > Ok, > > one test segfaulted *but* I can't tell if it is the *same* bug because there > has been a segfault: > > stderr: > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt> > > > [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path > NULL > [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash > 1366255883 > [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 > [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL > [lorien:190552] [[53310,0],0] plm:base:receive start comm > *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 *** > ... > ... > [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon > on the local node in file ess_singleton_module.c at line 573 > [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon > on the local node in file ess_singleton_module.c at line 163 > *** An error occurred in MPI_Init_thread > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [lorien:190306] Local abort before MPI_INIT completed completed successfully, > but am not able to aggregate error messages, and not able to guarantee that > all other processes were killed! > > stdout: > > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_ess_init failed > --> Returned value Unable to start a daemon on the local node (-127) > instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: ompi_rte_init failed > --> Returned "Unable to start a daemon on the local node" (-127) instead of > "Success" (0) > -------------------------------------------------------------------------- > > openmpi content of $TMP: > > /tmp/tmp.GoQXICeyJl> ls -la > total 1500 > drwx------ 3 cmpbib bib 250 Sep 14 13:34 . > drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. > ... > drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 > openmpi-sessions-40031@lorien_0 > srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 > > cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . > -type f > ./53310/contact.txt > > cat 53310/contact.txt > 3493724160.0;usock;tcp://132.203.7.36:54605 <http://132.203.7.36:54605/> > 190552 > > egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 > dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] > plm:base:set_hnp_name: final jobfam 53310 > > (this is the faulty test) > full egrep: > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt> > > > config.log: > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log> > > > ompi_info: > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt> > > > Maybe it aborted (instead of giving the other message) while doing the error > because of export OMPI_MCA_plm_base_verbose=5 ? > > Thanks, > > Eric > > > On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: > Eric, > > do you mean you have a unique $TMP per a.out ? > or a unique $TMP per "batch" of run ? > > in the first case, my understanding is that conflicts cannot happen ... > > once you hit the bug, can you please please post the output of the > failed a.out, > and run > egrep 'jobfam|stop' > on all your logs, so we might spot a conflict > > Cheers, > > Gilles > > On Wednesday, September 14, 2016, Eric Chamberland > <eric.chamberl...@giref.ulaval.ca <> > <mailto:eric.chamberl...@giref.ulaval.ca <>>> wrote: > > Lucky! > > Since each runs have a specific TMP, I still have it on disc. > > for the faulty run, the TMP variable was: > > TMP=/tmp/tmp.wOv5dkNaSI > > and into $TMP I have: > > openmpi-sessions-40031@lorien_0 > > and into this subdirectory I have a bunch of empty dirs: > > cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> > ls -la |wc -l > 1841 > > cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> > ls -la |more > total 68 > drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . > drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. > drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 > drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 > drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 > drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 > drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 > drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 > ... > > If I do: > > lsof |grep "openmpi-sessions-40031" > lsof: WARNING: can't stat() fuse.gvfsd-fuse file system > /run/user/1000/gvfs > Output information may be incomplete. > lsof: WARNING: can't stat() tracefs file system > /sys/kernel/debug/tracing > Output information may be incomplete. > > nothing... > > What else may I check? > > Eric > > > On 14/09/16 08:47 AM, Joshua Ladd wrote: > > Hi, Eric > > I **think** this might be related to the following: > > https://github.com/pmix/master/pull/145 > <https://github.com/pmix/master/pull/145> > <https://github.com/pmix/master/pull/145 > <https://github.com/pmix/master/pull/145>> > > I'm wondering if you can look into the /tmp directory and see if you > have a bunch of stale usock files. > > Best, > > Josh > > > On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet > <gil...@rist.or.jp <> > <mailto:gil...@rist.or.jp <>>> wrote: > > Eric, > > > can you please provide more information on how your tests > are launched ? > > do you > > mpirun -np 1 ./a.out > > or do you simply > > ./a.out > > > do you use a batch manager ? if yes, which one ? > > do you run one test per job ? or multiple tests per job ? > > how are these tests launched ? > > > do the test that crashes use MPI_Comm_spawn ? > > i am surprised by the process name [[9325,5754],0], which > suggests > there MPI_Comm_spawn was called 5753 times (!) > > > can you also run > > hostname > > on the 'lorien' host ? > > if you configure'd Open MPI with --enable-debug, can you > > export OMPI_MCA_plm_base_verbose 5 > > then run one test and post the logs ? > > > from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 > should > produce job family 5576 (but you get 9325) > > the discrepancy could be explained by the use of a batch manager > and/or a full hostname i am unaware of. > > > orte_plm_base_set_hnp_name() generate a 16 bits job family > from the > (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. > > so strictly speaking, it is possible two jobs launched on > the same > node are assigned the same 16 bits job family. > > > the easiest way to detect this could be to > > - edit orte/mca/plm/base/plm_base_jobid.c > > and replace > > OPAL_OUTPUT_VERBOSE((5, > orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final > jobfam %lu", > (unsigned long)jobfam)); > > with > > OPAL_OUTPUT_VERBOSE((4, > orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final > jobfam %lu", > (unsigned long)jobfam)); > > configure Open MPI with --enable-debug and rebuild > > and then > > export OMPI_MCA_plm_base_verbose=4 > > and run your tests. > > > when the problem occurs, you will be able to check which pids > produced the faulty jobfam, and that could hint to a conflict. > > > Cheers, > > > Gilles > > > > On 9/14/2016 12:35 AM, Eric Chamberland wrote: > > Hi, > > It is the third time this happened into the last 10 days. > > While running nighlty tests (~2200), we have one or two > tests > that fails at the very beginning with this strange error: > > [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: > received unexpected process identifier [[9325,0],0] from > [[5590,0],0] > > But I can't reproduce the problem right now... ie: If I > launch > this test alone "by hand", it is successful... the same > test was > successful yesterday... > > Is there some kind of "race condition" that can happen > on the > creation of "tmp" files if many tests runs together on > the same > node? (we are oversubcribing even sequential runs...) > > Here are the build logs: > > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>> > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>> > > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>> > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt > > <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>> > > > Thanks, > > Eric > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <> <mailto:devel@lists.open-mpi.org > <>> > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> > > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <> <mailto:devel@lists.open-mpi.org <>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel