Great catch, Gilles! Not much of a surprise though. Indeed, this issue has EVERYTHING to do with how PMIx is calculating the jobid, which, in this case, results in hash collisions. ;-P
Josh On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > Eric, > > > a bug has been identified, and a patch is available at > https://patch-diff.githubusercontent.com/raw/open-mpi/ompi- > release/pull/1376.patch > > the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 > ./a.out), so if applying a patch does not fit your test workflow, > > it might be easier for you to update it and mpirun -np 1 ./a.out instead > of ./a.out > > > basically, increasing verbosity runs some extra code, which include > sprintf. > so yes, it is possible to crash an app by increasing verbosity by running > into a bug that is hidden under normal operation. > my intuition suggests this is quite unlikely ... if you can get a core > file and a backtrace, we will soon find out > > > Cheers, > > Gilles > > > > On 9/15/2016 2:58 AM, Eric Chamberland wrote: > >> Ok, >> >> one test segfaulted *but* I can't tell if it is the *same* bug because >> there has been a segfault: >> >> stderr: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt >> >> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh >> path NULL >> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash >> 1366255883 >> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 >> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL >> [lorien:190552] [[53310,0],0] plm:base:receive start comm >> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 *** >> ... >> ... >> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >> daemon on the local node in file ess_singleton_module.c at line 573 >> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >> daemon on the local node in file ess_singleton_module.c at line 163 >> *** An error occurred in MPI_Init_thread >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [lorien:190306] Local abort before MPI_INIT completed completed >> successfully, but am not able to aggregate error messages, and not able to >> guarantee that all other processes were killed! >> >> stdout: >> >> -------------------------------------------------------------------------- >> >> It looks like orte_init failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during orte_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> orte_ess_init failed >> --> Returned value Unable to start a daemon on the local node (-127) >> instead of ORTE_SUCCESS >> -------------------------------------------------------------------------- >> >> -------------------------------------------------------------------------- >> >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or >> environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> ompi_mpi_init: ompi_rte_init failed >> --> Returned "Unable to start a daemon on the local node" (-127) >> instead of "Success" (0) >> -------------------------------------------------------------------------- >> >> >> openmpi content of $TMP: >> >> /tmp/tmp.GoQXICeyJl> ls -la >> total 1500 >> drwx------ 3 cmpbib bib 250 Sep 14 13:34 . >> drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. >> ... >> drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 >> openmpi-sessions-40031@lorien_0 >> srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 >> >> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find >> . -type f >> ./53310/contact.txt >> >> cat 53310/contact.txt >> 3493724160.0;usock;tcp://132.203.7.36:54605 >> 190552 >> >> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 >> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] >> plm:base:set_hnp_name: final jobfam 53310 >> >> (this is the faulty test) >> full egrep: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >> .10h38m52s.egrep.txt >> >> config.log: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >> .10h38m52s_config.log >> >> ompi_info: >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >> .10h38m52s_ompi_info_all.txt >> >> Maybe it aborted (instead of giving the other message) while doing the >> error because of export OMPI_MCA_plm_base_verbose=5 ? >> >> Thanks, >> >> Eric >> >> >> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: >> >>> Eric, >>> >>> do you mean you have a unique $TMP per a.out ? >>> or a unique $TMP per "batch" of run ? >>> >>> in the first case, my understanding is that conflicts cannot happen ... >>> >>> once you hit the bug, can you please please post the output of the >>> failed a.out, >>> and run >>> egrep 'jobfam|stop' >>> on all your logs, so we might spot a conflict >>> >>> Cheers, >>> >>> Gilles >>> >>> On Wednesday, September 14, 2016, Eric Chamberland >>> <eric.chamberl...@giref.ulaval.ca >>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote: >>> >>> Lucky! >>> >>> Since each runs have a specific TMP, I still have it on disc. >>> >>> for the faulty run, the TMP variable was: >>> >>> TMP=/tmp/tmp.wOv5dkNaSI >>> >>> and into $TMP I have: >>> >>> openmpi-sessions-40031@lorien_0 >>> >>> and into this subdirectory I have a bunch of empty dirs: >>> >>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>> ls -la |wc -l >>> 1841 >>> >>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>> ls -la |more >>> total 68 >>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>> ... >>> >>> If I do: >>> >>> lsof |grep "openmpi-sessions-40031" >>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>> /run/user/1000/gvfs >>> Output information may be incomplete. >>> lsof: WARNING: can't stat() tracefs file system >>> /sys/kernel/debug/tracing >>> Output information may be incomplete. >>> >>> nothing... >>> >>> What else may I check? >>> >>> Eric >>> >>> >>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>> >>> Hi, Eric >>> >>> I **think** this might be related to the following: >>> >>> https://github.com/pmix/master/pull/145 >>> <https://github.com/pmix/master/pull/145> >>> >>> I'm wondering if you can look into the /tmp directory and see if >>> you >>> have a bunch of stale usock files. >>> >>> Best, >>> >>> Josh >>> >>> >>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >>> <gil...@rist.or.jp >>> <mailto:gil...@rist.or.jp>> wrote: >>> >>> Eric, >>> >>> >>> can you please provide more information on how your tests >>> are launched ? >>> >>> do you >>> >>> mpirun -np 1 ./a.out >>> >>> or do you simply >>> >>> ./a.out >>> >>> >>> do you use a batch manager ? if yes, which one ? >>> >>> do you run one test per job ? or multiple tests per job ? >>> >>> how are these tests launched ? >>> >>> >>> do the test that crashes use MPI_Comm_spawn ? >>> >>> i am surprised by the process name [[9325,5754],0], which >>> suggests >>> there MPI_Comm_spawn was called 5753 times (!) >>> >>> >>> can you also run >>> >>> hostname >>> >>> on the 'lorien' host ? >>> >>> if you configure'd Open MPI with --enable-debug, can you >>> >>> export OMPI_MCA_plm_base_verbose 5 >>> >>> then run one test and post the logs ? >>> >>> >>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 >>> should >>> produce job family 5576 (but you get 9325) >>> >>> the discrepancy could be explained by the use of a batch >>> manager >>> and/or a full hostname i am unaware of. >>> >>> >>> orte_plm_base_set_hnp_name() generate a 16 bits job family >>> from the >>> (32 bits hash of the) hostname and the mpirun (32 bits ?) >>> pid. >>> >>> so strictly speaking, it is possible two jobs launched on >>> the same >>> node are assigned the same 16 bits job family. >>> >>> >>> the easiest way to detect this could be to >>> >>> - edit orte/mca/plm/base/plm_base_jobid.c >>> >>> and replace >>> >>> OPAL_OUTPUT_VERBOSE((5, >>> orte_plm_base_framework.framework_output, >>> "plm:base:set_hnp_name: final >>> jobfam %lu", >>> (unsigned long)jobfam)); >>> >>> with >>> >>> OPAL_OUTPUT_VERBOSE((4, >>> orte_plm_base_framework.framework_output, >>> "plm:base:set_hnp_name: final >>> jobfam %lu", >>> (unsigned long)jobfam)); >>> >>> configure Open MPI with --enable-debug and rebuild >>> >>> and then >>> >>> export OMPI_MCA_plm_base_verbose=4 >>> >>> and run your tests. >>> >>> >>> when the problem occurs, you will be able to check which pids >>> produced the faulty jobfam, and that could hint to a >>> conflict. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> >>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>> >>> Hi, >>> >>> It is the third time this happened into the last 10 days. >>> >>> While running nighlty tests (~2200), we have one or two >>> tests >>> that fails at the very beginning with this strange error: >>> >>> [lorien:142766] [[9325,5754],0] >>> usock_peer_recv_connect_ack: >>> received unexpected process identifier [[9325,0],0] from >>> [[5590,0],0] >>> >>> But I can't reproduce the problem right now... ie: If I >>> launch >>> this test alone "by hand", it is successful... the same >>> test was >>> successful yesterday... >>> >>> Is there some kind of "race condition" that can happen >>> on the >>> creation of "tmp" files if many tests runs together on >>> the same >>> node? (we are oversubcribing even sequential runs...) >>> >>> Here are the build logs: >>> >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>> .01h16m01s_config.log >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_config.log> >>> >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_config.log >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_config.log>> >>> >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>> .01h16m01s_ompi_info_all.txt >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_ompi_info_all.txt> >>> >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_ompi_info_all.txt >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_ompi_info_all.txt>> >>> >>> >>> Thanks, >>> >>> Eric >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:de...@lists.open-mpi. >>> org> >>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> >>> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel