Ralph, is there any reason to use a session directory based on the jobid (or job family) ? I mean, could we use mkstemp to generate a unique directory, and then propagate the path via orted comm or the environment ?
Cheers, Gilles On Wednesday, September 14, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote: > This has nothing to do with PMIx, Josh - the error is coming out of the > usock OOB component. > > > On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com > <javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote: > > Eric, > > We are looking into the PMIx code path that sets up the jobid. The session > directories are created based on the jobid. It might be the case that the > jobids (generated with rand) happen to be the same for different jobs > resulting in multiple jobs sharing the same session directory, but we need > to check. We will update. > > Josh > > On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland <Eric.Chamberland@giref. > ulaval.ca > <javascript:_e(%7B%7D,'cvml','eric.chamberl...@giref.ulaval.ca');>> wrote: > >> Lucky! >> >> Since each runs have a specific TMP, I still have it on disc. >> >> for the faulty run, the TMP variable was: >> >> TMP=/tmp/tmp.wOv5dkNaSI >> >> and into $TMP I have: >> >> openmpi-sessions-40031@lorien_0 >> >> and into this subdirectory I have a bunch of empty dirs: >> >> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls >> -la |wc -l >> 1841 >> >> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls >> -la |more >> total 68 >> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >> ... >> >> If I do: >> >> lsof |grep "openmpi-sessions-40031" >> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >> /run/user/1000/gvfs >> Output information may be incomplete. >> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing >> Output information may be incomplete. >> >> nothing... >> >> What else may I check? >> >> Eric >> >> >> On 14/09/16 08:47 AM, Joshua Ladd wrote: >> >>> Hi, Eric >>> >>> I **think** this might be related to the following: >>> >>> https://github.com/pmix/master/pull/145 >>> >>> I'm wondering if you can look into the /tmp directory and see if you >>> have a bunch of stale usock files. >>> >>> Best, >>> >>> Josh >>> >>> >>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp >>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');> >>> <mailto:gil...@rist.or.jp >>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>> wrote: >>> >>> Eric, >>> >>> >>> can you please provide more information on how your tests are >>> launched ? >>> >>> do you >>> >>> mpirun -np 1 ./a.out >>> >>> or do you simply >>> >>> ./a.out >>> >>> >>> do you use a batch manager ? if yes, which one ? >>> >>> do you run one test per job ? or multiple tests per job ? >>> >>> how are these tests launched ? >>> >>> >>> do the test that crashes use MPI_Comm_spawn ? >>> >>> i am surprised by the process name [[9325,5754],0], which suggests >>> there MPI_Comm_spawn was called 5753 times (!) >>> >>> >>> can you also run >>> >>> hostname >>> >>> on the 'lorien' host ? >>> >>> if you configure'd Open MPI with --enable-debug, can you >>> >>> export OMPI_MCA_plm_base_verbose 5 >>> >>> then run one test and post the logs ? >>> >>> >>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should >>> produce job family 5576 (but you get 9325) >>> >>> the discrepancy could be explained by the use of a batch manager >>> and/or a full hostname i am unaware of. >>> >>> >>> orte_plm_base_set_hnp_name() generate a 16 bits job family from the >>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. >>> >>> so strictly speaking, it is possible two jobs launched on the same >>> node are assigned the same 16 bits job family. >>> >>> >>> the easiest way to detect this could be to >>> >>> - edit orte/mca/plm/base/plm_base_jobid.c >>> >>> and replace >>> >>> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framew >>> ork_output, >>> "plm:base:set_hnp_name: final jobfam %lu", >>> (unsigned long)jobfam)); >>> >>> with >>> >>> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framew >>> ork_output, >>> "plm:base:set_hnp_name: final jobfam %lu", >>> (unsigned long)jobfam)); >>> >>> configure Open MPI with --enable-debug and rebuild >>> >>> and then >>> >>> export OMPI_MCA_plm_base_verbose=4 >>> >>> and run your tests. >>> >>> >>> when the problem occurs, you will be able to check which pids >>> produced the faulty jobfam, and that could hint to a conflict. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> >>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>> >>> Hi, >>> >>> It is the third time this happened into the last 10 days. >>> >>> While running nighlty tests (~2200), we have one or two tests >>> that fails at the very beginning with this strange error: >>> >>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: >>> received unexpected process identifier [[9325,0],0] from >>> [[5590,0],0] >>> >>> But I can't reproduce the problem right now... ie: If I launch >>> this test alone "by hand", it is successful... the same test was >>> successful yesterday... >>> >>> Is there some kind of "race condition" that can happen on the >>> creation of "tmp" files if many tests runs together on the same >>> node? (we are oversubcribing even sequential runs...) >>> >>> Here are the build logs: >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>> .01h16m01s_config.log >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_config.log> >>> >>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>> .01h16m01s_ompi_info_all.txt >>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>> 3.01h16m01s_ompi_info_all.txt> >>> >>> >>> Thanks, >>> >>> Eric >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> <mailto: >>> devel@lists.open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> <mailto: >>> devel@lists.open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>> >>> >>> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel