Ralph, that looks good to me.
can you please remind me how to test if an app was launched by mpirun/orted or direct launched by the RM ? right now, which direct launch method is supported ? i am aware of srun (SLURM) and apron (CRAY), are there any other ? Cheers, Gilles On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org <r...@open-mpi.org> wrote: > > On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ralph, > > > my reply is in the text > > > On 9/15/2016 11:11 AM, r...@open-mpi.org wrote: > > If we are going to make a change, then let’s do it only once. Since we > introduced PMIx and the concept of the string namespace, the plan has been > to switch away from a numerical jobid and to the namespace. This eliminates > the issue of the hash altogether. If we are going to make a disruptive > change, then let’s do that one. Either way, this isn’t something that could > go into the 2.x series. It is far too invasive, and would have to be delayed > until a 3.x at the earliest. > > got it ! > > Note that I am not yet convinced that is the issue here. We’ve had this hash > for 12 years, and this is the first time someone has claimed to see a > problem. That makes me very suspicious that the root cause isn’t what you > are pursuing. This is only being reported for _singletons_, and that is a > very unique code path. The only reason for launching the orted is to support > PMIx operations such as notification and comm_spawn. If those aren’t being > used, then we could use the “isolated” mode where the usock OOB isn’t even > activated, thus eliminating the problem. This would be a much smaller “fix” > and could potentially fit into 2.x > > a bug has been identified and fixed, let's wait and see how things go > > how can i use the isolated mode ? > shall i simply > export OMPI_MCA_pmix=isolated > export OMPI_MCA_plm=isolated > ? > > out of curiosity, does "isolated" means we would not even need to fork the > HNP ? > > > Yes - that’s the idea. Simplify and make things faster. All you have to do > is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that code > is in 2.x as well > > > > FWIW: every organization I’ve worked with has an epilog script that blows > away temp dirs. It isn’t the RM-based environment that is of concern - it’s > the non-RM one where epilog scripts don’t exist that is the problem. > > well, i was looking at this the other way around. > if mpirun/orted creates the session directory with mkstemp(), then there is > no more need to do any cleanup > (as long as you do not run out of disk space) > but with direct run, there is always a little risk that a previous session > directory is used, hence the requirement for an epilogue. > also, if the RM is configured to run one job at a time per a given node, > epilog can be quite trivial. > but if several jobs can run on a given node at the same time, epilog become > less trivial > > > Yeah, this session directory thing has always been problematic. We’ve had > litter problems since day one, and tried multiple solutions over the years. > Obviously, none of those has proven fully successful :-( > > Artem came up with a good solution using PMIx that allows the RM to control > the session directory location for both direct launch and mpirun launch, > thus ensuring the RM can cleanup the correct place upon session termination. > As we get better adoption of that method out there, then the RM-based > solution (even for multiple jobs sharing a node) should be resolved. > > This leaves the non-RM (i.e., ssh-based launch using mpirun) problem. Your > proposal would resolve that one as (a) we always have orted’s in that > scenario, and (b) the orted’s pass the session directory down to the apps. > So maybe the right approach is to use mkstemp() in the scenario where we are > launched via orted and the RM has not specified a session directory. > > I’m not sure we can resolve the direct launch without PMIx problem - I think > that’s best left as another incentive for RMs to get on-board the PMIx bus. > > Make sense? > > > > > Cheers, > > Gilles > > > On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ralph, > > > On 9/15/2016 12:11 AM, r...@open-mpi.org wrote: > > Many things are possible, given infinite time :-) > > i could not agree more :-D > > The issue with this notion lies in direct launch scenarios - i.e., when > procs are launched directly by the RM and not via mpirun. In this case, > there is nobody who can give us the session directory (well, until PMIx > becomes universal), and so the apps must be able to generate a name that > they all can know. Otherwise, we lose shared memory support because they > can’t rendezvous. > > thanks for the explanation, > now let me rephrase that > "a MPI task must be able to rebuild the path to the session directory, based > on the information it has when launched. > if mpirun is used, we have several options to pass this option to the MPI > tasks. > in case of direct run, this info is unlikely (PMIx is not universal (yet)) > passed by the batch manager, so we have to use what is available" > > my concern is that, to keep things simple, session directory is based on the > Open MPI jobid, and since stepid is zero most of the time, jobid really > means job family > which is stored on 16 bits. > > in the case of mpirun, jobfam is a 16 bit hash of the hostname (reasonnable > sized string) and the mpirun pid (32 bits on Linux) > if several mpirun are invoked on the same host at a given time, there is a > risk two distinct jobs are assigned the same jobfam (since we hash from 32 > bits down to 16 bits). > also, there is a risk the session directory already exists from a previous > job, with some/all files and unix sockets from a previous job, leading to > undefined behavior > (an early crash if we are lucky, odd things otherwise). > > in the case of direct run, i guess jobfam is a 16 bit hash of the jobid > passed by the RM, and once again, there is a risk of conflict and/or the > re-use of a previous session directory. > > to me, the issue here is we are using the Open MPI jobfam in order to build > the session directory path > instead, what if we > 1) when mpirun, use a session directory created by mkstemp(), and pass it to > MPI tasks via the environment or retrieve it from orted/mpirun right after > the communication has been established. > 2) for direct run, use a session directory based on the full jobid (which > might be a string or a number) as passed by the RM > > in case of 1), there is no more risk of a hash conflict, or re-using a > previous session directory > in case of 2), there is no more risk of a hash conflict, but there is still > a risk of re-using a session directory from a previous (e.g. terminated) > job. > that being said, once we document how the session directory is built from > the jobid, sysadmins will be able to write a RM epilog that do remove the > session directory. > > does that make sense ? > > > However, that doesn’t seem to be the root problem here. I suspect there is a > bug in the code that spawns the orted from the singleton, and subsequently > parses the returned connection info. If you look at the error, you’ll see > that both jobid’s have “zero” for their local jobid. This means that the two > procs attempting to communicate both think they are daemons, which is > impossible in this scenario. > > So something garbled the string that the orted returns on startup to the > singleton, and/or the singleton is parsing it incorrectly. IIRC, the > singleton gets its name from that string, and so I expect it is getting the > wrong name - and hence the error. > > i will investigate that. > > As you may recall, you made a change a little while back where we modified > the code in ess/singleton to be a little less strict in its checking of that > returned string. I wonder if that is biting us here? It wouldn’t fix the > problem, but might generate a different error at a more obvious place. > > do you mean > https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2badec0740f > ? > this has not been backported to v2.x, and the issue was reported on v2.x > > > Cheers, > > Gilles > > > On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > is there any reason to use a session directory based on the jobid (or job > family) ? > I mean, could we use mkstemp to generate a unique directory, and then > propagate the path via orted comm or the environment ? > > Cheers, > > Gilles > > On Wednesday, September 14, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote: >> >> This has nothing to do with PMIx, Josh - the error is coming out of the >> usock OOB component. >> >> >> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >> >> Eric, >> >> We are looking into the PMIx code path that sets up the jobid. The session >> directories are created based on the jobid. It might be the case that the >> jobids (generated with rand) happen to be the same for different jobs >> resulting in multiple jobs sharing the same session directory, but we need >> to check. We will update. >> >> Josh >> >> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland >> <eric.chamberl...@giref.ulaval.ca> wrote: >>> >>> Lucky! >>> >>> Since each runs have a specific TMP, I still have it on disc. >>> >>> for the faulty run, the TMP variable was: >>> >>> TMP=/tmp/tmp.wOv5dkNaSI >>> >>> and into $TMP I have: >>> >>> openmpi-sessions-40031@lorien_0 >>> >>> and into this subdirectory I have a bunch of empty dirs: >>> >>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la >>> |wc -l >>> 1841 >>> >>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la >>> |more >>> total 68 >>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>> ... >>> >>> If I do: >>> >>> lsof |grep "openmpi-sessions-40031" >>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>> /run/user/1000/gvfs >>> Output information may be incomplete. >>> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing >>> Output information may be incomplete. >>> >>> nothing... >>> >>> What else may I check? >>> >>> Eric >>> >>> >>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>>> >>>> Hi, Eric >>>> >>>> I **think** this might be related to the following: >>>> >>>> https://github.com/pmix/master/pull/145 >>>> >>>> I'm wondering if you can look into the /tmp directory and see if you >>>> have a bunch of stale usock files. >>>> >>>> Best, >>>> >>>> Josh >>>> >>>> >>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp >>>> <mailto:gil...@rist.or.jp>> wrote: >>>> >>>> Eric, >>>> >>>> >>>> can you please provide more information on how your tests are >>>> launched ? >>>> >>>> do you >>>> >>>> mpirun -np 1 ./a.out >>>> >>>> or do you simply >>>> >>>> ./a.out >>>> >>>> >>>> do you use a batch manager ? if yes, which one ? >>>> >>>> do you run one test per job ? or multiple tests per job ? >>>> >>>> how are these tests launched ? >>>> >>>> >>>> do the test that crashes use MPI_Comm_spawn ? >>>> >>>> i am surprised by the process name [[9325,5754],0], which suggests >>>> there MPI_Comm_spawn was called 5753 times (!) >>>> >>>> >>>> can you also run >>>> >>>> hostname >>>> >>>> on the 'lorien' host ? >>>> >>>> if you configure'd Open MPI with --enable-debug, can you >>>> >>>> export OMPI_MCA_plm_base_verbose 5 >>>> >>>> then run one test and post the logs ? >>>> >>>> >>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should >>>> produce job family 5576 (but you get 9325) >>>> >>>> the discrepancy could be explained by the use of a batch manager >>>> and/or a full hostname i am unaware of. >>>> >>>> >>>> orte_plm_base_set_hnp_name() generate a 16 bits job family from the >>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. >>>> >>>> so strictly speaking, it is possible two jobs launched on the same >>>> node are assigned the same 16 bits job family. >>>> >>>> >>>> the easiest way to detect this could be to >>>> >>>> - edit orte/mca/plm/base/plm_base_jobid.c >>>> >>>> and replace >>>> >>>> OPAL_OUTPUT_VERBOSE((5, >>>> orte_plm_base_framework.framework_output, >>>> "plm:base:set_hnp_name: final jobfam %lu", >>>> (unsigned long)jobfam)); >>>> >>>> with >>>> >>>> OPAL_OUTPUT_VERBOSE((4, >>>> orte_plm_base_framework.framework_output, >>>> "plm:base:set_hnp_name: final jobfam %lu", >>>> (unsigned long)jobfam)); >>>> >>>> configure Open MPI with --enable-debug and rebuild >>>> >>>> and then >>>> >>>> export OMPI_MCA_plm_base_verbose=4 >>>> >>>> and run your tests. >>>> >>>> >>>> when the problem occurs, you will be able to check which pids >>>> produced the faulty jobfam, and that could hint to a conflict. >>>> >>>> >>>> Cheers, >>>> >>>> >>>> Gilles >>>> >>>> >>>> >>>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>>> >>>> Hi, >>>> >>>> It is the third time this happened into the last 10 days. >>>> >>>> While running nighlty tests (~2200), we have one or two tests >>>> that fails at the very beginning with this strange error: >>>> >>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: >>>> received unexpected process identifier [[9325,0],0] from >>>> [[5590,0],0] >>>> >>>> But I can't reproduce the problem right now... ie: If I launch >>>> this test alone "by hand", it is successful... the same test was >>>> successful yesterday... >>>> >>>> Is there some kind of "race condition" that can happen on the >>>> creation of "tmp" files if many tests runs together on the same >>>> node? (we are oversubcribing even sequential runs...) >>>> >>>> Here are the build logs: >>>> >>>> >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log >>>> >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> >>>> >>>> >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt >>>> >>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> >>>> >>>> >>>> Thanks, >>>> >>>> Eric >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel