HI Gilles, From what point in the job launch are you needed to determine whether or not the job was direct launched?
Howard -- Howard Pritchard HPC-DES Los Alamos National Laboratory On 9/15/16, 7:38 AM, "devel on behalf of Gilles Gouaillardet" <devel-boun...@lists.open-mpi.org on behalf of gilles.gouaillar...@gmail.com> wrote: >Ralph, > >that looks good to me. > >can you please remind me how to test if an app was launched by >mpirun/orted or direct launched by the RM ? > >right now, which direct launch method is supported ? >i am aware of srun (SLURM) and apron (CRAY), are there any other ? > >Cheers, > >Gilles > >On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org <r...@open-mpi.org> >wrote: >> >> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet <gil...@rist.or.jp> >>wrote: >> >> Ralph, >> >> >> my reply is in the text >> >> >> On 9/15/2016 11:11 AM, r...@open-mpi.org wrote: >> >> If we are going to make a change, then let’s do it only once. Since we >> introduced PMIx and the concept of the string namespace, the plan has >>been >> to switch away from a numerical jobid and to the namespace. This >>eliminates >> the issue of the hash altogether. If we are going to make a disruptive >> change, then let’s do that one. Either way, this isn’t something that >>could >> go into the 2.x series. It is far too invasive, and would have to be >>delayed >> until a 3.x at the earliest. >> >> got it ! >> >> Note that I am not yet convinced that is the issue here. We’ve had this >>hash >> for 12 years, and this is the first time someone has claimed to see a >> problem. That makes me very suspicious that the root cause isn’t what >>you >> are pursuing. This is only being reported for _singletons_, and that is >>a >> very unique code path. The only reason for launching the orted is to >>support >> PMIx operations such as notification and comm_spawn. If those aren’t >>being >> used, then we could use the “isolated” mode where the usock OOB isn’t >>even >> activated, thus eliminating the problem. This would be a much smaller >>“fix” >> and could potentially fit into 2.x >> >> a bug has been identified and fixed, let's wait and see how things go >> >> how can i use the isolated mode ? >> shall i simply >> export OMPI_MCA_pmix=isolated >> export OMPI_MCA_plm=isolated >> ? >> >> out of curiosity, does "isolated" means we would not even need to fork >>the >> HNP ? >> >> >> Yes - that’s the idea. Simplify and make things faster. All you have to >>do >> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that >>code >> is in 2.x as well >> >> >> >> FWIW: every organization I’ve worked with has an epilog script that >>blows >> away temp dirs. It isn’t the RM-based environment that is of concern - >>it’s >> the non-RM one where epilog scripts don’t exist that is the problem. >> >> well, i was looking at this the other way around. >> if mpirun/orted creates the session directory with mkstemp(), then >>there is >> no more need to do any cleanup >> (as long as you do not run out of disk space) >> but with direct run, there is always a little risk that a previous >>session >> directory is used, hence the requirement for an epilogue. >> also, if the RM is configured to run one job at a time per a given node, >> epilog can be quite trivial. >> but if several jobs can run on a given node at the same time, epilog >>become >> less trivial >> >> >> Yeah, this session directory thing has always been problematic. We’ve >>had >> litter problems since day one, and tried multiple solutions over the >>years. >> Obviously, none of those has proven fully successful :-( >> >> Artem came up with a good solution using PMIx that allows the RM to >>control >> the session directory location for both direct launch and mpirun launch, >> thus ensuring the RM can cleanup the correct place upon session >>termination. >> As we get better adoption of that method out there, then the RM-based >> solution (even for multiple jobs sharing a node) should be resolved. >> >> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem. >>Your >> proposal would resolve that one as (a) we always have orted’s in that >> scenario, and (b) the orted’s pass the session directory down to the >>apps. >> So maybe the right approach is to use mkstemp() in the scenario where >>we are >> launched via orted and the RM has not specified a session directory. >> >> I’m not sure we can resolve the direct launch without PMIx problem - I >>think >> that’s best left as another incentive for RMs to get on-board the PMIx >>bus. >> >> Make sense? >> >> >> >> >> Cheers, >> >> Gilles >> >> >> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp> >>wrote: >> >> Ralph, >> >> >> On 9/15/2016 12:11 AM, r...@open-mpi.org wrote: >> >> Many things are possible, given infinite time :-) >> >> i could not agree more :-D >> >> The issue with this notion lies in direct launch scenarios - i.e., when >> procs are launched directly by the RM and not via mpirun. In this case, >> there is nobody who can give us the session directory (well, until PMIx >> becomes universal), and so the apps must be able to generate a name that >> they all can know. Otherwise, we lose shared memory support because they >> can’t rendezvous. >> >> thanks for the explanation, >> now let me rephrase that >> "a MPI task must be able to rebuild the path to the session directory, >>based >> on the information it has when launched. >> if mpirun is used, we have several options to pass this option to the >>MPI >> tasks. >> in case of direct run, this info is unlikely (PMIx is not universal >>(yet)) >> passed by the batch manager, so we have to use what is available" >> >> my concern is that, to keep things simple, session directory is based >>on the >> Open MPI jobid, and since stepid is zero most of the time, jobid really >> means job family >> which is stored on 16 bits. >> >> in the case of mpirun, jobfam is a 16 bit hash of the hostname >>(reasonnable >> sized string) and the mpirun pid (32 bits on Linux) >> if several mpirun are invoked on the same host at a given time, there >>is a >> risk two distinct jobs are assigned the same jobfam (since we hash from >>32 >> bits down to 16 bits). >> also, there is a risk the session directory already exists from a >>previous >> job, with some/all files and unix sockets from a previous job, leading >>to >> undefined behavior >> (an early crash if we are lucky, odd things otherwise). >> >> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid >> passed by the RM, and once again, there is a risk of conflict and/or the >> re-use of a previous session directory. >> >> to me, the issue here is we are using the Open MPI jobfam in order to >>build >> the session directory path >> instead, what if we >> 1) when mpirun, use a session directory created by mkstemp(), and pass >>it to >> MPI tasks via the environment or retrieve it from orted/mpirun right >>after >> the communication has been established. >> 2) for direct run, use a session directory based on the full jobid >>(which >> might be a string or a number) as passed by the RM >> >> in case of 1), there is no more risk of a hash conflict, or re-using a >> previous session directory >> in case of 2), there is no more risk of a hash conflict, but there is >>still >> a risk of re-using a session directory from a previous (e.g. terminated) >> job. >> that being said, once we document how the session directory is built >>from >> the jobid, sysadmins will be able to write a RM epilog that do remove >>the >> session directory. >> >> does that make sense ? >> >> >> However, that doesn’t seem to be the root problem here. I suspect there >>is a >> bug in the code that spawns the orted from the singleton, and >>subsequently >> parses the returned connection info. If you look at the error, you’ll >>see >> that both jobid’s have “zero” for their local jobid. This means that >>the two >> procs attempting to communicate both think they are daemons, which is >> impossible in this scenario. >> >> So something garbled the string that the orted returns on startup to the >> singleton, and/or the singleton is parsing it incorrectly. IIRC, the >> singleton gets its name from that string, and so I expect it is getting >>the >> wrong name - and hence the error. >> >> i will investigate that. >> >> As you may recall, you made a change a little while back where we >>modified >> the code in ess/singleton to be a little less strict in its checking of >>that >> returned string. I wonder if that is biting us here? It wouldn’t fix the >> problem, but might generate a different error at a more obvious place. >> >> do you mean >> >>https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2bad >>ec0740f >> ? >> this has not been backported to v2.x, and the issue was reported on v2.x >> >> >> Cheers, >> >> Gilles >> >> >> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >> Ralph, >> >> is there any reason to use a session directory based on the jobid (or >>job >> family) ? >> I mean, could we use mkstemp to generate a unique directory, and then >> propagate the path via orted comm or the environment ? >> >> Cheers, >> >> Gilles >> >> On Wednesday, September 14, 2016, r...@open-mpi.org <r...@open-mpi.org> >>wrote: >>> >>> This has nothing to do with PMIx, Josh - the error is coming out of the >>> usock OOB component. >>> >>> >>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >>> >>> Eric, >>> >>> We are looking into the PMIx code path that sets up the jobid. The >>>session >>> directories are created based on the jobid. It might be the case that >>>the >>> jobids (generated with rand) happen to be the same for different jobs >>> resulting in multiple jobs sharing the same session directory, but we >>>need >>> to check. We will update. >>> >>> Josh >>> >>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland >>> <eric.chamberl...@giref.ulaval.ca> wrote: >>>> >>>> Lucky! >>>> >>>> Since each runs have a specific TMP, I still have it on disc. >>>> >>>> for the faulty run, the TMP variable was: >>>> >>>> TMP=/tmp/tmp.wOv5dkNaSI >>>> >>>> and into $TMP I have: >>>> >>>> openmpi-sessions-40031@lorien_0 >>>> >>>> and into this subdirectory I have a bunch of empty dirs: >>>> >>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls >>>>-la >>>> |wc -l >>>> 1841 >>>> >>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls >>>>-la >>>> |more >>>> total 68 >>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>>> ... >>>> >>>> If I do: >>>> >>>> lsof |grep "openmpi-sessions-40031" >>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>>> /run/user/1000/gvfs >>>> Output information may be incomplete. >>>> lsof: WARNING: can't stat() tracefs file system >>>>/sys/kernel/debug/tracing >>>> Output information may be incomplete. >>>> >>>> nothing... >>>> >>>> What else may I check? >>>> >>>> Eric >>>> >>>> >>>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>>>> >>>>> Hi, Eric >>>>> >>>>> I **think** this might be related to the following: >>>>> >>>>> https://github.com/pmix/master/pull/145 >>>>> >>>>> I'm wondering if you can look into the /tmp directory and see if you >>>>> have a bunch of stale usock files. >>>>> >>>>> Best, >>>>> >>>>> Josh >>>>> >>>>> >>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >>>>><gil...@rist.or.jp >>>>> <mailto:gil...@rist.or.jp>> wrote: >>>>> >>>>> Eric, >>>>> >>>>> >>>>> can you please provide more information on how your tests are >>>>> launched ? >>>>> >>>>> do you >>>>> >>>>> mpirun -np 1 ./a.out >>>>> >>>>> or do you simply >>>>> >>>>> ./a.out >>>>> >>>>> >>>>> do you use a batch manager ? if yes, which one ? >>>>> >>>>> do you run one test per job ? or multiple tests per job ? >>>>> >>>>> how are these tests launched ? >>>>> >>>>> >>>>> do the test that crashes use MPI_Comm_spawn ? >>>>> >>>>> i am surprised by the process name [[9325,5754],0], which >>>>>suggests >>>>> there MPI_Comm_spawn was called 5753 times (!) >>>>> >>>>> >>>>> can you also run >>>>> >>>>> hostname >>>>> >>>>> on the 'lorien' host ? >>>>> >>>>> if you configure'd Open MPI with --enable-debug, can you >>>>> >>>>> export OMPI_MCA_plm_base_verbose 5 >>>>> >>>>> then run one test and post the logs ? >>>>> >>>>> >>>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should >>>>> produce job family 5576 (but you get 9325) >>>>> >>>>> the discrepancy could be explained by the use of a batch manager >>>>> and/or a full hostname i am unaware of. >>>>> >>>>> >>>>> orte_plm_base_set_hnp_name() generate a 16 bits job family from >>>>>the >>>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. >>>>> >>>>> so strictly speaking, it is possible two jobs launched on the >>>>>same >>>>> node are assigned the same 16 bits job family. >>>>> >>>>> >>>>> the easiest way to detect this could be to >>>>> >>>>> - edit orte/mca/plm/base/plm_base_jobid.c >>>>> >>>>> and replace >>>>> >>>>> OPAL_OUTPUT_VERBOSE((5, >>>>> orte_plm_base_framework.framework_output, >>>>> "plm:base:set_hnp_name: final jobfam >>>>>%lu", >>>>> (unsigned long)jobfam)); >>>>> >>>>> with >>>>> >>>>> OPAL_OUTPUT_VERBOSE((4, >>>>> orte_plm_base_framework.framework_output, >>>>> "plm:base:set_hnp_name: final jobfam >>>>>%lu", >>>>> (unsigned long)jobfam)); >>>>> >>>>> configure Open MPI with --enable-debug and rebuild >>>>> >>>>> and then >>>>> >>>>> export OMPI_MCA_plm_base_verbose=4 >>>>> >>>>> and run your tests. >>>>> >>>>> >>>>> when the problem occurs, you will be able to check which pids >>>>> produced the faulty jobfam, and that could hint to a conflict. >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> Gilles >>>>> >>>>> >>>>> >>>>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>>>> >>>>> Hi, >>>>> >>>>> It is the third time this happened into the last 10 days. >>>>> >>>>> While running nighlty tests (~2200), we have one or two tests >>>>> that fails at the very beginning with this strange error: >>>>> >>>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: >>>>> received unexpected process identifier [[9325,0],0] from >>>>> [[5590,0],0] >>>>> >>>>> But I can't reproduce the problem right now... ie: If I >>>>>launch >>>>> this test alone "by hand", it is successful... the same test >>>>>was >>>>> successful yesterday... >>>>> >>>>> Is there some kind of "race condition" that can happen on the >>>>> creation of "tmp" files if many tests runs together on the >>>>>same >>>>> node? (we are oversubcribing even sequential runs...) >>>>> >>>>> Here are the build logs: >>>>> >>>>> >>>>> >>>>>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s >>>>>_config.log >>>>> >>>>> >>>>><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01 >>>>>s_config.log> >>>>> >>>>> >>>>> >>>>>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s >>>>>_ompi_info_all.txt >>>>> >>>>> >>>>><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01 >>>>>s_ompi_info_all.txt> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> >>>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >_______________________________________________ >devel mailing list >devel@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel