Re: [OMPI devel] toward a unique session directory

Pritchard Jr., Howard Thu, 15 Sep 2016 07:53:12 -0700

HI Gilles,

From what point in the job launch are you needed to determine whether
or not the job was direct launched?


Howard

-- 
Howard Pritchard

HPC-DES
Los Alamos National Laboratory





On 9/15/16, 7:38 AM, "devel on behalf of Gilles Gouaillardet"
<devel-boun...@lists.open-mpi.org on behalf of
gilles.gouaillar...@gmail.com> wrote:

>Ralph,
>
>that looks good to me.
>
>can you please remind me how to test if an app was launched by
>mpirun/orted or direct launched by the RM ?
>
>right now, which direct launch method is supported ?
>i am aware of srun (SLURM) and apron (CRAY), are there any other ?
>
>Cheers,
>
>Gilles
>
>On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org <r...@open-mpi.org>
>wrote:
>>
>> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet <gil...@rist.or.jp>
>>wrote:
>>
>> Ralph,
>>
>>
>> my reply is in the text
>>
>>
>> On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
>>
>> If we are going to make a change, then let’s do it only once. Since we
>> introduced PMIx and the concept of the string namespace, the plan has
>>been
>> to switch away from a numerical jobid and to the namespace. This
>>eliminates
>> the issue of the hash altogether. If we are going to make a disruptive
>> change, then let’s do that one. Either way, this isn’t something that
>>could
>> go into the 2.x series. It is far too invasive, and would have to be
>>delayed
>> until a 3.x at the earliest.
>>
>> got it !
>>
>> Note that I am not yet convinced that is the issue here. We’ve had this
>>hash
>> for 12 years, and this is the first time someone has claimed to see a
>> problem. That makes me very suspicious that the root cause isn’t what
>>you
>> are pursuing. This is only being reported for _singletons_, and that is
>>a
>> very unique code path. The only reason for launching the orted is to
>>support
>> PMIx operations such as notification and comm_spawn. If those aren’t
>>being
>> used, then we could use the “isolated” mode where the usock OOB isn’t
>>even
>> activated, thus eliminating the problem. This would be a much smaller
>>“fix”
>> and could potentially fit into 2.x
>>
>> a bug has been identified and fixed, let's wait and see how things go
>>
>> how can i use the isolated mode ?
>> shall i simply
>> export OMPI_MCA_pmix=isolated
>> export OMPI_MCA_plm=isolated
>> ?
>>
>> out of curiosity, does "isolated" means we would not even need to fork
>>the
>> HNP ?
>>
>>
>> Yes - that’s the idea. Simplify and make things faster. All you have to
>>do
>> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that
>>code
>> is in 2.x as well
>>
>>
>>
>> FWIW: every organization I’ve worked with has an epilog script that
>>blows
>> away temp dirs. It isn’t the RM-based environment that is of concern -
>>it’s
>> the non-RM one where epilog scripts don’t exist that is the problem.
>>
>> well, i was looking at this the other way around.
>> if mpirun/orted creates the session directory with mkstemp(), then
>>there is
>> no more need to do any cleanup
>> (as long as you do not run out of disk space)
>> but with direct run, there is always a little risk that a previous
>>session
>> directory is used, hence the requirement for an epilogue.
>> also, if the RM is configured to run one job at a time per a given node,
>> epilog can be quite trivial.
>> but if several jobs can run on a given node at the same time, epilog
>>become
>> less trivial
>>
>>
>> Yeah, this session directory thing has always been problematic. We’ve
>>had
>> litter problems since day one, and tried multiple solutions over the
>>years.
>> Obviously, none of those has proven fully successful :-(
>>
>> Artem came up with a good solution using PMIx that allows the RM to
>>control
>> the session directory location for both direct launch and mpirun launch,
>> thus ensuring the RM can cleanup the correct place upon session
>>termination.
>> As we get better adoption of that method out there, then the RM-based
>> solution (even for multiple jobs sharing a node) should be resolved.
>>
>> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem.
>>Your
>> proposal would resolve that one as (a) we always have orted’s in that
>> scenario, and (b) the orted’s pass the session directory down to the
>>apps.
>> So maybe the right approach is to use mkstemp() in the scenario where
>>we are
>> launched via orted and the RM has not specified a session directory.
>>
>> I’m not sure we can resolve the direct launch without PMIx problem - I
>>think
>> that’s best left as another incentive for RMs to get on-board the PMIx
>>bus.
>>
>> Make sense?
>>
>>
>>
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp>
>>wrote:
>>
>> Ralph,
>>
>>
>> On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:
>>
>> Many things are possible, given infinite time :-)
>>
>> i could not agree more :-D
>>
>> The issue with this notion lies in direct launch scenarios - i.e., when
>> procs are launched directly by the RM and not via mpirun. In this case,
>> there is nobody who can give us the session directory (well, until PMIx
>> becomes universal), and so the apps must be able to generate a name that
>> they all can know. Otherwise, we lose shared memory support because they
>> can’t rendezvous.
>>
>> thanks for the explanation,
>> now let me rephrase that
>> "a MPI task must be able to rebuild the path to the session directory,
>>based
>> on the information it has when launched.
>> if mpirun is used, we have several options to pass this option to the
>>MPI
>> tasks.
>> in case of direct run, this info is unlikely (PMIx is not universal
>>(yet))
>> passed by the batch manager, so we have to use what is available"
>>
>> my concern is that, to keep things simple, session directory is based
>>on the
>> Open MPI jobid, and since stepid is zero most of the time, jobid really
>> means job family
>> which is stored on 16 bits.
>>
>> in the case of mpirun, jobfam is a 16 bit hash of the hostname
>>(reasonnable
>> sized string) and the mpirun pid (32 bits on Linux)
>> if several mpirun are invoked on the same host at a given time, there
>>is a
>> risk two distinct jobs are assigned the same jobfam (since we hash from
>>32
>> bits down to 16 bits).
>> also, there is a risk the session directory already exists from a
>>previous
>> job, with some/all files and unix sockets from a previous job, leading
>>to
>> undefined behavior
>> (an early crash if we are lucky, odd things otherwise).
>>
>> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid
>> passed by the RM, and once again, there is a risk of conflict and/or the
>> re-use of a previous session directory.
>>
>> to me, the issue here is we are using the Open MPI jobfam in order to
>>build
>> the session directory path
>> instead, what if we
>> 1) when mpirun, use a session directory created by mkstemp(), and pass
>>it to
>> MPI tasks via the environment or retrieve it from orted/mpirun right
>>after
>> the communication has been established.
>> 2) for direct run, use a session directory based on the full jobid
>>(which
>> might be a string or a number) as passed by the RM
>>
>> in case of 1), there is no more risk of a hash conflict, or re-using a
>> previous session directory
>> in case of 2), there is no more risk of a hash conflict, but there is
>>still
>> a risk of re-using a session directory from a previous (e.g. terminated)
>> job.
>> that being said, once we document how the session directory is built
>>from
>> the jobid, sysadmins will be able to write a RM epilog that do remove
>>the
>> session directory.
>>
>>  does that make sense ?
>>
>>
>> However, that doesn’t seem to be the root problem here. I suspect there
>>is a
>> bug in the code that spawns the orted from the singleton, and
>>subsequently
>> parses the returned connection info. If you look at the error, you’ll
>>see
>> that both jobid’s have “zero” for their local jobid. This means that
>>the two
>> procs attempting to communicate both think they are daemons, which is
>> impossible in this scenario.
>>
>> So something garbled the string that the orted returns on startup to the
>> singleton, and/or the singleton is parsing it incorrectly. IIRC, the
>> singleton gets its name from that string, and so I expect it is getting
>>the
>> wrong name - and hence the error.
>>
>> i will investigate that.
>>
>> As you may recall, you made a change a little while back where we
>>modified
>> the code in ess/singleton to be a little less strict in its checking of
>>that
>> returned string. I wonder if that is biting us here? It wouldn’t fix the
>> problem, but might generate a different error at a more obvious place.
>>
>> do you mean
>> 
>>https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2bad
>>ec0740f
>> ?
>> this has not been backported to v2.x, and the issue was reported on v2.x
>>
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet
>> <gilles.gouaillar...@gmail.com> wrote:
>>
>> Ralph,
>>
>> is there any reason to use a session directory based on the jobid (or
>>job
>> family) ?
>> I mean, could we use mkstemp to generate a unique directory, and then
>> propagate the path via orted comm or the environment ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Wednesday, September 14, 2016, r...@open-mpi.org <r...@open-mpi.org>
>>wrote:
>>>
>>> This has nothing to do with PMIx, Josh - the error is coming out of the
>>> usock OOB component.
>>>
>>>
>>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>>
>>> Eric,
>>>
>>> We are looking into the PMIx code path that sets up the jobid. The
>>>session
>>> directories are created based on the jobid. It might be the case that
>>>the
>>> jobids (generated with rand) happen to be the same for different jobs
>>> resulting in multiple jobs sharing the same session directory, but we
>>>need
>>> to check. We will update.
>>>
>>> Josh
>>>
>>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland
>>> <eric.chamberl...@giref.ulaval.ca> wrote:
>>>>
>>>> Lucky!
>>>>
>>>> Since each runs have a specific TMP, I still have it on disc.
>>>>
>>>> for the faulty run, the TMP variable was:
>>>>
>>>> TMP=/tmp/tmp.wOv5dkNaSI
>>>>
>>>> and into $TMP I have:
>>>>
>>>> openmpi-sessions-40031@lorien_0
>>>>
>>>> and into this subdirectory I have a bunch of empty dirs:
>>>>
>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>>>>-la
>>>> |wc -l
>>>> 1841
>>>>
>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>>>>-la
>>>> |more
>>>> total 68
>>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>> drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>> drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>> drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>> ...
>>>>
>>>> If I do:
>>>>
>>>> lsof |grep "openmpi-sessions-40031"
>>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>> /run/user/1000/gvfs
>>>>       Output information may be incomplete.
>>>> lsof: WARNING: can't stat() tracefs file system
>>>>/sys/kernel/debug/tracing
>>>>       Output information may be incomplete.
>>>>
>>>> nothing...
>>>>
>>>> What else may I check?
>>>>
>>>> Eric
>>>>
>>>>
>>>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>>>
>>>>> Hi, Eric
>>>>>
>>>>> I **think** this might be related to the following:
>>>>>
>>>>> https://github.com/pmix/master/pull/145
>>>>>
>>>>> I'm wondering if you can look into the /tmp directory and see if you
>>>>> have a bunch of stale usock files.
>>>>>
>>>>> Best,
>>>>>
>>>>> Josh
>>>>>
>>>>>
>>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>>><gil...@rist.or.jp
>>>>> <mailto:gil...@rist.or.jp>> wrote:
>>>>>
>>>>>     Eric,
>>>>>
>>>>>
>>>>>     can you please provide more information on how your tests are
>>>>> launched ?
>>>>>
>>>>>     do you
>>>>>
>>>>>     mpirun -np 1 ./a.out
>>>>>
>>>>>     or do you simply
>>>>>
>>>>>     ./a.out
>>>>>
>>>>>
>>>>>     do you use a batch manager ? if yes, which one ?
>>>>>
>>>>>     do you run one test per job ? or multiple tests per job ?
>>>>>
>>>>>     how are these tests launched ?
>>>>>
>>>>>
>>>>>     do the test that crashes use MPI_Comm_spawn ?
>>>>>
>>>>>     i am surprised by the process name [[9325,5754],0], which
>>>>>suggests
>>>>>     there MPI_Comm_spawn was called 5753 times (!)
>>>>>
>>>>>
>>>>>     can you also run
>>>>>
>>>>>     hostname
>>>>>
>>>>>     on the 'lorien' host ?
>>>>>
>>>>>     if you configure'd Open MPI with --enable-debug, can you
>>>>>
>>>>>     export OMPI_MCA_plm_base_verbose 5
>>>>>
>>>>>     then run one test and post the logs ?
>>>>>
>>>>>
>>>>>     from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>>>>     produce job family 5576 (but you get 9325)
>>>>>
>>>>>     the discrepancy could be explained by the use of a batch manager
>>>>>     and/or a full hostname i am unaware of.
>>>>>
>>>>>
>>>>>     orte_plm_base_set_hnp_name() generate a 16 bits job family from
>>>>>the
>>>>>     (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>>>
>>>>>     so strictly speaking, it is possible two jobs launched on the
>>>>>same
>>>>>     node are assigned the same 16 bits job family.
>>>>>
>>>>>
>>>>>     the easiest way to detect this could be to
>>>>>
>>>>>     - edit orte/mca/plm/base/plm_base_jobid.c
>>>>>
>>>>>     and replace
>>>>>
>>>>>         OPAL_OUTPUT_VERBOSE((5,
>>>>> orte_plm_base_framework.framework_output,
>>>>>                              "plm:base:set_hnp_name: final jobfam
>>>>>%lu",
>>>>>                              (unsigned long)jobfam));
>>>>>
>>>>>     with
>>>>>
>>>>>         OPAL_OUTPUT_VERBOSE((4,
>>>>> orte_plm_base_framework.framework_output,
>>>>>                              "plm:base:set_hnp_name: final jobfam
>>>>>%lu",
>>>>>                              (unsigned long)jobfam));
>>>>>
>>>>>     configure Open MPI with --enable-debug and rebuild
>>>>>
>>>>>     and then
>>>>>
>>>>>     export OMPI_MCA_plm_base_verbose=4
>>>>>
>>>>>     and run your tests.
>>>>>
>>>>>
>>>>>     when the problem occurs, you will be able to check which pids
>>>>>     produced the faulty jobfam, and that could hint to a conflict.
>>>>>
>>>>>
>>>>>     Cheers,
>>>>>
>>>>>
>>>>>     Gilles
>>>>>
>>>>>
>>>>>
>>>>>     On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         It is the third time this happened into the last 10 days.
>>>>>
>>>>>         While running nighlty tests (~2200), we have one or two tests
>>>>>         that fails at the very beginning with this strange error:
>>>>>
>>>>>         [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>>>         received unexpected process identifier [[9325,0],0] from
>>>>>         [[5590,0],0]
>>>>>
>>>>>         But I can't reproduce the problem right now... ie: If I
>>>>>launch
>>>>>         this test alone "by hand", it is successful... the same test
>>>>>was
>>>>>         successful yesterday...
>>>>>
>>>>>         Is there some kind of "race condition" that can happen on the
>>>>>         creation of "tmp" files if many tests runs together on the
>>>>>same
>>>>>         node? (we are oversubcribing even sequential runs...)
>>>>>
>>>>>         Here are the build logs:
>>>>>
>>>>>
>>>>> 
>>>>>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s
>>>>>_config.log
>>>>>
>>>>> 
>>>>><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01
>>>>>s_config.log>
>>>>>
>>>>>
>>>>> 
>>>>>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s
>>>>>_ompi_info_all.txt
>>>>>
>>>>> 
>>>>><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01
>>>>>s_ompi_info_all.txt>
>>>>>
>>>>>
>>>>>         Thanks,
>>>>>
>>>>>         Eric
>>>>>         _______________________________________________
>>>>>         devel mailing list
>>>>>         devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>
>>>>>
>>>>>     _______________________________________________
>>>>>     devel mailing list
>>>>>     devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>_______________________________________________
>devel mailing list
>devel@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] toward a unique session directory

Reply via email to