Re: [OMPI devel] toward a unique session directory

Gilles Gouaillardet Thu, 15 Sep 2016 06:41:20 -0700

Ralph,

that looks good to me.


can you please remind me how to test if an app was launched by
mpirun/orted or direct launched by the RM ?

right now, which direct launch method is supported ?
i am aware of srun (SLURM) and apron (CRAY), are there any other ?

Cheers,

Gilles

On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>
> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
> Ralph,
>
>
> my reply is in the text
>
>
> On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
>
> If we are going to make a change, then let’s do it only once. Since we
> introduced PMIx and the concept of the string namespace, the plan has been
> to switch away from a numerical jobid and to the namespace. This eliminates
> the issue of the hash altogether. If we are going to make a disruptive
> change, then let’s do that one. Either way, this isn’t something that could
> go into the 2.x series. It is far too invasive, and would have to be delayed
> until a 3.x at the earliest.
>
> got it !
>
> Note that I am not yet convinced that is the issue here. We’ve had this hash
> for 12 years, and this is the first time someone has claimed to see a
> problem. That makes me very suspicious that the root cause isn’t what you
> are pursuing. This is only being reported for _singletons_, and that is a
> very unique code path. The only reason for launching the orted is to support
> PMIx operations such as notification and comm_spawn. If those aren’t being
> used, then we could use the “isolated” mode where the usock OOB isn’t even
> activated, thus eliminating the problem. This would be a much smaller “fix”
> and could potentially fit into 2.x
>
> a bug has been identified and fixed, let's wait and see how things go
>
> how can i use the isolated mode ?
> shall i simply
> export OMPI_MCA_pmix=isolated
> export OMPI_MCA_plm=isolated
> ?
>
> out of curiosity, does "isolated" means we would not even need to fork the
> HNP ?
>
>
> Yes - that’s the idea. Simplify and make things faster. All you have to do
> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that code
> is in 2.x as well
>
>
>
> FWIW: every organization I’ve worked with has an epilog script that blows
> away temp dirs. It isn’t the RM-based environment that is of concern - it’s
> the non-RM one where epilog scripts don’t exist that is the problem.
>
> well, i was looking at this the other way around.
> if mpirun/orted creates the session directory with mkstemp(), then there is
> no more need to do any cleanup
> (as long as you do not run out of disk space)
> but with direct run, there is always a little risk that a previous session
> directory is used, hence the requirement for an epilogue.
> also, if the RM is configured to run one job at a time per a given node,
> epilog can be quite trivial.
> but if several jobs can run on a given node at the same time, epilog become
> less trivial
>
>
> Yeah, this session directory thing has always been problematic. We’ve had
> litter problems since day one, and tried multiple solutions over the years.
> Obviously, none of those has proven fully successful :-(
>
> Artem came up with a good solution using PMIx that allows the RM to control
> the session directory location for both direct launch and mpirun launch,
> thus ensuring the RM can cleanup the correct place upon session termination.
> As we get better adoption of that method out there, then the RM-based
> solution (even for multiple jobs sharing a node) should be resolved.
>
> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem. Your
> proposal would resolve that one as (a) we always have orted’s in that
> scenario, and (b) the orted’s pass the session directory down to the apps.
> So maybe the right approach is to use mkstemp() in the scenario where we are
> launched via orted and the RM has not specified a session directory.
>
> I’m not sure we can resolve the direct launch without PMIx problem - I think
> that’s best left as another incentive for RMs to get on-board the PMIx bus.
>
> Make sense?
>
>
>
>
> Cheers,
>
> Gilles
>
>
> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
> Ralph,
>
>
> On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:
>
> Many things are possible, given infinite time :-)
>
> i could not agree more :-D
>
> The issue with this notion lies in direct launch scenarios - i.e., when
> procs are launched directly by the RM and not via mpirun. In this case,
> there is nobody who can give us the session directory (well, until PMIx
> becomes universal), and so the apps must be able to generate a name that
> they all can know. Otherwise, we lose shared memory support because they
> can’t rendezvous.
>
> thanks for the explanation,
> now let me rephrase that
> "a MPI task must be able to rebuild the path to the session directory, based
> on the information it has when launched.
> if mpirun is used, we have several options to pass this option to the MPI
> tasks.
> in case of direct run, this info is unlikely (PMIx is not universal (yet))
> passed by the batch manager, so we have to use what is available"
>
> my concern is that, to keep things simple, session directory is based on the
> Open MPI jobid, and since stepid is zero most of the time, jobid really
> means job family
> which is stored on 16 bits.
>
> in the case of mpirun, jobfam is a 16 bit hash of the hostname (reasonnable
> sized string) and the mpirun pid (32 bits on Linux)
> if several mpirun are invoked on the same host at a given time, there is a
> risk two distinct jobs are assigned the same jobfam (since we hash from 32
> bits down to 16 bits).
> also, there is a risk the session directory already exists from a previous
> job, with some/all files and unix sockets from a previous job, leading to
> undefined behavior
> (an early crash if we are lucky, odd things otherwise).
>
> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid
> passed by the RM, and once again, there is a risk of conflict and/or the
> re-use of a previous session directory.
>
> to me, the issue here is we are using the Open MPI jobfam in order to build
> the session directory path
> instead, what if we
> 1) when mpirun, use a session directory created by mkstemp(), and pass it to
> MPI tasks via the environment or retrieve it from orted/mpirun right after
> the communication has been established.
> 2) for direct run, use a session directory based on the full jobid (which
> might be a string or a number) as passed by the RM
>
> in case of 1), there is no more risk of a hash conflict, or re-using a
> previous session directory
> in case of 2), there is no more risk of a hash conflict, but there is still
> a risk of re-using a session directory from a previous (e.g. terminated)
> job.
> that being said, once we document how the session directory is built from
> the jobid, sysadmins will be able to write a RM epilog that do remove the
> session directory.
>
>  does that make sense ?
>
>
> However, that doesn’t seem to be the root problem here. I suspect there is a
> bug in the code that spawns the orted from the singleton, and subsequently
> parses the returned connection info. If you look at the error, you’ll see
> that both jobid’s have “zero” for their local jobid. This means that the two
> procs attempting to communicate both think they are daemons, which is
> impossible in this scenario.
>
> So something garbled the string that the orted returns on startup to the
> singleton, and/or the singleton is parsing it incorrectly. IIRC, the
> singleton gets its name from that string, and so I expect it is getting the
> wrong name - and hence the error.
>
> i will investigate that.
>
> As you may recall, you made a change a little while back where we modified
> the code in ess/singleton to be a little less strict in its checking of that
> returned string. I wonder if that is biting us here? It wouldn’t fix the
> problem, but might generate a different error at a more obvious place.
>
> do you mean
> https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2badec0740f
> ?
> this has not been backported to v2.x, and the issue was reported on v2.x
>
>
> Cheers,
>
> Gilles
>
>
> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet
> <gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> is there any reason to use a session directory based on the jobid (or job
> family) ?
> I mean, could we use mkstemp to generate a unique directory, and then
> propagate the path via orted comm or the environment ?
>
> Cheers,
>
> Gilles
>
> On Wednesday, September 14, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>
>> This has nothing to do with PMIx, Josh - the error is coming out of the
>> usock OOB component.
>>
>>
>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>
>> Eric,
>>
>> We are looking into the PMIx code path that sets up the jobid. The session
>> directories are created based on the jobid. It might be the case that the
>> jobids (generated with rand) happen to be the same for different jobs
>> resulting in multiple jobs sharing the same session directory, but we need
>> to check. We will update.
>>
>> Josh
>>
>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland
>> <eric.chamberl...@giref.ulaval.ca> wrote:
>>>
>>> Lucky!
>>>
>>> Since each runs have a specific TMP, I still have it on disc.
>>>
>>> for the faulty run, the TMP variable was:
>>>
>>> TMP=/tmp/tmp.wOv5dkNaSI
>>>
>>> and into $TMP I have:
>>>
>>> openmpi-sessions-40031@lorien_0
>>>
>>> and into this subdirectory I have a bunch of empty dirs:
>>>
>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
>>> |wc -l
>>> 1841
>>>
>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
>>> |more
>>> total 68
>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>> drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>> drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>> drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>> drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>> drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>> ...
>>>
>>> If I do:
>>>
>>> lsof |grep "openmpi-sessions-40031"
>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>> /run/user/1000/gvfs
>>>       Output information may be incomplete.
>>> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>>>       Output information may be incomplete.
>>>
>>> nothing...
>>>
>>> What else may I check?
>>>
>>> Eric
>>>
>>>
>>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>>
>>>> Hi, Eric
>>>>
>>>> I **think** this might be related to the following:
>>>>
>>>> https://github.com/pmix/master/pull/145
>>>>
>>>> I'm wondering if you can look into the /tmp directory and see if you
>>>> have a bunch of stale usock files.
>>>>
>>>> Best,
>>>>
>>>> Josh
>>>>
>>>>
>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp
>>>> <mailto:gil...@rist.or.jp>> wrote:
>>>>
>>>>     Eric,
>>>>
>>>>
>>>>     can you please provide more information on how your tests are
>>>> launched ?
>>>>
>>>>     do you
>>>>
>>>>     mpirun -np 1 ./a.out
>>>>
>>>>     or do you simply
>>>>
>>>>     ./a.out
>>>>
>>>>
>>>>     do you use a batch manager ? if yes, which one ?
>>>>
>>>>     do you run one test per job ? or multiple tests per job ?
>>>>
>>>>     how are these tests launched ?
>>>>
>>>>
>>>>     do the test that crashes use MPI_Comm_spawn ?
>>>>
>>>>     i am surprised by the process name [[9325,5754],0], which suggests
>>>>     there MPI_Comm_spawn was called 5753 times (!)
>>>>
>>>>
>>>>     can you also run
>>>>
>>>>     hostname
>>>>
>>>>     on the 'lorien' host ?
>>>>
>>>>     if you configure'd Open MPI with --enable-debug, can you
>>>>
>>>>     export OMPI_MCA_plm_base_verbose 5
>>>>
>>>>     then run one test and post the logs ?
>>>>
>>>>
>>>>     from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>>>     produce job family 5576 (but you get 9325)
>>>>
>>>>     the discrepancy could be explained by the use of a batch manager
>>>>     and/or a full hostname i am unaware of.
>>>>
>>>>
>>>>     orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>>>>     (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>>
>>>>     so strictly speaking, it is possible two jobs launched on the same
>>>>     node are assigned the same 16 bits job family.
>>>>
>>>>
>>>>     the easiest way to detect this could be to
>>>>
>>>>     - edit orte/mca/plm/base/plm_base_jobid.c
>>>>
>>>>     and replace
>>>>
>>>>         OPAL_OUTPUT_VERBOSE((5,
>>>> orte_plm_base_framework.framework_output,
>>>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>>>                              (unsigned long)jobfam));
>>>>
>>>>     with
>>>>
>>>>         OPAL_OUTPUT_VERBOSE((4,
>>>> orte_plm_base_framework.framework_output,
>>>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>>>                              (unsigned long)jobfam));
>>>>
>>>>     configure Open MPI with --enable-debug and rebuild
>>>>
>>>>     and then
>>>>
>>>>     export OMPI_MCA_plm_base_verbose=4
>>>>
>>>>     and run your tests.
>>>>
>>>>
>>>>     when the problem occurs, you will be able to check which pids
>>>>     produced the faulty jobfam, and that could hint to a conflict.
>>>>
>>>>
>>>>     Cheers,
>>>>
>>>>
>>>>     Gilles
>>>>
>>>>
>>>>
>>>>     On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>>
>>>>         Hi,
>>>>
>>>>         It is the third time this happened into the last 10 days.
>>>>
>>>>         While running nighlty tests (~2200), we have one or two tests
>>>>         that fails at the very beginning with this strange error:
>>>>
>>>>         [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>>         received unexpected process identifier [[9325,0],0] from
>>>>         [[5590,0],0]
>>>>
>>>>         But I can't reproduce the problem right now... ie: If I launch
>>>>         this test alone "by hand", it is successful... the same test was
>>>>         successful yesterday...
>>>>
>>>>         Is there some kind of "race condition" that can happen on the
>>>>         creation of "tmp" files if many tests runs together on the same
>>>>         node? (we are oversubcribing even sequential runs...)
>>>>
>>>>         Here are the build logs:
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>>>
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>>>
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>>>>
>>>>
>>>>         Thanks,
>>>>
>>>>         Eric
>>>>         _______________________________________________
>>>>         devel mailing list
>>>>         devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     devel mailing list
>>>>     devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] toward a unique session directory

Reply via email to