Re: [OMPI devel] toward a unique session directory

r...@open-mpi.org Wed, 14 Sep 2016 19:14:37 -0700

If we are going to make a change, then let’s do it only once. Since we 
introduced PMIx and the concept of the string namespace, the plan has been to 
switch away from a numerical jobid and to the namespace. This eliminates the 
issue of the hash altogether. If we are going to make a disruptive change, then 
let’s do that one. Either way, this isn’t something that could go into the 2.x 
series. It is far too invasive, and would have to be delayed until a 3.x at the 
earliest.


Note that I am not yet convinced that is the issue here. We’ve had this hash 
for 12 years, and this is the first time someone has claimed to see a problem. 
That makes me very suspicious that the root cause isn’t what you are pursuing. 
This is only being reported for _singletons_, and that is a very unique code 
path. The only reason for launching the orted is to support PMIx operations 
such as notification and comm_spawn. If those aren’t being used, then we could 
use the “isolated” mode where the usock OOB isn’t even activated, thus 
eliminating the problem. This would be a much smaller “fix” and could 
potentially fit into 2.x

FWIW: every organization I’ve worked with has an epilog script that blows away 
temp dirs. It isn’t the RM-based environment that is of concern - it’s the 
non-RM one where epilog scripts don’t exist that is the problem.


> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Ralph,
> 
> On 9/15/2016 12:11 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote:
>> Many things are possible, given infinite time :-)
>> 
> i could not agree more :-D
>> The issue with this notion lies in direct launch scenarios - i.e., when 
>> procs are launched directly by the RM and not via mpirun. In this case, 
>> there is nobody who can give us the session directory (well, until PMIx 
>> becomes universal), and so the apps must be able to generate a name that 
>> they all can know. Otherwise, we lose shared memory support because they 
>> can’t rendezvous.
> thanks for the explanation,
> now let me rephrase that
> "a MPI task must be able to rebuild the path to the session directory, based 
> on the information it has when launched.
> if mpirun is used, we have several options to pass this option to the MPI 
> tasks.
> in case of direct run, this info is unlikely (PMIx is not universal (yet)) 
> passed by the batch manager, so we have to use what is available"
> 
> my concern is that, to keep things simple, session directory is based on the 
> Open MPI jobid, and since stepid is zero most of the time, jobid really means 
> job family
> which is stored on 16 bits.
> 
> in the case of mpirun, jobfam is a 16 bit hash of the hostname (reasonnable 
> sized string) and the mpirun pid (32 bits on Linux)
> if several mpirun are invoked on the same host at a given time, there is a 
> risk two distinct jobs are assigned the same jobfam (since we hash from 32 
> bits down to 16 bits).
> also, there is a risk the session directory already exists from a previous 
> job, with some/all files and unix sockets from a previous job, leading to 
> undefined behavior
> (an early crash if we are lucky, odd things otherwise).
> 
> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid 
> passed by the RM, and once again, there is a risk of conflict and/or the 
> re-use of a previous session directory.
> 
> to me, the issue here is we are using the Open MPI jobfam in order to build 
> the session directory path
> instead, what if we
> 1) when mpirun, use a session directory created by mkstemp(), and pass it to 
> MPI tasks via the environment or retrieve it from orted/mpirun right after 
> the communication has been established.
> 2) for direct run, use a session directory based on the full jobid (which 
> might be a string or a number) as passed by the RM
> 
> in case of 1), there is no more risk of a hash conflict, or re-using a 
> previous session directory
> in case of 2), there is no more risk of a hash conflict, but there is still a 
> risk of re-using a session directory from a previous (e.g. terminated) job.
> that being said, once we document how the session directory is built from the 
> jobid, sysadmins will be able to write a RM epilog that do remove the session 
> directory.
> 
>  does that make sense ?
>> 
>> However, that doesn’t seem to be the root problem here. I suspect there is a 
>> bug in the code that spawns the orted from the singleton, and subsequently 
>> parses the returned connection info. If you look at the error, you’ll see 
>> that both jobid’s have “zero” for their local jobid. This means that the two 
>> procs attempting to communicate both think they are daemons, which is 
>> impossible in this scenario.
>> 
>> So something garbled the string that the orted returns on startup to the 
>> singleton, and/or the singleton is parsing it incorrectly. IIRC, the 
>> singleton gets its name from that string, and so I expect it is getting the 
>> wrong name - and hence the error.
>> 
> i will investigate that.
>> As you may recall, you made a change a little while back where we modified 
>> the code in ess/singleton to be a little less strict in its checking of that 
>> returned string. I wonder if that is biting us here? It wouldn’t fix the 
>> problem, but might generate a different error at a more obvious place.
>> 
> do you mean 
> https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2badec0740f
>  
> <https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2badec0740f>
>  ?
> this has not been backported to v2.x, and the issue was reported on v2.x
> 
> 
> Cheers,
> 
> Gilles
>> 
>>> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> 
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> is there any reason to use a session directory based on the jobid (or job 
>>> family) ?
>>> I mean, could we use mkstemp to generate a unique directory, and then 
>>> propagate the path via orted comm or the environment ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Wednesday, September 14, 2016, r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> 
>>> wrote:
>>> This has nothing to do with PMIx, Josh - the error is coming out of the 
>>> usock OOB component.
>>> 
>>> 
>>>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com 
>>>> <javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote:
>>>> 
>>>> Eric,
>>>> 
>>>> We are looking into the PMIx code path that sets up the jobid. The session 
>>>> directories are created based on the jobid. It might be the case that the 
>>>> jobids (generated with rand) happen to be the same for different jobs 
>>>> resulting in multiple jobs sharing the same session directory, but we need 
>>>> to check. We will update.
>>>> 
>>>> Josh
>>>> 
>>>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland 
>>>> <eric.chamberl...@giref.ulaval.ca 
>>>> <javascript:_e(%7B%7D,'cvml','eric.chamberl...@giref.ulaval.ca');>> wrote:
>>>> Lucky!
>>>> 
>>>> Since each runs have a specific TMP, I still have it on disc.
>>>> 
>>>> for the faulty run, the TMP variable was:
>>>> 
>>>> TMP=/tmp/tmp.wOv5dkNaSI
>>>> 
>>>> and into $TMP I have:
>>>> 
>>>> openmpi-sessions-40031@lorien_0
>>>> 
>>>> and into this subdirectory I have a bunch of empty dirs:
>>>> 
>>>> cmpbib@lorien:/tmp/tmp.wOv5dkN 
>>>> <mailto:cmpbib@lorien:/tmp/tmp.wOv5dkN>aSI/openmpi-sessions-40031@lorien_0>
>>>>  ls -la |wc -l
>>>> 1841
>>>> 
>>>> cmpbib@lorien:/tmp/tmp.wOv5dkN 
>>>> <mailto:cmpbib@lorien:/tmp/tmp.wOv5dkN>aSI/openmpi-sessions-40031@lorien_0>
>>>>  ls -la |more
>>>> total 68
>>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>> drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>> drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>> drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>> drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>> ...
>>>> 
>>>> If I do:
>>>> 
>>>> lsof |grep "openmpi-sessions-40031"
>>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
>>>>       Output information may be incomplete.
>>>> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>>>>       Output information may be incomplete.
>>>> 
>>>> nothing...
>>>> 
>>>> What else may I check?
>>>> 
>>>> Eric
>>>> 
>>>> 
>>>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>> Hi, Eric
>>>> 
>>>> I **think** this might be related to the following:
>>>> 
>>>> https://github.com/pmix/master/pull/145 
>>>> <https://github.com/pmix/master/pull/145>
>>>> 
>>>> I'm wondering if you can look into the /tmp directory and see if you
>>>> have a bunch of stale usock files.
>>>> 
>>>> Best,
>>>> 
>>>> Josh
>>>> 
>>>> 
>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp 
>>>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>
>>>> <mailto:gil...@rist.or.jp 
>>>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>> wrote:
>>>> 
>>>>     Eric,
>>>> 
>>>> 
>>>>     can you please provide more information on how your tests are launched 
>>>> ?
>>>> 
>>>>     do you
>>>> 
>>>>     mpirun -np 1 ./a.out
>>>> 
>>>>     or do you simply
>>>> 
>>>>     ./a.out
>>>> 
>>>> 
>>>>     do you use a batch manager ? if yes, which one ?
>>>> 
>>>>     do you run one test per job ? or multiple tests per job ?
>>>> 
>>>>     how are these tests launched ?
>>>> 
>>>> 
>>>>     do the test that crashes use MPI_Comm_spawn ?
>>>> 
>>>>     i am surprised by the process name [[9325,5754],0], which suggests
>>>>     there MPI_Comm_spawn was called 5753 times (!)
>>>> 
>>>> 
>>>>     can you also run
>>>> 
>>>>     hostname
>>>> 
>>>>     on the 'lorien' host ?
>>>> 
>>>>     if you configure'd Open MPI with --enable-debug, can you
>>>> 
>>>>     export OMPI_MCA_plm_base_verbose 5
>>>> 
>>>>     then run one test and post the logs ?
>>>> 
>>>> 
>>>>     from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>>>     produce job family 5576 (but you get 9325)
>>>> 
>>>>     the discrepancy could be explained by the use of a batch manager
>>>>     and/or a full hostname i am unaware of.
>>>> 
>>>> 
>>>>     orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>>>>     (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>> 
>>>>     so strictly speaking, it is possible two jobs launched on the same
>>>>     node are assigned the same 16 bits job family.
>>>> 
>>>> 
>>>>     the easiest way to detect this could be to
>>>> 
>>>>     - edit orte/mca/plm/base/plm_base_jobid.c
>>>> 
>>>>     and replace
>>>> 
>>>>         OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>>>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>>>                              (unsigned long)jobfam));
>>>> 
>>>>     with
>>>> 
>>>>         OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>>>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>>>                              (unsigned long)jobfam));
>>>> 
>>>>     configure Open MPI with --enable-debug and rebuild
>>>> 
>>>>     and then
>>>> 
>>>>     export OMPI_MCA_plm_base_verbose=4
>>>> 
>>>>     and run your tests.
>>>> 
>>>> 
>>>>     when the problem occurs, you will be able to check which pids
>>>>     produced the faulty jobfam, and that could hint to a conflict.
>>>> 
>>>> 
>>>>     Cheers,
>>>> 
>>>> 
>>>>     Gilles
>>>> 
>>>> 
>>>> 
>>>>     On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>> 
>>>>         Hi,
>>>> 
>>>>         It is the third time this happened into the last 10 days.
>>>> 
>>>>         While running nighlty tests (~2200), we have one or two tests
>>>>         that fails at the very beginning with this strange error:
>>>> 
>>>>         [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>>         received unexpected process identifier [[9325,0],0] from
>>>>         [[5590,0],0]
>>>> 
>>>>         But I can't reproduce the problem right now... ie: If I launch
>>>>         this test alone "by hand", it is successful... the same test was
>>>>         successful yesterday...
>>>> 
>>>>         Is there some kind of "race condition" that can happen on the
>>>>         creation of "tmp" files if many tests runs together on the same
>>>>         node? (we are oversubcribing even sequential runs...)
>>>> 
>>>>         Here are the build logs:
>>>> 
>>>>         
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>>>  
>>>> <http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>>>>         
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>>>  
>>>> <http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
>>>> 
>>>>         
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>>>  
>>>> <http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>>>>         
>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>>>  
>>>> <http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
>>>> 
>>>> 
>>>>         Thanks,
>>>> 
>>>>         Eric
>>>>         _______________________________________________
>>>>         devel mailing list
>>>>         devel@lists.open-mpi.org 
>>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> 
>>>> <mailto:devel@lists.open-mpi.org 
>>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
>>>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>> 
>>>> 
>>>>     _______________________________________________
>>>>     devel mailing list
>>>>     devel@lists.open-mpi.org 
>>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> 
>>>> <mailto:devel@lists.open-mpi.org 
>>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org 
>>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org 
>>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] toward a unique session directory

Reply via email to