Ralph,
my reply is in the text On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
If we are going to make a change, then let’s do it only once. Since we introduced PMIx and the concept of the string namespace, the plan has been to switch away from a numerical jobid and to the namespace. This eliminates the issue of the hash altogether. If we are going to make a disruptive change, then let’s do that one. Either way, this isn’t something that could go into the 2.x series. It is far too invasive, and would have to be delayed until a 3.x at the earliest.
got it !
Note that I am not yet convinced that is the issue here. We’ve had this hash for 12 years, and this is the first time someone has claimed to see a problem. That makes me very suspicious that the root cause isn’t what you are pursuing. This is only being reported for _singletons_, and that is a very unique code path. The only reason for launching the orted is to support PMIx operations such as notification and comm_spawn. If those aren’t being used, then we could use the “isolated” mode where the usock OOB isn’t even activated, thus eliminating the problem. This would be a much smaller “fix” and could potentially fit into 2.x
a bug has been identified and fixed, let's wait and see how things go how can i use the isolated mode ? shall i simply export OMPI_MCA_pmix=isolated export OMPI_MCA_plm=isolated ?out of curiosity, does "isolated" means we would not even need to fork the HNP ?
FWIW: every organization I’ve worked with has an epilog script that blows away temp dirs. It isn’t the RM-based environment that is of concern - it’s the non-RM one where epilog scripts don’t exist that is the problem.
well, i was looking at this the other way around.if mpirun/orted creates the session directory with mkstemp(), then there is no more need to do any cleanup
(as long as you do not run out of disk space)but with direct run, there is always a little risk that a previous session directory is used, hence the requirement for an epilogue. also, if the RM is configured to run one job at a time per a given node, epilog can be quite trivial. but if several jobs can run on a given node at the same time, epilog become less trivial
Cheers, Gilles
On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:Ralph, On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:Many things are possible, given infinite time :-)i could not agree more :-DThe issue with this notion lies in direct launch scenarios - i.e., when procs are launched directly by the RM and not via mpirun. In this case, there is nobody who can give us the session directory (well, until PMIx becomes universal), and so the apps must be able to generate a name that they all can know. Otherwise, we lose shared memory support because they can’t rendezvous.thanks for the explanation, now let me rephrase that"a MPI task must be able to rebuild the path to the session directory, based on the information it has when launched. if mpirun is used, we have several options to pass this option to the MPI tasks. in case of direct run, this info is unlikely (PMIx is not universal (yet)) passed by the batch manager, so we have to use what is available"my concern is that, to keep things simple, session directory is based on the Open MPI jobid, and since stepid is zero most of the time, jobid really means job familywhich is stored on 16 bits.in the case of mpirun, jobfam is a 16 bit hash of the hostname (reasonnable sized string) and the mpirun pid (32 bits on Linux) if several mpirun are invoked on the same host at a given time, there is a risk two distinct jobs are assigned the same jobfam (since we hash from 32 bits down to 16 bits). also, there is a risk the session directory already exists from a previous job, with some/all files and unix sockets from a previous job, leading to undefined behavior(an early crash if we are lucky, odd things otherwise).in the case of direct run, i guess jobfam is a 16 bit hash of the jobid passed by the RM, and once again, there is a risk of conflict and/or the re-use of a previous session directory.to me, the issue here is we are using the Open MPI jobfam in order to build the session directory pathinstead, what if we1) when mpirun, use a session directory created by mkstemp(), and pass it to MPI tasks via the environment or retrieve it from orted/mpirun right after the communication has been established. 2) for direct run, use a session directory based on the full jobid (which might be a string or a number) as passed by the RMin case of 1), there is no more risk of a hash conflict, or re-using a previous session directory in case of 2), there is no more risk of a hash conflict, but there is still a risk of re-using a session directory from a previous (e.g. terminated) job. that being said, once we document how the session directory is built from the jobid, sysadmins will be able to write a RM epilog that do remove the session directory.does that make sense ?However, that doesn’t seem to be the root problem here. I suspect there is a bug in the code that spawns the orted from the singleton, and subsequently parses the returned connection info. If you look at the error, you’ll see that both jobid’s have “zero” for their local jobid. This means that the two procs attempting to communicate both think they are daemons, which is impossible in this scenario.So something garbled the string that the orted returns on startup to the singleton, and/or the singleton is parsing it incorrectly. IIRC, the singleton gets its name from that string, and so I expect it is getting the wrong name - and hence the error.i will investigate that.As you may recall, you made a change a little while back where we modified the code in ess/singleton to be a little less strict in its checking of that returned string. I wonder if that is biting us here? It wouldn’t fix the problem, but might generate a different error at a more obvious place.do you mean https://github.com/open-mpi/ompi/commit/93e73841f9ec3739dec209365979e2badec0740f ?this has not been backported to v2.x, and the issue was reported on v2.x Cheers, GillesOn Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote:Ralph,is there any reason to use a session directory based on the jobid (or job family) ? I mean, could we use mkstemp to generate a unique directory, and then propagate the path via orted comm or the environment ?Cheers, GillesOn Wednesday, September 14, 2016, r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:This has nothing to do with PMIx, Josh - the error is coming out of the usock OOB component.On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com <javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote: Eric, We are looking into the PMIx code path that sets up the jobid. The session directories are created based on the jobid. It might be the case that the jobids (generated with rand) happen to be the same for different jobs resulting in multiple jobs sharing the same session directory, but we need to check. We will update. Josh On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland <eric.chamberl...@giref.ulaval.ca <javascript:_e(%7B%7D,'cvml','eric.chamberl...@giref.ulaval.ca');>> wrote: Lucky! Since each runs have a specific TMP, I still have it on disc. for the faulty run, the TMP variable was: TMP=/tmp/tmp.wOv5dkNaSI and into $TMP I have: openmpi-sessions-40031@lorien_0 and into this subdirectory I have a bunch of empty dirs: cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc -l 1841 cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |more total 68 drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 ... If I do: lsof |grep "openmpi-sessions-40031" lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing Output information may be incomplete. nothing... What else may I check? Eric On 14/09/16 08:47 AM, Joshua Ladd wrote: Hi, Eric I **think** this might be related to the following: https://github.com/pmix/master/pull/145 <https://github.com/pmix/master/pull/145> I'm wondering if you can look into the /tmp directory and see if you have a bunch of stale usock files. Best, Josh On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');> <mailto:gil...@rist.or.jp <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>> wrote: Eric, can you please provide more information on how your tests are launched ? do you mpirun -np 1 ./a.out or do you simply ./a.out do you use a batch manager ? if yes, which one ? do you run one test per job ? or multiple tests per job ? how are these tests launched ? do the test that crashes use MPI_Comm_spawn ? i am surprised by the process name [[9325,5754],0], which suggests there MPI_Comm_spawn was called 5753 times (!) can you also run hostname on the 'lorien' host ? if you configure'd Open MPI with --enable-debug, can you export OMPI_MCA_plm_base_verbose 5 then run one test and post the logs ? from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce job family 5576 (but you get 9325) the discrepancy could be explained by the use of a batch manager and/or a full hostname i am unaware of. orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. so strictly speaking, it is possible two jobs launched on the same node are assigned the same 16 bits job family. the easiest way to detect this could be to - edit orte/mca/plm/base/plm_base_jobid.c and replace OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); with OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); configure Open MPI with --enable-debug and rebuild and then export OMPI_MCA_plm_base_verbose=4 and run your tests. when the problem occurs, you will be able to check which pids produced the faulty jobfam, and that could hint to a conflict. Cheers, Gilles On 9/14/2016 12:35 AM, Eric Chamberland wrote: Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one or two tests that fails at the very beginning with this strange error: [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],0] from [[5590,0],0] But I can't reproduce the problem right now... ie: If I launch this test alone "by hand", it is successful... the same test was successful yesterday... Is there some kind of "race condition" that can happen on the creation of "tmp" files if many tests runs together on the same node? (we are oversubcribing even sequential runs...) Here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log <http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt <http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>> Thanks, Eric _______________________________________________ devel mailing list devel@lists.open-mpi.org <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> <mailto:devel@lists.open-mpi.org <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> _______________________________________________ devel mailing list devel@lists.open-mpi.org <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> <mailto:devel@lists.open-mpi.org <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> _______________________________________________ devel mailing list devel@lists.open-mpi.org <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> _______________________________________________ devel mailing list devel@lists.open-mpi.org <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>_______________________________________________ devel mailing list devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel_______________________________________________ devel mailing list devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel