Ralph,

is there any reason to use a session directory based on the jobid (or job
family) ?
I mean, could we use mkstemp to generate a unique directory, and then
propagate the path via orted comm or the environment ?

Cheers,

Gilles

On Wednesday, September 14, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote:

> This has nothing to do with PMIx, Josh - the error is coming out of the
> usock OOB component.
>
>
> On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote:
>
> Eric,
>
> We are looking into the PMIx code path that sets up the jobid. The session
> directories are created based on the jobid. It might be the case that the
> jobids (generated with rand) happen to be the same for different jobs
> resulting in multiple jobs sharing the same session directory, but we need
> to check. We will update.
>
> Josh
>
> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland <Eric.Chamberland@giref.
> ulaval.ca
> <javascript:_e(%7B%7D,'cvml','eric.chamberl...@giref.ulaval.ca');>> wrote:
>
>> Lucky!
>>
>> Since each runs have a specific TMP, I still have it on disc.
>>
>> for the faulty run, the TMP variable was:
>>
>> TMP=/tmp/tmp.wOv5dkNaSI
>>
>> and into $TMP I have:
>>
>> openmpi-sessions-40031@lorien_0
>>
>> and into this subdirectory I have a bunch of empty dirs:
>>
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>> -la |wc -l
>> 1841
>>
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>> -la |more
>> total 68
>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>> drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>> drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>> drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>> drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>> drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>> drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>> ...
>>
>> If I do:
>>
>> lsof |grep "openmpi-sessions-40031"
>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>> /run/user/1000/gvfs
>>       Output information may be incomplete.
>> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>>       Output information may be incomplete.
>>
>> nothing...
>>
>> What else may I check?
>>
>> Eric
>>
>>
>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>
>>> Hi, Eric
>>>
>>> I **think** this might be related to the following:
>>>
>>> https://github.com/pmix/master/pull/145
>>>
>>> I'm wondering if you can look into the /tmp directory and see if you
>>> have a bunch of stale usock files.
>>>
>>> Best,
>>>
>>> Josh
>>>
>>>
>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp
>>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>
>>> <mailto:gil...@rist.or.jp
>>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>> wrote:
>>>
>>>     Eric,
>>>
>>>
>>>     can you please provide more information on how your tests are
>>> launched ?
>>>
>>>     do you
>>>
>>>     mpirun -np 1 ./a.out
>>>
>>>     or do you simply
>>>
>>>     ./a.out
>>>
>>>
>>>     do you use a batch manager ? if yes, which one ?
>>>
>>>     do you run one test per job ? or multiple tests per job ?
>>>
>>>     how are these tests launched ?
>>>
>>>
>>>     do the test that crashes use MPI_Comm_spawn ?
>>>
>>>     i am surprised by the process name [[9325,5754],0], which suggests
>>>     there MPI_Comm_spawn was called 5753 times (!)
>>>
>>>
>>>     can you also run
>>>
>>>     hostname
>>>
>>>     on the 'lorien' host ?
>>>
>>>     if you configure'd Open MPI with --enable-debug, can you
>>>
>>>     export OMPI_MCA_plm_base_verbose 5
>>>
>>>     then run one test and post the logs ?
>>>
>>>
>>>     from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>>     produce job family 5576 (but you get 9325)
>>>
>>>     the discrepancy could be explained by the use of a batch manager
>>>     and/or a full hostname i am unaware of.
>>>
>>>
>>>     orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>>>     (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>
>>>     so strictly speaking, it is possible two jobs launched on the same
>>>     node are assigned the same 16 bits job family.
>>>
>>>
>>>     the easiest way to detect this could be to
>>>
>>>     - edit orte/mca/plm/base/plm_base_jobid.c
>>>
>>>     and replace
>>>
>>>         OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framew
>>> ork_output,
>>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>>                              (unsigned long)jobfam));
>>>
>>>     with
>>>
>>>         OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framew
>>> ork_output,
>>>                              "plm:base:set_hnp_name: final jobfam %lu",
>>>                              (unsigned long)jobfam));
>>>
>>>     configure Open MPI with --enable-debug and rebuild
>>>
>>>     and then
>>>
>>>     export OMPI_MCA_plm_base_verbose=4
>>>
>>>     and run your tests.
>>>
>>>
>>>     when the problem occurs, you will be able to check which pids
>>>     produced the faulty jobfam, and that could hint to a conflict.
>>>
>>>
>>>     Cheers,
>>>
>>>
>>>     Gilles
>>>
>>>
>>>
>>>     On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>
>>>         Hi,
>>>
>>>         It is the third time this happened into the last 10 days.
>>>
>>>         While running nighlty tests (~2200), we have one or two tests
>>>         that fails at the very beginning with this strange error:
>>>
>>>         [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>         received unexpected process identifier [[9325,0],0] from
>>>         [[5590,0],0]
>>>
>>>         But I can't reproduce the problem right now... ie: If I launch
>>>         this test alone "by hand", it is successful... the same test was
>>>         successful yesterday...
>>>
>>>         Is there some kind of "race condition" that can happen on the
>>>         creation of "tmp" files if many tests runs together on the same
>>>         node? (we are oversubcribing even sequential runs...)
>>>
>>>         Here are the build logs:
>>>
>>>         http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>> .01h16m01s_config.log
>>>         <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_config.log>
>>>
>>>         http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>> .01h16m01s_ompi_info_all.txt
>>>         <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>> 3.01h16m01s_ompi_info_all.txt>
>>>
>>>
>>>         Thanks,
>>>
>>>         Eric
>>>         _______________________________________________
>>>         devel mailing list
>>>         devel@lists.open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> <mailto:
>>> devel@lists.open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
>>>         https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>         <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>
>>>
>>>     _______________________________________________
>>>     devel mailing list
>>>     devel@lists.open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');> <mailto:
>>> devel@lists.open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>
>>>
>>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> <javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to