On Sep 14, 2016, at 7:17 AM, Joshua Ladd <jladd.m...@gmail.com
<javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote:
Eric,
We are looking into the PMIx code path that sets up the jobid.
The session directories are created based on the jobid. It might
be the case that the jobids (generated with rand) happen to be
the same for different jobs resulting in multiple jobs sharing
the same session directory, but we need to check. We will update.
Josh
On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland
<eric.chamberl...@giref.ulaval.ca
<javascript:_e(%7B%7D,'cvml','eric.chamberl...@giref.ulaval.ca');>>
wrote:
Lucky!
Since each runs have a specific TMP, I still have it on disc.
for the faulty run, the TMP variable was:
TMP=/tmp/tmp.wOv5dkNaSI
and into $TMP I have:
openmpi-sessions-40031@lorien_0
and into this subdirectory I have a bunch of empty dirs:
cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |wc -l
1841
cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |more
total 68
drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx------ 3 cmpbib bib 231 Sep 13 03:50 ..
drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015
drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049
drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052
drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059
drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110
drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114
...
If I do:
lsof |grep "openmpi-sessions-40031"
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
/run/user/1000/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() tracefs file system
/sys/kernel/debug/tracing
Output information may be incomplete.
nothing...
What else may I check?
Eric
On 14/09/16 08:47 AM, Joshua Ladd wrote:
Hi, Eric
I **think** this might be related to the following:
https://github.com/pmix/master/pull/145
<https://github.com/pmix/master/pull/145>
I'm wondering if you can look into the /tmp directory
and see if you
have a bunch of stale usock files.
Best,
Josh
On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
<gil...@rist.or.jp
<javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>
<mailto:gil...@rist.or.jp
<javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>> wrote:
Eric,
can you please provide more information on how your
tests are launched ?
do you
mpirun -np 1 ./a.out
or do you simply
./a.out
do you use a batch manager ? if yes, which one ?
do you run one test per job ? or multiple tests per
job ?
how are these tests launched ?
do the test that crashes use MPI_Comm_spawn ?
i am surprised by the process name [[9325,5754],0],
which suggests
there MPI_Comm_spawn was called 5753 times (!)
can you also run
hostname
on the 'lorien' host ?
if you configure'd Open MPI with --enable-debug, can you
export OMPI_MCA_plm_base_verbose 5
then run one test and post the logs ?
from orte_plm_base_set_hnp_name(), "lorien" and pid
142766 should
produce job family 5576 (but you get 9325)
the discrepancy could be explained by the use of a
batch manager
and/or a full hostname i am unaware of.
orte_plm_base_set_hnp_name() generate a 16 bits job
family from the
(32 bits hash of the) hostname and the mpirun (32
bits ?) pid.
so strictly speaking, it is possible two jobs
launched on the same
node are assigned the same 16 bits job family.
the easiest way to detect this could be to
- edit orte/mca/plm/base/plm_base_jobid.c
and replace
OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
"plm:base:set_hnp_name: final jobfam %lu",
(unsigned long)jobfam));
with
OPAL_OUTPUT_VERBOSE((4,
orte_plm_base_framework.framework_output,
"plm:base:set_hnp_name: final jobfam %lu",
(unsigned long)jobfam));
configure Open MPI with --enable-debug and rebuild
and then
export OMPI_MCA_plm_base_verbose=4
and run your tests.
when the problem occurs, you will be able to check
which pids
produced the faulty jobfam, and that could hint to a
conflict.
Cheers,
Gilles
On 9/14/2016 12:35 AM, Eric Chamberland wrote:
Hi,
It is the third time this happened into the last
10 days.
While running nighlty tests (~2200), we have one
or two tests
that fails at the very beginning with this
strange error:
[lorien:142766] [[9325,5754],0]
usock_peer_recv_connect_ack:
received unexpected process identifier
[[9325,0],0] from
[[5590,0],0]
But I can't reproduce the problem right now...
ie: If I launch
this test alone "by hand", it is successful...
the same test was
successful yesterday...
Is there some kind of "race condition" that can
happen on the
creation of "tmp" files if many tests runs
together on the same
node? (we are oversubcribing even sequential
runs...)
Here are the build logs:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/%7Ecmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
Thanks,
Eric
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
<javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
<mailto:devel@lists.open-mpi.org
<javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
<javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
<mailto:devel@lists.open-mpi.org
<javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
<javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
<javascript:_e(%7B%7D,'cvml','devel@lists.open-mpi.org');>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>