Hi, Eric I **think** this might be related to the following:
https://github.com/pmix/master/pull/145 I'm wondering if you can look into the /tmp directory and see if you have a bunch of stale usock files. Best, Josh On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > Eric, > > > can you please provide more information on how your tests are launched ? > > do you > > mpirun -np 1 ./a.out > > or do you simply > > ./a.out > > > do you use a batch manager ? if yes, which one ? > > do you run one test per job ? or multiple tests per job ? > > how are these tests launched ? > > > do the test that crashes use MPI_Comm_spawn ? > > i am surprised by the process name [[9325,5754],0], which suggests there > MPI_Comm_spawn was called 5753 times (!) > > > can you also run > > hostname > > on the 'lorien' host ? > > if you configure'd Open MPI with --enable-debug, can you > > export OMPI_MCA_plm_base_verbose 5 > > then run one test and post the logs ? > > > from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce > job family 5576 (but you get 9325) > > the discrepancy could be explained by the use of a batch manager and/or a > full hostname i am unaware of. > > > orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 > bits hash of the) hostname and the mpirun (32 bits ?) pid. > > so strictly speaking, it is possible two jobs launched on the same node > are assigned the same 16 bits job family. > > > the easiest way to detect this could be to > > - edit orte/mca/plm/base/plm_base_jobid.c > > and replace > > OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final jobfam %lu", > (unsigned long)jobfam)); > > with > > OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final jobfam %lu", > (unsigned long)jobfam)); > > configure Open MPI with --enable-debug and rebuild > > and then > > export OMPI_MCA_plm_base_verbose=4 > > and run your tests. > > > when the problem occurs, you will be able to check which pids produced the > faulty jobfam, and that could hint to a conflict. > > > Cheers, > > > Gilles > > > > On 9/14/2016 12:35 AM, Eric Chamberland wrote: > >> Hi, >> >> It is the third time this happened into the last 10 days. >> >> While running nighlty tests (~2200), we have one or two tests that fails >> at the very beginning with this strange error: >> >> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received >> unexpected process identifier [[9325,0],0] from [[5590,0],0] >> >> But I can't reproduce the problem right now... ie: If I launch this test >> alone "by hand", it is successful... the same test was successful >> yesterday... >> >> Is there some kind of "race condition" that can happen on the creation of >> "tmp" files if many tests runs together on the same node? (we are >> oversubcribing even sequential runs...) >> >> Here are the build logs: >> >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >> .01h16m01s_config.log >> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >> .01h16m01s_ompi_info_all.txt >> >> Thanks, >> >> Eric >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel