Hi, Eric

I **think** this might be related to the following:

https://github.com/pmix/master/pull/145

I'm wondering if you can look into the /tmp directory and see if you have a
bunch of stale usock files.

Best,

Josh


On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> Eric,
>
>
> can you please provide more information on how your tests are launched ?
>
> do you
>
> mpirun -np 1 ./a.out
>
> or do you simply
>
> ./a.out
>
>
> do you use a batch manager ? if yes, which one ?
>
> do you run one test per job ? or multiple tests per job ?
>
> how are these tests launched ?
>
>
> do the test that crashes use MPI_Comm_spawn ?
>
> i am surprised by the process name [[9325,5754],0], which suggests there
> MPI_Comm_spawn was called 5753 times (!)
>
>
> can you also run
>
> hostname
>
> on the 'lorien' host ?
>
> if you configure'd Open MPI with --enable-debug, can you
>
> export OMPI_MCA_plm_base_verbose 5
>
> then run one test and post the logs ?
>
>
> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce
> job family 5576 (but you get 9325)
>
> the discrepancy could be explained by the use of a batch manager and/or a
> full hostname i am unaware of.
>
>
> orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32
> bits hash of the) hostname and the mpirun (32 bits ?) pid.
>
> so strictly speaking, it is possible two jobs launched on the same node
> are assigned the same 16 bits job family.
>
>
> the easiest way to detect this could be to
>
> - edit orte/mca/plm/base/plm_base_jobid.c
>
> and replace
>
>     OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>                          "plm:base:set_hnp_name: final jobfam %lu",
>                          (unsigned long)jobfam));
>
> with
>
>     OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>                          "plm:base:set_hnp_name: final jobfam %lu",
>                          (unsigned long)jobfam));
>
> configure Open MPI with --enable-debug and rebuild
>
> and then
>
> export OMPI_MCA_plm_base_verbose=4
>
> and run your tests.
>
>
> when the problem occurs, you will be able to check which pids produced the
> faulty jobfam, and that could hint to a conflict.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>
>> Hi,
>>
>> It is the third time this happened into the last 10 days.
>>
>> While running nighlty tests (~2200), we have one or two tests that fails
>> at the very beginning with this strange error:
>>
>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
>> unexpected process identifier [[9325,0],0] from [[5590,0],0]
>>
>> But I can't reproduce the problem right now... ie: If I launch this test
>> alone "by hand", it is successful... the same test was successful
>> yesterday...
>>
>> Is there some kind of "race condition" that can happen on the creation of
>> "tmp" files if many tests runs together on the same node? (we are
>> oversubcribing even sequential runs...)
>>
>> Here are the build logs:
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>> .01h16m01s_config.log
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>> .01h16m01s_ompi_info_all.txt
>>
>> Thanks,
>>
>> Eric
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to