Hi,

It looks like there is a problem in trunk which reproduces with
simple_spawn test (orte/test/mpi/simple_spawn.c). It seems to be a n issue
with pmix. It doesn't reproduce with default set of btls. But it reproduces
with several btls specified. For example,

salloc -N5 $OMPI_HOME/install/bin/mpirun -np 33 --map-by node -mca coll ^ml
-display-map -mca orte_debug_daemons true --leave-session-attached
--debug-daemons -mca pml ob1 -mca btl *tcp,self*
./orte/test/mpi/simple_spawn

gets

simple_spawn: ../../ompi/group/group_init.c:215:
ompi_group_increment_proc_count: Assertion `((0xdeafbeedULL << 32) +
0xdeafbeedULL) == ((opal_object_t *) (proc_pointer))->obj_magic_id' failed.
[sputnik3.vbench.com:28888] [[41877,0],3] orted_cmd: exit cmd, but proc
[[41877,1],2] is alive
[sputnik5][[41877,1],29][../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:675:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.1.42 failed: Connection refused (111)

salloc -N1 $OMPI_HOME/install/bin/mpirun -np 3 --map-by node -mca coll ^ml
-display-map -mca orte_debug_daemons true --leave-session-attached
--debug-daemons -mca pml ob1 -mca btl *sm,self* ./orte/test/mpi/simple_spawn

fails with

At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[59481,2],0]) is on host: sputnik1
  Process 2 ([[59481,1],0]) is on host: sputnik1
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[sputnik1.vbench.com:22156] [[59481,1],2] ORTE_ERROR_LOG: Unreachable in
file ../../../../../ompi/mca/dpm/orte/dpm_orte.c at line 485


salloc -N1 $OMPI_HOME/install/bin/mpirun -np 3 --map-by node -mca coll ^ml
-display-map -mca orte_debug_daemons true --leave-session-attached
--debug-daemons -mca pml ob1 -mca btl *openib,self*
 ./orte/test/mpi/simple_spawn

also doesn't work:
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[60046,1],13]) is on host: sputnik4
  Process 2 ([[60046,2],1]) is on host: sputnik4
  BTLs attempted: openib self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[sputnik4.vbench.com:25476] [[60046,1],3] ORTE_ERROR_LOG: Unreachable in
file ../../../../../ompi/mca/dpm/orte/dpm_orte.c at line 485


*But* combination ^sm,openib seems to work.

I tried different revisions from the beginning of October. It reproduces on
them.

Best regards,
Elena

Reply via email to