Ralph,
i updated the MODEX flag to PMIX_GLOBAL
https://github.com/open-mpi/ompi/commit/d542c9ff2dc57ca5d260d0578fd5c1c556c598c7
Elena,
i was able to reproduce the issue (salloc -N 5 mpirun -np 2 is enough).
i was "lucky" to reproduce the issue : it happened because one of node
was misconfigured
with two interfaces in the same subnet (!)
could you please give a try to the attached patch ?
i did not commit it because i do not know if this is the right fix or a
simple workaround
/* for example, should the opal_proc_t be OBJ_RETAINed before invoking
add_procs and
then be OBJ_RELEASEd by the btl add_proc if it is unreachable ? */
Cheers,
Gilles
On 2014/11/06 12:46, Ralph Castain wrote:
>> On Nov 5, 2014, at 6:11 PM, Gilles Gouaillardet
>> <[email protected]> wrote:
>>
>> Elena,
>>
>> the first case (-mca btl tcp,self) crashing is a bug and i will have a look
>> at it.
>>
>> the second case (-mca sm,self) is a feature : the sm btl cannot be used with
>> tasks
>> having different jobids (this is the case after a spawn), and obviously,
>> self cannot be used also,
>> so the behaviour and error message is correct.
>> /* i am not aware of any plans to make the sm btl work with tasks from
>> different jobids */\
> That is correct - I'm also unaware of any plans to extend it at this point,
> though IIRC Nathan at one time mentioned perhaps extending vader for that
> purpose
>
>> the third case (-mca openib,self) is more controversial ...
>> i previously posted
>> http://www.open-mpi.org/community/lists/devel/2014/10/16136.php
>> <http://www.open-mpi.org/community/lists/devel/2014/10/16136.php>
>> what happens in your case (simple_spawn) is the openib modex is sent with
>> PMIX_REMOTE,
>> that means openib btl cannot be used between tasks on the same node.
>> i am still waiting for some feedback since i cannot figure out whether this
>> is a feature or an
>> undesired side effect / bug
> I believe it is a bug - I provided some initial values for the modex scope
> with the expectation (and request when we committed it) that people would
> review and modify them as appropriate. I recall setting the openib scope as
> "remote" only because I wasn't aware of anyone using it for local comm. Since
> Mellanox obviously is testing for that case, a scope of PMIX_GLOBAL would be
> more appropriate
>
>> the last cast (-mca ^sm,openib) does make sense to me :
>> the tcp and self btls are used and they work just like they should.
>>
>> bottom line, i will investigate the first crash, wait for feedback about the
>> openib btl.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/11/06 1:08, Elena Elkina wrote:
>>> Hi,
>>>
>>> It looks like there is a problem in trunk which reproduces with
>>> simple_spawn test (orte/test/mpi/simple_spawn.c). It seems to be a n issue
>>> with pmix. It doesn't reproduce with default set of btls. But it reproduces
>>> with several btls specified. For example,
>>>
>>> salloc -N5 $OMPI_HOME/install/bin/mpirun -np 33 --map-by node -mca coll ^ml
>>> -display-map -mca orte_debug_daemons true --leave-session-attached
>>> --debug-daemons -mca pml ob1 -mca btl *tcp,self*
>>> ./orte/test/mpi/simple_spawn
>>>
>>> gets
>>>
>>> simple_spawn: ../../ompi/group/group_init.c:215:
>>> ompi_group_increment_proc_count: Assertion `((0xdeafbeedULL << 32) +
>>> 0xdeafbeedULL) == ((opal_object_t *) (proc_pointer))->obj_magic_id' failed.
>>> [sputnik3.vbench.com:28888] [[41877,0],3] orted_cmd: exit cmd, but proc
>>> [[41877,1],2] is alive
>>> [sputnik5][[41877,1],29][../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:675:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.1.42 failed: Connection refused (111)
>>>
>>> salloc -N1 $OMPI_HOME/install/bin/mpirun -np 3 --map-by node -mca coll ^ml
>>> -display-map -mca orte_debug_daemons true --leave-session-attached
>>> --debug-daemons -mca pml ob1 -mca btl *sm,self* ./orte/test/mpi/simple_spawn
>>>
>>> fails with
>>>
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications. This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes. This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other. This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>>> Process 1 ([[59481,2],0]) is on host: sputnik1
>>> Process 2 ([[59481,1],0]) is on host: sputnik1
>>> BTLs attempted: self sm
>>>
>>> Your MPI job is now going to abort; sorry.
>>> --------------------------------------------------------------------------
>>> [sputnik1.vbench.com:22156] [[59481,1],2] ORTE_ERROR_LOG: Unreachable in
>>> file ../../../../../ompi/mca/dpm/orte/dpm_orte.c at line 485
>>>
>>>
>>> salloc -N1 $OMPI_HOME/install/bin/mpirun -np 3 --map-by node -mca coll ^ml
>>> -display-map -mca orte_debug_daemons true --leave-session-attached
>>> --debug-daemons -mca pml ob1 -mca btl *openib,self*
>>> ./orte/test/mpi/simple_spawn
>>>
>>> also doesn't work:
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications. This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes. This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other. This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>>> Process 1 ([[60046,1],13]) is on host: sputnik4
>>> Process 2 ([[60046,2],1]) is on host: sputnik4
>>> BTLs attempted: openib self
>>>
>>> Your MPI job is now going to abort; sorry.
>>> --------------------------------------------------------------------------
>>> [sputnik4.vbench.com:25476] [[60046,1],3] ORTE_ERROR_LOG: Unreachable in
>>> file ../../../../../ompi/mca/dpm/orte/dpm_orte.c at line 485
>>>
>>>
>>> *But* combination ^sm,openib seems to work.
>>>
>>> I tried different revisions from the beginning of October. It reproduces on
>>> them.
>>>
>>> Best regards,
>>> Elena
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected] <mailto:[email protected]>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16202.php
>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16202.php>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16223.php
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16225.php
diff --git a/opal/mca/btl/tcp/btl_tcp.c b/opal/mca/btl/tcp/btl_tcp.c
index 6e7e2f4..076656c 100644
--- a/opal/mca/btl/tcp/btl_tcp.c
+++ b/opal/mca/btl/tcp/btl_tcp.c
@@ -12,6 +12,8 @@
* All rights reserved.
* Copyright (c) 2006-2014 Los Alamos National Security, LLC. All rights
* reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
*
* $COPYRIGHT$
*
@@ -106,7 +108,6 @@ int mca_btl_tcp_add_procs( struct mca_btl_base_module_t*
btl,
rc = mca_btl_tcp_proc_insert(tcp_proc, tcp_endpoint);
if(rc != OPAL_SUCCESS) {
OPAL_THREAD_UNLOCK(&tcp_proc->proc_lock);
- OBJ_RELEASE(opal_proc);
OBJ_RELEASE(tcp_endpoint);
continue;
}