Hi all

From my test, it is impossible to use "btl:tcp" with "grpcomm:hier". The "grpcomm:hier" module is important because, "srun" launch protocol can't use any other "grpcomm" module. You can reproduce this bug, by using "btl:tcp" and "grpcomm:hier" , when you create a ring(like: IMB sendrecv)

$>salloc -N 2 -n 4 mpirun --mca grpcomm hier --mca btl self,sm,tcp ./IMB-MPI1 Sendrecv
salloc: Granted job allocation 2979
[cuzco95][[59536,1],2][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[59536,1],0] [cuzco92][[59536,1],0][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[59536,1],2]
^C
$>

This error message show: "btl:tcp" have create a connection to a peer, but it not the good one ( peer identity is checked with the "ack").
To create a connection between two peers with "btl:tcp":
- Each peer broadcast theirs IP parameters with ompi_modex_send().
- IP parameters from selected peer is received with ompi_modex_recv().

In fact, modex use "orte_grpcomm.set_proc_attr()" and "orte_grpcomm.get_proc_attr()" to exchange data. The problem is "grpcomm:hier" doesn't make difference between two peer on the same node. From my test the IP parameters, from the fist rank on the selected node, is always return.


"grpcomm:hier" is restricted to "btl:sm" and "btl:openib" ?


--------

One easy solution to fix this problem, is to add rank information in the "name" variable on
-    ompi/runtime/ompi_module_exchange.c:ompi_modex_send()
-    ompi/runtime/ompi_module_exchange.c:ompi_modex_recv()
but I dislike it.

Someone have a better solution ?


thanks you
Damien

Reply via email to