Update :

From what I've tracked, while initializing the osc_rdma module, there is a btl selected, whose endpoint can't be found back when calling ompi_osc_rdma_peer_btl_endpoint().

It seems like that for even peers, the available btl endpoints are tcp, even though we only find openib and ugni in ompi_osc_rdma_btl_names.

(gdb) p ompi_osc_rdma_btl_names
$29 = 0x7981e0 "openib,ugni"

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$39 = "self", '\000' <repeats 59 times>
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module
$40 = (mca_btl_base_module_t *) 0x7fffed102100 <mca_btl_self>

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$41 = "openib", '\000' <repeats 57 times>
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module
$42 = (mca_btl_base_module_t *) 0x759e30

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$43 = "sm", '\000' <repeats 61 times>
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module
$44 = (mca_btl_base_module_t *) 0x7fffec89e200 <mca_btl_sm>

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$46 = "tcp", '\000' <repeats 60 times>
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$47 = (mca_btl_base_module_t *) 0x76cab0

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$48 = "tcp", '\000' <repeats 60 times>
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$49 = (mca_btl_base_module_t *) 0x72a680

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$50 = "vader", '\000' <repeats 58 times>
(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$51 = (mca_btl_base_module_t *) 0x7fffec275200 <mca_btl_vader>

(gdb) p 
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$52 = (mca_btl_base_module_t *) 0x0

Sorry for the noise in your mail boxes. I thought it may have been valuable informations to know where these pointers point to.

Clement FOYER

On 02/02/2017 11:17 AM, Clement FOYER wrote:
Hi everyone,

I've been facing issues with the creations of windows (MPI_Win_create). Maybe it's an already known issue, or maybe you will be able to tell me where to check to find the problem.

I've been developping some benchmark to evaluate the overhead of a monitoring module. Everything works fine for PML based operations (coll and point-to-point). But I have some errors while creating windows (even without the monitoring component : I launch my applications with --mca pml ^monitoring --mca osc ^monitoring --mca coll ^monitoring, so my components shouldn't be loaded).

From what I've tracked, while initializing the osc_rdma module, there is a btl selected, whose endpoint can't be found back when calling ompi_osc_rdma_peer_btl_endpoint().

Here are the traces of a problematic example (with 4 processes, curent process is 1). Every processes are on one node :

Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850)
     at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735
735             *btl = selected_btl;
(gdb) p selected_btl
$1 = (struct mca_btl_base_module_t *) 0x759e30
Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0)
     at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54
54          return NULL;
(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 0))->btl_rdma)
$17 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
0))->btl_rdma.bml_btls[0].btl
$18 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
0))->btl_rdma.bml_btls[1].btl
$19 = (struct mca_btl_base_module_t *) 0x72a680
(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 1))->btl_rdma)
$20 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
1))->btl_rdma.bml_btls[0].btl
$21 = (struct mca_btl_base_module_t *) 0x759e30
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
1))->btl_rdma.bml_btls[1].btl
$22 = (struct mca_btl_base_module_t *) 0x7fffec275200 <mca_btl_vader>
(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 2))->btl_rdma)
$23 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
2))->btl_rdma.bml_btls[0].btl
$24 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
2))->btl_rdma.bml_btls[1].btl
$25 = (struct mca_btl_base_module_t *) 0x72a680
(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 3))->btl_rdma)
$26 = 1
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
3))->btl_rdma.bml_btls[0].btl
$27 = (struct mca_btl_base_module_t *) 0x759e30

It seems that for even peer_id's, the corresponding selected btl can be retrieved, but not for the odd ones. I haven't check deeply into the library to explain this behavior yet. Do you have any idea of where to look this up?

Thank's you in advance,

Clément FOYER


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to