Update :
From what I've tracked, while initializing the osc_rdma module, there
is a btl selected, whose endpoint can't be found back when calling
ompi_osc_rdma_peer_btl_endpoint().
It seems like that for even peers, the available btl endpoints are tcp,
even though we only find openib and ugni in ompi_osc_rdma_btl_names.
(gdb) p ompi_osc_rdma_btl_names
$29 = 0x7981e0 "openib,ugni"
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$39 = "self", '\000' <repeats 59 times>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next)).btl_module
$40 = (mca_btl_base_module_t *) 0x7fffed102100 <mca_btl_self>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$41 = "openib", '\000' <repeats 57 times>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next)).btl_module
$42 = (mca_btl_base_module_t *) 0x759e30
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$43 = "sm", '\000' <repeats 61 times>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next)).btl_module
$44 = (mca_btl_base_module_t *) 0x7fffec89e200 <mca_btl_sm>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$46 = "tcp", '\000' <repeats 60 times>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$47 = (mca_btl_base_module_t *) 0x76cab0
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$48 = "tcp", '\000' <repeats 60 times>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$49 = (mca_btl_base_module_t *) 0x72a680
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module->btl_component->btl_version.mca_component_name
$50 = "vader", '\000' <repeats 58 times>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$51 = (mca_btl_base_module_t *) 0x7fffec275200 <mca_btl_vader>
(gdb) p
((mca_btl_base_selected_module_t*)(mca_btl_base_modules_initialized.opal_list_sentinel.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next.opal_list_next)).btl_module
$52 = (mca_btl_base_module_t *) 0x0
Sorry for the noise in your mail boxes. I thought it may have been
valuable informations to know where these pointers point to.
Clement FOYER
On 02/02/2017 11:17 AM, Clement FOYER wrote:
Hi everyone,
I've been facing issues with the creations of windows
(MPI_Win_create). Maybe it's an already known issue, or maybe you will
be able to tell me where to check to find the problem.
I've been developping some benchmark to evaluate the overhead of a
monitoring module. Everything works fine for PML based operations
(coll and point-to-point). But I have some errors while creating
windows (even without the monitoring component : I launch my
applications with --mca pml ^monitoring --mca osc ^monitoring --mca
coll ^monitoring, so my components shouldn't be loaded).
From what I've tracked, while initializing the osc_rdma module, there
is a btl selected, whose endpoint can't be found back when calling
ompi_osc_rdma_peer_btl_endpoint().
Here are the traces of a problematic example (with 4 processes, curent
process is 1). Every processes are on one node :
Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850)
at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735
735 *btl = selected_btl;
(gdb) p selected_btl
$1 = (struct mca_btl_base_module_t *) 0x759e30
Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0)
at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54
54 return NULL;
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 0))->btl_rdma)
$17 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
0))->btl_rdma.bml_btls[0].btl
$18 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
0))->btl_rdma.bml_btls[1].btl
$19 = (struct mca_btl_base_module_t *) 0x72a680
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 1))->btl_rdma)
$20 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
1))->btl_rdma.bml_btls[0].btl
$21 = (struct mca_btl_base_module_t *) 0x759e30
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
1))->btl_rdma.bml_btls[1].btl
$22 = (struct mca_btl_base_module_t *) 0x7fffec275200 <mca_btl_vader>
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 2))->btl_rdma)
$23 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
2))->btl_rdma.bml_btls[0].btl
$24 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
2))->btl_rdma.bml_btls[1].btl
$25 = (struct mca_btl_base_module_t *) 0x72a680
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 3))->btl_rdma)
$26 = 1
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
3))->btl_rdma.bml_btls[0].btl
$27 = (struct mca_btl_base_module_t *) 0x759e30
It seems that for even peer_id's, the corresponding selected btl can
be retrieved, but not for the odd ones. I haven't check deeply into
the library to explain this behavior yet. Do you have any idea of
where to look this up?
Thank's you in advance,
Clément FOYER
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel