Hi everyone,
I've been facing issues with the creations of windows (MPI_Win_create).
Maybe it's an already known issue, or maybe you will be able to tell me
where to check to find the problem.
I've been developping some benchmark to evaluate the overhead of a
monitoring module. Everything works fine for PML based operations (coll
and point-to-point). But I have some errors while creating windows (even
without the monitoring component : I launch my applications with --mca
pml ^monitoring --mca osc ^monitoring --mca coll ^monitoring, so my
components shouldn't be loaded).
From what I've tracked, while initializing the osc_rdma module, there
is btl that's selected, and can't be found back when calling
ompi_osc_rdma_peer_btl_endpoint().
Here are the traces of a problematic example (with 4 processes, curent
process is 1). Every processes are on one node :
Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850)
at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735
735 *btl = selected_btl;
(gdb) p selected_btl
$1 = (struct mca_btl_base_module_t *) 0x759e30
Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0)
at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54
54 return NULL;
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 0))->btl_rdma)
$17 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
0))->btl_rdma.bml_btls[0].btl
$18 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
0))->btl_rdma.bml_btls[1].btl
$19 = (struct mca_btl_base_module_t *) 0x72a680
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 1))->btl_rdma)
$20 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
1))->btl_rdma.bml_btls[0].btl
$21 = (struct mca_btl_base_module_t *) 0x759e30
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
1))->btl_rdma.bml_btls[1].btl
$22 = (struct mca_btl_base_module_t *) 0x7fffec275200 <mca_btl_vader>
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 2))->btl_rdma)
$23 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
2))->btl_rdma.bml_btls[0].btl
$24 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
2))->btl_rdma.bml_btls[1].btl
$25 = (struct mca_btl_base_module_t *) 0x72a680
(gdb) p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint (
ompi_comm_peer_lookup (module->comm, 3))->btl_rdma)
$26 = 1
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm,
3))->btl_rdma.bml_btls[0].btl
$27 = (struct mca_btl_base_module_t *) 0x759e30
It seems that for odd proc_id's, the corresponding selected btl can be
retrieved, but not for the odd ones. I haven't check deeply into the
library to explain this behavior yet. Do you have any idea of where to
look this up?
Thank's you in advance,
Clément FOYER
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel