Hi everyone,

I've been facing issues with the creations of windows (MPI_Win_create). Maybe it's an already known issue, or maybe you will be able to tell me where to check to find the problem.

I've been developping some benchmark to evaluate the overhead of a monitoring module. Everything works fine for PML based operations (coll and point-to-point). But I have some errors while creating windows (even without the monitoring component : I launch my applications with --mca pml ^monitoring --mca osc ^monitoring --mca coll ^monitoring, so my components shouldn't be loaded).

From what I've tracked, while initializing the osc_rdma module, there is btl that's selected, and can't be found back when calling ompi_osc_rdma_peer_btl_endpoint().

Here are the traces of a problematic example (with 4 processes, curent process is 1). Every processes are on one node :

Breakpoint 11, ompi_osc_rdma_query_btls (comm=0x85f8b0, btl=0x85e850)
    at ../../../../../../ompi/mca/osc/rdma/osc_rdma_component.c:735
735             *btl = selected_btl;
(gdb) p selected_btl
$1 = (struct mca_btl_base_module_t *) 0x759e30

Breakpoint 10, ompi_osc_rdma_peer_btl_endpoint (module=0x85e360, peer_id=0)
    at ../../../../../../ompi/mca/osc/rdma/osc_rdma_peer.c:54
54          return NULL;
(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 0))->btl_rdma)
$17 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
0))->btl_rdma.bml_btls[0].btl
$18 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
0))->btl_rdma.bml_btls[1].btl
$19 = (struct mca_btl_base_module_t *) 0x72a680

(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 1))->btl_rdma)
$20 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
1))->btl_rdma.bml_btls[0].btl
$21 = (struct mca_btl_base_module_t *) 0x759e30
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
1))->btl_rdma.bml_btls[1].btl
$22 = (struct mca_btl_base_module_t *) 0x7fffec275200 <mca_btl_vader>

(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 2))->btl_rdma)
$23 = 2
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
2))->btl_rdma.bml_btls[0].btl
$24 = (struct mca_btl_base_module_t *) 0x76cab0
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
2))->btl_rdma.bml_btls[1].btl
$25 = (struct mca_btl_base_module_t *) 0x72a680

(gdb)  p mca_bml_base_btl_array_get_size (&mca_bml_base_get_endpoint ( 
ompi_comm_peer_lookup (module->comm, 3))->btl_rdma)
$26 = 1
(gdb) p mca_bml_base_get_endpoint ( ompi_comm_peer_lookup (module->comm, 
3))->btl_rdma.bml_btls[0].btl
$27 = (struct mca_btl_base_module_t *) 0x759e30

It seems that for odd proc_id's, the corresponding selected btl can be retrieved, but not for the odd ones. I haven't check deeply into the library to explain this behavior yet. Do you have any idea of where to look this up?

Thank's you in advance,

Clément FOYER

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to