Hi,Josh
>https://svn.open-mpi.org/trac/ompi/ticket/2397
Thank you very much for filing my questions to ticket system.
Now I have 3 new questions and I will post them.
Regards,
Takayuki Seki
12th question is as follows:
(12) Checkpointing of an MPI job which uses two (or more?) openib btl modules
fails.
Please build Open MPI with "--enable-debug" configure option.
Assersion fails in mca_btl_openib_ft_event.
Framework : bml
Component : r2
The source file : ompi/mca/bml/r2/bml_r2_ft.c
The function name : mca_bml_r2_ft_event
Framework : btl
Component : openib
The source file : ompi/mca/btl/openib/btl_openib.c
The function name : mca_btl_openib_ft_event
* Following message is printed in mca_btl_openib_ft_event.
a.out: ../../../../../ompi/mca/btl/openib/btl_openib.c:1603:
mca_btl_openib_ft_event: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
((opal_object_t *) (&mca_btl_openib_component.ib_procs))->obj_magic_id' failed.
* Hardware/System requirement.
There are two active openib ports.
Here's the output of ifconfig.
ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
ib2 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
Here's the output of ibv_devinfo.
hca_id: mlx4_0
port: 1
state: PORT_ACTIVE (4)
port: 2
state: PORT_DOWN (1)
hca_id: mlx4_1
port: 1
state: PORT_ACTIVE (4)
port: 2
state: PORT_DOWN (1)
* Debugging output.
mpiexec -n 2 -mca btl self,openib -am ft-enable-cr ...
DEBUG: mca_bml_r2_ft_event 0 num_btl_modules=33
DEBUG: r2 call btl ft 2aaaade70213 0 self
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
Number of processes is 2.
Specified btl is self,openib.
Total btl module count is 33 and openib module count is 32.
* r2 ft_event function calls btl ft_event function in each module.
Therefore, it calls openib's ft_event function(mca_btl_openib_ft_event) 32
times.
/*
* Call ft_event in:
* - BTL modules
* - MPool modules
*
* These should be cleaning out stale state, and memory references
in
* preparation for being shut down.
*/
for(btl_idx = 0; btl_idx < mca_bml_r2.num_btl_modules; btl_idx++) {
* mca_btl_openib_ft_event seems to release all openib environments at a time.
for (i = 0; i < mca_btl_openib_component.ib_num_btls; ++i ) {
mca_btl_openib_finalize_resources(
&(mca_btl_openib_component.openib_btls[i])->super);
}
/* closing all openib modules at a time. */
mca_btl_openib_component.devices_count = 0;
mca_btl_openib_component.ib_num_btls = 0;
OBJ_DESTRUCT(&mca_btl_openib_component.ib_procs);
/* When mca_btl_openib_ft_event is called for the second time,
an error occurs at this point. */
ompi_btl_openib_connect_base_finalize();
* case using tcpip instead of openib.(for reference)
mpiexec -n 2 -mca btl self,tcp -am ft-enable-cr ...
DEBUG: mca_bml_r2_ft_event 0 num_btl_modules=4
DEBUG: r2 call btl ft 2aaaad89d213 0 self
DEBUG: r2 call btl ft 2aaaadaad590 0 tcp
DEBUG: r2 call btl ft 2aaaadaad590 0 tcp
DEBUG: r2 call btl ft 2aaaadaad590 0 tcp
tcpip module count is 3.
r2 ft_event function calls tcp's ft_event function(mca_btl_tcp_ft_event) 3
times.
But there is no action in mca_btl_tcp_ft_event.
(It means NOP operation 3 times.)
* Should r2 ft_event call btl ft_event function only once on each btl component?