Hi,Josh >https://svn.open-mpi.org/trac/ompi/ticket/2397
Thank you very much for filing my questions to ticket system. Now I have 3 new questions and I will post them. Regards, Takayuki Seki 12th question is as follows: (12) Checkpointing of an MPI job which uses two (or more?) openib btl modules fails. Please build Open MPI with "--enable-debug" configure option. Assersion fails in mca_btl_openib_ft_event. Framework : bml Component : r2 The source file : ompi/mca/bml/r2/bml_r2_ft.c The function name : mca_bml_r2_ft_event Framework : btl Component : openib The source file : ompi/mca/btl/openib/btl_openib.c The function name : mca_btl_openib_ft_event * Following message is printed in mca_btl_openib_ft_event. a.out: ../../../../../ompi/mca/btl/openib/btl_openib.c:1603: mca_btl_openib_ft_event: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&mca_btl_openib_component.ib_procs))->obj_magic_id' failed. * Hardware/System requirement. There are two active openib ports. Here's the output of ifconfig. ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 ib2 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 Here's the output of ibv_devinfo. hca_id: mlx4_0 port: 1 state: PORT_ACTIVE (4) port: 2 state: PORT_DOWN (1) hca_id: mlx4_1 port: 1 state: PORT_ACTIVE (4) port: 2 state: PORT_DOWN (1) * Debugging output. mpiexec -n 2 -mca btl self,openib -am ft-enable-cr ... DEBUG: mca_bml_r2_ft_event 0 num_btl_modules=33 DEBUG: r2 call btl ft 2aaaade70213 0 self DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib Number of processes is 2. Specified btl is self,openib. Total btl module count is 33 and openib module count is 32. * r2 ft_event function calls btl ft_event function in each module. Therefore, it calls openib's ft_event function(mca_btl_openib_ft_event) 32 times. /* * Call ft_event in: * - BTL modules * - MPool modules * * These should be cleaning out stale state, and memory references in * preparation for being shut down. */ for(btl_idx = 0; btl_idx < mca_bml_r2.num_btl_modules; btl_idx++) { * mca_btl_openib_ft_event seems to release all openib environments at a time. for (i = 0; i < mca_btl_openib_component.ib_num_btls; ++i ) { mca_btl_openib_finalize_resources( &(mca_btl_openib_component.openib_btls[i])->super); } /* closing all openib modules at a time. */ mca_btl_openib_component.devices_count = 0; mca_btl_openib_component.ib_num_btls = 0; OBJ_DESTRUCT(&mca_btl_openib_component.ib_procs); /* When mca_btl_openib_ft_event is called for the second time, an error occurs at this point. */ ompi_btl_openib_connect_base_finalize(); * case using tcpip instead of openib.(for reference) mpiexec -n 2 -mca btl self,tcp -am ft-enable-cr ... DEBUG: mca_bml_r2_ft_event 0 num_btl_modules=4 DEBUG: r2 call btl ft 2aaaad89d213 0 self DEBUG: r2 call btl ft 2aaaadaad590 0 tcp DEBUG: r2 call btl ft 2aaaadaad590 0 tcp DEBUG: r2 call btl ft 2aaaadaad590 0 tcp tcpip module count is 3. r2 ft_event function calls tcp's ft_event function(mca_btl_tcp_ft_event) 3 times. But there is no action in mca_btl_tcp_ft_event. (It means NOP operation 3 times.) * Should r2 ft_event call btl ft_event function only once on each btl component?