Hi, I'm from the same lab as Sayantan. Thanks for your suggestion. Currently we could not reproduce the problem, however, we meet another problem. When I try to tear down a connection between two nodes I often get some messages like this:
[ 0] 005e0406 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 05f90000 [14] 00000000 [18] 00000008 [1c] fe100000 The program can run and exit though. After using the debug option as you suggested I got the following log. It starts from the point where I start to free the resources and disconnect the nodes: dapl_lmr_free (0x76f3b0) dapl_lmr_free (0x76f4e0) dapl_lmr_free (0x76f650) dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x7 ID=0x76fa70 CTX=0x76fb00 passive_cb: conn 0x76fb00 id 7797360 event 7 dapli_async_event_cb(0x5c40c0) dapl_lmr_free (0x76fee0) dapl_lmr_free (0x7a9150) dapl_lmr_free (0x7a9280) dapl_lmr_free (0x7a93b0) dapl_lmr_free (0x7a94e0) dapl_lmr_free (0x7a9610) dapl_lmr_free (0x7a9740) dapl_lmr_free (0x7a9870) dapl_lmr_free (0x7a99a0) dapl_lmr_free (0x7a9ad0) dapl_ep_disconnect (0x69b070, 1) disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 1) dapl_ep_disconnect () returns 0x0 dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x8 ID=0x76f9c0 CTX=0x76f7a0 active_cb: conn 0x76f7a0 id 7797184 event 8 dapli_async_event_cb(0x5c4410) dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7bc) dapl_evd_wait: EVD 0x5c89b0, CQ (nil) dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b4c) dapl_evd_wait: EVD 0x5c89b0, CQ (nil) dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76f9c0 CTX=0x76f7a0 active_cb: conn 0x76f7a0 id 7797184 event 9 --> dapl_evd_connection_callback: ctxt: 0x69b070 event: 1 cm_handle 0x76f7a0 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 0) destroy_cm_id: conn 0x76f7a0 id 7797184 modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406 dapli_evd_post_event: Called with event # 4005 dapl_evd_connection_callback () returns active_cb: DESTROY conn 0x76f7a0 id 7797184 dapli_async_event_cb(0x5c4410) dapl_evd_wait () returns 0x0 dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76fa70 CTX=0x76fb00 passive_cb: conn 0x76fb00 id 7797360 event 9 --> dapl_cr_callback! context: 0x5c8b20 event: 1 cm_handle 0x76fb00 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69b070, conn 0x76fb00, id 7797360 flags 0) destroy_cm_id: conn 0x76fb00 id 7797360 modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406 dapli_evd_post_event: Called with event # 4005 dapl_evd_wait () returns 0x0 dapli_async_event_cb(0x5c40c0) dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x7 ID=0x76f910 CTX=0x7a9120 passive_cb: conn 0x7a9120 id 7797008 event 7 dapli_async_event_cb(0x5c40c0) dapl_ep_disconnect (0x69bd20, 1) disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 1) dapl_ep_disconnect () returns 0x0 dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7b8) dapl_evd_wait: EVD 0x5ccb00, CQ (nil) dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x8 ID=0x76fc70 CTX=0x76fa50 active_cb: conn 0x76fa50 id 7797872 event 8 dapli_async_event_cb(0x5c4410) dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b48) dapl_evd_wait: EVD 0x5ccb00, CQ (nil) dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76fc70 CTX=0x76fa50 active_cb: conn 0x76fa50 id 7797872 event 9 --> dapl_evd_connection_callback: ctxt: 0x69bd20 event: 1 cm_handle 0x76fa50 dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76f910 CTX=0x7a9120 passive_cb: conn 0x7a9120 id 7797008 event 9 --> dapl_cr_callback! context: 0x5ccc70 event: 1 cm_handle 0x7a9120 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69bd20, conn 0x7a9120, id 7797008 flags 0) destroy_cm_id: conn 0x7a9120 id 7797008 modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407 dapli_evd_post_event: Called with event # 4005 dapl_evd_wait () returns 0x0 dapl_ep_free (0x69b070) dapl_ep_disconnect (0x69b070, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0 qp_free: ep_ptr 0x69b070 qp 0x69b3a0 modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406 dapli_async_event_cb(0x5c40c0) dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 0) destroy_cm_id: conn 0x76fa50 id 7797872 modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407 dapli_evd_post_event: Called with event # 4005 dapl_evd_connection_callback () returns active_cb: DESTROY conn 0x76fa50 id 7797872 dapli_async_event_cb(0x5c4410) dapl_evd_wait () returns 0x0 dapl_ep_free (0x69b070) dapl_ep_disconnect (0x69b070, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0 qp_free: ep_ptr 0x69b070 qp 0x69b3a0 modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406 >>> dapl_psp_free 0x5c8b20 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0) destroy_cm_id: conn 0x5c8be0 id 6065664 dapl_evd_free (0x5c89b0) dapl_evd_free () returns 0x0 dapl_evd_free (0x5c8840) dapl_evd_free () returns 0x0 dapl_evd_free (0x5c85e0) [ 0] 002c0406 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 05f90000 [14] 00000000 [18] 00000008 [1c] fe100000 >>> dapl_psp_free 0x5c8b20 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0) destroy_cm_id: conn 0x5c8be0 id 6065664 dapl_evd_free (0x5c89b0) dapl_evd_free () returns 0x0 dapl_evd_free (0x5c8840) dapl_evd_free () returns 0x0 cq_object_destroy: wait_obj=0x5c8750 dapl_evd_free () returns 0x0 dapl_ep_free (0x69bd20) dapl_ep_disconnect (0x69bd20, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220 qp_free: ep_ptr 0x69bd20 qp 0x76f220 modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407 >>> dapl_psp_free 0x5ccc70 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30) destroy_cm_id: conn 0x5ccd30 id 6082384 dapl_evd_free (0x5ccb00) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc990) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc730) cq_object_destroy: wait_obj=0x5cc8a0 dapl_evd_free () returns 0x0 dapl_pz_free (0x5c8510) dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffff9a9900) dapl_ia_query () returns 0x0 dapl_ia_close (0x5c8000, 1) setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil) dapl_evd_free (0x5c80f0) dapl_evd_free () returns 0x0 close_hca: 0x5c4390->0x5ca3b0 ib_thread_destroy: wait on hca 0x2 destroy dapl_evd_free (0x5c85e0) cq_object_destroy: wait_obj=0x5c8750 dapl_evd_free () returns 0x0 dapl_ep_free (0x69bd20) dapl_ep_disconnect (0x69bd20, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220 qp_free: ep_ptr 0x69bd20 qp 0x76f220 modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407 >>> dapl_psp_free 0x5ccc70 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30) destroy_cm_id: conn 0x5ccd30 id 6082384 dapl_evd_free (0x5ccb00) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc990) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc730) cq_object_destroy: wait_obj=0x5cc8a0 dapl_evd_free () returns 0x0 dapl_pz_free (0x5c8510) dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffffebf570) dapl_ia_query () returns 0x0 dapl_ia_close (0x5c8000, 1) setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil) dapl_evd_free (0x5c80f0) dapl_evd_free () returns 0x0 close_hca: 0x5c4040->0x5ca3b0 DAPL: Stopped (dapl_fini) dapl_ib_release: ib_thread_destroy(8512) ib_thread_destroy: waiting for ib_thread ib_thread(8512) EXIT DAPL: Stopped (dapl_fini) dapl_ib_release: ib_thread_destroy(8081) ib_thread_destroy: waiting for ib_thread ib_thread(8081) EXIT ib_thread_destroy(8512) exit ib_thread_destroy(8081) exit Any suggestions would be highly appreciated. Thanks. Lei ----- Original Message ----- From: Arlin Davis <[EMAIL PROTECTED]> Date: Friday, October 21, 2005 2:59 pm Subject: Re: [openib-general] uDAPL open HCA problem > Sayantan Sur wrote: > > >Hello, > > > >I have udapl over Gen2 setup on our cluster and am able to run udapl > >programs. However, sometimes I get this error (after a few runs > of the > >same program): > > > > open_hca: ERR ib_at_ips_by_gid for mthca0 > >dapls_ib_open_hca failed 40000 > > > > > > uDAPL uses uAT to get the IP address using the GID (ATS records > via SA) > of the local device/port. The SA query for this record is failing > for > some reason. Did your SM bounce during this time? Did you bounce > or > reconfigure the IPoIB network device? > > You can set "env DAPL_DBG_TYPE=0xffff" for more information. > > -arlin > > >The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree > cards>(fw ver 5.1.0). > > > >lsmod on my machine shows this: > > > >[EMAIL PROTECTED]:~] lsmod | grep ^ib > >ib_ipoib 48008 0 > >ib_uat 14840 0 > >ib_at 25696 1 ib_uat > >ib_sa 17804 2 ib_ipoib,ib_at > >ib_ucm 22280 0 > >ib_cm 37744 1 ib_ucm > >ib_uverbs 35992 0 > >ib_umad 18208 0 > >ib_mthca 122656 0 > >ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca > >ib_core 56192 8 > >ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad > > > >My infiniband devices are (created by hand): > > > >[EMAIL PROTECTED]:~] ls -l /dev/infiniband/ > >total 0 > >crw-rw-rw- 1 root root 231, 191 2005-10-20 21:13 uat > >crw-rw-rw- 1 root root 231, 224 2005-10-20 21:12 ucm0 > >crwxrwxrwx 1 root root 231, 192 2005-09-21 04:37 umad0 > >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs0 > >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs1 > > > > > >I'd really appreciate if someone could help me understand what > might be > >going wrong. > > > >Thanks, > >Sayantan. > > > > > > > > _______________________________________________ > openib-general mailing list > [email protected] > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
