Aniruddha Bohra wrote:

cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) Success^M
        >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M
        dapl_evd_dto_callback : CQE ^M
                work_req_id 134771572^M
                status 12^M
        >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M
DTO completion ERROR: 12: op 0xff^M
disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M
destroy_cm_id: conn 0x808a008 id 134774528^M
dapli_evd_post_event: Called with event # 4006^M


Any ideas how to proceed to even debug this ?


Are you using the uDAPL provider with socket CM (VERBS=openib_scm) or the default one that use's uCM and uAT? For the socket_CM version the timeout is set to 14 (~67ms) and the retries are set to 7 so the receiving node would have to be delayed beyond ~469ms to get this failure. For the default uCM/uAT version the retries are set to 7 and the timeout is set to pktlifetime+1 so you would have to look at the path-record for the timeout value for the connection.

Can you successfully run the IB verbs ibv_rc_pingpong test suite? Anything special about your fabric configuration that could induce this kind of latencies? Something on the fabric or in your remote system is delaying ACK's beyond your total timeout/retry times.

If you had no buffers posted or attempted to send to unregistered memory you would get different errors.

-arlin


Thanks
Aniruddha


_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to