This error means the combination of timeout/retry configured for the QP has been exceeded. It could mean:
1. The destination does not exist on the IB fabric 2. The destination address was not properly resolved and the wrong IB address is being used 3. The fabric path between this host and the remote host is not stable or has a high symbol error rate If this message occurs at the start of the job, it is likely #1 or #2. If it occurs later in the job after some traffic has been successfully sent between those nodes, #3 is more likely. Since you indicate it is affecting performance, I assume it occurs mid-job, so #3 is more likely. In which case you should use tools like ibping and ibdiagnet to analyze the errors in the fabric or better yet you can use the Intel True Scale IB Fabric Suite (contact your HW supplier or Intel if you do not have this, it includes a rich set of fabric analysis and diagnostic tools) Once you resolve this connectivity issue, for the best MPI performance on QLogic Infiniband cards it is recommended to use openmpi’s psm mtl as opposed to the verbs btl. If the problem is #3, you should work with the distributor you purchased your hardware from for further debug of the faulty component. Todd Rimmer DCSG Architecture Voice: 610-312-2152 Fax: 610-312-2233 [email protected]<mailto:[email protected]> From: tac [mailto:[email protected]] On Behalf Of Woodruff, Robert J Sent: Monday, August 15, 2016 12:28 PM To: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Cc: Marciniszyn, Mike Subject: Re: [tac] [ofiwg] OFED compatibility issue with Qlogic Infiniband card + Mike from the Intel InfiniBand driver team. From: ofiwg [mailto:[email protected]] On Behalf Of [email protected]<mailto:[email protected]> Sent: Monday, August 15, 2016 7:50 AM To: [email protected]<mailto:[email protected]>; [email protected]<mailto:[email protected]>; [email protected]<mailto:[email protected]>; [email protected]<mailto:[email protected]> Subject: [ofiwg] OFED compatibility issue with Qlogic Infiniband card Hi, I am trying to work with CentOS 6.8 and Qlogic Corp. IBA6110 Infiniband HCA(rev 3) while testing I receive following error message: [[23581,1],4][btl_openib_component.c:3369:handle_wc] from n003 to: 192.168.2.5 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 7057fe8 opcode 2 vendor error 0 qp_idx 3 and increasing btl_openib_ib_timeout had no effect. I am not sure what it means and how can it be resolved as it is affecting my performance. I am using infiniband support package from centos 6.8 for configuration of drivers. Thanking you, Regards, Nikhil
_______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
