This error means the combination of timeout/retry configured for the QP has 
been exceeded.  It could mean:

1.       The destination does not exist on the IB fabric

2.       The destination address was not properly resolved and the wrong IB 
address is being used

3.       The fabric path between this host and the remote host is not stable or 
has a high symbol error rate

If this message occurs at the start of the job, it is likely #1 or #2.  If it 
occurs later in the job after some traffic has been successfully sent between 
those nodes, #3 is more likely.

Since you indicate it is affecting performance, I assume it occurs mid-job, so 
#3 is more likely.  In which case you should use tools like ibping and 
ibdiagnet to analyze the errors in the fabric or better yet you can use the 
Intel True Scale IB Fabric Suite (contact your HW supplier or Intel if you do 
not have this, it includes a rich set of fabric analysis and diagnostic tools)

Once you resolve this connectivity issue, for the best MPI performance on 
QLogic Infiniband cards it is recommended to use openmpi’s psm mtl as opposed 
to the verbs btl.

If the problem is #3, you should work with the distributor you purchased your 
hardware from for further debug of the faulty component.

Todd Rimmer
DCSG Architecture
Voice: 610-312-2152     Fax: 610-312-2233
[email protected]<mailto:[email protected]>

From: tac [mailto:[email protected]] On Behalf Of Woodruff, 
Robert J
Sent: Monday, August 15, 2016 12:28 PM
To: [email protected]; [email protected]; 
[email protected]; [email protected]; 
[email protected]
Cc: Marciniszyn, Mike
Subject: Re: [tac] [ofiwg] OFED compatibility issue with Qlogic Infiniband card

+ Mike from the Intel InfiniBand driver team.

From: ofiwg [mailto:[email protected]] On Behalf Of 
[email protected]<mailto:[email protected]>
Sent: Monday, August 15, 2016 7:50 AM
To: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: [ofiwg] OFED compatibility issue with Qlogic Infiniband card

Hi,

I am trying to work with CentOS 6.8  and Qlogic Corp. IBA6110 Infiniband 
HCA(rev 3) while testing I receive following error message:

[[23581,1],4][btl_openib_component.c:3369:handle_wc] from n003 to: 192.168.2.5 
error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 
7057fe8 opcode 2  vendor error 0 qp_idx 3

and increasing btl_openib_ib_timeout had no effect.

I am not sure what it means and how can it be resolved as it is affecting my 
performance. I am using infiniband support package from centos 6.8 for 
configuration of drivers.

Thanking you,
Regards,
Nikhil
_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg

Reply via email to