> > > perfquery output before ib_send_bw test: > > > > # Port counters: Lid 2 port 1 > > PortSelect:......................1 > > CounterSelect:...................0x1400 > > SymbolErrorCounter:..............15814 > > LinkErrorRecoveryCounter:........255 > > LinkDownedCounter:...............0 > > PortRcvErrors:...................5403 > > PortRcvRemotePhysicalErrors:.....0 > > PortRcvSwitchRelayErrors:........0 > > PortXmitDiscards:................0 > > PortXmitConstraintErrors:........0 > > PortRcvConstraintErrors:.........0 > > CounterSelect2:..................0x00 > > LocalLinkIntegrityErrors:........0 > > ExcessiveBufferOverrunErrors:....0 > > VL15Dropped:.....................0 > > PortXmitData:....................2925583200 > > PortRcvData:.....................145715607 > > PortXmitPkts:....................10975597 > > PortRcvPkts:.....................8191613 > > PortXmitWait:....................7570 > > > > > > Run Ib_send_bw test: > > [root@vsanqa7 ~]# ib_send_bw > > ------------------------------------------------------------------ > > Send BW Test > > Number of qps : 1 > > Connection type : RC > > RX depth : 600 > > CQ Moderation : 50 > > Mtu : 2048B > > Link type : IB > > Max inline data : 0B > > rdma_cm QPs : OFF > > Data ex. method : Ethernet > > ------------------------------------------------------------------ > > local address: LID 0x02 QPN 0xde1b PSN 000000 > > remote address: LID 0x01 QPN 0x64004a PSN 000000 > > ------------------------------------------------------------------ > > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] > > 65536 1000 -nan 42.71 > > > > Which is too low > > > > Perfquery after ib_send_bw test: > > > > # Port counters: Lid 2 port 1 > > PortSelect:......................1 > > CounterSelect:...................0x1400 > > SymbolErrorCounter:..............20750 > > Are symbol errors increasing ? > > Yes.
>From the outputs above: Before the ib_send_bw test, the symbol error counter reads as below: > SymbolErrorCounter:..............15814 Post test, the following is the counter value: > SymbolErrorCounter:..............20750 > LinkErrorRecoveryCounter:........255 > > Could it be that your link goes through error recovery as indicated by > this counter being max'd out ? > > Can you clear this counter and see if it increments ? > I will try this the next time I hit the issue. [...] I suspect the link is retraining due to minor errors over threshold or > major errors. > > Can you try some other known good cable ? > Will do that and will report if we continue to see issues. But the fact that the problems disappear everytime I reload the modules suggests it might be some software state that is getting messed, but I am only guessing. Also, it is not just one pair of systems that is seeing this problem. We have witnessed it between atleast 3 pairs of systems which reduces the likelihood of this being a cable problem. Pavan
_______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
