On Thu, 2005-06-02 at 20:25, Troy Benjegerdes wrote:
> Some of my problems seem to be from intermittent cables.. 
> 
> Is there anything for OpenIB that can read error counters?

Aside from pulling these from the driver via
/sys/class/infiniband/mthca0/ports/1/counters/, there is also perfquery
which displays the portcounters (which contains the error counters):

Usage: perfquery [-d(ebug) -G(uid_addr) -a(ll_ports) -r(reset_after_read) -C 
ca_name -P hca_port -R(eset_only) -t timeout_ms -V(ersion) -h(elp)] [<lid|guid> 
[[port] [reset_mask]]]
        Examples:
                perfquery               # read local port's performance counters
                perfquery 32 1          # read performance counters from lid 
32, port 1
                perfquery -a 32         # read performance counters from lid 
32, all ports
                perfquery -r 32 1       # read performance counters and reset
                perfquery -R 32 1       # reset performance counters of port 1 
only
                perfquery -R -a 32      # reset performance counters of all 
ports
                perfquery -R 32 2 0xf000        # reset only non-error counters 
of port 2

perfquery 2 1
# Port counters: Lid 0x2 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................1506
LinkRecovers:....................255
LinkDowned:......................1
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtBytes:........................2612
RcvBytes:........................2160
XmtPkts:.........................36
RcvBytes:........................30

> What I'd really like to see is something that I can integrate with
> nagios ( http://www.nagios.org/about ) 

Nagios says it runs external plugins so it would be possible to create
one for this which based on polling counters at some rate could cause
the contact notifications to be issued based on some algorithm for
deciding that this is appropriate (e.g. error counters are increasing so
a cable might be intermittent (e.g. certain link is suspect)).

-- Hal

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to