On Thu, 2005-06-02 at 20:25, Troy Benjegerdes wrote:
> Some of my problems seem to be from intermittent cables..
>
> Is there anything for OpenIB that can read error counters?
Aside from pulling these from the driver via
/sys/class/infiniband/mthca0/ports/1/counters/, there is also perfquery
which displays the portcounters (which contains the error counters):
Usage: perfquery [-d(ebug) -G(uid_addr) -a(ll_ports) -r(reset_after_read) -C
ca_name -P hca_port -R(eset_only) -t timeout_ms -V(ersion) -h(elp)] [<lid|guid>
[[port] [reset_mask]]]
Examples:
perfquery # read local port's performance counters
perfquery 32 1 # read performance counters from lid
32, port 1
perfquery -a 32 # read performance counters from lid
32, all ports
perfquery -r 32 1 # read performance counters and reset
perfquery -R 32 1 # reset performance counters of port 1
only
perfquery -R -a 32 # reset performance counters of all
ports
perfquery -R 32 2 0xf000 # reset only non-error counters
of port 2
perfquery 2 1
# Port counters: Lid 0x2 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................1506
LinkRecovers:....................255
LinkDowned:......................1
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtBytes:........................2612
RcvBytes:........................2160
XmtPkts:.........................36
RcvBytes:........................30
> What I'd really like to see is something that I can integrate with
> nagios ( http://www.nagios.org/about )
Nagios says it runs external plugins so it would be possible to create
one for this which based on polling counters at some rate could cause
the contact notifications to be issued based on some algorithm for
deciding that this is appropriate (e.g. error counters are increasing so
a cable might be intermittent (e.g. certain link is suspect)).
-- Hal
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general