"Christian MATHIEU" <[EMAIL PROTECTED]> and others asked about
network performance & monitoring tools. Here's some of the things
we use for this at the UM for IFS production services.
Generalness:
/etc/ping
is a good quick network diagnostic. If run for a while (options vary)
it will report min/avg/max network delays, and packet loss rates -
actually surprisingly sufficient for many purposes. The only
thing ping isn't good at is data sensitive network problems
(which Can happen.)
netstat
vmstat
ps
traceroute
are also good tools sometimes...
On an RS/6K (AIX):
iptrace
is occasionally helpful in tracking down weird network problems
(like data sensitivity.) A good network sniffer can also help here.
AFS specific:
rxdebug
Allows you a quick look into any rx based process - it will tell
you what connections are active. "-port" specifies which udp
port - hence which process, "-allconns" also prints out recent
connections that aren't currently active, & "-rxstats" prints
out some statistics which may be useful (which I don't entirely trust).
cmdebug
Can be used to look inside a cache manager. This gives you additional
information on what's in use.
udebug
Can be used to get a snap-shot of ubik status. This can be useful
in tracking down sync failures & other problems - which can cause
long mysterious delays even though everything looks like it's up.
"xstat"
the "xstat" library includes a way to look inside the file
server and look at a bunch of performance counters it keeps,
and inside a cache manager for similar things. Beware; old
versions of the AFS file server log every call - if you write
something that uses it once a second and leave it running
for a while (as I did once in my ignorance), you can make the log
grow real big on the file server and perhaps cause problems on it.
Mysteriously, the log message doesn't include the IP address
so the people who track it down will be very unhappy by the time
they find you.
scout
You probably are already familiar with this.
UM specific:
With the exception of scout, none of these other available tools
has an option for continuous operation, and none of them have
facilities to trip an alarm or otherwise identify actual problems.
So, at least here, several tools have evolved to answer these
and other specialized needs.
rtime
This is a program that just asks the file server what time it is.
I think the original version was by Michael T. Stolarchuk of CITI.
A variety of versions are floating around - the most sophisticated
versions can run in the background, watch for timeouts that are "too
large", or decide if a server is "up" or "down", and syslog these
events. This provides a fairly simple mechanism to keep a performance
and reliability log.
upoll
This is a program that does what "udebug" does, except it
produces a more compact report, and it can run continuously,
monitor several servers at once, and report changes on each.
Using this, it's possible to watch all of the steps as a ubik
service loses sync and recover.
bigbro
This program is another one of those programs that has
been evolving here and various versions can be found at
different spots on campus. I believe the original version
comes from CAEN. The version that monitors IFS
has been specially modified for this use, and includes
special support for several "threaded" sub-monitor programs
that are basically specialized versions of "rtime", with
support for at least some of the various other AFS services.
bigbro can also do "bos status", "ping", and "snmp queries",
and knows all the dependencies, so it knows when a network
outtage has obscured a file server, and won't classify
that as a file server outtage. It's all tied in with a log
file and a pager, so an online log of any outtages is kept,
an "up to the minute" report can be gotten with finger
(try "finger [EMAIL PROTECTED]"), and an outtage that
lasts for more than 5 minutes results in a page to a human who can
go inspect the problem and initiate any manual recovery procedure
needed.
ipd
Is a program that prints out files produced by the RS/6K AIX
utility "iptrace". It knows something about appletalk and RX, so
in certain cases (like netatalk and IFS) can be more useful than
the standard AIX "ipreport" utility.
If there were general interest in any of these, I think we could
scrape up the time to package them up for anonymous ftp as
"unsupported" products (ie, if you don't know what to do
with a core dump and don't like to read C source code over lunch,
you almost certainly won't want to mess with these.)
-Marcus Watts
UM ITD RS Umich Systems Group