You need to learn how to use the available diagnostic tools.
cmdebug can tell you what the cache manager on a "stuck"
client thinks it's doing.
sample run:
llama% cmdebug llama
** Cache entry @ 0x05c3d03c for 1.1853153849.12.1289
locks: (writer_waiting(pid:0 at:0), 1 waiters)
66061312 bytes DV 18 refcnt 1
callback 8dd3016e expires 845512652
1 opens 0 writers
normal file
states (0x1), stat'd
llama%
In the above output, 1853153849 is the volume ID.
there is one process waiting for the lock.
The file is very big (66061312). &etc.
This is a much better way to narrow down problems
than trying to log into each server in turn.
rxdebug can tell you if rx calls are pending from/to a
server/client. Sample output (on a client):
llama% rxdebug llama 7001
Trying 141.211.168.86 (port 7001):
Free packets: 65, packet reclaims: 0, calls: 30632, used FDs: 64
debugi.version M
not waiting for packets.
Connection from host 141.211.1.109, port 7000, Cuid 81b7607c/c8643758
serial 2, natMTU 0, security index 0, client conn
call 0: # 1, state dally, mode: receiving, flags: receive_done
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Connection from host 141.211.1.110, port 7000, Cuid 81b7607c/c864375c
serial 2, natMTU 0, security index 0, client conn
call 0: # 1, state dally, mode: receiving, flags: receive_done
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Done.
llama%
(on a server):
Trying 141.211.1.110 (port 7000):
Free packets: 316, packet reclaims: 0, calls: 52672751, used FDs: 6
debugi.version N
not waiting for packets.
0 calls waiting for a thread
Connection from host 141.211.1.32, port 7002, Cuid 9ffce4f9/45da89fc
serial 4197424, natMTU 4324, flags pktCksum, security index 2, client conn
rxkad: level crypt, flags pktCksum
Received 88808144 bytes in 2909039 packets
Sent 229144348 bytes in 2909019 packets
call 0: # 2814818, state dally, mode: receiving, flags: receive_done
call 1: # 76266, state not initialized
call 2: # 13317, state not initialized
call 3: # 4004, state not initialized
Connection from host 141.211.168.86, port 7001, Cuid 81b7607c/39639ed4
serial 41241, natMTU 1472, flags pktCksum, security index 2, server conn
rxkad: level clear, flags authenticated pktCksum, expires in 11.2 hours
Received 21616 bytes in 903 packets
Sent 58637460 bytes in 41087 packets
call 0: # 903, state active, mode: sending, flags: window_alloc
receive_done, has_output_packets
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Done.
state=dally means the call is done. has_output_packets means
the call is currently sending data to the client.
the most likely numbers you'll see:
7000 fs fs
7001 cache manager client
7002 pt db
7003 vl db
7005 vol fs
(etc.)
If rx can't contact a server at all:
llama% rxdebug homer
Trying 141.211.217.190 (port 7000):
getstats call failed with code -1
llama%
it is reasonable to suppose that service or server is not running.
When the fileserver does things, it contacts the pt servers to get
user membership for acl rights purposes. If you have more than one
pt server, then the fileserver picks a random pt server to use. If that
server is up, but *very slow*, then that fileserver will be very
slow, even though file service from other servers may appear normal.
Various other ubik ills are possible, & the "udebug" command
can be of service here.
llama% udebug fear -port 7002
Host's 141.211.1.32 time is Wed Oct 16 20:09:40 1996
Local time is Wed Oct 16 20:09:41 1996 (time differential 1 secs)
Last yes vote for 141.211.1.32 was 7 secs ago (sync site);
Last vote started 7 secs ago (at Wed Oct 16 20:09:34 1996)
Local db version is 845200834.634
I am sync site until 53 secs from now (at Wed Oct 16 20:10:34 1996) (3 servers)
Recovery state 1f
Sync site's db version is 845200834.634
0 locked pages, 0 of them for write
There are read locks held
Last time a new db version was labelled was:
310147 secs ago (at Sun Oct 13 06:00:34 1996)
Server 141.211.1.34: (db 845200834.634)
last vote rcvd 7 secs ago (at Wed Oct 16 20:09:34 1996),
last beacon sent 7 secs ago (at Wed Oct 16 20:09:34 1996), last vote was
yes
dbcurrent=1, up=1 beaconSince=1
Server 141.211.1.33: (db 845200834.634)
last vote rcvd 7 secs ago (at Wed Oct 16 20:09:34 1996),
last beacon sent 7 secs ago (at Wed Oct 16 20:09:34 1996), last vote was
yes
dbcurrent=1, up=1 beaconSince=1
llama%
llama% udebug surprise -port 7002
Host's 141.211.1.33 time is Wed Oct 16 20:10:51 1996
Local time is Wed Oct 16 20:10:51 1996 (time differential 0 secs)
Last yes vote for 141.211.1.32 was 2 secs ago (sync site);
Last vote started 2 secs ago (at Wed Oct 16 20:10:49 1996)
Local db version is 845200834.634
I am not sync site
Lowest host 141.211.1.32 was set 2 secs ago
Sync host 141.211.1.32 was set 2 secs ago
Sync site's db version is 845200834.634
0 locked pages, 0 of them for write
llama%
In the above, fear is the sync site, surprise is not a sync site.
845200834 is the db epoch, & 634 is the db version. The two should be
the same on all servers unless a transaction is in progress. 845200834
is also the # of seconds between Jan 1, 1970 & when a db quorum was
last established, in this case, at Oct 13 06:00:34 1996. Recovery
state should always be 1f, except during or immediately after a restart
or network partition. Anything else indicates a problem.
A volume can be made busy or off-line, and this can cause odd results.
The vos command can provide more information on this kind of thing.
llama% vos status afs10
Total transactions: 1
--------------------------------------
transaction: 2054166 created: Wed Oct 16 19:48:20 1996
attachFlags: offline
volume: 536878734 partition: /vicepj procedure: DeleteVolume
packetRead: 2 lastReceiveTime: 6366832 packetSend: 1 lastSendTime: 6366833
--------------------------------------
llama%
(In the above, the volumes associated with 536878734 would be marked busy,
and if a "vos e" were done, might appear as follows):
llama% vos e group.umich.bin
**** Volume 536878733 is busy ****
RWrite: 536878733 ROnly: 536878734 Backup: 536878735
number of sites -> 3
server afs10.ifs.umich.edu partition /vicepj RW Site
server afs10.ifs.umich.edu partition /vicepj RO Site
server afs11.ifs.umich.edu partition /vicepj RO Site
Volume is currently LOCKED
llama%
In all of the above cases, the messages I get, above, will
likely be slightly different in format from what you'll see.
The message format changes slightly between versions,
and "cmdebug" in particular is not portable between
different revisions of AFS. You have to get exactly the
right version of cmdebug (ie, from the same AFS build that
produced the cache manager) to get useful results.
You reported that you weren't able to log into the affected
fileserver. That suggests an OS problem that was somehow
affecting AFS service. Check the usual suspects on the fileserver.
Did the machine crash recently? Are there any interesting log
messages (in /usr/adm/messages?) Does the machine collect error
logging information anywhere else? In your case, you are running
AIX so the answer is yes -- and the utility "errpt" can display
these errors. Are there are any obvious signs of distress from disk
drives (strange noises, odd light patterns?) (Pssst. If you still
have any IBM 850 M 5.25" drives (part# 95x2430 or similar,
marked "made in england") left in service, consider turning
them into paper weights as soon as possible.) Is there any
other odd event, such as a file partition being "almost full"?
(In some versions of AFS and AIX, anything over 92% full is
very bad news.) Check the salvage log - are there any recent
messages there? Does the machine have enough swap on it?
It is very odd that a problem with the server binaries would
keep you from logging in. Surely you aren't per chance running
the cache manager on the file server?
You should definitely try the above first, but for much more than this,
you should definitely contact your transarc customer representative.
Under their instructions, you may be able to cause a core dump of the
fileserver process and submit it to them for more detailed diagnostics.
The above should help you to narrow it down to which component
is failing, and may help you to find external causes which you
can best fix yourself.
-Marcus Watts
UM ITD PD&D Umich Systems Group