Re: AFS going into limbo!

Marcus Watts Thu, 17 Oct 1996 01:21:24 GMT
You need to learn how to use the available diagnostic tools.

cmdebug can tell you what the cache manager on a "stuck"
client thinks it's doing.
        sample run:
        llama% cmdebug llama
        ** Cache entry @ 0x05c3d03c for 1.1853153849.12.1289
            locks: (writer_waiting(pid:0 at:0), 1 waiters)
            66061312 bytes      DV 18 refcnt 1
            callback 8dd3016e   expires 845512652
            1 opens     0 writers
            normal file
            states (0x1), stat'd
        llama%

        In the above output, 1853153849 is the volume ID.
        there is one process waiting for the lock.
        The file is very big (66061312).  &etc.

        This is a much better way to narrow down problems
        than trying to log into each server in turn.

rxdebug can tell you if rx calls are pending from/to a
server/client.  Sample output (on a client):

        llama% rxdebug llama 7001
        Trying 141.211.168.86 (port 7001):
        Free packets: 65, packet reclaims: 0, calls: 30632, used FDs: 64
         debugi.version M
        not waiting for packets.
        Connection from host 141.211.1.109, port 7000, Cuid 81b7607c/c8643758
          serial 2,  natMTU 0, security index 0, client conn
            call 0: # 1, state dally, mode: receiving, flags: receive_done
            call 1: # 0, state not initialized
            call 2: # 0, state not initialized
            call 3: # 0, state not initialized
        Connection from host 141.211.1.110, port 7000, Cuid 81b7607c/c864375c
          serial 2,  natMTU 0, security index 0, client conn
            call 0: # 1, state dally, mode: receiving, flags: receive_done
            call 1: # 0, state not initialized
            call 2: # 0, state not initialized
            call 3: # 0, state not initialized
        Done.
        llama%

        (on a server):
        Trying 141.211.1.110 (port 7000):
        Free packets: 316, packet reclaims: 0, calls: 52672751, used FDs: 6
         debugi.version N
        not waiting for packets.
        0 calls waiting for a thread
        Connection from host 141.211.1.32, port 7002, Cuid 9ffce4f9/45da89fc
          serial 4197424,  natMTU 4324, flags pktCksum, security index 2, client conn
          rxkad: level crypt, flags pktCksum
          Received 88808144 bytes in 2909039 packets
          Sent 229144348 bytes in 2909019 packets
            call 0: # 2814818, state dally, mode: receiving, flags: receive_done
            call 1: # 76266, state not initialized
            call 2: # 13317, state not initialized
            call 3: # 4004, state not initialized
        Connection from host 141.211.168.86, port 7001, Cuid 81b7607c/39639ed4
          serial 41241,  natMTU 1472, flags pktCksum, security index 2, server conn
          rxkad: level clear, flags authenticated pktCksum, expires in 11.2 hours
          Received 21616 bytes in 903 packets
          Sent 58637460 bytes in 41087 packets
            call 0: # 903, state active, mode: sending, flags: window_alloc 
receive_done, has_output_packets
            call 1: # 0, state not initialized
            call 2: # 0, state not initialized
            call 3: # 0, state not initialized
        Done.

        state=dally means the call is done.  has_output_packets means
        the call is currently sending data to the client.

        the most likely numbers you'll see:
                7000    fs                      fs
                7001    cache manager           client
                7002    pt                      db
                7003    vl                      db
                7005    vol                     fs
                (etc.)

If rx can't contact a server at all:
        llama% rxdebug homer
        Trying 141.211.217.190 (port 7000):
        getstats call failed with code -1
        llama%

it is reasonable to suppose that service or server is not running.

When the fileserver does things, it contacts the pt servers to get
user membership for acl rights purposes.  If you have more than one
pt server, then the fileserver picks a random pt server to use.  If that
server is up, but *very slow*, then that fileserver will be very
slow, even though file service from other servers may appear normal.
Various other ubik ills are possible, & the "udebug" command
can be of service here.
        llama% udebug fear -port 7002
        Host's 141.211.1.32 time is Wed Oct 16 20:09:40 1996
        Local time is Wed Oct 16 20:09:41 1996 (time differential 1 secs)
        Last yes vote for 141.211.1.32 was 7 secs ago (sync site); 
        Last vote started 7 secs ago (at Wed Oct 16 20:09:34 1996)
        Local db version is 845200834.634
        I am sync site until 53 secs from now (at Wed Oct 16 20:10:34 1996) (3 servers)
        Recovery state 1f
        Sync site's db version is 845200834.634
        0 locked pages, 0 of them for write
        There are read locks held
        Last time a new db version was labelled was:
                 310147 secs ago (at Sun Oct 13 06:00:34 1996)

        Server 141.211.1.34: (db 845200834.634)
            last vote rcvd 7 secs ago (at Wed Oct 16 20:09:34 1996),
            last beacon sent 7 secs ago (at Wed Oct 16 20:09:34 1996), last vote was 
yes
            dbcurrent=1, up=1 beaconSince=1

        Server 141.211.1.33: (db 845200834.634)
            last vote rcvd 7 secs ago (at Wed Oct 16 20:09:34 1996),
            last beacon sent 7 secs ago (at Wed Oct 16 20:09:34 1996), last vote was 
yes
            dbcurrent=1, up=1 beaconSince=1
        llama%
        llama% udebug surprise -port 7002
        Host's 141.211.1.33 time is Wed Oct 16 20:10:51 1996
        Local time is Wed Oct 16 20:10:51 1996 (time differential 0 secs)
        Last yes vote for 141.211.1.32 was 2 secs ago (sync site); 
        Last vote started 2 secs ago (at Wed Oct 16 20:10:49 1996)
        Local db version is 845200834.634
        I am not sync site
        Lowest host 141.211.1.32 was set 2 secs ago
        Sync host 141.211.1.32 was set 2 secs ago
        Sync site's db version is 845200834.634
        0 locked pages, 0 of them for write
        llama%

In the above, fear is the sync site, surprise is not a sync site.
845200834 is the db epoch, & 634 is the db version.  The two should be
the same on all servers unless a transaction is in progress.  845200834
is also the # of seconds between Jan 1, 1970 & when a db quorum was
last established, in this case, at Oct 13 06:00:34 1996.  Recovery
state should always be 1f, except during or immediately after a restart
or network partition.  Anything else indicates a problem.

A volume can be made busy or off-line, and this can cause odd results.
The vos command can provide more information on this kind of thing.

        llama% vos status afs10
        Total transactions: 1
        --------------------------------------
        transaction: 2054166  created: Wed Oct 16 19:48:20 1996
        attachFlags:  offline 
        volume: 536878734  partition: /vicepj  procedure: DeleteVolume
        packetRead: 2  lastReceiveTime: 6366832  packetSend: 1  lastSendTime: 6366833
        --------------------------------------

        llama%

        (In the above, the volumes associated with 536878734 would be marked busy,
        and if a "vos e" were done, might appear as follows):

        llama% vos e group.umich.bin
        **** Volume 536878733 is busy ****

            RWrite: 536878733     ROnly: 536878734     Backup: 536878735 
            number of sites -> 3
               server afs10.ifs.umich.edu partition /vicepj RW Site 
               server afs10.ifs.umich.edu partition /vicepj RO Site 
               server afs11.ifs.umich.edu partition /vicepj RO Site 
            Volume is currently LOCKED  
        llama%

In all of the above cases, the messages I get, above, will
likely be slightly different in format from what you'll see.  
The message format changes slightly between versions,
and "cmdebug" in particular is not portable between
different revisions of AFS.  You have to get exactly the
right version of cmdebug (ie, from the same AFS build that
produced the cache manager) to get useful results.

You reported that you weren't able to log into the affected
fileserver.  That suggests an OS problem that was somehow
affecting AFS service.  Check the usual suspects on the fileserver.
Did the machine crash recently?  Are there any interesting log
messages (in /usr/adm/messages?)  Does the machine collect error
logging information anywhere else?  In your case, you are running
AIX so the answer is yes -- and the utility "errpt" can display
these errors.  Are there are any obvious signs of distress from disk
drives (strange noises, odd light patterns?)  (Pssst.  If you still
have any IBM 850 M 5.25" drives (part# 95x2430 or similar,
marked "made in england") left in service, consider turning
them into paper weights as soon as possible.)  Is there any
other odd event, such as a file partition being "almost full"?
(In some versions of AFS and AIX, anything over 92% full is
very bad news.)  Check the salvage log - are there any recent
messages there?  Does the machine have enough swap on it?
It is very odd that a problem with the server binaries would
keep you from logging in.  Surely you aren't per chance running
the cache manager on the file server?

You should definitely try the above first, but for much more than this,
you should definitely contact your transarc customer representative.
Under their instructions, you may be able to cause a core dump of the
fileserver process and submit it to them for more detailed diagnostics.
The above should help you to narrow it down to which component
is failing, and may help you to find external causes which you
can best fix yourself.

                                -Marcus Watts
                                UM ITD PD&D Umich Systems Group
Re: AFS going into limbo!

Reply via email to