Todd DeSantis wrote:
Hi Rich -

I am glad that

      fs checkvolumes

was able to help you get rid of this problem.

Hopefully this was not a coincidence and the "vos release"
of the bogus root.cell.readonly also did not happen around
this time.

To help understand why your clients were in this state
I would like to ask some questions:

 - a kdump snapshot would have been able to give us some
   information on the state of the client and could have
   helped us determine if any volume and/or vcache entry
   was still pointing at this old fileserver

yes - that would be nice - I wish I used these tools more and
was more proficient with them.  But I'm no longer supposed
to do this ;-)  But as I mentioned these have been happening
for a number of years.  I've also seen very inconsistant
releases of root.cell at out site. e.g. some going offline
and a LOT of communication errors when doing the release
which happens daily at 7 A.M.


   Did you just not build kdump for the client, or does
   OpenAFS not build kdump by default ?

I don't remember - I believe that there are  "problems" getting
this to build on openafs.


 - when was this fileserver taken out of commission, was it
   within 2 hours ?

No MUCH longer > 1 day.


   Normal callback timeouts on volumes would be 2 hours.
   There is a daemon on the client that will run every 2
   hours and it will clear the "volume status" flag on
   the volumes in the volume cache, if the expiration time
   has elapsed.  I think readonly volumes had a maximum
   2 hour timeout.

What happens when the 1st readonly volume is "screwed up
as we saw yesterday due to the lack of a vos release on root.cell?
Although as mentioned - this always used to work (transparent
fileserver moves and reconfigurations) until the last couple years.

   This process also causes the vcache structures to have
   their CStatd bit cleared.  This tells the client to run
   a FetchStatus call to determine if my cached version is
   still the correct version of the file/dir.

   This is the way that the IBM Transarc clients work.  It is
   possible that the OpenAFS code has changed the callback timing
   a bit, I am not sure of this.

   But the above procedures will cause the following to happen
   the next time you tried to access a file or directory that
   had its volume status flag cleared

      - contact the vlserver and get location information for
        the volume.  If the client still thought that this file
        lived on the bad fileserver, and the VLDB information is
        correct, then it would get the new server location info.

      - it would then contact the fileserver with a FetchStatus
        call to determine if its cache is current, or if it
        needs to do a FetchData call to the fileserver for your
        directories and files.

      - and at this time, it has located the directory/file you
        are looking for

Other ways that the volume location information can get cleared is
with

      - fs checkvolumes, as Kim and I suggested to Rich
      - vos move
      - vos release
      - bringing more volumes into the cache than the -volumes option
        in afsd.  This causes some volumes to cycle out of the cache
        and this can clear the status flag for the volume
      - and possibly other vos transactions on the volume

Also, as Derrick mentioned in the first email, once the client knows
about a fileserver, it will remember it until the client is rebooted.
And every once in a while the CheckServersDaemon will run and it will
see that it does not get an answer from this fileserver.  And then
every 5 minutes or so, the client will send a GetTime request to the
fileserver IP to determine if the fileserver is back up.  This could
have been the tcpdump traffic you saw going to this old fileserver IP,
the GetTime call.

Sorry for chiming in on this one, but I wanted to add some information
to this issue since the "checkv" has seemd to get us out of this
problem.

NO THANK YOU VERY MUCH!!


A kdump snapshot would have really helped.

OK


And one more thing to check is if OpenAFS changed any of the
callback timing for volumes.

OK - Thanks. I did see some very similiar messages which were reported for the Windows client - and mention that there were some recent server
changes to go with this - not 100% sure that these are related.

https://lists.openafs.org/pipermail/openafs-info/2005-June/018298.html

Thanks for your help Todd ;-)

Rich


Thanks

Todd DeSantis
AFS Support
IBM Pittsburgh Lab



Rich Sudlow <[EMAIL PROTECTED]> Sent by: To openafs-info-admi [EMAIL PROTECTED] [EMAIL PROTECTED] cc "'openafs'" <[email protected]> 08/09/2005 05:21 Subject PM Re: [OpenAFS] Problems on AFS Unix clients after AFS fileserver moves



Dexter 'Kim' Kimball wrote:

fs checkv will cause the client to discard what it remembers about

volumes.

Did you try that?


No - That worked!

Thanks

Rich


Kim


    -----Original Message-----
    From: [EMAIL PROTECTED]
    [mailto:[EMAIL PROTECTED] On Behalf Of Rich Sudlow
    Sent: Tuesday, August 09, 2005 9:58 AM
    To: openafs
    Subject: [OpenAFS] Problems on AFS Unix clients after AFS
    fileserver moves


    We've been having problems with our cell for the last couple
    years with AFS clients after fileservers are taken out of service.
    Before that things seemed to work ok when doing fileserver
    moves and
    rebuilding. All data was moved off the fileserver but the clients
    still seem to have some need to talk to it.  In the past the AFS
    admins have left the fileservers up and empty for a number of
    days to try to resolve this issue -  but it doesn't resolve the
    issue.

    For example a recent example:

    The fileserver reno.helios.nd.edu was shutdown after all data
    moved off of it.  However the client still can't get to
    a number of AFS files.

    [EMAIL PROTECTED] root]# fs checkservers
    These servers unavailable due to network or server problems:
    reno.helios.nd.edu.
    [EMAIL PROTECTED] root]# cmdebug reno.helios.nd.edu -long
    cmdebug: error checking locks: server or network not responding
    cmdebug: failed to get cache entry 0 (server or network
    not responding)
    [EMAIL PROTECTED] root]# cmdebug reno.helios.nd.edu
    cmdebug: error checking locks: server or network not responding
    cmdebug: failed to get cache entry 0 (server or network
    not responding)
    [EMAIL PROTECTED] root]#

    [EMAIL PROTECTED] root]#  vos listvldb -server reno.helios.nd.edu
    VLDB entries for server reno.helios.nd.edu

    Total entries: 0
    [EMAIL PROTECTED] root]#

    on the client:
    rxdebug localhost 7001 -version
    Trying 127.0.0.1 (port 7001):
    AFS version:  OpenAFS 1.2.11 built  2004-01-11


    This is a linux 2.4 client and I don't have kdump - have
    also had these
    problems on sun4x_58 clients too.

    I should mention that we've seen some correlation
    to this happening on machines with "busy" AFS caches  -
    which makes it
    even more frustrating as it seems to affect machines which
    depend on
    AFS the most. We've tried lots of fs flush* * -
    So far we've ended up rebooting which does fix the
    problem.

    Does anyone have any clues what the problem is or what a workaround
    might be?

    Thanks

    Rich

    --
    Rich Sudlow
    University of Notre Dame
    Office of Information Technologies
    321 Information Technologies Center
    PO Box 539
    Notre Dame, IN 46556-0539

    (574) 631-7258 office phone
    (574) 631-9283 office fax



    _______________________________________________
    OpenAFS-info mailing list
    [email protected]
    https://lists.openafs.org/mailman/listinfo/openafs-info



_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info



--
Rich Sudlow
University of Notre Dame
Office of Information Technologies
321 Information Technologies Center
PO Box 539
Notre Dame, IN 46556-0539

(574) 631-7258 office phone
(574) 631-9283 office fax

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info



--
Rich Sudlow
University of Notre Dame
Office of Information Technologies
321 Information Technologies Center
PO Box 539
Notre Dame, IN 46556-0539

(574) 631-7258 office phone
(574) 631-9283 office fax

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to