Todd DeSantis wrote:
Hi Rich -
I am glad that
fs checkvolumes
was able to help you get rid of this problem.
Hopefully this was not a coincidence and the "vos release"
of the bogus root.cell.readonly also did not happen around
this time.
To help understand why your clients were in this state
I would like to ask some questions:
- a kdump snapshot would have been able to give us some
information on the state of the client and could have
helped us determine if any volume and/or vcache entry
was still pointing at this old fileserver
yes - that would be nice - I wish I used these tools more and
was more proficient with them. But I'm no longer supposed
to do this ;-) But as I mentioned these have been happening
for a number of years. I've also seen very inconsistant
releases of root.cell at out site. e.g. some going offline
and a LOT of communication errors when doing the release
which happens daily at 7 A.M.
Did you just not build kdump for the client, or does
OpenAFS not build kdump by default ?
I don't remember - I believe that there are "problems" getting
this to build on openafs.
- when was this fileserver taken out of commission, was it
within 2 hours ?
No MUCH longer > 1 day.
Normal callback timeouts on volumes would be 2 hours.
There is a daemon on the client that will run every 2
hours and it will clear the "volume status" flag on
the volumes in the volume cache, if the expiration time
has elapsed. I think readonly volumes had a maximum
2 hour timeout.
What happens when the 1st readonly volume is "screwed up
as we saw yesterday due to the lack of a vos release on root.cell?
Although as mentioned - this always used to work (transparent
fileserver moves and reconfigurations) until the last couple years.
This process also causes the vcache structures to have
their CStatd bit cleared. This tells the client to run
a FetchStatus call to determine if my cached version is
still the correct version of the file/dir.
This is the way that the IBM Transarc clients work. It is
possible that the OpenAFS code has changed the callback timing
a bit, I am not sure of this.
But the above procedures will cause the following to happen
the next time you tried to access a file or directory that
had its volume status flag cleared
- contact the vlserver and get location information for
the volume. If the client still thought that this file
lived on the bad fileserver, and the VLDB information is
correct, then it would get the new server location info.
- it would then contact the fileserver with a FetchStatus
call to determine if its cache is current, or if it
needs to do a FetchData call to the fileserver for your
directories and files.
- and at this time, it has located the directory/file you
are looking for
Other ways that the volume location information can get cleared is
with
- fs checkvolumes, as Kim and I suggested to Rich
- vos move
- vos release
- bringing more volumes into the cache than the -volumes option
in afsd. This causes some volumes to cycle out of the cache
and this can clear the status flag for the volume
- and possibly other vos transactions on the volume
Also, as Derrick mentioned in the first email, once the client knows
about a fileserver, it will remember it until the client is rebooted.
And every once in a while the CheckServersDaemon will run and it will
see that it does not get an answer from this fileserver. And then
every 5 minutes or so, the client will send a GetTime request to the
fileserver IP to determine if the fileserver is back up. This could
have been the tcpdump traffic you saw going to this old fileserver IP,
the GetTime call.
Sorry for chiming in on this one, but I wanted to add some information
to this issue since the "checkv" has seemd to get us out of this
problem.
NO THANK YOU VERY MUCH!!
A kdump snapshot would have really helped.
OK
And one more thing to check is if OpenAFS changed any of the
callback timing for volumes.
OK - Thanks. I did see some very similiar messages which were reported
for the Windows client - and mention that there were some recent server
changes to go with this - not 100% sure that these are related.
https://lists.openafs.org/pipermail/openafs-info/2005-June/018298.html
Thanks for your help Todd ;-)
Rich
Thanks
Todd DeSantis
AFS Support
IBM Pittsburgh Lab
Rich Sudlow
<[EMAIL PROTECTED]>
Sent by: To
openafs-info-admi [EMAIL PROTECTED]
[EMAIL PROTECTED] cc
"'openafs'"
<[email protected]>
08/09/2005 05:21 Subject
PM Re: [OpenAFS] Problems on AFS Unix
clients after AFS fileserver moves
Dexter 'Kim' Kimball wrote:
fs checkv will cause the client to discard what it remembers about
volumes.
Did you try that?
No - That worked!
Thanks
Rich
Kim
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Rich Sudlow
Sent: Tuesday, August 09, 2005 9:58 AM
To: openafs
Subject: [OpenAFS] Problems on AFS Unix clients after AFS
fileserver moves
We've been having problems with our cell for the last couple
years with AFS clients after fileservers are taken out of service.
Before that things seemed to work ok when doing fileserver
moves and
rebuilding. All data was moved off the fileserver but the clients
still seem to have some need to talk to it. In the past the AFS
admins have left the fileservers up and empty for a number of
days to try to resolve this issue - but it doesn't resolve the
issue.
For example a recent example:
The fileserver reno.helios.nd.edu was shutdown after all data
moved off of it. However the client still can't get to
a number of AFS files.
[EMAIL PROTECTED] root]# fs checkservers
These servers unavailable due to network or server problems:
reno.helios.nd.edu.
[EMAIL PROTECTED] root]# cmdebug reno.helios.nd.edu -long
cmdebug: error checking locks: server or network not responding
cmdebug: failed to get cache entry 0 (server or network
not responding)
[EMAIL PROTECTED] root]# cmdebug reno.helios.nd.edu
cmdebug: error checking locks: server or network not responding
cmdebug: failed to get cache entry 0 (server or network
not responding)
[EMAIL PROTECTED] root]#
[EMAIL PROTECTED] root]# vos listvldb -server reno.helios.nd.edu
VLDB entries for server reno.helios.nd.edu
Total entries: 0
[EMAIL PROTECTED] root]#
on the client:
rxdebug localhost 7001 -version
Trying 127.0.0.1 (port 7001):
AFS version: OpenAFS 1.2.11 built 2004-01-11
This is a linux 2.4 client and I don't have kdump - have
also had these
problems on sun4x_58 clients too.
I should mention that we've seen some correlation
to this happening on machines with "busy" AFS caches -
which makes it
even more frustrating as it seems to affect machines which
depend on
AFS the most. We've tried lots of fs flush* * -
So far we've ended up rebooting which does fix the
problem.
Does anyone have any clues what the problem is or what a workaround
might be?
Thanks
Rich
--
Rich Sudlow
University of Notre Dame
Office of Information Technologies
321 Information Technologies Center
PO Box 539
Notre Dame, IN 46556-0539
(574) 631-7258 office phone
(574) 631-9283 office fax
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info
--
Rich Sudlow
University of Notre Dame
Office of Information Technologies
321 Information Technologies Center
PO Box 539
Notre Dame, IN 46556-0539
(574) 631-7258 office phone
(574) 631-9283 office fax
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info
--
Rich Sudlow
University of Notre Dame
Office of Information Technologies
321 Information Technologies Center
PO Box 539
Notre Dame, IN 46556-0539
(574) 631-7258 office phone
(574) 631-9283 office fax
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info