On Thu, 7 May 1998, Peter Scott wrote:

> I need a way to flush the information from my client that a particular
> server isn't responding.  I believe this is the "non-functioning mark"
> referred to in the man page for fs checkservers.
> 
> We have some directories mounted on the other side of a firewall.  We
> have some code that attempts to access them and it needs to realize
> that it can't, quickly.  The first time it tries there is a ~45 second
> pause until the console message "Lost contact with file server ..."
> comes up.  I am testing a faster timeout by setting an alarm() (and
> it's not working... grrr...) but the trouble is that the next time I
> run the code, it immediately returns the "connection timed out" message
> because the client knows that it can't reach that server.  Its memory
> appears most persistent.  
> 
> I don't wish to reboot the client every time I want to test this.  How
> can I tell the client to forget what it learned about that server? 
> I've tried fs checkservers and fs flush to no avail.

The cache manager periodically sends "dummy"  requests to servers it cares
about, and updates its status based on the response.  I believe the period
is 3 minutes for servers it believes are up, and 5 minutes for those it
believes are down, but that might be backwards.  What "fs checkservers"
does is query that information, _and_ force the cache manager to re-query
servers it thinks are down.  If the server is, in fact, reachable, then
"fs checkservers" should force the cache manager to notice that.  But,
AFAIK, there is no way to make it "forget" that the server is down so that
you have to wait for the timeout again.

> [Also, perhaps someone can confirm my suspicion: I am guessing there is
> an alarm (45) somewhere in the client code which is invoked when a
> client-side program calls lstat().  So if my code calls alarm(3) before
> the lstat(), it won't do any good, because the alarm is overridden?  
> My signal handler was called - on the one occasion I was able to test
> it - but after 45 seconds, not 3.]

Well, not exactly.  Nothing the AFS cache manager does will override your
alarm - it's entirely kernel code, and doesn't depend on the SIGALRM
mechanism to do timeouts.  However, most system calls that involve
accessing objects in AFS are non-interruptable.  That means your alarm
goes off after 3 seconds, but it doesn't get delivered until the
non-interruptable system call returns, 42 seconds later.  To see that this
is what's happening, set your alarm to something _longer_ than the 45
second timeout, and notice that the signal handler doesn't get called
until the longer timeout expires.

-- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]>
   Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA

Reply via email to