Just having had my nose pinned down to that area for the past weeks I'd risk a 
quick diagnosis: if your clients experience long delays and even timeouts on 
writes (and you could check whether this also applies to mkdir, rmdir, rm, 
chmod, etc..), then callback breaking is the main suspect as those are 
synchronous to operations altering the state of files.

Your FileLog shows ports other than 7001 or 4711, a strong hint that those are 
Nat gateways, perhaps simply virtual machines with non-static assignments and 
short UDP port mapping intervals.

On the client side you can help by cranking up UDP port mapping timeouts from a 
few minutes to 1-2 hour, easy in vmware for example, in personal NAT routers 
you'd try to assign static mappings.

Unfortunately, AFS/RX agnostic firewalls and NAT gateways are hard to deal with 
on the server side. The server can protect itself to a certain degree, 
yesterday I submitted the second in a series of patches addressing parts of the 
problem, but it remains an effort to keep adverse affects down to a bearable 
level. RX over TCP would solve many problems in this area - if only it got a 
little more attention.










Le 10 févr. 2011 à 05:44, John Tang Boyland a écrit :

> Since the start of the semester, OpenAFS seems to occasionally hang
> for a few seconds (5? 10?) when trying to do things like write files.
> I finally had it happen while running a script that was doing fs calls,
> and got the message:
> fs:'path-to-directory-in-afs': server not responding promptly
> 
> The FileLog for the server (jeremiah.cs.uwm.edu) from the appropriate time 
> has:
> ...
> Wed Feb  9 22:14:06 2011 CB: ProbeUuid for 999.102.202.55:2841 failed -01
> Wed Feb  9 22:15:07 2011 CheckHost_r: Probing all interfaces of host 
> 999.35.48.249:56648 failed, code -01
> Wed Feb  9 22:15:09 2011 CB: ProbeUuid for 999.131.13.134:7001 failed -01
> Wed Feb  9 22:16:04 2011 CB: ProbeUuid for 999.59.5.145:63713 failed -01
> Wed Feb  9 22:16:05 2011 CB: WhoAreYou failed for host gge5870 
> (999.30.179.54:7001), error -01
> Wed Feb  9 22:16:12 2011 CB: ProbeUuid for 999.100.203.66:53467 failed -01
> Wed Feb  9 22:16:36 2011 CB: ProbeUuid for 999.229.195.248:49341 failed -01
> Wed Feb  9 22:18:12 2011 CB: ProbeUuid for 999.102.202.55:2846 failed -01
> Wed Feb  9 22:19:14 2011 CB: ProbeUuid for 999.131.13.134:12627 failed -01
> 
> (I have obscured the first byte in each network id.)
> 
> % rxdebug jeremiah.cs.uwm.edu -version
> Trying 129.89.143.70 (port 7000):
> AFS version:  OpenAFS 1.4.12 built  2010-03-09 
> % fs --version
> openafs 1.4.3
> % rxdebug localhost -version -port 7001
> Trying 127.0.0.1 (port 7001):
> AFS version:  OpenAFS 1.4.11 built  2009-07-13 
> 
> People notice the delays on Windows machines, MacOSX and on Solaris.
> (The machine I caught it on above was solaris 10.)
> 
> On MacOSX and Windows, the delays are particularly disturbing
> because they are long enough for the OS to time out
> and give the application an IO error.  This causes the application
> to say the files aren't there anymore, which is highly disturbing
> to my students.
> 
> I'm running the file server with all default values.
> Perhaps I need to tune the number of daemons?
> 
> My original guess is that the server is hanging while waiting to break 
> callbacks
> from clients that are behind firewalls and not responding.  But even
> running 'fs checks' from all possible clients that are accessing the
> volume doesn't seem to work; at least it still takes a few seconds more.
> But this sort of behavior presumably would drive everyone mad and would
> have been fixed before 1.4.12, so now I'm at a loss.
> 
> Suggestions always appreciated,
> John
> _______________________________________________
> OpenAFS-info mailing list
> [email protected]
> https://lists.openafs.org/mailman/listinfo/openafs-info

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to