On Fri, 18 Jun 2004, Charles Karney wrote: > We have encountered problems with our clients hanging on AFS accesses. > Yesterday I think I engineering a more-or-less reproducible set of > circumstance to reproduce this.
I'll guess it's the same problem the person with the SuSe 9 machine complained of, notably that nptl and the fileserver are feuding. Standard answers apply: try LD_ASSUME_KERNEL=2.4.1, or failing that, try the LWP fileserver from src/viced/fileserver. > Configuration: > > RH Linux 9 (clients and servers) > openafs 1.2.11-rh9 > > Symptom > > During and following a full backup of large (~30GB) volumes, client > hang (e.g., in 'ls'). > > After the backup > > bos status > fs checkserv > > both indicate that the servers are up and accessible. > > Clients unfreeze when server with the large volumes is restarted > (bos restart...) > > Details > > We run a small AFS cell with ~30 volumes some of which are large (5GB to > 30GB). We have 3 servers, several client machines, and two human AFS > users. We do backups in the "approved" way, namely > > vos backupsys ... > > followed by > > backup dump > > of the volumes *.backup. During the course of the backup of the largest > volumes I see several messages of form > > Thu Jun 17 20:05:08 2004 trans 24 on volume 536871469 is older than 1200 seconds > > in VolserLog, where 536871469 is the ID of the BK version of one of of > large volumes. There are no other indications of problems in the server > logs. During the backup there was no AFS client activity. > > I let the backup run to completion. At this point, some clients now > freeze on accessing AFS. > > I can get the clients to unfreeze by restarting the AFS server with > the large volumes (bos restart server -all). > > Notes: > > The freeze was associated with accessing RW and RO volumes (not > necessarily the recently locked AFS volumes). > > No "lost contact with file server" messages in log files on client. > > fs checkserv says "All servers are running". > > bos status shows all 3 AFS servers are OK. > > No "connection timed out" message on the client. > > It seems that one or more of the servers have ended up in an > "unresponsive" mode during the backup, even though all the normal > diagnostic claim that they are all running OK. > > Other information: > > This isn't an easy problem to diagnose since the full backup takes ~3 > hours and I don't like to endlessly clobber our AFS setup. > > Sometimes in similar circumstances, I DO get the "lost contact with > file server" but I don't get the "back up" message. In this case "fs > checkserv" agrees that one of the servers is down, but "bos status" > claims that it's up. Again restarting some or all of the servers > appears to be necessary. > > Similar circumstances = full backups, moving a large volume, "vos > backup" on a large volume. The common thread appears to be the > presence of the > > trans xx on volume nnnnnn is older than yyyy seconds > > messages in VolserLog. > > We have iptables firewalling in effect. On the clients > [0:0] -A trust -p udp -m udp --sport afs3-fileserver > --dport afs3-callback -j ACCEPT > [0:0] -A trust -p tcp -m tcp --sport afs3-fileserver > --dport afs3-callback -j ACCEPT > > On the servers > > [0:0] -A trust -p udp -m udp --dport 88 -j ACCEPT > [0:0] -A trust -p tcp -m tcp --dport 88 -j ACCEPT > [0:0] -A trust -p udp -m udp --dport 750:751 -j ACCEPT > [0:0] -A trust -p tcp -m tcp --dport 750:751 -j ACCEPT > [0:0] -A trust -p udp -m udp --dport 7000:7009 -j ACCEPT > [0:0] -A trust -p tcp -m tcp --dport 7000:7009 -j ACCEPT > [0:0] -A trust -p udp -m udp --dport 7021 -j ACCEPT > [0:0] -A trust -p tcp -m tcp --dport 7021 -j ACCEPT > [0:0] -A trust -p udp -m udp --dport 7025:7027 -j ACCEPT > [0:0] -A trust -p tcp -m tcp --dport 7025:7027 -j ACCEPT > > Any advice on how to cure or to diagnose this problem would be > appreciated. Thanks. > > -- > Charles Karney Email: [EMAIL PROTECTED] > 201 Washington Rd URL: http://charles.karney.info > Sarnoff Corporation Phone: +1 609 734 2312 > Princeton, NJ 08543-5300 Fax: +1 609 734 2323 > > _______________________________________________ > OpenAFS-info mailing list > [EMAIL PROTECTED] > https://lists.openafs.org/mailman/listinfo/openafs-info > _______________________________________________ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info
