Hello, starting from last month we have been facing "Lost contact with fileserver" situations on one of our zLinux systems (Novell SLES-9 distribution). After further investigation we have found out, that the cause for the "Lost contact" hanger seems to be our AFS client (version 1.4.5) not replying to whoareyou() calls from the fileserver. We have used tcpdump to record all packages we hope are essential to track the problem. For example, we see the whoareyou() call replied by our AFS-Client in about 40 to 100 µsec in normal operation:
10:30:00.945453 IP fs13.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call whoareyou (32) 10:30:00.945499 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs13.xxx.xx.xx.xx.afs3-fileserver: rx data cb reply whoareyou (460) 10:30:08.941373 IP fs20.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call whoareyou (32) 10:30:08.941455 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs20.xxx.xx.xx.xx.afs3-fileserver: rx data cb reply whoareyou (460) 10:30:08.952207 IP fs25.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call whoareyou (32) 10:30:08.952266 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs25.xxx.xx.xx.xx.afs3-fileserver: rx data cb reply whoareyou (460) 10:30:24.173003 IP fs13.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call whoareyou (32) 10:30:24.173042 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs13.xxx.xx.xx.xx.afs3-fileserver: rx data cb reply whoareyou (460) 10:30:24.176168 IP fs11.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call whoareyou (32) 10:30:24.176213 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs11.xxx.xx.xx.xx.afs3-fileserver: rx data cb reply whoareyou (460) mclinx is our AFS client and fsxx are AFS fileservers. We see those whoareyou() calls and replys any time, but sometimes our client does not response: 10:31 AM, the first whoareyou from fs15 is not replied 10:31:22.808760 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs20.xxx.xx.xx.xx.afs3-fileserver: rx data fs call give-cbs (244) 10:31:22.809183 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs20.xxx.xx.xx.xx.afs3-fileserver: rx data fs call give-cbs (88) 10:31:22.809368 IP fs20.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call callback fid 1802411095/5/1330193 afsuuid [|cb] (52) 10:31:22.809602 IP mclinx.xxx.xx.xx.xx.afs3-callback > fs15.xxx.xx.xx.xx.afs3-fileserver: rx data fs call give-cbs (244) (see below at 10:35 AM) 10:31:22.810046 IP fs15.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call whoareyou (32) 10:31:23.134195 IP fs15.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call callback fid 1802410300/17405/631375 afsuuid [|cb] (52) 10:31:23.163772 IP fs20.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call callback fid 1802411095/5/1330193 afsuuid [|cb] (52) 10:31:23.166077 IP fs15.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call whoareyou (32) 10:31:23.489312 IP fs15.xxx.xx.xx.xx.afs3-fileserver > mclinx.xxx.xx.xx.xx.afs3-callback: rx data cb call callback fid 1802410300/17405/631375 afsuuid [|cb] (52) Here, fs15 sends a whoareyou() which doesn't get a reply and about a third second later another whoareyou() is sent to the AFS-Client on mclinx. Neither of them get an answer. To make a long story short the fileserver fs15 will send initcb() to the AFS-Client two minutes later and another two minutes later we'll see the first rx abort packet send to the AFS-Client which will make the AFS-Client reporting the "Lost contact" to fs15 on the system log (at least this is my interpretation). Unfortunately, the AFS-Client won't respond to any whoareyou() from other fileservers until 10:45 AM in our log which ends up in "Lost contact" with all the fileservers being around and any AFS activity freezing in for about a quarter hour until the connections are reported "back up" again. My question is: What can block an AFS-Client from answering whoareyou() for several minutes? Are there any limits or restrictions that can lead an AFS client to a situation where it is internally blocked? Are there parameters one can adjust for tuning in order to avoid this situation? We have had those "Lost contact" time slots once every two days lately and they are painful for users who are logged on our system during that time. I would be happy to get rid of them somehow ... With kind regards, Carsten Jacobi (*120-4468) Firmware Development in Böblingen IBM Deutschland Entwicklung GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Herbert Kircher Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294
