Aside: we did have another one of these calls-waiting events at RPI today, on a different file server. This one lasted about 30 minutes, and it cleared up before we could pin down any specific culprit.
On Fri, 16 Aug 2013, Andrew Deason wrote: > > On Fri, 16 Aug 2013, Russ Allbery <[email protected]> wrote: > > > The specific pathology that we've seen in the past is that a client > > holds a callback on some file or directory (usually a directory) > > that a bunch of other clients want to access. Another client tries > > to do something that requires a callback break. The client holding > > the callback can't be contacted for some reason. Therefore, the > > threads trying to do something with that object all start blocking > > on the thread trying to break the callback (or callbacks). [...] > > Note that in this scenario, you should likely see issues logged about > failing to contact the relevant address at some point. Then again you > might see a lot of those messages almost constantly, like some people > do. In the FileLog we do see messages such as: ProbeUuid for 128.113.71.61:7262 failed -01 WhoAreYou failed for host ac423c08 (128.113.14.38:7001), error -01 RCallBackConnectBack (host.c) failed for host 128.113.26.87:7001 Looking farther back in the logs we see these from time-to-time on all the file servers. I haven't added up specific numbers, but it looks like we do see a few more of these during the calls-waiting period, but not dramatically more. So I didn't know how much to make of those messages. And with the debugging-level up during today's event we saw the much more damning: RCallBackConnectBack failed for host ac5d9f08 (128.113.102.29:7001) Call back connect back failed (in break delayed) for Host 128.113.102.29:7001 BreakDelayedCallbacks FAILED for host 128.113.102.29:7001 which IS UP. Connection from 128.113.102.29:7001. Possible network or routing failure. The networking guy was trying to investigate everything about the network and switches between 128.113.102.29 and the file server, but then his machines locked up due to the AFS problems. By the time he was able to do stuff again, the calls-waiting problem had cleared up. > On Fri, 16 Aug 2013, [email protected] wrote: > > >> I did also try doing some tcpdumps and summarizing that traffic, but > >> nothing remarkable showed up. However earlier today I learned that > >> the way I did that might have generated misleading results (for > >> reasons I won't bore you with right now). But based on those > >> tcpdumps I doubt we were getting hammered with AFS traffic, > >> especially not for such a long stretch of time in the middle of the > >> summer. > > Did you see _anything_ AFS-related in captured traffic? Even small > packets like our rx ACKs or ABORTs could give an indication as to > what is happening, or just information like which hosts the packets > are going to/from. I had tcpdump capturing *only* the packets for ports 7001 to 7005, so all the traffic I captured was AFS traffic. But I didn't look too closely at the packets. I was doing quick traffic summaries, just so I could see if some specific AFS client was hammering the server. Russ Allbery also suggested "Also, if you haven't already, increasing the number of file server threads to way more than the default number (which is quite low) can obviously help by creating more of the resource that is exhausted by this problem." FWIW we are running with '/usr/afs/bin/fileserver -L', but I do hope to get a chance to restart the file servers and increase the number of file-server threads during this weekend. -- Garance Alistair Drosehn Senior Systems Programmer Rensselaer Polytechnic Institute; Troy NY _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
