Nice script., Dan! I was going to suggest running tcpdump to see if one client 
is accounting for most of the traffic. Some misconfiguration or a hardware 
problem out at the client end can definitely cause a headache for a server. (I 
dimly recall finding some client system that appeared to have two different afs 
clients installed and running, or trying to run, at the same time, causing a 
nasty load on a server.) 

Best regards,
Anne



________________________________
 From: Dan Van Der Ster <[email protected]>
To: "<[email protected]>" <[email protected]> 
Cc: "<[email protected]>" <[email protected]> 
Sent: Friday, August 16, 2013 4:15 AM
Subject: Re: [OpenAFS] Investigating 'calls waiting' from rxdebug
 

Hi,
Whenever we get waiting calls it is ~always caused by one or two users 
hammering a fileserver from batch jobs.

To find the culprit(s) you could try debugging the fileserver by sending the 
TSTP signal:
   http://rzdocs.uni-hohenheim.de/afs_3.6/debug/fs/fileserver.html

We have a script that enables debugging for 3 seconds then parses the output to 
make a nice summary. It has some dependencies on our local perl mgmt api but 
perhaps you can adapt it to work for you. I copied it here: 
http://pastebin.com/B6De4idS

Cheers, Dan

On Aug 16, 2013, at 4:33 AM, [email protected] wrote:

> Hi.
> 
> In the past week we have had two frustrating periods of significant
> performance problems in our 
> AFS cell.  The first one lasted for maybe two hours, at which point it
> seemed the culprit was 
> something odd-looking on two of our remote-access linux servers.  I
> rebooted those servers, and 
> the performance problems disappeared.  That sounds good, but I was so
> busy investigating 
> various red-herrings that the performance problems might have stopped
> 15-20 minutes earlier, 
> and I just didn't notice until after I had done that reboot.  This
> incident, by itself, is not too 
> worrisome.
> 
> Wednesday the significant (but intermittent) performance problems
> returned, and there was 
> nothing particularly odd-looking on any machines I could see.  Based on
> some google searches, 
> we zeroed in on the fact that one of our file servers was reporting
> rather high values for 'calls 
> waiting for a thread' in the output of 'rxdebug $fileserver -rxstats'.
> The other file servers almost 
> always reported zero calls waiting, but on this one file server the value 
> tended to range between 5 
> and 50.  Occasionally it got over 100.  And the higher the value, the
> more likely we would see 
> performance problems on a wide variety of AFS clients.
> 
> Googling some more showed that many people had reported that this value
> was indeed a good 
> indicator of performance problems.  And looking in log files on the file
> servers we saw a few (but 
> not many) messages which pointed us to problems in our network.  Most of
> those looked like 
> minor problems, one or two were more significant and were magnified by
> some heavy network 
> traffic which happened to be going on at the time.  We fixed all of
> those, and actually shut down 
> the process which was (legitimately) doing a lot of network I/O.  These
> were all good things to do, 
> and none of them made a bit of difference to the values we saw for 'calls 
> waiting" on that file 
> server, or on the very frustratingly hangs we were seeing on AFS clients.
> 
> And then at 7:07am this morning, the problem disappeared.  Completely.
> The 'calls wating' value 
> on that server has not gone above zero for the entire rest of the day.
> So, the immediate crisis is 
> over.  Everything is working fine.
> 
> But my question is:  If this returns, how can I track down what is
> *causing* the calls-waiting value 
> to climb?  We had over 100 workstations using AFS at the time, scattered
> all around campus.  I did 
> a variety of things to try and pinpoint the culprit, but didn't have much 
> luck.
> 
> So, given a streak of high values for 'call waiting', how can I track
> that down to a specific client (or 
> clients), or maybe a specific AFS volume?
> 
> -- 
> Garance Alistair Drosehn
> Senior Systems Programmer
> RPI; Troy NY
> 
> 
> _______________________________________________
> OpenAFS-info mailing list
> [email protected]
> https://lists.openafs.org/mailman/listinfo/openafs-info

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to