Re: load spikes revisited

Greg Ames Thu, 17 Jan 2002 09:04:06 -0800

Aaron Bannert wrote:
> 
> On Thu, Jan 17, 2002 at 10:20:00AM -0500, Greg Ames wrote:
> [snip]
> > As a matter of fact, I typically listen on two ports while testing to disable
> > S_L_U_A, so I can easily figure out which process will get the next connection
> > in case I want to gdb it.  While trying out ktrace on my test config, I saw that
> > the fcntl() accept mutex has got a thundering herd problem on daedalus.  After
> > releasing the fcntl mutex, you see the kernel context switching to all of the
> > idle httpd processes.  The first process that wakes up gets the mutex, the rest
> > of the context switches simply burn a little CPU, then block again.  Moral:  the
> > default cure looks as bad as the disease.
> 
> Is there anywhere else that we've started using cross process locks
> since 2.0.28? If fcntl() is known to cause this behaviour, why is it
> enabled at all on this version of FreeBSD?


I didn't know about this until yesterday.  Nobody else mentioned it AFAIK.
 
> Based on your ktrace output from a couple days ago, I have a working
> theory that I have yet to reproduce: I noticed that there is a very
> high occurance of sendfile returning with errno 35 (Resource temporarily
> unavailable). 

On FreeBSD, sendfile will almost always return -1 errno 35 for big files.  That
simply means the file is bigger than the socket buffers and the disk i/o
bandwidth is higher than the network bandwidth to the user.  But we call
sendfile twice as often as we need to on FreeBSD and probably Solaris. I
consistently see a pattern of two sendfiles then a select.  When apr sees that
sendfile sent some bytes, we change the retval from -1 and quickly forget that
it told us it would block.  We do need to exit from apr_sendfile after it sends
bytes so that the app can update the offset & length etc.  But before exiting,
apr should put a mark on the wall that tells us to issue select() first next
time apr_sendfile is called, because the kernel just told us what was likely to
happen.  

We already have this kind of logic in apr_recv.  It uses the APR_INCOMPLETE_READ
flag to predict whether the next call will return EAGAIN/EWOULDBLOCK, so we know
to try select() first on the next apr call.  Perhaps we should rename this to
APR_INCOMPLETE_IO and use it in apr_sendfile, or get really crazy and use a new
flag.
 
> Unfortunately, I don't think this will account for the short bursts of run
> queue growth we're talking about here, but it is something to look into.

Right.  The double sendfiles() have been happening for ages, so they are not the
cause of the load spike problem.  They certainly need to be addressed, along
with a number of other extra syscall problems.  However, if you look at the
change log for apr_sendfile, you'll see that we've gone round and round trying
to get it right on FreeBSD.  So this change needs to coded, reviewed and tested
very carefully.

Greg

Re: load spikes revisited

Reply via email to