On Tue, Jan 08, 2002 at 11:52:56PM -0800, Brian Pane wrote:
> Can you clarify what it is that these traces are supposed to
> show?  Was there a spike in load during the middle of the
> test case that the traces represent?  Or for the entire
> duration of the test?  Or none of it at all?  I can't find
> any clear evidence of a spike in syscalls/second (to match
> what happened on daedalus) anywhere in the trace data.

Sorry.  The spike was consistent.  From Greg Ames's post tonight,
he was saying that the load average is high whenever there is
*any* traffic.  I'm seeing this behavior.  I think we're not 
dispatching with the requests as quickly as we should be.  (This
flood profile was designed to be network-intensive, not 
CPU-intensive.  So, it isn't expected to max out the CPUs, but
the network.)
 
The load average was about 12 or so during the test consistently.
I doubt that 10 concurrent requests should make the box as 
sluggish as it was (it was completely unusable interactively).  I 
remember from my earlier testing that the boxes could handle a
much higher load without breaking a sweat.  Something isn't right.
Now, it could be this specific test profile or httpd.  I don't know.

I was seeing about ~50% sys time, ~25% interrupt time, ~25%
user time for the duration of the test (with top).  Some of that 
time could certainly be related to NIC overhead, but I still don't 
think that would completely explain why the box was so sluggish.

(FWIW, this particular box has kernel paniced again due to threaded
testing, so I can't test again until someone in the office can 
reboot it.) 

> It looks like you used flood to hit the server with a lot
> of requests in parallel, right?.  If so, can you recreate
> the problem with a test case that issues just one request
> at a time?

I have no idea if this data is of any help.  =(  Once the
box is up again, I can try more tests.  But, that won't
happen until tomorrow at the earliest...

I can try to do some ramp-up tests in the next few days or
so.  Without having an idea how to reproduce this problem or even
quantifying what we are looking for, we're shooting in the dark.

> >One thing I noticed is that shutdown(2) looks very expensive.
> >
> 
> From the traces, the elapsed time for shutdown calls in general appears
> to be somewhat smaller than that of open or stat calls.  Are there
> specific slow shutdown calls that you're looking at (ones that are
> much more expensive than the average)?

I saw some shutdown calls that took about .5 seconds (see the 
shutdown call at 236379 of the dump'd output and that shutdown call 
returns at line 237745).  That area had a lot of suspicious syscall
timings - IIRC, some of them were recorded as negative times.

I was hoping that there were some specific calls that took a long
time to execute that didn't seem normal.  From a cursory examination,
shutdown() looked a bit odd.  Perhaps, I was looking at some 
abnormal spikes with shutdown.

> P.S.: Not one brk call in the entire trace...does that surprise anybody
> else?

httpd was started before I enabled the traces.  So, that probably
explains it.  I also think that brk() isn't caught by the kernel
tracing facilities on FreeBSD unlike Solaris (I haven't seen it
so far on FreeBSD).  -- justin

Reply via email to