On Tue, Jan 08, 2002 at 11:52:56PM -0800, Brian Pane wrote: > Can you clarify what it is that these traces are supposed to > show? Was there a spike in load during the middle of the > test case that the traces represent? Or for the entire > duration of the test? Or none of it at all? I can't find > any clear evidence of a spike in syscalls/second (to match > what happened on daedalus) anywhere in the trace data.
Sorry. The spike was consistent. From Greg Ames's post tonight, he was saying that the load average is high whenever there is *any* traffic. I'm seeing this behavior. I think we're not dispatching with the requests as quickly as we should be. (This flood profile was designed to be network-intensive, not CPU-intensive. So, it isn't expected to max out the CPUs, but the network.) The load average was about 12 or so during the test consistently. I doubt that 10 concurrent requests should make the box as sluggish as it was (it was completely unusable interactively). I remember from my earlier testing that the boxes could handle a much higher load without breaking a sweat. Something isn't right. Now, it could be this specific test profile or httpd. I don't know. I was seeing about ~50% sys time, ~25% interrupt time, ~25% user time for the duration of the test (with top). Some of that time could certainly be related to NIC overhead, but I still don't think that would completely explain why the box was so sluggish. (FWIW, this particular box has kernel paniced again due to threaded testing, so I can't test again until someone in the office can reboot it.) > It looks like you used flood to hit the server with a lot > of requests in parallel, right?. If so, can you recreate > the problem with a test case that issues just one request > at a time? I have no idea if this data is of any help. =( Once the box is up again, I can try more tests. But, that won't happen until tomorrow at the earliest... I can try to do some ramp-up tests in the next few days or so. Without having an idea how to reproduce this problem or even quantifying what we are looking for, we're shooting in the dark. > >One thing I noticed is that shutdown(2) looks very expensive. > > > > From the traces, the elapsed time for shutdown calls in general appears > to be somewhat smaller than that of open or stat calls. Are there > specific slow shutdown calls that you're looking at (ones that are > much more expensive than the average)? I saw some shutdown calls that took about .5 seconds (see the shutdown call at 236379 of the dump'd output and that shutdown call returns at line 237745). That area had a lot of suspicious syscall timings - IIRC, some of them were recorded as negative times. I was hoping that there were some specific calls that took a long time to execute that didn't seem normal. From a cursory examination, shutdown() looked a bit odd. Perhaps, I was looking at some abnormal spikes with shutdown. > P.S.: Not one brk call in the entire trace...does that surprise anybody > else? httpd was started before I enabled the traces. So, that probably explains it. I also think that brk() isn't caught by the kernel tracing facilities on FreeBSD unlike Solaris (I haven't seen it so far on FreeBSD). -- justin
