Now that I've had some sleep, and a week to reflect on everything that happened, I wanted to revisit the discussion and give some feedback on things.
On Tuesday, Oct 29, 2002, at 15:07 US/Pacific, Clifton Royston wrote:
I should have thought of that at the time, but by Friday I had had 4 hours of sleep in 48 hours, and Saturday wasn't much better. It turns out that at the same time that I had loaded a new configuration of config-file based pop daemons, within an hour of that event the battery in the raid controller card in the server died. My peer hadn't set "cache even without battery option" on the machine (which is mostly safe for us, since everything involved has big honkin' ups'es), and there's a bug in the raid management software that wasn't showing us the full log file (we've logged a complaint with Sun). Plus, the log messages, once we were able to find them, were rather cryptic (making references to things we weren't doing). By Monday, my peer finally found the underlying cause, and had everything fixed inside of a few minutes.On Sat, Oct 26, 2002 at 07:25:02PM -0700, John Rudd wrote:
I trimmed my part of the above down because the answer to your question
was in my message. To explain it a little more, if a popper starts up and
then immediately tries to open a config file, it's going to hang while
waiting for IO. If you've got 15000 users, that can build up. For example,
in the last two days, when I forgot about this issue, I accidently enabled
config files for the poppers, and I had, at some points, 6000 popper processes
all sleeping on waiting to open their config files.
I am really a bit dubious this was the source of the contention, unless you are running some unusual operating system, or there is some other underlying bug. As Chuck Yerkes noted, if you have 6000 processes all opening the same file for read access, in most reasonable versions of UNIX they should all be transparently reading copies from the same disk buffer cached in the kernel's RAM.
So, it turns out that the IO slow down had nothing to do with the config files, it was just an amazingly huge coincidence that both events happened so close together. I'm not sure why it is that I saw an immediate performance boost when I reversed my change ... but it was probably also coincidental, because later the performance issues came back. Any non-trivial load of users was enough to overwhelm the disk access rate when caching wasn't happening. (and, it wasn't the disk with the config files, as I had originally thought, but the disk array that had the spool files)
Thanks for everyone's input and suggestions. Monday afternoon I felt rather embarrassed that I hadn't gotten him to triple check the disk array earlier.
John
