how is the I/O wait states at the time? can you try a netdump to a remote syslog?
On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe <ja...@rampaginggeek.com>wrote: > Hi everyone, > > I administer Linux servers for a university. I have had two our over > servers have become unresponsive three times (2 on one server) in the > past week. These servers are general purpose timesharing machines and > were under a steady load of around 8. We have students running compute > jobs for last-minute homework assignments. I know that some students are > working on an intro to threading class. the most telling data is that > ganglia shows a load spike of 50 before one of the outages. > > The servers are Dell PowerEdge 860 with 8GB of RAM and a single > quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop. > > I have the following limits in place: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 16367 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 200 > virtual memory (kbytes, -v) 2057564 > file locks (-x) unlimited > > I'm recording sar data one per minute. The only notable thing is a peak > of context switches before the outage and the interrupts all go to core 0. > > How can prevent the servers from becoming unresponsive even under heavy > load? > > What can I do to troubleshoot further? > > Thanks, > Jason > > _______________________________________________ > rhelv5-list mailing list > rhelv5-list@redhat.com > https://www.redhat.com/mailman/listinfo/rhelv5-list >
_______________________________________________ rhelv5-list mailing list rhelv5-list@redhat.com https://www.redhat.com/mailman/listinfo/rhelv5-list