Hi everyone,

I administer Linux servers for a university. I have had two our over
servers have become unresponsive three times (2 on one server) in the
past week. These servers are general purpose timesharing machines and
were under a steady load of around 8. We have students running compute
jobs for last-minute homework assignments. I know that some students are
working on an intro to threading class. the most telling data is that
ganglia shows a load spike of 50 before one of the outages.

The servers are Dell PowerEdge 860 with 8GB of RAM and a single
quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.

I have the following limits in place:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 16367
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 200
virtual memory          (kbytes, -v) 2057564
file locks                      (-x) unlimited

I'm recording sar data one per minute. The only notable thing is a peak
of context switches before the outage and the interrupts all go to core 0.

How can prevent the servers from becoming unresponsive even under heavy
load?

What can I do to troubleshoot further?

Thanks,
Jason

_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to