Jason Edgecombe wrote:

I administer Linux servers for a university. I have had two our over
servers have become unresponsive three times (2 on one server) in the
past week. These servers are general purpose timesharing machines and
were under a steady load of around 8. We have students running compute
jobs for last-minute homework assignments. I know that some students are
working on an intro to threading class. the most telling data is that
ganglia shows a load spike of 50 before one of the outages.

Yeah, sounds like the student programs are definitely the cause.

I'm recording sar data one per minute. The only notable thing is a peak
of context switches before the outage and the interrupts all go to core 0.

Further data that supports runaway programs

How can prevent the servers from becoming unresponsive even under heavy
load?

Is it possible to try and set or adjust the user limits on cpu time, memory size and # of processes to prevent the runaway programs from blowing up so quickly? I'm sure it will be a balancing act to accomplish this w/o impacting normal work, perhaps you can just put them in place for some of the students who are enrolled in the class.

What can I do to troubleshoot further?

This could gonna be a tougher one to catch since it blows up so quickly. If your grabbing data at 1 min intervals and it goes out in 5 mins or so, your agent collecting data is going to have to fight for resources itself.

My suspicion is the programs are creating processes that are using up a small amounts of ram and multiplying very quickly, maybe too fast for the OOM killer to get a real handle on cleaning things up. When I've dealt with issues like this in the past, I've found that reducing the amount of cpu time or memory limits on the process helped us keep the box online (at least long enough to get more data).

If you had a better idea of when this might happen again, you could write a little script to capture output from the ps command every 3-10 secs and review the outputs to see who/what is causing it.

Tom

_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to