usually remote syslog can catch any last error messages to the console that might give some clues. I tried this before but it dodnt help me, I had 2 servers that would hang, one of the more frequently than the other, I never did find out what was wrong. Are there any raid drivers or anything different with it?
On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote: > > Are you talking about running iostat and sending it to a remote syslog > periodically? > > sar shows31% of the CPU was used for I/O for one minute5 minutes before > it stopped recording, but the last I/O record shows 1.82% > > Here is the "sar -u" output for the time before the crash: > Linux 2.6.18-92.1.17.el5 (xxxxxxx) 04/29/2009give some clues > > 12:00:01 AM CPU %user %nice %system %iowait > %steal %idle > 12:01:01 AM all 26.00 0.00 1.34 0.13 > 0.00 72.53 > 12:02:01 AM all 25.93 0.00 1.19 0.00 > 0.00 72.88 > 12:03:01 AM all 24.78 0.00 1.01 0.00 > 0.00 74.20 > 12:04:01 AM all 24.67 0.00 0.95 0.00 > 0.00 74.38 > 12:05:02 AM all 25.30 0.00 0.93 0.03 > 0.00 73.74 > 12:06:01 AM all 25.51 0.00 1.06 0.04 > 0.00 73.39 > 12:07:01 AM all 25.45 0.00 1.32 0.00 > 0.00 73.23 > 12:08:01 AM all 26.11 0.00 1.04 0.03 > 0.00 72.82 > 12:09:01 AM all 25.35 0.00 0.98 0.00 > 0.00 73.66 > 12:10:01 AM all 26.89 0.00 2.63 1.09 > 0.00 69.39 > 12:11:01 AM all 26.86 0.00 1.66 0.47 > 0.00 71.01 > 12:12:01 AM all 26.16 0.00 1.42 0.04 > 0.00 72.38 > 12:13:01 AM all 25.88 0.00 1.33 0.00 > 0.00 72.79 > 12:14:01 AM all 26.52 0.00 1.97 0.40 > 0.00 71.12 > 12:15:01 AM all 27.35 0.00 2.18 0.25 > 0.00 70.22 > 12:16:01 AM all 25.17 0.00 1.17 0.05 > 0.00 73.61 > 12:17:01 AM all 26.24 0.00 1.75 0.03 > 0.00 71.98 > 12:18:01 AM all 25.37 0.00 1.43 0.13 > 0.00 73.07 > 12:19:01 AM all 26.60 0.00 1.65 0.02 > 0.00 71.73 > 12:20:01 AM all 26.66 0.00 1.87 0.59 > 0.00 70.89 > 12:21:01 AM all 25.16 0.00 1.25 1.21 > 0.00 72.38 > 12:22:01 AM all 28.26 0.00 1.26 0.42 > 0.00 70.07 > 12:23:01 AM all 26.54 0.00 1.46 1.02 > 0.00 70.99 > 12:24:01 AM all 25.56 0.00 1.64 0.30 > 0.00 72.50 > 12:25:01 AM all 24.87 0.00 9.23 31.85 > 0.00 34.04 > 12:26:01 AM all 28.32 0.00 2.84 15.70 > 0.00 53.14 > 12:27:01 AM all 24.97 0.00 1.17 0.07 > 0.00 73.80 > 12:28:02 AM all 26.20 0.00 1.27 0.23 > 0.00 72.30 > 12:29:01 AM all 27.37 0.00 2.50 0.18 > 0.00 69.95 > 12:30:01 AM all 31.04 0.00 2.65 0.15 > 0.00 66.16 > Average: all 26.24 0.00 1.80 1.82 > 0.00 70.14 > 08:20:06 AM LINUX RESTART > > Ganglia shows the number of running processes spike sharply at a max of > 30+. > > I had to power-cycle the boxes to recover. > > Thanks, > Jason > > solarflow99 wrote: > > how is the I/O wait states at the time? can you try a netdump to a > > remote syslog? > > > > > > On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe > > <ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote: > > > > Hi everyone, > > > > I administer Linux servers for a university. I have had two our over > > servers have become unresponsive three times (2 on one server) in the > > past week. These servers are general purpose timesharing machines and > > were under a steady load of around 8. We have students running > compute > > jobs for last-minute homework assignments. I know that some > > students are > > working on an intro to threading class. the most telling data is that > > ganglia shows a load spike of 50 before one of the outages. > > > > The servers are Dell PowerEdge 860 with 8GB of RAM and a single > > quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop. > > > > I have the following limits in place: > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > scheduling priority (-e) 0 > > file size (blocks, -f) unlimited > > pending signals (-i) 16367 > > max locked memory (kbytes, -l) 32 > > max memory size (kbytes, -m) unlimited > > open files (-n) 1024 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) 10240 > > cpu time (seconds, -t) unlimited > > max user processes (-u) 200 > > virtual memory (kbytes, -v) 2057564 > > file locks (-x) unlimited > > > > I'm recording sar data one per minute. The only notable thing is a > > peak > > of context switches before the outage and the interrupts all go to > > core 0. > > > > How can prevent the servers from becoming unresponsive even under > > heavy > > load? > > > > What can I do to troubleshoot further? > > > > Thanks, > > Jason > > > > _______________________________________________ > > rhelv5-list mailing list > > rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com> > > https://www.redhat.com/mailman/listinfo/rhelv5-list > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > rhelv5-list mailing list > > rhelv5-list@redhat.com > > https://www.redhat.com/mailman/listinfo/rhelv5-list > > > > _______________________________________________ > rhelv5-list mailing list > rhelv5-list@redhat.com > https://www.redhat.com/mailman/listinfo/rhelv5-list >
_______________________________________________ rhelv5-list mailing list rhelv5-list@redhat.com https://www.redhat.com/mailman/listinfo/rhelv5-list