Thanks everyone, I have remote syslog and sysrq enabled, but the hangs have stopped since the semester is over. The machines that are hanging are identical make, model, and configuration. No RAID, just SATA drives. I'm exploring the mysteries of kdump and gdb so that I will be better prepared next time. I'm also going to try different ways of breaking linux to see if I can reproduce the problem.
Thanks, Jason Barry Brimer wrote: > I don't know if this will help, but hangwatch > <http://people.redhat.com/astokes/hangwatch/> will run sysrq commands > when load reaches a certain point so you can find out what was going > on when you get a load spike. I would probably set it up with a > remote syslog of some kind too .. although if your system is not > crashing, you probably wouldn't need remote syslogging. > > On Fri, 1 May 2009, solarflow99 wrote: > >> usually remote syslog can catch any last error messages to the >> console that >> might give some clues. I tried this before but it dodnt help me, I >> had 2 >> servers that would hang, one of the more frequently than the other, I >> never >> did find out what was wrong. Are there any raid drivers or anything >> different with it? >> >> >> >> On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote: >>> >>> Are you talking about running iostat and sending it to a remote syslog >>> periodically? >>> >>> sar shows31% of the CPU was used for I/O for one minute5 minutes before >>> it stopped recording, but the last I/O record shows 1.82% >>> >>> Here is the "sar -u" output for the time before the crash: >>> Linux 2.6.18-92.1.17.el5 (xxxxxxx) 04/29/2009give some clues >>> >>> 12:00:01 AM CPU %user %nice %system %iowait >>> %steal %idle >>> 12:01:01 AM all 26.00 0.00 1.34 0.13 >>> 0.00 72.53 >>> 12:02:01 AM all 25.93 0.00 1.19 0.00 >>> 0.00 72.88 >>> 12:03:01 AM all 24.78 0.00 1.01 0.00 >>> 0.00 74.20 >>> 12:04:01 AM all 24.67 0.00 0.95 0.00 >>> 0.00 74.38 >>> 12:05:02 AM all 25.30 0.00 0.93 0.03 >>> 0.00 73.74 >>> 12:06:01 AM all 25.51 0.00 1.06 0.04 >>> 0.00 73.39 >>> 12:07:01 AM all 25.45 0.00 1.32 0.00 >>> 0.00 73.23 >>> 12:08:01 AM all 26.11 0.00 1.04 0.03 >>> 0.00 72.82 >>> 12:09:01 AM all 25.35 0.00 0.98 0.00 >>> 0.00 73.66 >>> 12:10:01 AM all 26.89 0.00 2.63 1.09 >>> 0.00 69.39 >>> 12:11:01 AM all 26.86 0.00 1.66 0.47 >>> 0.00 71.01 >>> 12:12:01 AM all 26.16 0.00 1.42 0.04 >>> 0.00 72.38 >>> 12:13:01 AM all 25.88 0.00 1.33 0.00 >>> 0.00 72.79 >>> 12:14:01 AM all 26.52 0.00 1.97 0.40 >>> 0.00 71.12 >>> 12:15:01 AM all 27.35 0.00 2.18 0.25 >>> 0.00 70.22 >>> 12:16:01 AM all 25.17 0.00 1.17 0.05 >>> 0.00 73.61 >>> 12:17:01 AM all 26.24 0.00 1.75 0.03 >>> 0.00 71.98 >>> 12:18:01 AM all 25.37 0.00 1.43 0.13 >>> 0.00 73.07 >>> 12:19:01 AM all 26.60 0.00 1.65 0.02 >>> 0.00 71.73 >>> 12:20:01 AM all 26.66 0.00 1.87 0.59 >>> 0.00 70.89 >>> 12:21:01 AM all 25.16 0.00 1.25 1.21 >>> 0.00 72.38 >>> 12:22:01 AM all 28.26 0.00 1.26 0.42 >>> 0.00 70.07 >>> 12:23:01 AM all 26.54 0.00 1.46 1.02 >>> 0.00 70.99 >>> 12:24:01 AM all 25.56 0.00 1.64 0.30 >>> 0.00 72.50 >>> 12:25:01 AM all 24.87 0.00 9.23 31.85 >>> 0.00 34.04 >>> 12:26:01 AM all 28.32 0.00 2.84 15.70 >>> 0.00 53.14 >>> 12:27:01 AM all 24.97 0.00 1.17 0.07 >>> 0.00 73.80 >>> 12:28:02 AM all 26.20 0.00 1.27 0.23 >>> 0.00 72.30 >>> 12:29:01 AM all 27.37 0.00 2.50 0.18 >>> 0.00 69.95 >>> 12:30:01 AM all 31.04 0.00 2.65 0.15 >>> 0.00 66.16 >>> Average: all 26.24 0.00 1.80 1.82 >>> 0.00 70.14 >>> 08:20:06 AM LINUX RESTART >>> >>> Ganglia shows the number of running processes spike sharply at a max of >>> 30+. >>> >>> I had to power-cycle the boxes to recover. >>> >>> Thanks, >>> Jason >>> >>> solarflow99 wrote: >>>> how is the I/O wait states at the time? can you try a netdump to a >>>> remote syslog? >>>> >>>> >>>> On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe >>>> <ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote: >>>> >>>> Hi everyone, >>>> >>>> I administer Linux servers for a university. I have had two our >>>> over >>>> servers have become unresponsive three times (2 on one server) >>>> in the >>>> past week. These servers are general purpose timesharing >>>> machines and >>>> were under a steady load of around 8. We have students running >>> compute >>>> jobs for last-minute homework assignments. I know that some >>>> students are >>>> working on an intro to threading class. the most telling data >>>> is that >>>> ganglia shows a load spike of 50 before one of the outages. >>>> >>>> The servers are Dell PowerEdge 860 with 8GB of RAM and a single >>>> quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop. >>>> >>>> I have the following limits in place: >>>> core file size (blocks, -c) 0 >>>> data seg size (kbytes, -d) unlimited >>>> scheduling priority (-e) 0 >>>> file size (blocks, -f) unlimited >>>> pending signals (-i) 16367 >>>> max locked memory (kbytes, -l) 32 >>>> max memory size (kbytes, -m) unlimited >>>> open files (-n) 1024 >>>> pipe size (512 bytes, -p) 8 >>>> POSIX message queues (bytes, -q) 819200 >>>> real-time priority (-r) 0 >>>> stack size (kbytes, -s) 10240 >>>> cpu time (seconds, -t) unlimited >>>> max user processes (-u) 200 >>>> virtual memory (kbytes, -v) 2057564 >>>> file locks (-x) unlimited >>>> >>>> I'm recording sar data one per minute. The only notable thing is a >>>> peak >>>> of context switches before the outage and the interrupts all go to >>>> core 0. >>>> >>>> How can prevent the servers from becoming unresponsive even under >>>> heavy >>>> load? >>>> >>>> What can I do to troubleshoot further? >>>> >>>> Thanks, >>>> Jason >>>> >>>> _______________________________________________ >>>> rhelv5-list mailing list >>>> rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com> >>>> https://www.redhat.com/mailman/listinfo/rhelv5-list >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> rhelv5-list mailing list >>>> rhelv5-list@redhat.com >>>> https://www.redhat.com/mailman/listinfo/rhelv5-list >>>> >>> >>> _______________________________________________ >>> rhelv5-list mailing list >>> rhelv5-list@redhat.com >>> https://www.redhat.com/mailman/listinfo/rhelv5-list >>> >> >> >> !DSPAM:49fa9ce7177038083534001! >> > > _______________________________________________ > rhelv5-list mailing list > rhelv5-list@redhat.com > https://www.redhat.com/mailman/listinfo/rhelv5-list > > _______________________________________________ rhelv5-list mailing list rhelv5-list@redhat.com https://www.redhat.com/mailman/listinfo/rhelv5-list