Thanks everyone,

I have remote syslog and sysrq enabled, but the hangs have stopped since
the semester is over. The machines that are hanging are identical make,
model, and configuration. No RAID, just SATA drives. I'm exploring the
mysteries of kdump and gdb so that I will be better prepared next time.
I'm also going to try different ways of breaking linux to see if I can
reproduce the problem.

Thanks,
Jason

Barry Brimer wrote:
> I don't know if this will help, but hangwatch
> <http://people.redhat.com/astokes/hangwatch/> will run sysrq commands
> when load reaches a certain point so you can find out what was going
> on when you get a load spike.  I would probably set it up with a
> remote syslog of some kind too .. although if your system is not
> crashing, you probably wouldn't need remote syslogging.
>
> On Fri, 1 May 2009, solarflow99 wrote:
>
>> usually remote syslog can catch any last error messages to the
>> console that
>> might give some clues.  I tried this before but it dodnt help me, I
>> had 2
>> servers that would hang, one of the more frequently than the other, I
>> never
>> did find out what was wrong.  Are there any raid drivers or anything
>> different with it?
>>
>>
>>
>> On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote:
>>>
>>> Are you talking about running iostat and sending it to a remote syslog
>>> periodically?
>>>
>>> sar shows31% of the CPU was used for I/O for one minute5 minutes before
>>> it stopped recording, but the last I/O record shows 1.82%
>>>
>>> Here is the "sar -u" output for the time before the crash:
>>> Linux 2.6.18-92.1.17.el5 (xxxxxxx)      04/29/2009give some clues
>>>
>>> 12:00:01 AM       CPU     %user     %nice   %system   %iowait
>>> %steal     %idle
>>> 12:01:01 AM       all     26.00      0.00      1.34      0.13
>>> 0.00     72.53
>>> 12:02:01 AM       all     25.93      0.00      1.19      0.00
>>> 0.00     72.88
>>> 12:03:01 AM       all     24.78      0.00      1.01      0.00
>>> 0.00     74.20
>>> 12:04:01 AM       all     24.67      0.00      0.95      0.00
>>> 0.00     74.38
>>> 12:05:02 AM       all     25.30      0.00      0.93      0.03
>>> 0.00     73.74
>>> 12:06:01 AM       all     25.51      0.00      1.06      0.04
>>> 0.00     73.39
>>> 12:07:01 AM       all     25.45      0.00      1.32      0.00
>>> 0.00     73.23
>>> 12:08:01 AM       all     26.11      0.00      1.04      0.03
>>> 0.00     72.82
>>> 12:09:01 AM       all     25.35      0.00      0.98      0.00
>>> 0.00     73.66
>>> 12:10:01 AM       all     26.89      0.00      2.63      1.09
>>> 0.00     69.39
>>> 12:11:01 AM       all     26.86      0.00      1.66      0.47
>>> 0.00     71.01
>>> 12:12:01 AM       all     26.16      0.00      1.42      0.04
>>> 0.00     72.38
>>> 12:13:01 AM       all     25.88      0.00      1.33      0.00
>>> 0.00     72.79
>>> 12:14:01 AM       all     26.52      0.00      1.97      0.40
>>> 0.00     71.12
>>> 12:15:01 AM       all     27.35      0.00      2.18      0.25
>>> 0.00     70.22
>>> 12:16:01 AM       all     25.17      0.00      1.17      0.05
>>> 0.00     73.61
>>> 12:17:01 AM       all     26.24      0.00      1.75      0.03
>>> 0.00     71.98
>>> 12:18:01 AM       all     25.37      0.00      1.43      0.13
>>> 0.00     73.07
>>> 12:19:01 AM       all     26.60      0.00      1.65      0.02
>>> 0.00     71.73
>>> 12:20:01 AM       all     26.66      0.00      1.87      0.59
>>> 0.00     70.89
>>> 12:21:01 AM       all     25.16      0.00      1.25      1.21
>>> 0.00     72.38
>>> 12:22:01 AM       all     28.26      0.00      1.26      0.42
>>> 0.00     70.07
>>> 12:23:01 AM       all     26.54      0.00      1.46      1.02
>>> 0.00     70.99
>>> 12:24:01 AM       all     25.56      0.00      1.64      0.30
>>> 0.00     72.50
>>> 12:25:01 AM       all     24.87      0.00      9.23     31.85
>>> 0.00     34.04
>>> 12:26:01 AM       all     28.32      0.00      2.84     15.70
>>> 0.00     53.14
>>> 12:27:01 AM       all     24.97      0.00      1.17      0.07
>>> 0.00     73.80
>>> 12:28:02 AM       all     26.20      0.00      1.27      0.23
>>> 0.00     72.30
>>> 12:29:01 AM       all     27.37      0.00      2.50      0.18
>>> 0.00     69.95
>>> 12:30:01 AM       all     31.04      0.00      2.65      0.15
>>> 0.00     66.16
>>> Average:          all     26.24      0.00      1.80      1.82
>>> 0.00     70.14
>>> 08:20:06 AM       LINUX RESTART
>>>
>>> Ganglia shows the number of running processes spike sharply at a max of
>>> 30+.
>>>
>>> I had to power-cycle the boxes to recover.
>>>
>>> Thanks,
>>> Jason
>>>
>>> solarflow99 wrote:
>>>> how is the I/O wait states at the time?  can you try a netdump to a
>>>> remote syslog?
>>>>
>>>>
>>>> On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
>>>> <ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:
>>>>
>>>>     Hi everyone,
>>>>
>>>>     I administer Linux servers for a university. I have had two our
>>>> over
>>>>     servers have become unresponsive three times (2 on one server)
>>>> in the
>>>>     past week. These servers are general purpose timesharing
>>>> machines and
>>>>     were under a steady load of around 8. We have students running
>>> compute
>>>>     jobs for last-minute homework assignments. I know that some
>>>>     students are
>>>>     working on an intro to threading class. the most telling data
>>>> is that
>>>>     ganglia shows a load spike of 50 before one of the outages.
>>>>
>>>>     The servers are Dell PowerEdge 860 with 8GB of RAM and a single
>>>>     quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.
>>>>
>>>>     I have the following limits in place:
>>>>     core file size          (blocks, -c) 0
>>>>     data seg size           (kbytes, -d) unlimited
>>>>     scheduling priority             (-e) 0
>>>>     file size               (blocks, -f) unlimited
>>>>     pending signals                 (-i) 16367
>>>>     max locked memory       (kbytes, -l) 32
>>>>     max memory size         (kbytes, -m) unlimited
>>>>     open files                      (-n) 1024
>>>>     pipe size            (512 bytes, -p) 8
>>>>     POSIX message queues     (bytes, -q) 819200
>>>>     real-time priority              (-r) 0
>>>>     stack size              (kbytes, -s) 10240
>>>>     cpu time               (seconds, -t) unlimited
>>>>     max user processes              (-u) 200
>>>>     virtual memory          (kbytes, -v) 2057564
>>>>     file locks                      (-x) unlimited
>>>>
>>>>     I'm recording sar data one per minute. The only notable thing is a
>>>>     peak
>>>>     of context switches before the outage and the interrupts all go to
>>>>     core 0.
>>>>
>>>>     How can prevent the servers from becoming unresponsive even under
>>>>     heavy
>>>>     load?
>>>>
>>>>     What can I do to troubleshoot further?
>>>>
>>>>     Thanks,
>>>>     Jason
>>>>
>>>>     _______________________________________________
>>>>     rhelv5-list mailing list
>>>>     rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
>>>>     https://www.redhat.com/mailman/listinfo/rhelv5-list
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> _______________________________________________
>>>> rhelv5-list mailing list
>>>> rhelv5-list@redhat.com
>>>> https://www.redhat.com/mailman/listinfo/rhelv5-list
>>>>
>>>
>>> _______________________________________________
>>> rhelv5-list mailing list
>>> rhelv5-list@redhat.com
>>> https://www.redhat.com/mailman/listinfo/rhelv5-list
>>>
>>
>>
>> !DSPAM:49fa9ce7177038083534001!
>>
>
> _______________________________________________
> rhelv5-list mailing list
> rhelv5-list@redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv5-list
>
>

_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to