I don't know if this will help, but hangwatch
<http://people.redhat.com/astokes/hangwatch/> will run sysrq commands when
load reaches a certain point so you can find out what was going on when
you get a load spike. I would probably set it up with a remote syslog of
some kind too .. although if your system is not crashing, you probably
wouldn't need remote syslogging.
On Fri, 1 May 2009, solarflow99 wrote:
usually remote syslog can catch any last error messages to the console that
might give some clues. I tried this before but it dodnt help me, I had 2
servers that would hang, one of the more frequently than the other, I never
did find out what was wrong. Are there any raid drivers or anything
different with it?
On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote:
Are you talking about running iostat and sending it to a remote syslog
periodically?
sar shows31% of the CPU was used for I/O for one minute5 minutes before
it stopped recording, but the last I/O record shows 1.82%
Here is the "sar -u" output for the time before the crash:
Linux 2.6.18-92.1.17.el5 (xxxxxxx) 04/29/2009give some clues
12:00:01 AM CPU %user %nice %system %iowait
%steal %idle
12:01:01 AM all 26.00 0.00 1.34 0.13
0.00 72.53
12:02:01 AM all 25.93 0.00 1.19 0.00
0.00 72.88
12:03:01 AM all 24.78 0.00 1.01 0.00
0.00 74.20
12:04:01 AM all 24.67 0.00 0.95 0.00
0.00 74.38
12:05:02 AM all 25.30 0.00 0.93 0.03
0.00 73.74
12:06:01 AM all 25.51 0.00 1.06 0.04
0.00 73.39
12:07:01 AM all 25.45 0.00 1.32 0.00
0.00 73.23
12:08:01 AM all 26.11 0.00 1.04 0.03
0.00 72.82
12:09:01 AM all 25.35 0.00 0.98 0.00
0.00 73.66
12:10:01 AM all 26.89 0.00 2.63 1.09
0.00 69.39
12:11:01 AM all 26.86 0.00 1.66 0.47
0.00 71.01
12:12:01 AM all 26.16 0.00 1.42 0.04
0.00 72.38
12:13:01 AM all 25.88 0.00 1.33 0.00
0.00 72.79
12:14:01 AM all 26.52 0.00 1.97 0.40
0.00 71.12
12:15:01 AM all 27.35 0.00 2.18 0.25
0.00 70.22
12:16:01 AM all 25.17 0.00 1.17 0.05
0.00 73.61
12:17:01 AM all 26.24 0.00 1.75 0.03
0.00 71.98
12:18:01 AM all 25.37 0.00 1.43 0.13
0.00 73.07
12:19:01 AM all 26.60 0.00 1.65 0.02
0.00 71.73
12:20:01 AM all 26.66 0.00 1.87 0.59
0.00 70.89
12:21:01 AM all 25.16 0.00 1.25 1.21
0.00 72.38
12:22:01 AM all 28.26 0.00 1.26 0.42
0.00 70.07
12:23:01 AM all 26.54 0.00 1.46 1.02
0.00 70.99
12:24:01 AM all 25.56 0.00 1.64 0.30
0.00 72.50
12:25:01 AM all 24.87 0.00 9.23 31.85
0.00 34.04
12:26:01 AM all 28.32 0.00 2.84 15.70
0.00 53.14
12:27:01 AM all 24.97 0.00 1.17 0.07
0.00 73.80
12:28:02 AM all 26.20 0.00 1.27 0.23
0.00 72.30
12:29:01 AM all 27.37 0.00 2.50 0.18
0.00 69.95
12:30:01 AM all 31.04 0.00 2.65 0.15
0.00 66.16
Average: all 26.24 0.00 1.80 1.82
0.00 70.14
08:20:06 AM LINUX RESTART
Ganglia shows the number of running processes spike sharply at a max of
30+.
I had to power-cycle the boxes to recover.
Thanks,
Jason
solarflow99 wrote:
how is the I/O wait states at the time? can you try a netdump to a
remote syslog?
On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
<ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:
Hi everyone,
I administer Linux servers for a university. I have had two our over
servers have become unresponsive three times (2 on one server) in the
past week. These servers are general purpose timesharing machines and
were under a steady load of around 8. We have students running
compute
jobs for last-minute homework assignments. I know that some
students are
working on an intro to threading class. the most telling data is that
ganglia shows a load spike of 50 before one of the outages.
The servers are Dell PowerEdge 860 with 8GB of RAM and a single
quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.
I have the following limits in place:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16367
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 200
virtual memory (kbytes, -v) 2057564
file locks (-x) unlimited
I'm recording sar data one per minute. The only notable thing is a
peak
of context switches before the outage and the interrupts all go to
core 0.
How can prevent the servers from becoming unresponsive even under
heavy
load?
What can I do to troubleshoot further?
Thanks,
Jason
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
https://www.redhat.com/mailman/listinfo/rhelv5-list
------------------------------------------------------------------------
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list
!DSPAM:49fa9ce7177038083534001!
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list