I would like to thank everyone for their help. The bug reappeared on a
server that was at a text console, so I got to see the "out of swap"
errors that pointed be at the culprit. The overcommit_memory flag was
set to allow memory overcommits. This doesn't work well when large
memory applications actually try to use the memory that they ask for.
I've disabled overcommitting of memory and I'll see how well that works.
Two users were running memory hungry jobs and the kernel was stuck
trying trying to run OOM_killer.
Thanks,
Jason
Barry Brimer wrote:
I don't know if this will help, but hangwatch
<http://people.redhat.com/astokes/hangwatch/> will run sysrq commands
when load reaches a certain point so you can find out what was going
on when you get a load spike. I would probably set it up with a
remote syslog of some kind too .. although if your system is not
crashing, you probably wouldn't need remote syslogging.
On Fri, 1 May 2009, solarflow99 wrote:
usually remote syslog can catch any last error messages to the
console that
might give some clues. I tried this before but it dodnt help me, I
had 2
servers that would hang, one of the more frequently than the other, I
never
did find out what was wrong. Are there any raid drivers or anything
different with it?
On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote:
Are you talking about running iostat and sending it to a remote syslog
periodically?
sar shows31% of the CPU was used for I/O for one minute5 minutes before
it stopped recording, but the last I/O record shows 1.82%
Here is the "sar -u" output for the time before the crash:
Linux 2.6.18-92.1.17.el5 (xxxxxxx) 04/29/2009give some clues
12:00:01 AM CPU %user %nice %system %iowait
%steal %idle
12:01:01 AM all 26.00 0.00 1.34 0.13
0.00 72.53
12:02:01 AM all 25.93 0.00 1.19 0.00
0.00 72.88
12:03:01 AM all 24.78 0.00 1.01 0.00
0.00 74.20
12:04:01 AM all 24.67 0.00 0.95 0.00
0.00 74.38
12:05:02 AM all 25.30 0.00 0.93 0.03
0.00 73.74
12:06:01 AM all 25.51 0.00 1.06 0.04
0.00 73.39
12:07:01 AM all 25.45 0.00 1.32 0.00
0.00 73.23
12:08:01 AM all 26.11 0.00 1.04 0.03
0.00 72.82
12:09:01 AM all 25.35 0.00 0.98 0.00
0.00 73.66
12:10:01 AM all 26.89 0.00 2.63 1.09
0.00 69.39
12:11:01 AM all 26.86 0.00 1.66 0.47
0.00 71.01
12:12:01 AM all 26.16 0.00 1.42 0.04
0.00 72.38
12:13:01 AM all 25.88 0.00 1.33 0.00
0.00 72.79
12:14:01 AM all 26.52 0.00 1.97 0.40
0.00 71.12
12:15:01 AM all 27.35 0.00 2.18 0.25
0.00 70.22
12:16:01 AM all 25.17 0.00 1.17 0.05
0.00 73.61
12:17:01 AM all 26.24 0.00 1.75 0.03
0.00 71.98
12:18:01 AM all 25.37 0.00 1.43 0.13
0.00 73.07
12:19:01 AM all 26.60 0.00 1.65 0.02
0.00 71.73
12:20:01 AM all 26.66 0.00 1.87 0.59
0.00 70.89
12:21:01 AM all 25.16 0.00 1.25 1.21
0.00 72.38
12:22:01 AM all 28.26 0.00 1.26 0.42
0.00 70.07
12:23:01 AM all 26.54 0.00 1.46 1.02
0.00 70.99
12:24:01 AM all 25.56 0.00 1.64 0.30
0.00 72.50
12:25:01 AM all 24.87 0.00 9.23 31.85
0.00 34.04
12:26:01 AM all 28.32 0.00 2.84 15.70
0.00 53.14
12:27:01 AM all 24.97 0.00 1.17 0.07
0.00 73.80
12:28:02 AM all 26.20 0.00 1.27 0.23
0.00 72.30
12:29:01 AM all 27.37 0.00 2.50 0.18
0.00 69.95
12:30:01 AM all 31.04 0.00 2.65 0.15
0.00 66.16
Average: all 26.24 0.00 1.80 1.82
0.00 70.14
08:20:06 AM LINUX RESTART
Ganglia shows the number of running processes spike sharply at a max of
30+.
I had to power-cycle the boxes to recover.
Thanks,
Jason
solarflow99 wrote:
how is the I/O wait states at the time? can you try a netdump to a
remote syslog?
On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
<ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:
Hi everyone,
I administer Linux servers for a university. I have had two our
over
servers have become unresponsive three times (2 on one server)
in the
past week. These servers are general purpose timesharing
machines and
were under a steady load of around 8. We have students running
compute
jobs for last-minute homework assignments. I know that some
students are
working on an intro to threading class. the most telling data
is that
ganglia shows a load spike of 50 before one of the outages.
The servers are Dell PowerEdge 860 with 8GB of RAM and a single
quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.
I have the following limits in place:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16367
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 200
virtual memory (kbytes, -v) 2057564
file locks (-x) unlimited
I'm recording sar data one per minute. The only notable thing is a
peak
of context switches before the outage and the interrupts all go to
core 0.
How can prevent the servers from becoming unresponsive even under
heavy
load?
What can I do to troubleshoot further?
Thanks,
Jason
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
https://www.redhat.com/mailman/listinfo/rhelv5-list
------------------------------------------------------------------------
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list
!DSPAM:49fa9ce7177038083534001!
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list