Re: [rhelv5-list] High load makes servers unresponsive

Jason Edgecombe Tue, 21 Jul 2009 14:59:56 -0700

I would like to thank everyone for their help. The bug reappeared on aserver that was at a text console, so I got to see the "out of swap"errors that pointed be at the culprit. The overcommit_memory flag wasset to allow memory overcommits. This doesn't work well when largememory applications actually try to use the memory that they ask for.I've disabled overcommitting of memory and I'll see how well that works.Two users were running memory hungry jobs and the kernel was stucktrying trying to run OOM_killer.


Thanks,
Jason


Barry Brimer wrote:

I don't know if this will help, but hangwatch<http://people.redhat.com/astokes/hangwatch/> will run sysrq commandswhen load reaches a certain point so you can find out what was goingon when you get a load spike. I would probably set it up with aremote syslog of some kind too .. although if your system is notcrashing, you probably wouldn't need remote syslogging.


On Fri, 1 May 2009, solarflow99 wrote:

usually remote syslog can catch any last error messages to theconsole thatmight give some clues. I tried this before but it dodnt help me, Ihad 2servers that would hang, one of the more frequently than the other, Inever

did find out what was wrong.  Are there any raid drivers or anything
different with it?



On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote:


Are you talking about running iostat and sending it to a remote syslog
periodically?

sar shows31% of the CPU was used for I/O for one minute5 minutes before
it stopped recording, but the last I/O record shows 1.82%

Here is the "sar -u" output for the time before the crash:
Linux 2.6.18-92.1.17.el5 (xxxxxxx)      04/29/2009give some clues

12:00:01 AM       CPU     %user     %nice   %system   %iowait
%steal     %idle
12:01:01 AM       all     26.00      0.00      1.34      0.13
0.00     72.53
12:02:01 AM       all     25.93      0.00      1.19      0.00
0.00     72.88
12:03:01 AM       all     24.78      0.00      1.01      0.00
0.00     74.20
12:04:01 AM       all     24.67      0.00      0.95      0.00
0.00     74.38
12:05:02 AM       all     25.30      0.00      0.93      0.03
0.00     73.74
12:06:01 AM       all     25.51      0.00      1.06      0.04
0.00     73.39
12:07:01 AM       all     25.45      0.00      1.32      0.00
0.00     73.23
12:08:01 AM       all     26.11      0.00      1.04      0.03
0.00     72.82
12:09:01 AM       all     25.35      0.00      0.98      0.00
0.00     73.66
12:10:01 AM       all     26.89      0.00      2.63      1.09
0.00     69.39
12:11:01 AM       all     26.86      0.00      1.66      0.47
0.00     71.01
12:12:01 AM       all     26.16      0.00      1.42      0.04
0.00     72.38
12:13:01 AM       all     25.88      0.00      1.33      0.00
0.00     72.79
12:14:01 AM       all     26.52      0.00      1.97      0.40
0.00     71.12
12:15:01 AM       all     27.35      0.00      2.18      0.25
0.00     70.22
12:16:01 AM       all     25.17      0.00      1.17      0.05
0.00     73.61
12:17:01 AM       all     26.24      0.00      1.75      0.03
0.00     71.98
12:18:01 AM       all     25.37      0.00      1.43      0.13
0.00     73.07
12:19:01 AM       all     26.60      0.00      1.65      0.02
0.00     71.73
12:20:01 AM       all     26.66      0.00      1.87      0.59
0.00     70.89
12:21:01 AM       all     25.16      0.00      1.25      1.21
0.00     72.38
12:22:01 AM       all     28.26      0.00      1.26      0.42
0.00     70.07
12:23:01 AM       all     26.54      0.00      1.46      1.02
0.00     70.99
12:24:01 AM       all     25.56      0.00      1.64      0.30
0.00     72.50
12:25:01 AM       all     24.87      0.00      9.23     31.85
0.00     34.04
12:26:01 AM       all     28.32      0.00      2.84     15.70
0.00     53.14
12:27:01 AM       all     24.97      0.00      1.17      0.07
0.00     73.80
12:28:02 AM       all     26.20      0.00      1.27      0.23
0.00     72.30
12:29:01 AM       all     27.37      0.00      2.50      0.18
0.00     69.95
12:30:01 AM       all     31.04      0.00      2.65      0.15
0.00     66.16
Average:          all     26.24      0.00      1.80      1.82
0.00     70.14
08:20:06 AM       LINUX RESTART

Ganglia shows the number of running processes spike sharply at a max of
30+.

I had to power-cycle the boxes to recover.

Thanks,
Jason

solarflow99 wrote:

how is the I/O wait states at the time?  can you try a netdump to a
remote syslog?


On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
<ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:

    Hi everyone,
I administer Linux servers for a university. I have had two ouroverservers have become unresponsive three times (2 on one server)in thepast week. These servers are general purpose timesharingmachines and
    were under a steady load of around 8. We have students running

compute

    jobs for last-minute homework assignments. I know that some
    students are

working on an intro to threading class. the most telling datais that

    ganglia shows a load spike of 50 before one of the outages.

    The servers are Dell PowerEdge 860 with 8GB of RAM and a single
    quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.

    I have the following limits in place:
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 16367
    max locked memory       (kbytes, -l) 32
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 10240
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 200
    virtual memory          (kbytes, -v) 2057564
    file locks                      (-x) unlimited

    I'm recording sar data one per minute. The only notable thing is a
    peak
    of context switches before the outage and the interrupts all go to
    core 0.

    How can prevent the servers from becoming unresponsive even under
    heavy
    load?

    What can I do to troubleshoot further?

    Thanks,
    Jason

    _______________________________________________
    rhelv5-list mailing list
    rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
    https://www.redhat.com/mailman/listinfo/rhelv5-list

------------------------------------------------------------------------


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list



!DSPAM:49fa9ce7177038083534001!


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Re: [rhelv5-list] High load makes servers unresponsive

Reply via email to