I would like to thank everyone for their help. The bug reappeared on a server that was at a text console, so I got to see the "out of swap" errors that pointed be at the culprit. The overcommit_memory flag was set to allow memory overcommits. This doesn't work well when large memory applications actually try to use the memory that they ask for. I've disabled overcommitting of memory and I'll see how well that works. Two users were running memory hungry jobs and the kernel was stuck trying trying to run OOM_killer.

Thanks,
Jason

Barry Brimer wrote:
I don't know if this will help, but hangwatch <http://people.redhat.com/astokes/hangwatch/> will run sysrq commands when load reaches a certain point so you can find out what was going on when you get a load spike. I would probably set it up with a remote syslog of some kind too .. although if your system is not crashing, you probably wouldn't need remote syslogging.

On Fri, 1 May 2009, solarflow99 wrote:

usually remote syslog can catch any last error messages to the console that might give some clues. I tried this before but it dodnt help me, I had 2 servers that would hang, one of the more frequently than the other, I never
did find out what was wrong.  Are there any raid drivers or anything
different with it?



On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote:

Are you talking about running iostat and sending it to a remote syslog
periodically?

sar shows31% of the CPU was used for I/O for one minute5 minutes before
it stopped recording, but the last I/O record shows 1.82%

Here is the "sar -u" output for the time before the crash:
Linux 2.6.18-92.1.17.el5 (xxxxxxx)      04/29/2009give some clues

12:00:01 AM       CPU     %user     %nice   %system   %iowait
%steal     %idle
12:01:01 AM       all     26.00      0.00      1.34      0.13
0.00     72.53
12:02:01 AM       all     25.93      0.00      1.19      0.00
0.00     72.88
12:03:01 AM       all     24.78      0.00      1.01      0.00
0.00     74.20
12:04:01 AM       all     24.67      0.00      0.95      0.00
0.00     74.38
12:05:02 AM       all     25.30      0.00      0.93      0.03
0.00     73.74
12:06:01 AM       all     25.51      0.00      1.06      0.04
0.00     73.39
12:07:01 AM       all     25.45      0.00      1.32      0.00
0.00     73.23
12:08:01 AM       all     26.11      0.00      1.04      0.03
0.00     72.82
12:09:01 AM       all     25.35      0.00      0.98      0.00
0.00     73.66
12:10:01 AM       all     26.89      0.00      2.63      1.09
0.00     69.39
12:11:01 AM       all     26.86      0.00      1.66      0.47
0.00     71.01
12:12:01 AM       all     26.16      0.00      1.42      0.04
0.00     72.38
12:13:01 AM       all     25.88      0.00      1.33      0.00
0.00     72.79
12:14:01 AM       all     26.52      0.00      1.97      0.40
0.00     71.12
12:15:01 AM       all     27.35      0.00      2.18      0.25
0.00     70.22
12:16:01 AM       all     25.17      0.00      1.17      0.05
0.00     73.61
12:17:01 AM       all     26.24      0.00      1.75      0.03
0.00     71.98
12:18:01 AM       all     25.37      0.00      1.43      0.13
0.00     73.07
12:19:01 AM       all     26.60      0.00      1.65      0.02
0.00     71.73
12:20:01 AM       all     26.66      0.00      1.87      0.59
0.00     70.89
12:21:01 AM       all     25.16      0.00      1.25      1.21
0.00     72.38
12:22:01 AM       all     28.26      0.00      1.26      0.42
0.00     70.07
12:23:01 AM       all     26.54      0.00      1.46      1.02
0.00     70.99
12:24:01 AM       all     25.56      0.00      1.64      0.30
0.00     72.50
12:25:01 AM       all     24.87      0.00      9.23     31.85
0.00     34.04
12:26:01 AM       all     28.32      0.00      2.84     15.70
0.00     53.14
12:27:01 AM       all     24.97      0.00      1.17      0.07
0.00     73.80
12:28:02 AM       all     26.20      0.00      1.27      0.23
0.00     72.30
12:29:01 AM       all     27.37      0.00      2.50      0.18
0.00     69.95
12:30:01 AM       all     31.04      0.00      2.65      0.15
0.00     66.16
Average:          all     26.24      0.00      1.80      1.82
0.00     70.14
08:20:06 AM       LINUX RESTART

Ganglia shows the number of running processes spike sharply at a max of
30+.

I had to power-cycle the boxes to recover.

Thanks,
Jason

solarflow99 wrote:
how is the I/O wait states at the time?  can you try a netdump to a
remote syslog?


On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
<ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:

    Hi everyone,

I administer Linux servers for a university. I have had two our over servers have become unresponsive three times (2 on one server) in the past week. These servers are general purpose timesharing machines and
    were under a steady load of around 8. We have students running
compute
    jobs for last-minute homework assignments. I know that some
    students are
working on an intro to threading class. the most telling data is that
    ganglia shows a load spike of 50 before one of the outages.

    The servers are Dell PowerEdge 860 with 8GB of RAM and a single
    quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.

    I have the following limits in place:
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 16367
    max locked memory       (kbytes, -l) 32
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 10240
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 200
    virtual memory          (kbytes, -v) 2057564
    file locks                      (-x) unlimited

    I'm recording sar data one per minute. The only notable thing is a
    peak
    of context switches before the outage and the interrupts all go to
    core 0.

    How can prevent the servers from becoming unresponsive even under
    heavy
    load?

    What can I do to troubleshoot further?

    Thanks,
    Jason

    _______________________________________________
    rhelv5-list mailing list
    rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
    https://www.redhat.com/mailman/listinfo/rhelv5-list


------------------------------------------------------------------------

_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list



!DSPAM:49fa9ce7177038083534001!


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to