I don't know if this will help, but hangwatch <http://people.redhat.com/astokes/hangwatch/> will run sysrq commands when load reaches a certain point so you can find out what was going on when you get a load spike. I would probably set it up with a remote syslog of some kind too .. although if your system is not crashing, you probably wouldn't need remote syslogging.

On Fri, 1 May 2009, solarflow99 wrote:

usually remote syslog can catch any last error messages to the console that
might give some clues.  I tried this before but it dodnt help me, I had 2
servers that would hang, one of the more frequently than the other, I never
did find out what was wrong.  Are there any raid drivers or anything
different with it?



On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote:

Are you talking about running iostat and sending it to a remote syslog
periodically?

sar shows31% of the CPU was used for I/O for one minute5 minutes before
it stopped recording, but the last I/O record shows 1.82%

Here is the "sar -u" output for the time before the crash:
Linux 2.6.18-92.1.17.el5 (xxxxxxx)      04/29/2009give some clues

12:00:01 AM       CPU     %user     %nice   %system   %iowait
%steal     %idle
12:01:01 AM       all     26.00      0.00      1.34      0.13
0.00     72.53
12:02:01 AM       all     25.93      0.00      1.19      0.00
0.00     72.88
12:03:01 AM       all     24.78      0.00      1.01      0.00
0.00     74.20
12:04:01 AM       all     24.67      0.00      0.95      0.00
0.00     74.38
12:05:02 AM       all     25.30      0.00      0.93      0.03
0.00     73.74
12:06:01 AM       all     25.51      0.00      1.06      0.04
0.00     73.39
12:07:01 AM       all     25.45      0.00      1.32      0.00
0.00     73.23
12:08:01 AM       all     26.11      0.00      1.04      0.03
0.00     72.82
12:09:01 AM       all     25.35      0.00      0.98      0.00
0.00     73.66
12:10:01 AM       all     26.89      0.00      2.63      1.09
0.00     69.39
12:11:01 AM       all     26.86      0.00      1.66      0.47
0.00     71.01
12:12:01 AM       all     26.16      0.00      1.42      0.04
0.00     72.38
12:13:01 AM       all     25.88      0.00      1.33      0.00
0.00     72.79
12:14:01 AM       all     26.52      0.00      1.97      0.40
0.00     71.12
12:15:01 AM       all     27.35      0.00      2.18      0.25
0.00     70.22
12:16:01 AM       all     25.17      0.00      1.17      0.05
0.00     73.61
12:17:01 AM       all     26.24      0.00      1.75      0.03
0.00     71.98
12:18:01 AM       all     25.37      0.00      1.43      0.13
0.00     73.07
12:19:01 AM       all     26.60      0.00      1.65      0.02
0.00     71.73
12:20:01 AM       all     26.66      0.00      1.87      0.59
0.00     70.89
12:21:01 AM       all     25.16      0.00      1.25      1.21
0.00     72.38
12:22:01 AM       all     28.26      0.00      1.26      0.42
0.00     70.07
12:23:01 AM       all     26.54      0.00      1.46      1.02
0.00     70.99
12:24:01 AM       all     25.56      0.00      1.64      0.30
0.00     72.50
12:25:01 AM       all     24.87      0.00      9.23     31.85
0.00     34.04
12:26:01 AM       all     28.32      0.00      2.84     15.70
0.00     53.14
12:27:01 AM       all     24.97      0.00      1.17      0.07
0.00     73.80
12:28:02 AM       all     26.20      0.00      1.27      0.23
0.00     72.30
12:29:01 AM       all     27.37      0.00      2.50      0.18
0.00     69.95
12:30:01 AM       all     31.04      0.00      2.65      0.15
0.00     66.16
Average:          all     26.24      0.00      1.80      1.82
0.00     70.14
08:20:06 AM       LINUX RESTART

Ganglia shows the number of running processes spike sharply at a max of
30+.

I had to power-cycle the boxes to recover.

Thanks,
Jason

solarflow99 wrote:
how is the I/O wait states at the time?  can you try a netdump to a
remote syslog?


On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
<ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:

    Hi everyone,

    I administer Linux servers for a university. I have had two our over
    servers have become unresponsive three times (2 on one server) in the
    past week. These servers are general purpose timesharing machines and
    were under a steady load of around 8. We have students running
compute
    jobs for last-minute homework assignments. I know that some
    students are
    working on an intro to threading class. the most telling data is that
    ganglia shows a load spike of 50 before one of the outages.

    The servers are Dell PowerEdge 860 with 8GB of RAM and a single
    quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.

    I have the following limits in place:
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 16367
    max locked memory       (kbytes, -l) 32
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 10240
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 200
    virtual memory          (kbytes, -v) 2057564
    file locks                      (-x) unlimited

    I'm recording sar data one per minute. The only notable thing is a
    peak
    of context switches before the outage and the interrupts all go to
    core 0.

    How can prevent the servers from becoming unresponsive even under
    heavy
    load?

    What can I do to troubleshoot further?

    Thanks,
    Jason

    _______________________________________________
    rhelv5-list mailing list
    rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
    https://www.redhat.com/mailman/listinfo/rhelv5-list


------------------------------------------------------------------------

_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list



!DSPAM:49fa9ce7177038083534001!


_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to