usually remote syslog can catch any last error messages to the console that
might give some clues.  I tried this before but it dodnt help me, I had 2
servers that would hang, one of the more frequently than the other, I never
did find out what was wrong.  Are there any raid drivers or anything
different with it?



On 4/30/09, Jason Edgecombe <ja...@rampaginggeek.com> wrote:
>
> Are you talking about running iostat and sending it to a remote syslog
> periodically?
>
> sar shows31% of the CPU was used for I/O for one minute5 minutes before
> it stopped recording, but the last I/O record shows 1.82%
>
> Here is the "sar -u" output for the time before the crash:
> Linux 2.6.18-92.1.17.el5 (xxxxxxx)      04/29/2009give some clues
>
> 12:00:01 AM       CPU     %user     %nice   %system   %iowait
> %steal     %idle
> 12:01:01 AM       all     26.00      0.00      1.34      0.13
> 0.00     72.53
> 12:02:01 AM       all     25.93      0.00      1.19      0.00
> 0.00     72.88
> 12:03:01 AM       all     24.78      0.00      1.01      0.00
> 0.00     74.20
> 12:04:01 AM       all     24.67      0.00      0.95      0.00
> 0.00     74.38
> 12:05:02 AM       all     25.30      0.00      0.93      0.03
> 0.00     73.74
> 12:06:01 AM       all     25.51      0.00      1.06      0.04
> 0.00     73.39
> 12:07:01 AM       all     25.45      0.00      1.32      0.00
> 0.00     73.23
> 12:08:01 AM       all     26.11      0.00      1.04      0.03
> 0.00     72.82
> 12:09:01 AM       all     25.35      0.00      0.98      0.00
> 0.00     73.66
> 12:10:01 AM       all     26.89      0.00      2.63      1.09
> 0.00     69.39
> 12:11:01 AM       all     26.86      0.00      1.66      0.47
> 0.00     71.01
> 12:12:01 AM       all     26.16      0.00      1.42      0.04
> 0.00     72.38
> 12:13:01 AM       all     25.88      0.00      1.33      0.00
> 0.00     72.79
> 12:14:01 AM       all     26.52      0.00      1.97      0.40
> 0.00     71.12
> 12:15:01 AM       all     27.35      0.00      2.18      0.25
> 0.00     70.22
> 12:16:01 AM       all     25.17      0.00      1.17      0.05
> 0.00     73.61
> 12:17:01 AM       all     26.24      0.00      1.75      0.03
> 0.00     71.98
> 12:18:01 AM       all     25.37      0.00      1.43      0.13
> 0.00     73.07
> 12:19:01 AM       all     26.60      0.00      1.65      0.02
> 0.00     71.73
> 12:20:01 AM       all     26.66      0.00      1.87      0.59
> 0.00     70.89
> 12:21:01 AM       all     25.16      0.00      1.25      1.21
> 0.00     72.38
> 12:22:01 AM       all     28.26      0.00      1.26      0.42
> 0.00     70.07
> 12:23:01 AM       all     26.54      0.00      1.46      1.02
> 0.00     70.99
> 12:24:01 AM       all     25.56      0.00      1.64      0.30
> 0.00     72.50
> 12:25:01 AM       all     24.87      0.00      9.23     31.85
> 0.00     34.04
> 12:26:01 AM       all     28.32      0.00      2.84     15.70
> 0.00     53.14
> 12:27:01 AM       all     24.97      0.00      1.17      0.07
> 0.00     73.80
> 12:28:02 AM       all     26.20      0.00      1.27      0.23
> 0.00     72.30
> 12:29:01 AM       all     27.37      0.00      2.50      0.18
> 0.00     69.95
> 12:30:01 AM       all     31.04      0.00      2.65      0.15
> 0.00     66.16
> Average:          all     26.24      0.00      1.80      1.82
> 0.00     70.14
> 08:20:06 AM       LINUX RESTART
>
> Ganglia shows the number of running processes spike sharply at a max of
> 30+.
>
> I had to power-cycle the boxes to recover.
>
> Thanks,
> Jason
>
> solarflow99 wrote:
> > how is the I/O wait states at the time?  can you try a netdump to a
> > remote syslog?
> >
> >
> > On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
> > <ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:
> >
> >     Hi everyone,
> >
> >     I administer Linux servers for a university. I have had two our over
> >     servers have become unresponsive three times (2 on one server) in the
> >     past week. These servers are general purpose timesharing machines and
> >     were under a steady load of around 8. We have students running
> compute
> >     jobs for last-minute homework assignments. I know that some
> >     students are
> >     working on an intro to threading class. the most telling data is that
> >     ganglia shows a load spike of 50 before one of the outages.
> >
> >     The servers are Dell PowerEdge 860 with 8GB of RAM and a single
> >     quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.
> >
> >     I have the following limits in place:
> >     core file size          (blocks, -c) 0
> >     data seg size           (kbytes, -d) unlimited
> >     scheduling priority             (-e) 0
> >     file size               (blocks, -f) unlimited
> >     pending signals                 (-i) 16367
> >     max locked memory       (kbytes, -l) 32
> >     max memory size         (kbytes, -m) unlimited
> >     open files                      (-n) 1024
> >     pipe size            (512 bytes, -p) 8
> >     POSIX message queues     (bytes, -q) 819200
> >     real-time priority              (-r) 0
> >     stack size              (kbytes, -s) 10240
> >     cpu time               (seconds, -t) unlimited
> >     max user processes              (-u) 200
> >     virtual memory          (kbytes, -v) 2057564
> >     file locks                      (-x) unlimited
> >
> >     I'm recording sar data one per minute. The only notable thing is a
> >     peak
> >     of context switches before the outage and the interrupts all go to
> >     core 0.
> >
> >     How can prevent the servers from becoming unresponsive even under
> >     heavy
> >     load?
> >
> >     What can I do to troubleshoot further?
> >
> >     Thanks,
> >     Jason
> >
> >     _______________________________________________
> >     rhelv5-list mailing list
> >     rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
> >     https://www.redhat.com/mailman/listinfo/rhelv5-list
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > rhelv5-list mailing list
> > rhelv5-list@redhat.com
> > https://www.redhat.com/mailman/listinfo/rhelv5-list
> >
>
> _______________________________________________
> rhelv5-list mailing list
> rhelv5-list@redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv5-list
>
_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to