You have a resource that is getting blocked and is causing the system to
run totally out of resources (hence the reboot).

You have:
    1) memory
    2) disk space
    3) temporary disk space
    4) network
    5) program id's

as resources.

You can use the various "*stat" programs to look at resources.
vmstat tells you about memory and virtual memory
iostat tells you about i/o activities
df will tell you about disk space
netstat -a will tell you about network connections
ps will tell you about processes. 

Write a small script to monitor the bits of the system you need
and have it run every few minutes and log the results.

Here is a script that I use to solve this problem when I have a wierd
pile of changes that I can't catch:



#!/usr/bin/perl5
# LVM -  Dave Regan/John Sechrest
# creates a log, which you can then use cut+plot to see patterns
#
#  monitor and log a lot of stuff. don't know what might be important, so
#  log a lot of stuff.
#
#  some of the stuff: vmstat output. loadavg. network counters.
#
#  here are the column data:
#  1 procs r
#  2 procs b
#  3 procs w
#  4 swpd
#  5 free mem, kb
#  6 buff mem, kb
#  7 cache mem, kb
#  8 swap si
#  9 swap so
# 10 io - bi
# 11 io - bo
# 12 system in
# 13 system cs
# 14 cpu user
# 15 cpu system
# 16 cpu idle
# 17 loadavg 1min
# 18 loadavg 5min
# 19 loadavg 15min
# 20 net recv bytes
# 21 net recv packets
# 22 net recv errs
# 23 net recv drop
# 24 net recv fifo
# 25 net recv frame
# 26 net recv compressed
# 27 net recv multicast
# 28 net trans bytes
# 29 net trans packets
# 30 net trans errs
# 31 net trans drop
# 32 net trans fifo
# 33 net trans colls
# 34 net trans carrier
# 35 net trans compressed
# 36 collision since last
# 37 epoch time
# 38 human time
#
# to plot "collision since last" in gnuplot, following commands:
#
#  set xdata time
#  set timefmt "%m/%d/%Y %H:%M:%S"
#  set format x "%H:%M"
#  plot "lvm.log" using 38:36

# open me a log file and hot pipe it.
open( O, ">lvm.log" ) || die "cannot open log: $!\n";
select O; $| = 1;

# set a few "last" value variables.
$lastCollisions = 0;

# open a vmstat
open( V, "vmstat -n 1 |" ) || die "cannot open vmstat for input: $!\n";

# first two vmstat output is header, third is "since boot", discard all;
$junk = <V>; $junk = <V>; $junk = <V>; $junk = <V>;

# ok, now lets do some work.
while( <V> ) {

        # have a vmstat line in $_
        $vm = $_;
        chomp $vm;

        # get loadavg, keep just loadavgs
        open( L, "/proc/loadavg" );
        $load = join( ' ', (split(' ', <L>))[0,1,2] );
        close L;

        # get eth1 stats
        open( E, "/proc/net/dev" );
        while( <E> ) { last if /eth1/; }
        $net = $_;
        chomp $net;
        close E;
        $net =~ s/eth1://;
        $colls = (split(' ',$net))[13];
        $lastCollisions = $colls if $lastCollisions == 0;
        $diffcols = $colls - $lastCollisions;
        $lastCollisions = $colls;

        #produce some output dude
        $now = time;
        @d = localtime($now);
        $nowT = sprintf "%02d/%02d/%04d %02d:%02d:%02d",
                $d[4]+1, $d[3], $d[5]+1900,
                $d[2], $d[1], $d[0];
        $line = "$vm $load $net $diffcols $now $nowT";
        $line =~ s/\s+/ /g;
        print $line . "\n";

}






Matthew Jarvis <[EMAIL PROTECTED]> writes:

 % I am trying to nail down what is happening on a server and hopefully a 
 % solution.
 % 
 % It seems that as I monitor CPU usage over time that there is a pattern 
 % to the load in a approx 2-week cycle. Things are pretty flat for a week 
 % (after a reboot), then start ramping up over about a week until load ave 
 % on the machine makes it unstable and requiring a reboot.
 % 
 % Sometimes the load on this thing hits in the high 40's, but typically 
 % runs around at 2-4.
 % 
 % Looking at output of ps or top don't seem to show, at least to me, what 
 % process or processes is being a bad boy.
 % 
 % The only plan I have at the moment is to wait for a reboot (anticipate 
 % one tonight if allowed), output ps to a file, then wait for the next 
 % time and ps again to see a comparison.
 % 
 % I'd love to hear from someone how to diagnose this issue. Next SAO 
 % meeting the beers on me if someone can help me out!  <g>
 % 
 % -- 
 % Matthew S. Jarvis
 % IT Manager
 % Bike Friday - "Performance that Packs."
 % www.bikefriday.com
 % 541/687-0487 x140
 % [EMAIL PROTECTED]
 % _______________________________________________
 % EUGLUG mailing list
 % [email protected]
 % http://www.euglug.org/mailman/listinfo/euglug

-----
John Sechrest          .         Helping people use
                        .           computers and the Internet
                          .            more effectively
                             .                      
                                 .       Internet: [EMAIL PROTECTED]
                                      .   
                                              . http://www.peak.org/~sechrest
_______________________________________________
EUGLUG mailing list
[email protected]
http://www.euglug.org/mailman/listinfo/euglug

Reply via email to