You have a resource that is getting blocked and is causing the system to
run totally out of resources (hence the reboot).
You have:
1) memory
2) disk space
3) temporary disk space
4) network
5) program id's
as resources.
You can use the various "*stat" programs to look at resources.
vmstat tells you about memory and virtual memory
iostat tells you about i/o activities
df will tell you about disk space
netstat -a will tell you about network connections
ps will tell you about processes.
Write a small script to monitor the bits of the system you need
and have it run every few minutes and log the results.
Here is a script that I use to solve this problem when I have a wierd
pile of changes that I can't catch:
#!/usr/bin/perl5
# LVM - Dave Regan/John Sechrest
# creates a log, which you can then use cut+plot to see patterns
#
# monitor and log a lot of stuff. don't know what might be important, so
# log a lot of stuff.
#
# some of the stuff: vmstat output. loadavg. network counters.
#
# here are the column data:
# 1 procs r
# 2 procs b
# 3 procs w
# 4 swpd
# 5 free mem, kb
# 6 buff mem, kb
# 7 cache mem, kb
# 8 swap si
# 9 swap so
# 10 io - bi
# 11 io - bo
# 12 system in
# 13 system cs
# 14 cpu user
# 15 cpu system
# 16 cpu idle
# 17 loadavg 1min
# 18 loadavg 5min
# 19 loadavg 15min
# 20 net recv bytes
# 21 net recv packets
# 22 net recv errs
# 23 net recv drop
# 24 net recv fifo
# 25 net recv frame
# 26 net recv compressed
# 27 net recv multicast
# 28 net trans bytes
# 29 net trans packets
# 30 net trans errs
# 31 net trans drop
# 32 net trans fifo
# 33 net trans colls
# 34 net trans carrier
# 35 net trans compressed
# 36 collision since last
# 37 epoch time
# 38 human time
#
# to plot "collision since last" in gnuplot, following commands:
#
# set xdata time
# set timefmt "%m/%d/%Y %H:%M:%S"
# set format x "%H:%M"
# plot "lvm.log" using 38:36
# open me a log file and hot pipe it.
open( O, ">lvm.log" ) || die "cannot open log: $!\n";
select O; $| = 1;
# set a few "last" value variables.
$lastCollisions = 0;
# open a vmstat
open( V, "vmstat -n 1 |" ) || die "cannot open vmstat for input: $!\n";
# first two vmstat output is header, third is "since boot", discard all;
$junk = <V>; $junk = <V>; $junk = <V>; $junk = <V>;
# ok, now lets do some work.
while( <V> ) {
# have a vmstat line in $_
$vm = $_;
chomp $vm;
# get loadavg, keep just loadavgs
open( L, "/proc/loadavg" );
$load = join( ' ', (split(' ', <L>))[0,1,2] );
close L;
# get eth1 stats
open( E, "/proc/net/dev" );
while( <E> ) { last if /eth1/; }
$net = $_;
chomp $net;
close E;
$net =~ s/eth1://;
$colls = (split(' ',$net))[13];
$lastCollisions = $colls if $lastCollisions == 0;
$diffcols = $colls - $lastCollisions;
$lastCollisions = $colls;
#produce some output dude
$now = time;
@d = localtime($now);
$nowT = sprintf "%02d/%02d/%04d %02d:%02d:%02d",
$d[4]+1, $d[3], $d[5]+1900,
$d[2], $d[1], $d[0];
$line = "$vm $load $net $diffcols $now $nowT";
$line =~ s/\s+/ /g;
print $line . "\n";
}
Matthew Jarvis <[EMAIL PROTECTED]> writes:
% I am trying to nail down what is happening on a server and hopefully a
% solution.
%
% It seems that as I monitor CPU usage over time that there is a pattern
% to the load in a approx 2-week cycle. Things are pretty flat for a week
% (after a reboot), then start ramping up over about a week until load ave
% on the machine makes it unstable and requiring a reboot.
%
% Sometimes the load on this thing hits in the high 40's, but typically
% runs around at 2-4.
%
% Looking at output of ps or top don't seem to show, at least to me, what
% process or processes is being a bad boy.
%
% The only plan I have at the moment is to wait for a reboot (anticipate
% one tonight if allowed), output ps to a file, then wait for the next
% time and ps again to see a comparison.
%
% I'd love to hear from someone how to diagnose this issue. Next SAO
% meeting the beers on me if someone can help me out! <g>
%
% --
% Matthew S. Jarvis
% IT Manager
% Bike Friday - "Performance that Packs."
% www.bikefriday.com
% 541/687-0487 x140
% [EMAIL PROTECTED]
% _______________________________________________
% EUGLUG mailing list
% [email protected]
% http://www.euglug.org/mailman/listinfo/euglug
-----
John Sechrest . Helping people use
. computers and the Internet
. more effectively
.
. Internet: [EMAIL PROTECTED]
.
. http://www.peak.org/~sechrest
_______________________________________________
EUGLUG mailing list
[email protected]
http://www.euglug.org/mailman/listinfo/euglug