On 10/31/07, Morris, Kevin J. (LNG-DAY) <[EMAIL PROTECTED]> wrote:
> Scenario:
> 1. LinuxA normally uses a steady ~30% cpu.
> 2. At 10:30am, it was noticed that LinuxA's cpu pegged 100% and memory
> consumption increased to the point of heavy swapping
> 3. At 11:10am, LinuxA's cpu went back to ~30% and memory utilization
> returned to normal.
> 4. It is now 3:00pm. How can I go back in time to find the culprit
> application/process for the 40 minutes of hiatus?
This is exactly the kind of scenario where the performance database of
ESALPS is a big advantage. Basically all "interesting" data is retained in a
central database (subject to thresholds and customer defined expiration).
Note that Performance Toolkit only keeps history data for a limited set of
system wide metrics (so you can see the CPU usage was 230% at 02:40, but no
idea why). It does have the option to "benchmark" a specific user and keep
detailed metrics for that user, but when you can predict which server will
get problems, then we have many other options.
>From the live data of today (so not a test that was set up to show the
data). I see at 02:50 that my Linux virtual machine uses a full CPU, we can
go back in time and review usage over time to see it started at 02:39 or so:
UserID <Processor>
Time /Class Total Virt
-------- -------- ----- -----
02:35:00 LNEALE1 0.50 0.49
02:36:00 LNEALE1 0.66 0.65
02:37:00 LNEALE1 0.49 0.48
02:38:00 LNEALE1 0.49 0.48
02:39:00 LNEALE1 25.96 25.90
02:40:00 LNEALE1 15.43 15.08
02:41:00 LNEALE1 45.62 44.71
02:42:00 LNEALE1 96.83 96.57
02:43:00 LNEALE1 99.90 99.68
[snip]
02:50:00 LNEALE1 98.87 98.66
Now I can see the process data in ESALPS for this server:
<-Process Ident-> <-----CPU Percents----->
Time Node Name ID PPID GRP Tot sys user syst usrt
-------- -------- --------- ----- ----- ----- ---- ---- ---- ---- ----
02:37:00 LNEALE1 khelper 8 6 0 0.3 0 0 0.2 0.1
*Totals* 0 0 0 0.5 0.1 0.1 0.2 0.2
02:38:00 LNEALE1 snmpd 1063 1 1062 0.3 0.1 0.1 0.0 0.0
khelper 8 6 0 0.2 0.0 0 0.0 0.1
*Totals* 0 0 0 0.5 0.2 0.1 0.0 0.2
02:39:00 LNEALE1 nscd 1225 1 1225 0.1 0.1 0 0 0
sshd 1143 1 1143 24.7 0 0 1.2 23.6
snmpd 1063 1 1062 0.1 0.1 0.1 0 0.0
khelper 8 6 0 0.2 0 0 0.1 0.1
init 1 0 0 0.6 0 0 0.1 0.5
*Totals* 0 0 0 25.7 0.2 0.1 1.3 24.2
02:40:00 LNEALE1 python 2882 2881 2860 1.2 0.3 0.8 0 0
python 2881 2860 2860 0.3 0.0 0.3 0 0
bash 2860 2859 2860 0.1 0.0 0.1 0.0 0
sshd 2857 1143 2857 0.1 0.0 0.1 0 0.0
sshd 1143 1 1143 11.6 0 0 2.4 9.1
snmpd 1063 1 1062 0.2 0.1 0.1 0 0.0
khelper 8 6 0 0.2 0 0 0.0 0.1
*Totals* 0 0 0 13.6 0.5 1.3 2.5 9.3
[snip]
02:50:00 LNEALE1 cc1 23669 23668 2966 1.5 0.0 1.4 0 0
sh 23497 17653 2966 2.4 0.1 0.3 0.4 1.7
make 17653 17652 2966 59.3 0.0 0.1 13.9 45.3
sh 13522 13521 2966 34.9 0 0 5.7 29.2
snmpd 1063 1 1062 0.2 0.1 0.1 0 0.0
khelper 8 6 0 0.1 0 0 0.1 0.1
*Totals* 0 0 0 98.5 0.3 1.9 20.0 76.3
You may notice that the 98.5% total is getting pretty close to the
98.66%that VM reported (the 10ms granularity makes things wobble a
bit).
The data actually shows many other details as well, but I will not bore
folks further now.
--
Rob van der Heij
Velocity Software, Inc
http://velocitysoftware.com/
----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390