On 10/31/07, Morris, Kevin J. (LNG-DAY) <[EMAIL PROTECTED]> wrote:

> Scenario:
> 1. LinuxA normally uses a steady ~30% cpu.
> 2. At 10:30am, it was noticed that LinuxA's cpu pegged 100% and memory
> consumption increased to the point of heavy swapping
> 3. At 11:10am, LinuxA's cpu went back to ~30% and memory utilization
> returned to normal.
> 4. It is now 3:00pm. How can I go back in time to find the culprit
> application/process for the 40 minutes of hiatus?

This is exactly the kind of scenario where the performance database of
ESALPS is a big advantage. Basically all "interesting" data is retained in a
central database (subject to thresholds and customer defined expiration).

Note that Performance Toolkit only keeps history data for a limited set of
system wide metrics (so you can see the CPU usage was 230% at 02:40, but no
idea why). It does have the option to "benchmark" a specific user and keep
detailed metrics for that user, but when you can predict which server will
get problems, then we have many other options.

>From the live data of today (so not a test that was set up to show the
data). I see at 02:50 that my Linux virtual machine uses a full CPU, we can
go back in time and review usage over time to see it started at 02:39 or so:
         UserID   <Processor>
Time     /Class   Total  Virt
-------- -------- ----- -----
02:35:00 LNEALE1   0.50  0.49
02:36:00 LNEALE1   0.66  0.65
02:37:00 LNEALE1   0.49  0.48
02:38:00 LNEALE1   0.49  0.48
02:39:00 LNEALE1  25.96 25.90
02:40:00 LNEALE1  15.43 15.08
02:41:00 LNEALE1  45.62 44.71
02:42:00 LNEALE1  96.83 96.57
02:43:00 LNEALE1  99.90 99.68
[snip]
02:50:00 LNEALE1  98.87 98.66

Now I can see the process data in ESALPS for this server:

                            <-Process Ident-> <-----CPU Percents----->
Time     Node     Name      ID    PPID   GRP   Tot  sys user syst usrt
-------- -------- --------- ----- ----- ----- ---- ---- ---- ---- ----
02:37:00 LNEALE1  khelper       8     6     0  0.3    0    0  0.2  0.1
                  *Totals*      0     0     0  0.5  0.1  0.1  0.2  0.2
02:38:00 LNEALE1  snmpd      1063     1  1062  0.3  0.1  0.1  0.0  0.0
                  khelper       8     6     0  0.2  0.0    0  0.0  0.1
                  *Totals*      0     0     0  0.5  0.2  0.1  0.0  0.2
02:39:00 LNEALE1  nscd       1225     1  1225  0.1  0.1    0    0    0
                  sshd       1143     1  1143 24.7    0    0  1.2 23.6
                  snmpd      1063     1  1062  0.1  0.1  0.1    0  0.0
                  khelper       8     6     0  0.2    0    0  0.1  0.1
                  init          1     0     0  0.6    0    0  0.1  0.5
                  *Totals*      0     0     0 25.7  0.2  0.1  1.3 24.2
02:40:00 LNEALE1  python     2882  2881  2860  1.2  0.3  0.8    0    0
                  python     2881  2860  2860  0.3  0.0  0.3    0    0
                  bash       2860  2859  2860  0.1  0.0  0.1  0.0    0
                  sshd       2857  1143  2857  0.1  0.0  0.1    0  0.0
                  sshd       1143     1  1143 11.6    0    0  2.4  9.1
                  snmpd      1063     1  1062  0.2  0.1  0.1    0  0.0
                  khelper       8     6     0  0.2    0    0  0.0  0.1
                  *Totals*      0     0     0 13.6  0.5  1.3  2.5  9.3
 [snip]
02:50:00 LNEALE1  cc1       23669 23668  2966  1.5  0.0  1.4    0    0
                  sh        23497 17653  2966  2.4  0.1  0.3  0.4  1.7
                  make      17653 17652  2966 59.3  0.0  0.1 13.9 45.3
                  sh        13522 13521  2966 34.9    0    0  5.7 29.2
                  snmpd      1063     1  1062  0.2  0.1  0.1    0  0.0
                  khelper       8     6     0  0.1    0    0  0.1  0.1
                  *Totals*      0     0     0 98.5  0.3  1.9 20.0 76.3

You may notice that the 98.5% total is getting pretty close to the
98.66%that VM reported (the 10ms granularity makes things wobble a
bit).

The data actually shows many other details as well, but I will not bore
folks further now.
--
Rob van der Heij
Velocity Software, Inc
http://velocitysoftware.com/

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390

Reply via email to