Hi
I'm probably not the biggest expert in performance analises but, some
things struck me as odd in your outputs.
Anil wrote:
We have a serious performance problem on our server. Here is some data:
sar -u 5 10:
18:06:58 %usr %sys %wio %idle
18:07:03 8 57 0 35
18:07:08 3 22 0 75
18:07:14 3 66 0 31
18:07:19 3 16 0 81
18:07:24 4 52 0 44
18:07:29 3 20 0 77
18:07:34 2 60 0 38
18:07:39 2 39 0 59
18:07:44 2 50 0 48
18:07:49 2 21 0 77
Average 3 40 0 57
A lot of system time is eating up the CPU. Using vmstat shows:
You have a peak of 31% of CPU idle so, if you hadn't told you were
having performance issues, I would just assume this machine was simply
doing I/O but still had space to grow.
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs us sy id
0 0 0 2593264 373392 182 1013 4 0 0 0 0 0 0 0 23 1483 9112 1862 3 9 88
0 0 0 2647980 425032 246 1168 2 0 0 0 0 0 0 0 23 896 23589 2229 4 10 86
0 0 0 2645524 424328 221 1091 3 0 0 0 0 0 0 0 20 872 8795 1870 3 9 88
<snip>
CPU still has a lot of idle, po column is always zero so, no serious CPU
or memory issues are shown here
(this is when system bogs down)
12 0 0 2815300 406156 292 1451 66 0 0 0 0 0 0 0 282 4993 110489 4649 6 24 70
11 0 0 2596252 370944 304 1910 27 0 0 0 0 0 0 0 223 2404 57232 3445 7 48 45
12 0 0 2654676 423784 199 1016 10 0 0 0 0 0 0 0 203 1470 9183 3672 3 48 49
Runnign queue shoots up, PI (Page in, rate at with Solaris is loading
stuff into memory) also shoots up but, PO (page out, rate at with
Solaris frees memory pages) stays zero. That is somewhat strange, I
would expect PO to move away from zero from time to time. Can you
tell us what was the interval you used in vmstat (so we can have an idea
of the sample's size)?
<br>
Notice the run queue. Is there a DTrace script (from the DTT package) that I
can use to figure out what is going on?
<br><br>
mpstat shows:
<pre>
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3511 48 97 784 197 1067 32 239 365 0 5814 5 43 0 52
1 1287 28 43 429 0 901 37 215 314 0 2821 3 40 0 57
2 2954 54 155 1442 1079 1176 26 241 339 0 4927 4 42 0 54
3 1364 20 886 167 16 655 32 184 299 0 3939 4 41 0 55
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3523 14 46 486 197 1129 50 251 411 0 6895 7 52 0 41
1 1536 8 31 119 0 922 53 220 375 0 4149 4 51 0 45
2 3160 11 76 1251 1177 1058 56 239 403 0 5987 5 57 0 38
3 1592 5 38 102 2 725 50 189 363 0 3929 4 51 0 45
</pre>
and when things *appear* to be good:
<pre>
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 355 0 14 680 202 631 5 146 67 0 2225 2 13 0 85
1 59 0 804 29 0 593 13 173 48 0 1948 2 3 0 95
2 455 0 13 648 363 675 7 179 43 0 4473 3 8 0 89
3 96 0 7 293 2 419 6 165 40 0 2434 2 9 0 89
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 379 0 12 610 202 821 7 174 62 0 1594 4 7 0 89
1 189 0 23 223 0 646 15 182 49 0 1695 3 7 0 90
2 322 0 582 565 535 695 10 169 45 0 2477 12 14 0 75
3 216 0 9 221 2 439 11 168 39 0 1845 12 5 0 83
</pre>
(the idle time is much higher)
The only thing I see is a high smtx?
vmstat and mpstat have some differences in the way they measure
information. That's why it's important to use both.
By the looks of it, you have an application running with it's threading
configuration a little too agressive and the CPUs spend a lot of time
switching their context. but, it's pretty hard to actually point out
something without:
- What is this machine?
- What applications is it running? When did the problems started, what
happened then, etc, etc.
- I hope you already did this but I'll ask just the same: Is the
hardware all checked? Is iostat -E output the same when executed in 24
hours period? Does /var/adm/messages show any errors (Retriable errors
on disks for example)?
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org