Hi

I'm probably not the biggest expert in performance analises but, some things struck me as odd in your outputs.

Anil wrote:
We have a serious performance problem on our server. Here is some data:
sar -u 5 10:
18:06:58    %usr    %sys    %wio   %idle
18:07:03       8      57       0      35
18:07:08       3      22       0      75
18:07:14       3      66       0      31
18:07:19       3      16       0      81
18:07:24       4      52       0      44
18:07:29       3      20       0      77
18:07:34       2      60       0      38
18:07:39       2      39       0      59
18:07:44       2      50       0      48
18:07:49       2      21       0      77

Average        3      40       0      57


A lot of system time is eating up the CPU. Using vmstat shows:
You have a peak of 31% of CPU idle so, if you hadn't told you were having performance issues, I would just assume this machine was simply doing I/O but still had space to grow.

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s2 s3 s4   in   sy   cs us sy id
 0 0 0 2593264 373392 182 1013 4 0 0  0  0  0  0  0 23 1483 9112 1862  3  9 88
 0 0 0 2647980 425032 246 1168 2 0 0  0  0  0  0  0 23  896 23589 2229 4 10 86
 0 0 0 2645524 424328 221 1091 3 0 0  0  0  0  0  0 20  872 8795 1870  3  9 88
<snip>
CPU still has a lot of idle, po column is always zero so, no serious CPU or memory issues are shown here
(this is when system bogs down)

 12 0 0 2815300 406156 292 1451 66 0 0 0 0  0  0  0 282 4993 110489 4649 6 24 70
 11 0 0 2596252 370944 304 1910 27 0 0 0 0  0  0  0 223 2404 57232 3445 7 48 45
 12 0 0 2654676 423784 199 1016 10 0 0 0 0  0  0  0 203 1470 9183 3672 3 48 49
Runnign queue shoots up, PI (Page in, rate at with Solaris is loading stuff into memory) also shoots up but, PO (page out, rate at with Solaris frees memory pages) stays zero. That is somewhat strange, I would expect PO to move away from zero from time to time. Can you tell us what was the interval you used in vmstat (so we can have an idea of the sample's size)?

<br>
Notice the run queue. Is there a DTrace script (from the DTT package) that I 
can use to figure out what is going on?
<br><br>

mpstat shows:
<pre>
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0 3511  48   97   784  197 1067   32  239  365    0  5814    5  43   0  52
  1 1287  28   43   429    0  901   37  215  314    0  2821    3  40   0  57
  2 2954  54  155  1442 1079 1176   26  241  339    0  4927    4  42   0  54
  3 1364  20  886   167   16  655   32  184  299    0  3939    4  41   0  55
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0 3523  14   46   486  197 1129   50  251  411    0  6895    7  52   0  41
  1 1536   8   31   119    0  922   53  220  375    0  4149    4  51   0  45
  2 3160  11   76  1251 1177 1058   56  239  403    0  5987    5  57   0  38
  3 1592   5   38   102    2  725   50  189  363    0  3929    4  51   0  45
</pre>

and when things *appear* to be good:

<pre>
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  355   0   14   680  202  631    5  146   67    0  2225    2  13   0  85
  1   59   0  804    29    0  593   13  173   48    0  1948    2   3   0  95
  2  455   0   13   648  363  675    7  179   43    0  4473    3   8   0  89
  3   96   0    7   293    2  419    6  165   40    0  2434    2   9   0  89
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  379   0   12   610  202  821    7  174   62    0  1594    4   7   0  89
  1  189   0   23   223    0  646   15  182   49    0  1695    3   7   0  90
  2  322   0  582   565  535  695   10  169   45    0  2477   12  14   0  75
  3  216   0    9   221    2  439   11  168   39    0  1845   12   5   0  83
</pre>

(the idle time is much higher)

The only thing I see is a high smtx?

vmstat and mpstat have some differences in the way they measure information. That's why it's important to use both. By the looks of it, you have an application running with it's threading configuration a little too agressive and the CPUs spend a lot of time switching their context. but, it's pretty hard to actually point out something without:
- What is this machine?
- What applications is it running? When did the problems started, what happened then, etc, etc. - I hope you already did this but I'll ask just the same: Is the hardware all checked? Is iostat -E output the same when executed in 24 hours period? Does /var/adm/messages show any errors (Retriable errors on disks for example)?
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org

Reply via email to