I am studying a SCO OSR5 system that has serious performance issues. I am
interested in hearing your critiques, suggestions, and corrections about my
performance analysis, suggested courses of action, and, in general, anything
else that you feel is relevant. I also hope that this post and resulting
thread will help others in troubleshooting performance issues on a SCO or
Unix server, since it can be something of a mystic art.




I have been collecting data via sar for several weeks, and am including
snippets of sar data for a single representative day, and my analyses in
this post.



I have attached the complete hardware information at the bottom of this
post. The short version: Pentium Pro, 192MB of RAM, and a single SCSI disk.



Problem: Users report very slow response times from the primary application,
a Dataflex-based database application.



Summary of findings: The sar data indicates to me that disk I/O is the sole
cause of the performance issues. The client is running a database
application (Dataflex) as the sole application used by around 35 concurrent
users during working hours (7 AM to 5 PM). A network backup begins in the
afternoon and tags the disk, but I am not including that as an issue as it
is performed during non-working hours, and is not indicative of normal use
by users.



Initial thoughts on corrective actions:



1. Increase read buffers by very large amounts. The server does mostly read
operations on the disk, and my goal should be to keep %rcache at or above
90%.



2. Buy a faster SCSI disk(s). I am not positive about the speed, but I
believe that the disk is 7200 RPM.



The database software does do caching, but only on a per-user basis. That
is, there is no centralized server. Dataflex runs per-user and each user has
their own cache. I believe this may cause reads to be highly random across
the disk, meaning that disk rotational speed can lead to a major gain.



(I would like to be able to confirm whether the server tends to do more
random access reads or sequential reads in general.)



I could spread the major database files across multiple disks, but that puts
me in a risky situation since my MTBF increases drastically.



3. Move to RAID-1, which has very good read performance, especially for
random access reads. Danger of RAID-0 prevents its use.



Here is the sar data and my general thoughts:



CPU

-----------------



00:00:00    %usr    %sys    %wio   %idle (-u)

01:00:00       0       0       0     100

02:00:00       0       0       0     100

...

11:00:02       0       2       4      93

11:20:02       5      17      35      43

11:40:03      12      25      45      18

12:00:05       9      19      64       8

12:20:00       5      18      35      42

12:40:00       0       2       5      93

13:00:00       0       2       3      95



Very low %usr and %sys along with very high %wio.



Action: review disk i/o



00:00:00 runq-sz %runocc swpq-sz %swpocc (-q)

01:00:00     1.0       0

02:00:00

...

11:00:02     1.0       0

11:20:02     1.1       4

11:40:03     1.0       5

12:00:05     1.0       4

12:20:00     1.0       2

12:40:00     1.0       1

13:00:00     1.5       0

13:20:00     3.0       0

13:40:00

14:00:01     2.0       0



runq-sz > 2 at times, but %runocc always < 90%, so we are fine on CPU.



*****

Conclusion: CPU is not a factor.

*****



Disk - contributing factor

-----------------



00:00:00  iget/s namei/s dirbk/s (-a)

01:00:00       5       1       1

02:00:00       2       0       0

...

11:00:02      33       7      12

11:20:02    1814     214     469

11:40:03     275      61     860

12:00:05      82      22    1561

12:20:00     187      50    2677

12:40:00      31       6      27

13:00:00      48       9      34



The ratio of iget/s to namei/s is very high. Bad filesystem layout?



Action: ???



00:00:00 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
(-b)

01:00:00       0       2      93       0       0      50       0       0

02:00:00       0       1      97       0       0      26       0       0

...

11:00:02       7      87      92       3       7      63       0       0

11:20:02     362    1392      74       7     156      95       0       0

11:40:03    1330    2831      53       4      36      90       0       0

12:00:05    1015    3616      72       4      17      78       0       0

12:20:00     422    7421      94       5      32      84       0       0

12:40:00       9     111      92       1       6      75       0       0

13:00:00       5      86      94       1       4      67       0       0



The ratio of lread/s to lwrit/s is very, very high. (Eyeball average around
60:1.) This server does mostly read operations, by far.



The %rcache is very low around peak times. We want to keep %rcache around
90% or higher, even during peak times, for optimal performance.



The %wcache also goes < 90%, but not a factor vs. %rcache?



Action: Increase disk buffers by a large amount. Focus on read buffers.



00:00:00  device   %busy     avque     r+w/s    blks/s    avwait    avserv
(-d)

01:00:00 Sdsk-0     0.18      2.02      0.11      0.43     17.31     16.89

02:00:00 Sdsk-0     0.07      1.96      0.06      0.18     12.37     12.95

...

11:00:02 Sdsk-0     9.17      5.48      5.71     19.84     71.92     16.04

11:20:02 Sdsk-0    70.84      1.32     72.72    739.53      3.13      9.74

11:40:03 Sdsk-0   100.00      1.09    120.22   2666.68      0.95     10.67

12:00:05 Sdsk-0   100.00      1.11    108.81   2038.57      2.07     18.99

12:20:00 Sdsk-0    83.51      1.10     59.97    854.50      1.38     13.93

12:40:00 Sdsk-0     7.43      2.85      5.80     20.99     23.61     12.80

13:00:00 Sdsk-0     5.52      3.56      3.55     13.34     39.77     15.54



This server has one disk.

%busy gets to 100% during peak times.

avque stays low.



Goal: %busy high and avque low.



Action: ???



00:00:00  c_hits cmisses (hit %) (-n)

01:00:00   17187    1246 (93%)

02:00:00    7137     237 (96%)

...

11:00:02   39199    2529 (93%)

11:20:02 2026739   29218 (98%)

11:40:03  287138   10446 (96%)

12:00:05   86565    7153 (92%)

12:20:00  191094   14335 (93%)

12:40:00   36252    2730 (92%)

13:00:00   57560    4362 (92%)



Our name cache is good.



*****

Conclusion: Disk I/O is the problem.

Cause: Disk too busy.

*****



Memory

-----------------



00:00:00  vflt/s  pflt/s pgfil/s  rclm/s (-p)

01:00:00    0.22    0.53    0.00    0.00

02:00:00    0.10    0.15    0.00    0.00

...

11:00:02    1.38    3.11    0.00    0.00

11:20:02    1.71    3.21    0.00    0.00

11:40:03    1.21    2.51    0.00    0.00

12:00:05    1.64    3.41    0.00    0.00

12:20:00    1.45    3.63    0.00    0.00

12:40:00    1.52    3.06    0.00    0.00

13:00:00    2.38    5.86    0.00    0.00



00:00:00 freemem freeswp (-r)

01:00:00   41117  188272

02:00:00   41124  188272

...

11:00:02   36736  188608

11:20:02   36855  188608

11:40:03   36728  188608

12:00:05   36699  188608

12:20:00   36731  188608

12:40:00   37041  188608

13:00:00   36794  188608



00:00:00 swpin/s bswin/s swpot/s bswot/s pswch/s (-w)

01:00:00    0.01     0.0    0.00     0.0       2

02:00:00    0.01     0.0    0.00     0.0       2

...

11:00:02    0.02     0.2    0.00     0.0      15

11:20:02    0.03     0.2    0.00     0.0     394

11:40:03    0.03     0.2    0.00     0.0     237

12:00:05    0.04     0.3    0.00     0.0     143

12:20:00    0.03     0.2    0.00     0.0     146

12:40:00    0.04     0.3    0.00     0.0      12

13:00:00    0.08     0.6    0.00     0.0      11



Memory high, swap and paging low. Memory is not a factor.



*****

Conclusion: Memory is not a factor.

*****



Hardware Information:



device    address       vector  dma     comment

----------------------------------------------------------------------------

%cpu      -             -  - unit=1 family=6 type=Pentium Pro

%cpuid    -             -  - unit=1 vend=GenuineIntel mod=5 step=1

%fpu      -             13 - unit=1 type=80387-compatible

%pci      0x0CF8-0x0CFF -  - am=1 sc=0 buses=1

%serial   0x03F8-0x03FF 4  - unit=0 type=Standard nports=1 fifo=yes

%console  -             -  - unit=vga type=0 12 screens=68k

%adapter  0xF800-0xF860 5  0 type=slha ha=0 id=7 Chip=53c875-E

%adapter  0xF400-0xF460 11 0 type=slha ha=1 id=7 Chip=53c875-E

%floppy   0x03F2-0x03F7 6  2 unit=0 type=135ds18

%kbmouse  0x0060-0x0064 12 - type=Keyboard mouse

%cd-rom   -             -  - type=S ha=0 id=5 lun=0 bus=0 ht=slha

%chey     -             -  - type=S ha=0 id=0 lun=0

%disk     -             -  - type=S ha=0 id=0 lun=0 bus=0 ht=slha

%Sdsk     -             -  -  cyls=1022 hds=138 secs=63 fts=stdb

mem: total = 65144k, kernel = 11348k, user = 53796k

swapdev = 1/41, swplo = 0, nswap = 192512, swapmem = 96256k

rootdev = 1/42, pipedev = 1/42, dumpdev = 1/41

kernel: Hz = 100, i/o bufs = 6300k




Reply via email to