I am studying a SCO OSR5 system that has serious performance issues. I am interested in hearing your critiques, suggestions, and corrections about my performance analysis, suggested courses of action, and, in general, anything else that you feel is relevant. I also hope that this post and resulting thread will help others in troubleshooting performance issues on a SCO or Unix server, since it can be something of a mystic art.
I have been collecting data via sar for several weeks, and am including snippets of sar data for a single representative day, and my analyses in this post. I have attached the complete hardware information at the bottom of this post. The short version: Pentium Pro, 192MB of RAM, and a single SCSI disk. Problem: Users report very slow response times from the primary application, a Dataflex-based database application. Summary of findings: The sar data indicates to me that disk I/O is the sole cause of the performance issues. The client is running a database application (Dataflex) as the sole application used by around 35 concurrent users during working hours (7 AM to 5 PM). A network backup begins in the afternoon and tags the disk, but I am not including that as an issue as it is performed during non-working hours, and is not indicative of normal use by users. Initial thoughts on corrective actions: 1. Increase read buffers by very large amounts. The server does mostly read operations on the disk, and my goal should be to keep %rcache at or above 90%. 2. Buy a faster SCSI disk(s). I am not positive about the speed, but I believe that the disk is 7200 RPM. The database software does do caching, but only on a per-user basis. That is, there is no centralized server. Dataflex runs per-user and each user has their own cache. I believe this may cause reads to be highly random across the disk, meaning that disk rotational speed can lead to a major gain. (I would like to be able to confirm whether the server tends to do more random access reads or sequential reads in general.) I could spread the major database files across multiple disks, but that puts me in a risky situation since my MTBF increases drastically. 3. Move to RAID-1, which has very good read performance, especially for random access reads. Danger of RAID-0 prevents its use. Here is the sar data and my general thoughts: CPU ----------------- 00:00:00 %usr %sys %wio %idle (-u) 01:00:00 0 0 0 100 02:00:00 0 0 0 100 ... 11:00:02 0 2 4 93 11:20:02 5 17 35 43 11:40:03 12 25 45 18 12:00:05 9 19 64 8 12:20:00 5 18 35 42 12:40:00 0 2 5 93 13:00:00 0 2 3 95 Very low %usr and %sys along with very high %wio. Action: review disk i/o 00:00:00 runq-sz %runocc swpq-sz %swpocc (-q) 01:00:00 1.0 0 02:00:00 ... 11:00:02 1.0 0 11:20:02 1.1 4 11:40:03 1.0 5 12:00:05 1.0 4 12:20:00 1.0 2 12:40:00 1.0 1 13:00:00 1.5 0 13:20:00 3.0 0 13:40:00 14:00:01 2.0 0 runq-sz > 2 at times, but %runocc always < 90%, so we are fine on CPU. ***** Conclusion: CPU is not a factor. ***** Disk - contributing factor ----------------- 00:00:00 iget/s namei/s dirbk/s (-a) 01:00:00 5 1 1 02:00:00 2 0 0 ... 11:00:02 33 7 12 11:20:02 1814 214 469 11:40:03 275 61 860 12:00:05 82 22 1561 12:20:00 187 50 2677 12:40:00 31 6 27 13:00:00 48 9 34 The ratio of iget/s to namei/s is very high. Bad filesystem layout? Action: ??? 00:00:00 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s (-b) 01:00:00 0 2 93 0 0 50 0 0 02:00:00 0 1 97 0 0 26 0 0 ... 11:00:02 7 87 92 3 7 63 0 0 11:20:02 362 1392 74 7 156 95 0 0 11:40:03 1330 2831 53 4 36 90 0 0 12:00:05 1015 3616 72 4 17 78 0 0 12:20:00 422 7421 94 5 32 84 0 0 12:40:00 9 111 92 1 6 75 0 0 13:00:00 5 86 94 1 4 67 0 0 The ratio of lread/s to lwrit/s is very, very high. (Eyeball average around 60:1.) This server does mostly read operations, by far. The %rcache is very low around peak times. We want to keep %rcache around 90% or higher, even during peak times, for optimal performance. The %wcache also goes < 90%, but not a factor vs. %rcache? Action: Increase disk buffers by a large amount. Focus on read buffers. 00:00:00 device %busy avque r+w/s blks/s avwait avserv (-d) 01:00:00 Sdsk-0 0.18 2.02 0.11 0.43 17.31 16.89 02:00:00 Sdsk-0 0.07 1.96 0.06 0.18 12.37 12.95 ... 11:00:02 Sdsk-0 9.17 5.48 5.71 19.84 71.92 16.04 11:20:02 Sdsk-0 70.84 1.32 72.72 739.53 3.13 9.74 11:40:03 Sdsk-0 100.00 1.09 120.22 2666.68 0.95 10.67 12:00:05 Sdsk-0 100.00 1.11 108.81 2038.57 2.07 18.99 12:20:00 Sdsk-0 83.51 1.10 59.97 854.50 1.38 13.93 12:40:00 Sdsk-0 7.43 2.85 5.80 20.99 23.61 12.80 13:00:00 Sdsk-0 5.52 3.56 3.55 13.34 39.77 15.54 This server has one disk. %busy gets to 100% during peak times. avque stays low. Goal: %busy high and avque low. Action: ??? 00:00:00 c_hits cmisses (hit %) (-n) 01:00:00 17187 1246 (93%) 02:00:00 7137 237 (96%) ... 11:00:02 39199 2529 (93%) 11:20:02 2026739 29218 (98%) 11:40:03 287138 10446 (96%) 12:00:05 86565 7153 (92%) 12:20:00 191094 14335 (93%) 12:40:00 36252 2730 (92%) 13:00:00 57560 4362 (92%) Our name cache is good. ***** Conclusion: Disk I/O is the problem. Cause: Disk too busy. ***** Memory ----------------- 00:00:00 vflt/s pflt/s pgfil/s rclm/s (-p) 01:00:00 0.22 0.53 0.00 0.00 02:00:00 0.10 0.15 0.00 0.00 ... 11:00:02 1.38 3.11 0.00 0.00 11:20:02 1.71 3.21 0.00 0.00 11:40:03 1.21 2.51 0.00 0.00 12:00:05 1.64 3.41 0.00 0.00 12:20:00 1.45 3.63 0.00 0.00 12:40:00 1.52 3.06 0.00 0.00 13:00:00 2.38 5.86 0.00 0.00 00:00:00 freemem freeswp (-r) 01:00:00 41117 188272 02:00:00 41124 188272 ... 11:00:02 36736 188608 11:20:02 36855 188608 11:40:03 36728 188608 12:00:05 36699 188608 12:20:00 36731 188608 12:40:00 37041 188608 13:00:00 36794 188608 00:00:00 swpin/s bswin/s swpot/s bswot/s pswch/s (-w) 01:00:00 0.01 0.0 0.00 0.0 2 02:00:00 0.01 0.0 0.00 0.0 2 ... 11:00:02 0.02 0.2 0.00 0.0 15 11:20:02 0.03 0.2 0.00 0.0 394 11:40:03 0.03 0.2 0.00 0.0 237 12:00:05 0.04 0.3 0.00 0.0 143 12:20:00 0.03 0.2 0.00 0.0 146 12:40:00 0.04 0.3 0.00 0.0 12 13:00:00 0.08 0.6 0.00 0.0 11 Memory high, swap and paging low. Memory is not a factor. ***** Conclusion: Memory is not a factor. ***** Hardware Information: device address vector dma comment ---------------------------------------------------------------------------- %cpu - - - unit=1 family=6 type=Pentium Pro %cpuid - - - unit=1 vend=GenuineIntel mod=5 step=1 %fpu - 13 - unit=1 type=80387-compatible %pci 0x0CF8-0x0CFF - - am=1 sc=0 buses=1 %serial 0x03F8-0x03FF 4 - unit=0 type=Standard nports=1 fifo=yes %console - - - unit=vga type=0 12 screens=68k %adapter 0xF800-0xF860 5 0 type=slha ha=0 id=7 Chip=53c875-E %adapter 0xF400-0xF460 11 0 type=slha ha=1 id=7 Chip=53c875-E %floppy 0x03F2-0x03F7 6 2 unit=0 type=135ds18 %kbmouse 0x0060-0x0064 12 - type=Keyboard mouse %cd-rom - - - type=S ha=0 id=5 lun=0 bus=0 ht=slha %chey - - - type=S ha=0 id=0 lun=0 %disk - - - type=S ha=0 id=0 lun=0 bus=0 ht=slha %Sdsk - - - cyls=1022 hds=138 secs=63 fts=stdb mem: total = 65144k, kernel = 11348k, user = 53796k swapdev = 1/41, swplo = 0, nswap = 192512, swapmem = 96256k rootdev = 1/42, pipedev = 1/42, dumpdev = 1/41 kernel: Hz = 100, i/o bufs = 6300k
