On May 12, 2011, at 7:52 AM, Kevin Hildebrand wrote: > > One of the oddities that I'm seeing that has me grasping at write > fragmentation and I/O sizes may not be directly related to these things at > all. Periodically, iostat will show that one or more of my OST disks will > be running at 99% utilization. Reads per second is somewhere in the > 150-200 range, while read kB/second is quite small.
That sounds familiar. You're probably experiencing these: https://bugzilla.lustre.org/show_bug.cgi?id=24183 http://jira.whamcloud.com/browse/LU-15 Jason > In addition, average > request size is also very small. llobdstat output on the OST in question > usually has zero, or very small values for reads and writes, and values > for stats/punches/creates/deletes in the ones and twos. > While this is happening, lustre starts complaining about 'slow commitrw', > 'slow direct_io', etc. At this time, accesses from clients are usually > hanging. > > Why would the disk(s) be pegged while llobdstat shows zero activity? > > After a few minutes in this state, the %util drops back down to single > digit percentages and normal I/O resumes on the clients. > > Thanks, > Kevin > > On Thu, 12 May 2011, Kevin Van Maren wrote: > >> Kevin Hildebrand wrote: >>> >>> The PERC 6 and H800 use megaraid_sas, I'm currently running >>> 00.00.04.17-RH1. >>> >>> The max_sectors numbers (320) are what is being set by default- I am >>> able to set it to something smaller than 320, but not larger. >> >> Right. You can not set max_sectors_kb larger than max_hw_sectors_kb >> (Linux normally defaults most drivers to 512, but Lustre sets them to be >> the same): you may want to instrument your HBA driver to see what is >> going on (ie, why the max_hw_sectors_kb is < 1024). I don't know if it >> is due to a driver limitation or a true hardware limit. >> >> Most drivers have a limit of 512KB by default; see Bug 22850 for the >> patches that fixed the QLogic and Emulex fibre channel drivers. >> >> Kevin >> >>> Kevin >>> >>> On Wed, 11 May 2011, Kevin Van Maren wrote: >>> >>>> You didn't say, but I think they are LSI-based: are you using the mptsas >>>> driver with the PERC cards? Which driver version? >>>> >>>> First, max_sectors_kb should normally be set to a power of 2 number, >>>> like 256, over an odd size like 320. This number should also match the >>>> native raid size of the device, to avoid read-modify-write cycles. (See >>>> Bug 22886 on why not to make it > 1024 in general). >>>> >>>> See Bug 17086 for patches to increase the max_sectors_kb limitation for >>>> the mptsas driver to 1MB, or the true hardware maximum, rather than a >>>> driver limit; however, the hardware may still be limited to sizes < 1MB. >>>> >>>> Also, to clarify the sizes: the smallest bucket >= transfer_size is the >>>> one incremented, so a 320KB IO increments the 512KB bucket. Since your >>>> HW says it can only do a 320KB IO, there will never be a 1MB IO. >>>> >>>> You may want to instrument your HBA driver to see what is going on (ie, >>>> why the max_hw_sectors_kb is < 1024). >>>> >>>> Kevin >>>> >>>> >>>> Kevin Hildebrand wrote: >>>>> Hi, I'm having some performance issues on my Lustre filesystem and it >>>>> looks to me like it's related to I/Os getting fragmented before being >>>>> written to disk, but I can't figure out why. This system is RHEL5, >>>>> running Lustre 1.8.4. >>>>> >>>>> All of my OSTs look pretty much the same- >>>>> >>>>> read | write >>>>> pages per bulk r/w rpcs % cum % | rpcs % cum % >>>>> 1: 88811 38 38 | 46375 17 17 >>>>> 2: 1497 0 38 | 7733 2 20 >>>>> 4: 1161 0 39 | 1840 0 21 >>>>> 8: 1168 0 39 | 7148 2 24 >>>>> 16: 922 0 40 | 3297 1 25 >>>>> 32: 979 0 40 | 7602 2 28 >>>>> 64: 1576 0 41 | 9046 3 31 >>>>> 128: 7063 3 44 | 16284 6 37 >>>>> 256: 129282 55 100 | 162090 62 100 >>>>> >>>>> >>>>> read | write >>>>> disk fragmented I/Os ios % cum % | ios % cum % >>>>> 0: 51181 22 22 | 0 0 0 >>>>> 1: 45280 19 42 | 82206 31 31 >>>>> 2: 16615 7 49 | 29108 11 42 >>>>> 3: 3425 1 50 | 17392 6 49 >>>>> 4: 110445 48 98 | 129481 49 98 >>>>> 5: 1661 0 99 | 2702 1 99 >>>>> >>>>> read | write >>>>> disk I/O size ios % cum % | ios % cum % >>>>> 4K: 45889 8 8 | 56240 7 7 >>>>> 8K: 3658 0 8 | 6416 0 8 >>>>> 16K: 7956 1 10 | 4703 0 9 >>>>> 32K: 4527 0 11 | 11951 1 10 >>>>> 64K: 114369 20 31 | 134128 18 29 >>>>> 128K: 5095 0 32 | 17229 2 31 >>>>> 256K: 7164 1 33 | 30826 4 35 >>>>> 512K: 369512 66 100 | 465719 64 100 >>>>> >>>>> Oddly, there's no 1024K row in the I/O size table... >>>>> >>>>> >>>>> ...and these seem small to me as well, but I can't seem to change them. >>>>> Writing new values to either doesn't change anything. >>>>> >>>>> # cat /sys/block/sdb/queue/max_hw_sectors_kb >>>>> 320 >>>>> # cat /sys/block/sdb/queue/max_sectors_kb >>>>> 320 >>>>> >>>>> Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID >>>>> controllers, with MD1000 and MD1200 arrays, respectively. >>>>> >>>>> >>>>> Any clues on where I should look next? >>>>> >>>>> Thanks, >>>>> >>>>> Kevin >>>>> >>>>> Kevin Hildebrand >>>>> University of Maryland, College Park >>>>> Office of Information Technology >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> [email protected] >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>> >>>> >> >> > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
