On 08/15/2012 10:26 PM, Dennis Jacobfeuerborn wrote:
On 08/15/2012 03:54 AM, Phil Frost wrote:I'd guess (and this is just a guess, I've never examined DRBD internals beyond what's in the manual) that the unusually high %util is due to the activity log [1] or perhaps some other housekeeping function. [1] http://www.drbd.org/users-guide/s-activity-log.htmlThanks for the detailed explanation. I'm wondering though why something like this wouldn't be common knowledge and/or explained in the FAQ if this is a generic symptom of DRBD. Tomorrow I'm going to do some performance tests to see if this is a real problem or just a phantom issue.
It would probably be very educational to set up a DRBD device with external metadata, and then examine the disk activity during your benchmark with blktrace. This way you can see with great detail the requests being issued to the backing device, separate from those to the metadata device, and see how they compare when the DRBD device is presented with different loads. I'm sure plenty of people would appreciate the insight, myself included.
It is known more generally among storage professionals, that many of the numbers given by iostat are more like guidelines, and often assumed to have a more simple meaning than they actually do. It's simply not possible for the simple metrics provided by the kernel to give a detailed or even accurate measurement of the performance of a complex storage device, the topology and properties of which the kernel really knows nothing. Just search for "understanding iostat", and you will get pages with titles like "When iostat Leads You Astray".
I didn't really feel I had a grasp on what all iostat's numbers mean until I read more about the kernel counters that are the source of iostat's calculations [1] and then read the sysstat source code to see just what it was doing. I've attached a draft document I wrote attempting to explain what I learned. You may find it useful.
[1] http://www.kernel.org/doc/Documentation/iostats.txt
Interesting observations of iostat
Introduction
If you have ever tried to characterize IO performance, you have probably heard of iostat. If not, it's a part of the sysstat package which monitors the block device statistics provided by Linux, among many other things. It can give an administrator great insight into what sort of IO load is being handled by a device, and how well it's keeping pace. An example run, taken from one of our nameservers, looks like this:
$ iostat -dx Linux 2.6.32-5-xen-amd64 (ns1) 05/18/12 _x86_64_ (1 CPU) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap2 0.00 0.17 1.24 0.37 39.70 4.34 27.34 0.09 57.30 4.22 0.68 xvdap1 1.22 3.91 3.57 0.49 38.26 35.18 18.11 0.29 72.37 1.32 0.54
The basics of this tool are well documented. There is of course the manual page. I also like this post by Domas Mituzas to explain the basics. I had read many such posts, but I was still confused by the numbers iostat was giving me. I'm going to explain the meaning of these numbers with some mathematical rigor, show how they are calculated, and explain some of the more odd ways in which they may behave.
iostat performs all its calculations from a handful of counters provided by the Linux kernel. They can be access either from /sys/block/*/stats or /proc/diskstats. Let's give some names to the counters for the sake of this discussion:
- rdios
- number of read I/Os processed
- rdmerges
- number of read I/Os merged with in-queue I/O
- rdsectors
- number of sectors read
- rdticks
- total wait time for read requests
- wrios
- number of write I/Os processed
- wrmerges
- number of write I/Os merged with in-queue I/O
- wrsectors
- number of sectors written
- wrticks
- total wait time for write requests
- iosprg
- number of I/Os currently in flight
- totticks
- total time this block device has been active
- rqticks
- total wait time for all requests
-
The Linux documentation explains the significance of these counters in greater detail. However, they are mostly obvious, except for the difference between totticks and rqticks. The difference there is key to the following discussion, so I'll elaborate.
totticks is incremented by 1 for every ms that passes where there is something in the queue. It doesn't matter if there is 1 request in the queue, or 100, this counter still increments by just one for each 1 ms of wall clock time that elapses.
rqticks differs because it's weighted by the number of requests in the queue. The aforementioned kernel documentation offers a more complicated explanation, but I find it easier to describe this field as the sum of rdticks and wrticks. Each time a request enters the queue, it begins incrementing rdticks or wrticks by 1 for each ms it spends in the queue. If there is more than 1 request in the queue, then these counters will be incremented by each waiting request. So, if in 1 ms of wall clock time, there were 2 requests in the queue, these counters would increment by 2, while totticks increments only by 1.
It should be obvious how most of iostat's numbers are calculated from these counters. r/s, for example, is the change in rdios over the sampling period, divided by the length of the period. The last four numbers are more interesting:
It's notable that there's nothing instrumenting the difference between time spent waiting in the queue vs time spend waiting for a disk to service a request. The descriptions of await and svctm might suggest there is, but look carefully at the counters provided by the kernel, and you will see there is not. svctm is simply inferred: if in 10 ms, two requests were serviced, then 5 ms must have spent on average 5 ms servicing each one. It may make more sense to think of svctm as average IOPS for the time the device was not idle. If your device is never idle, Δtotticks is equal to the sample period, and then you can see svctm is exactly equal to 1/IOPS. This makes it a useful metric to monitor, since you should have some idea of the IOPS a device is capable of delivering, but you may not always be fully loading it. If a device is sometimes idle, r/s and w/s will be less than the device's capabilities, but svctm will not be.
Of course, the IOPS a device can deliver depends on the sort of load given to it. Sequential accesses will be faster than random accesses for spinning media. A more complicated case is a RAID device, which can achieve higher IOPS by servicing requests concurrently with multiple disks, but if there is only one request in line to be serviced, the advantage of concurrency is moot. Further, some IO schedulers may intentionally idle the device in anticipation of more IO. All of these mechanisms can decrease IOPS, and consequently increase svctm.
Now let's look at another example iostat run, one which confused me for a while. I noticed this after I had started monitoring iostat on all my servers, and I started receiving alerts for this server, which wasn't experiencing any application performance issues:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap3 0.00 1.03 1.32 0.74 10.18 14.17 11.79 0.26 126.68 2.67 0.55
Why is await so high? svctm tells me that the disk is responding at a very reasonable speed for a spinning disk. avgqu-sz, r/s, and w/s are low, and certainly within the capabilities of even the slowest disk found in modern hardware. Further, application performance is fine. Why then is await high?
To understand this, it's helpful to calculate another metric. We already have avgqu-sz, which tells us the average number of requests in the queue over the entire sampling period. However, if we divide this by the utilization (divide %util by 100), we get the average queue length over the time the device was not idle. This leads to an interesting relationship between the last four iostat metrics. (Remember I'm implicitly dividing %util by 100):
That is, the average time a request must wait is the average length of the queue when the device is not idle, multiplied by the average time it takes the device to service one request. Rearranging a bit we can make another statement which is useful when thinking about the IO system:
That is, the extent to which await is greater than svctm is proportional to the average queue length when the device is not idle. "await should not be much greater than svctm" is advice I've heard many times. This is why, but it's not good advice for all applications.
I can prove this mathematically by substituting the definitions of the metrics from above:
The common denominators in the complex fractions can be eliminated:
So why was await so high in my example? It's because the IO was very bursty. Though the device was busy only 0.55% of the time, it had 47 requests on average in the queue (avgqu-sz / util, or 0.26 / 0.0055) when it wasn't idle.
Newer versions of sysstat calculate await separately for reads and writes, but doing the same manually, I saw that await for reads was very low, while writes were very high. This doesn't present an application performance issue, because the writes are mostly generated by periodic writing of log files and other things which can complete asynchronously in the background. The CFQ scheduler that is the default in most current Linux distributions generally favors reads over writes for this reason.
So, what practical conclusions can we make?
- svctm should be no greater than 1 divided by the worst case IOPS you expect the device to deliver when not idle.
- await significantly higher than svctm indicates that requests are waiting in long lines, even if most of the time the device is idle. However, this may be benign when dirty page flushes from writing to logs and such cause this.
- It may be good to monitor that util is not approaching 100%. However, in some configurations (loaded database servers with large RAIDs) %util may always be 100%. In this case, monitoring avgqu-sz may make more sense, since as it approaches the maximum concurrency of the device, the device approaches saturation.
_______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
