Re: raw/block device disc troughput
On Fri, 25 May 2012, Edgar Fu? wrote: > Thanks for the most insightful explanation! > > > Also keep in mind: > Yes, sure. That's why I would have expected the raw device to outperform even > at lower block sizes. No, for small block sizes the overhead of the copyin() is more than offset by the larger buffercache block size. And the I/O operation is asynchronous with respect to the write() system call. With the character device the I/O operation must complete before the write() returns. So the I/O operations cannot be combined and you suffer the overhead of each one. Eduardo
Re: raw/block device disc troughput
Thanks for the most insightful explanation! > Also keep in mind: Yes, sure. That's why I would have expected the raw device to outperform even at lower block sizes.
Re: raw/block device disc troughput
> In this case, dd has to block after each disk write to wait for its > buffer to be (unnecessarily, as it happens, though it can't know that) > zeroed for the next write. This both imposes additional delay and > enforces a lack of overlap between each write and the next. But can't we safely assume reading /dev/zero taking zero time? I've checked that indeed get about 10GB/s from /dev/zero.
Re: raw/block device disc troughput
On Thu, 24 May 2012, Thor Lancelot Simon wrote: > On Thu, May 24, 2012 at 05:31:43PM +, Eduardo Horvath wrote: > > > > With large transfers (larger than MAXPHYS) the writes are split up into > > MAXPHYS chunks and the disk handles them in parallel, hence the > > performance increase even beyond MAXPHYS. > > Is this actually true? For requests from userspace via the raw device, > does physio actually issue the smaller chunks in parallel? Depends... this case it's true. physio() breaks the iov into chunks and allocates a buf for each chunk and calls the strategy() routine on each buf without waiting for completion. So on a controller that does tagged queuing they run in parallel. Eduardo
Re: raw/block device disc troughput
On Thu, May 24, 2012 at 05:31:43PM +, Eduardo Horvath wrote: > > With large transfers (larger than MAXPHYS) the writes are split up into > MAXPHYS chunks and the disk handles them in parallel, hence the > performance increase even beyond MAXPHYS. Is this actually true? For requests from userspace via the raw device, does physio actually issue the smaller chunks in parallel? Thor
Re: raw/block device disc troughput
> The block device will cause readahead at the OS layer. But I'm writing, not reading! > The increase is again tied to less latency in the synchronous dd read-write > loop. The kernel breaks the large request down to many MAXPHYS sized ones > and dispatches each in turn. I can't remember whether it is really > asynchronous or whether it waits for each request to complete before issuing > the next; if the former, it's effectively double- buffering for you. But would you expect an eightfold increas in troughput just by that? > What does read performance look like? I would be particularly interested to > know what it looks like if you use a tool like "buffer" or "ddd" that > double-buffers the I/O for you. It should be roughly twice the single-disk > rate or something is wrong with RAIDframe (or, at least, suboptimal). I will test that. I can't right now because I have no physical access atm. On an identical machine but with the RAID's SectorsPerSU=16, I get 99MByte/s from the raw device at 16k blocks, 191 at 64k blocks, roughly the same at 1M blocks. On the block device, I get 19MB/s independent of block size. All with dd.
Re: raw/block device disc troughput
On Thu, 24 May 2012, Edgar Fu? wrote: > > Keep in mind mpt uese a rather inefficient communication protocol and does > > tagged queuing. > You mean the protocol the main CPU uses to communicate with an MPT adapter is > inefficient? Or do you mean SAS is inefficient? The protocol used to communicate between the CPU and the adapter is inefficient. Not well designed. They redesigned it for SAS2. > > The former means the overhead for each command is not so good, but the > > latter means it can keep lots of commands in the air at the same time. > I'm sorry, I'm unable to conclude why this explains my results. dd will send the kernel individual write operations. sd and physio() will break them up into MAXPHYS chunks. Each chunk will be queued at the HBA. The HBA will dispatch them all as fast as it can. Tagged queuing will overlap them. With smaller transfers, the setup overhead becomes significant and you see poor performance. With large transfers (larger than MAXPHYS) the writes are split up into MAXPHYS chunks and the disk handles them in parallel, hence the performance increase even beyond MAXPHYS. Also keep in mind: When using the block device the data is copied from the process buffer into the buffer cache and the I/O happens from the buffer cache pages. When using the raw device the I/O happens directly from process memory, no copying involved. Eduardo
Re: raw/block device disc troughput
> What's "od="? Fat fingers. of=, or course. > Not awfully surprizing given your setup. Ah, good. > Keep in mind mpt uese a rather inefficient communication protocol and does > tagged queuing. You mean the protocol the main CPU uses to communicate with an MPT adapter is inefficient? Or do you mean SAS is inefficient? > The former means the overhead for each command is not so good, but the > latter means it can keep lots of commands in the air at the same time. I'm sorry, I'm unable to conclude why this explains my results. > Now you're just complicating things 8^). Sometimes, probably. > Let's see, RAID 1 is striping. Sorry, RAID 1 is mirroring. So, everything needs to be sent to two discs in parallel. I can't believe the CPU-HBA communication being a bottleneck?
Re: raw/block device disc troughput
>> dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx. (I've been assuming od= should be of=) > The block device will cause readahead at the OS layer. I thought of that too, but didn't mention it, because it's not relevant. dd isn't reading from the disk; it's writing to it. > I suspect that if you double-buffered at the client application layer > this effect might disappear, I suspect this is a significant effect. I was once using dd to copy from one disk to another, and both drives happened to have activity lights on them. Watching each drive wait for the other convinced me dd is an inefficient way to do that. I built a program that uses two processes, one reading and one writing, with a large chunk of memory shared between them for buffer space. Disk-to-disk copies (not on the same spindle) got significantly faster. :) In this case, dd has to block after each disk write to wait for its buffer to be (unnecessarily, as it happens, though it can't know that) zeroed for the next write. This both imposes additional delay and enforces a lack of overlap between each write and the next. I speculate that the cooked device helps because it means that dd's write finishes when the bits are in the buffer cache, rather than waiting for them to hit disk. Flushing from the buffer cache to the disk then (a) gets overlapped with zeroing memory for the next cycle and (b) allows writes to adjacent disk blocks to be collapsed as they get pushed from the buffer cache to the drive. Unless the host is unusually slow, writing to the disk will be the limiting factor here, meaning the buffer cache will have a large number of writes pending, so coalescing writes is plausible, even likely. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: raw/block device disc troughput
> I don't know your system (you don't say which port you're running on, > for example) Ah, sorry: amd64.
Re: raw/block device disc troughput
On Thu, May 24, 2012 at 06:26:45PM +0200, Edgar Fu? wrote: > It seems that I have to update my understanding of raw and block devices > for discs. > > Using a (non-recent) 6.0_BETA INSTALL kernel and an ST9146853SS 15k SAS disc > behind an LSI SAS 1068E (i.e. mpt(4)), I did a > dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx. > For the raw device, the troughput dramatically increased with the block size: > Block size 16k 64k 256k1M > Troughput (MByte/s) 4 15 49 112 > For the block device, throughput was around 81MByte/s independent of block > size. > > This surprised me in two ways: > 1. I would have expected the raw device to outperform the block devices >with not too small block sizes. The block device will cause readahead at the OS layer. Since you are accessing the disk sequentially, this will have a significant effect -- evidently greater than the overhead caused by memory allocation in the cache layer under the block device. I suspect that if you double-buffered at the client application layer this effect might disappear, since the drive itself will already readahead, and if we can present it with enough requests at once, it should return the results simultaneously. Plain dd on the raw device will not do that since it waits for every read to complete before issuing another, thus increasing latency and reducing the number of transactions the drive can effectively overlap. > 2. I would have expected inceasing the block size above MAXPHYS not >improving the performance. The increase is again tied to less latency in the synchronous dd read-write loop. The kernel breaks the large request down to many MAXPHYS sized ones and dispatches each in turn. I can't remember whether it is really asynchronous or whether it waits for each request to complete before issuing the next; if the former, it's effectively double- buffering for you. > > I then build a RAID 1 with SectorsPerSU=128 (e.g. a 64k stripe size) on two > of these discs, and, after the parity initialisation was complete, wrote > to [r]raid0b. > On the raw device, throghput ranged from 4MByte/s to 97MByte/s depending on > bs. What does read performance look like? I would be particularly interested to know what it looks like if you use a tool like "buffer" or "ddd" that double-buffers the I/O for you. It should be roughly twice the single-disk rate or something is wrong with RAIDframe (or, at least, suboptimal). > On the block device, it was always 3MByte/s. Furthermore, dd's WCHAN was > "vnode" for the whole run. Why is that so and why is throughput so low? I would guess locking, or, somehow, extra buffering. It may be waiting on the lock for the vnode for the block device? -- Thor Lancelot Simon t...@panix.com "The liberties...lose much of their value whenever those who have greater private means are permitted to use their advantages to control the course of public debate." -John Rawls
Re: raw/block device disc troughput
On Thu, 24 May 2012, Edgar Fu? wrote: > It seems that I have to update my understanding of raw and block devices > for discs. > > Using a (non-recent) 6.0_BETA INSTALL kernel and an ST9146853SS 15k SAS disc > behind an LSI SAS 1068E (i.e. mpt(4)), I did a > dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx. What's "od="? > For the raw device, the troughput dramatically increased with the block size: > Block size 16k 64k 256k1M > Troughput (MByte/s) 4 15 49 112 > For the block device, throughput was around 81MByte/s independent of block > size. > > This surprised me in two ways: > 1. I would have expected the raw device to outperform the block devices >with not too small block sizes. > 2. I would have expected inceasing the block size above MAXPHYS not >improving the performance. > > So obviously, my understanding is wrong. Not awfully surprizing given your setup. Keep in mind mpt uese a rather inefficient communication protocol and does tagged queuing. The former means the overhead for each command is not so good, but the latter means it can keep lots of commands in the air at the same time. > I then build a RAID 1 with SectorsPerSU=128 (e.g. a 64k stripe size) on two > of these discs, and, after the parity initialisation was complete, wrote > to [r]raid0b. > On the raw device, throghput ranged from 4MByte/s to 97MByte/s depending on > bs. > On the block device, it was always 3MByte/s. Furthermore, dd's WCHAN was > "vnode" for the whole run. Why is that so and why is throughput so low? Now you're just complicating things 8^). Let's see, RAID 1 is striping. That means all operations are broken at 64K boundaries so they can be sent to different disks. And split operations need to wait for all the devices to complete before the master operation can be completed. I expect you would probably get some rather unusual non-linear behavior in this sort of setup. Eduardo
Re: raw/block device disc troughput
> It seems that I have to update my understanding of raw and block > devices for discs. [...performance oddities...] Mostly I have nothing useful to say here. But... > 2. I would have expected inceasing the block size above MAXPHYS not >improving the performance. There is at least one aspect of performance that will not be cut off by MAXPHYS, that being syscall overhead. I don't know your system (you don't say which port you're running on, for example), but if syscall overhead for your hardware is not ignorably small compared to the costs of doing the disk transfer, then doing one syscall per 256K will be four times as costly in syscall overhead as doing one syscall per 1M, even if it is four times as costly in disk transfers. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: raw/block device disc troughput
On Thu, 24 May 2012, Edgar Fu? wrote: I then build a RAID 1 with SectorsPerSU=128 (e.g. a 64k stripe size) on two of these discs, and, after the parity initialisation was complete, wrote to [r]raid0b. On the raw device, throghput ranged from 4MByte/s to 97MByte/s depending on bs. On the block device, it was always 3MByte/s. Furthermore, dd's WCHAN was "vnode" for the whole run. Why is that so and why is throughput so low? What is the partition alignment on the raid? There's a discussion of the alignment impact on my raidframe how-to on the NetBSD WiKi. - | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net | | Kernel Developer | | pgoyette at netbsd.org | -