Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Chris Webb ch...@arachsys.com writes: Okay. What I was driving at in describing these systems as 'already broken' is that they will already lose data (in this sense) if they're run on bare metal with normal commodity SATA disks with their 32MB write caches on. That configuration surely describes the vast majority of PC-class desktops and servers! If I understand correctly, your point here is that the small cache on a real SATA drive gives a relatively small time window for data loss, whereas the worry with cache=writeback is that the host page cache can be gigabytes, so the time window for unsynced data to be lost is potentially enormous. Isn't the fix for that just forcing periodic sync on the host to bound-above the time window for unsynced data loss in the guest? For the benefit of the archives, it turns out the simplest fix for this is already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 3220, and the size of dirty page cache is bounded above by 32MB, so we are simulating exactly the case of a SATA drive with a 32MB writeback-cache. Unless I'm missing something, the risk to guest OSes in this configuration should therefore be exactly the same as the risk from running on normal commodity hardware with such drives and no expensive battery-backed RAM. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/22/2010 11:04 PM, Chris Webb wrote: Chris Webbch...@arachsys.com writes: Okay. What I was driving at in describing these systems as 'already broken' is that they will already lose data (in this sense) if they're run on bare metal with normal commodity SATA disks with their 32MB write caches on. That configuration surely describes the vast majority of PC-class desktops and servers! If I understand correctly, your point here is that the small cache on a real SATA drive gives a relatively small time window for data loss, whereas the worry with cache=writeback is that the host page cache can be gigabytes, so the time window for unsynced data to be lost is potentially enormous. Isn't the fix for that just forcing periodic sync on the host to bound-above the time window for unsynced data loss in the guest? For the benefit of the archives, it turns out the simplest fix for this is already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 3220, and the size of dirty page cache is bounded above by 32MB, so we are simulating exactly the case of a SATA drive with a 32MB writeback-cache. Unless I'm missing something, the risk to guest OSes in this configuration should therefore be exactly the same as the risk from running on normal commodity hardware with such drives and no expensive battery-backed RAM. A host crash will destroy your data. If your machine is connected to a UPS, only a firmware crash can destroy your data. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi Kivity a...@redhat.com writes: On 03/22/2010 11:04 PM, Chris Webb wrote: Unless I'm missing something, the risk to guest OSes in this configuration should therefore be exactly the same as the risk from running on normal commodity hardware with such drives and no expensive battery-backed RAM. A host crash will destroy your data. If your machine is connected to a UPS, only a firmware crash can destroy your data. Yes, that's a good point: in this configuration a host crash is equivalent to a power failure rather than a OS crash in terms of data loss. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Tue, 2010-03-16 at 11:05 +0200, Avi Kivity wrote: Not really. In many cloud environments, there's a set of common images that are instantiated on each node. Usually this is because you're running a horizontally scalable application or because you're supporting an ephemeral storage model. But will these servers actually benefit from shared cache? So the images are shared, they boot up, what then? - apache really won't like serving static files from the host pagecache - dynamic content (java, cgi) will be mostly in anonymous memory, not pagecache - ditto for application servers - what else are people doing? Think of an OpenVZ-style model where you're renting out a bunch of relatively tiny VMs and they're getting used pretty sporadically. They either have relatively little memory, or they've been ballooned down to a pretty small footprint. The more you shrink them down, the more similar they become. You'll end up having things like init, cron, apache, bash and libc start to dominate the memory footprint in the VM. That's *certainly* a case where this makes a lot of sense. -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote: If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Yes. And rememember those don't have to come from the same host. Also remember that we rather limit execssive reodering of O_DIRECT requests in the I/O scheduler because they are synchronous type I/O while we don't do that for pagecache writeback. And we don't have unlimited virtio queue size, in fact it's quite limited. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 10:49 AM, Christoph Hellwig wrote: On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote: If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Yes. And rememember those don't have to come from the same host. Also remember that we rather limit execssive reodering of O_DIRECT requests in the I/O scheduler because they are synchronous type I/O while we don't do that for pagecache writeback. Maybe we should relax that for kvm. Perhaps some of the problem comes from the fact that we call io_submit() once per request. And we don't have unlimited virtio queue size, in fact it's quite limited. That can be extended easily if it fixes the problem. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Anthony Liguori anth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? Following Christoph Hellwig's patch series from last September, I'm pretty convinced that (a) isn't true apart from the inability to disable the write-cache at run-time, which is something that neither recent linux nor windows seem to want to do out-of-the box. Given that modern SATA drives come with fairly substantial write-caches nowadays which operating systems leave on without widespread disaster, I don't really believe in (b) either, at least for the ide and scsi case. Filesystems know they have to flush the disk cache to avoid corruption. (Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so I know virtio-blk has to be avoided for current windows and obsolete linux when writeback caching is on.) I can certainly imagine (c) might be the case, although when I use strace to watch the IO to the block device, I see pretty regular fdatasyncs being issued by the guests, interleaved with the writes, so I'm not sure how likely the problem would be in practice. Perhaps my test guests were unrepresentatively well-behaved. However, the potentially unlimited time-window for loss of incorrectly unsynced data is also something one could imagine fixing at the qemu level. Perhaps I should be implementing something like cache=writeback,flushtimeout=N which, upon a write being issued to the block device, starts an N second timer if it isn't already running. The timer is destroyed on flush, and if it expires before it's destroyed, a gratuitous flush is sent. Do you think this is worth doing? Just a simple 'while sleep 10; do sync; done' on the host even! We've used cache=none and cache=writethrough, and whilst performance is fine with a single guest accessing a disk, when we chop the disks up with LVM and run a even a small handful of guests, the constant seeking to serve tiny synchronous IOs leads to truly abysmal throughput---we've seen less than 700kB/s streaming write rates within guests when the backing store is capable of 100MB/s. With cache=writeback, there's still IO contention between guests, but the write granularity is a bit coarser, so the host's elevator seems to get a bit more of a chance to help us out and we can at least squeeze out 5-10MB/s from two or three concurrently running guests, getting a total of 20-30% of the performance of the underlying block device rather than a total of around 5%. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi Kivity a...@redhat.com writes: On 03/15/2010 10:23 PM, Chris Webb wrote: Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. Is this with qcow2, raw file, or direct volume access? This is with direct access to logical volumes. No file systems or qcow2 in the stack. Our typical host has a couple of SATA disks, combined in md RAID1, chopped up into volumes with LVM2 (really just dm linear targets). The performance measured outside qemu is excellent, inside qemu-kvm is fine too until multiple guests are trying to access their drives at once, but then everything starts to grind badly. I can understand it for qcow2, but for direct volume access this shouldn't happen. The guest schedules as many writes as it can, followed by a sync. The host (and disk) can then reschedule them whether they are in the writeback cache or in the block layer, and must sync in the same way once completed. I don't really understand what's going on here, but I wonder if the underlying problem might be that all the O_DIRECT/O_SYNC writes from the guests go down into the same block device at the bottom of the device mapper stack, and thus can't be reordered with respect to one another. For our purposes, Guest AA Guest BB Guest AA Guest BB Guest AA Guest BB write A1 write A1 write B1 write B1 write A2 write A1 write A2 write B1 write A2 are all equivalent, but the system isn't allowed to reorder in this way because there isn't a separate request queue for each logical volume, just the one at the bottom. (I don't know whether nested request queues would behave remotely reasonably either, though!) Also, if my guest kernel issues (say) three small writes, one at the start of the disk, one in the middle, one at the end, and then does a flush, can virtio really express this as one non-contiguous O_DIRECT write (the three components of which can be reordered by the elevator with respect to one another) rather than three distinct O_DIRECT writes which can't be permuted? Can qemu issue a write like that? cache=writeback + flush allows this to be optimised by the block layer in the normal way. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 10:14 AM, Chris Webb wrote: Anthony Liguorianth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? This is the closest to the most accurate. It basically boils down to this: most enterprises use a disks with battery backed write caches. Having the host act as a giant write cache means that you can lose data. I agree that a well behaved file system will not become corrupt, but my contention is that for many types of applications, data lose == corruption and not all file systems are well behaved. And it's certainly valid to argue about whether common filesystems are broken but from a purely pragmatic perspective, this is going to be the case. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 05:24 PM, Chris Webb wrote: Avi Kivitya...@redhat.com writes: On 03/15/2010 10:23 PM, Chris Webb wrote: Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. Is this with qcow2, raw file, or direct volume access? This is with direct access to logical volumes. No file systems or qcow2 in the stack. Our typical host has a couple of SATA disks, combined in md RAID1, chopped up into volumes with LVM2 (really just dm linear targets). The performance measured outside qemu is excellent, inside qemu-kvm is fine too until multiple guests are trying to access their drives at once, but then everything starts to grind badly. OK. I can understand it for qcow2, but for direct volume access this shouldn't happen. The guest schedules as many writes as it can, followed by a sync. The host (and disk) can then reschedule them whether they are in the writeback cache or in the block layer, and must sync in the same way once completed. I don't really understand what's going on here, but I wonder if the underlying problem might be that all the O_DIRECT/O_SYNC writes from the guests go down into the same block device at the bottom of the device mapper stack, and thus can't be reordered with respect to one another. They should be reorderable. Otherwise host filesystems on several volumes would suffer the same problems. Whether the filesystem is in the host or guest shouldn't matter. For our purposes, Guest AA Guest BB Guest AA Guest BB Guest AA Guest BB write A1 write A1 write B1 write B1 write A2 write A1 write A2 write B1 write A2 are all equivalent, but the system isn't allowed to reorder in this way because there isn't a separate request queue for each logical volume, just the one at the bottom. (I don't know whether nested request queues would behave remotely reasonably either, though!) Also, if my guest kernel issues (say) three small writes, one at the start of the disk, one in the middle, one at the end, and then does a flush, can virtio really express this as one non-contiguous O_DIRECT write (the three components of which can be reordered by the elevator with respect to one another) rather than three distinct O_DIRECT writes which can't be permuted? Can qemu issue a write like that? cache=writeback + flush allows this to be optimised by the block layer in the normal way. Guest side virtio will send this as three requests followed by a flush. Qemu will issue these as three distinct requests and then flush. The requests are marked, as Christoph says, in a way that limits their reorderability, and perhaps if we fix these two problems performance will improve. Something that comes to mind is merging of flush requests. If N guests issue one write and one flush each, we should issue N writes and just one flush - a flush for the disk applies to all volumes on that disk. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Anthony Liguori anth...@codemonkey.ws [2010-03-17 10:55:47]: On 03/17/2010 10:14 AM, Chris Webb wrote: Anthony Liguorianth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? This is the closest to the most accurate. It basically boils down to this: most enterprises use a disks with battery backed write caches. Having the host act as a giant write cache means that you can lose data. Dirty limits can help control how much we lose, but also affect how much we write out. I agree that a well behaved file system will not become corrupt, but my contention is that for many types of applications, data lose == corruption and not all file systems are well behaved. And it's certainly valid to argue about whether common filesystems are broken but from a purely pragmatic perspective, this is going to be the case. I think it is a trade-off for end users to decide on. cache=writeback does provide performance benefits, but can cause data loss. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Anthony Liguori anth...@codemonkey.ws writes: On 03/17/2010 10:14 AM, Chris Webb wrote: (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? This is the closest to the most accurate. It basically boils down to this: most enterprises use a disks with battery backed write caches. Having the host act as a giant write cache means that you can lose data. I agree that a well behaved file system will not become corrupt, but my contention is that for many types of applications, data lose == corruption and not all file systems are well behaved. And it's certainly valid to argue about whether common filesystems are broken but from a purely pragmatic perspective, this is going to be the case. Okay. What I was driving at in describing these systems as 'already broken' is that they will already lose data (in this sense) if they're run on bare metal with normal commodity SATA disks with their 32MB write caches on. That configuration surely describes the vast majority of PC-class desktops and servers! If I understand correctly, your point here is that the small cache on a real SATA drive gives a relatively small time window for data loss, whereas the worry with cache=writeback is that the host page cache can be gigabytes, so the time window for unsynced data to be lost is potentially enormous. Isn't the fix for that just forcing periodic sync on the host to bound-above the time window for unsynced data loss in the guest? Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:22 PM, Avi Kivity wrote: Also, if my guest kernel issues (say) three small writes, one at the start of the disk, one in the middle, one at the end, and then does a flush, can virtio really express this as one non-contiguous O_DIRECT write (the three components of which can be reordered by the elevator with respect to one another) rather than three distinct O_DIRECT writes which can't be permuted? Can qemu issue a write like that? cache=writeback + flush allows this to be optimised by the block layer in the normal way. Guest side virtio will send this as three requests followed by a flush. Qemu will issue these as three distinct requests and then flush. The requests are marked, as Christoph says, in a way that limits their reorderability, and perhaps if we fix these two problems performance will improve. Something that comes to mind is merging of flush requests. If N guests issue one write and one flush each, we should issue N writes and just one flush - a flush for the disk applies to all volumes on that disk. Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi Kivity a...@redhat.com writes: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. Sure, sounds like an excellent plan. I don't have a test machine at the moment as the last host I was using for this has gone into production, but I'm due to get another one to install later today or first thing tomorrow which would be ideal for doing this. I'll follow up with the results once I have them. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote: They should be reorderable. Otherwise host filesystems on several volumes would suffer the same problems. They are reordable, just not as extremly as the the page cache. Remember that the request queue really is just a relatively small queue of outstanding I/O, and that is absolutely intentional. Large scale _caching_ is done by the VM in the pagecache, with all the usual aging, pressure, etc algorithms applied to it. The block devices have a relatively small fixed size request queue associated with it to facilitate request merging and limited reordering and having fully set up I/O requests for the device. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:47 PM, Chris Webb wrote: Avi Kivitya...@redhat.com writes: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. Sure, sounds like an excellent plan. I don't have a test machine at the moment as the last host I was using for this has gone into production, but I'm due to get another one to install later today or first thing tomorrow which would be ideal for doing this. I'll follow up with the results once I have them. Meanwhile I looked at the code, and it looks bad. There is an IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before issuing it. In any case, qemu doesn't use it as far as I could tell, and even if it did, device-matter doesn't implement the needed -aio_fsync() operation. So, there's a lot of plubming needed before we can get cache flushes merged into each other. Given cache=writeback does allow merging, I think we explained part of the problem at least. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. As the person who has written quite a bit of the current O_SYNC implementation and also reviewed the rest of it I can tell you that those flushes won't be coalesced. If we always rewrite the same block we do the cache flush from the fsync method and there's is nothing to coalesced it there. If you actually do modify metadata (e.g. by using the new real O_SYNC instead of the old one that always was O_DSYNC that I introduced in 2.6.33 but that isn't picked up by userspace yet) you might hit a very limited transaction merging window in some filesystems, but it's generally very small for a good reason. If it were too large we'd make the once progress wait for I/O in another just because we might expect transactions to coalesced later. There's been some long discussion about that fsync transaction batching tuning for ext3 a while ago. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote: Meanwhile I looked at the code, and it looks bad. There is an IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before issuing it. In any case, qemu doesn't use it as far as I could tell, and even if it did, device-matter doesn't implement the needed -aio_fsync() operation. No one implements it, and all surrounding code is dead wood. It would require us to do asynchronous pagecache operations, which involve major surgery of the VM code. Patches to do this were rejected multiple times. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:58 PM, Christoph Hellwig wrote: On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote: Meanwhile I looked at the code, and it looks bad. There is an IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before issuing it. In any case, qemu doesn't use it as far as I could tell, and even if it did, device-matter doesn't implement the needed -aio_fsync() operation. No one implements it, and all surrounding code is dead wood. It would require us to do asynchronous pagecache operations, which involve major surgery of the VM code. Patches to do this were rejected multiple times. Pity. What about the O_DIRECT aio case? It's ridiculous that you can submit async write requests but have to wait synchronously for them to actually hit the disk if you have a write cache. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:52 PM, Christoph Hellwig wrote: On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote: They should be reorderable. Otherwise host filesystems on several volumes would suffer the same problems. They are reordable, just not as extremly as the the page cache. Remember that the request queue really is just a relatively small queue of outstanding I/O, and that is absolutely intentional. Large scale _caching_ is done by the VM in the pagecache, with all the usual aging, pressure, etc algorithms applied to it. We already have the large scale caching and stuff running in the guest. We have a stream of optimized requests coming out of guests, running the same algorithm again shouldn't improve things. The host has an opportunity to do inter-guest optimization, but given each guest has its own disk area, I don't see how any reordering or merging could help here (beyond sorting guests according to disk order). The block devices have a relatively small fixed size request queue associated with it to facilitate request merging and limited reordering and having fully set up I/O requests for the device. We should enlarge the queues, increase request reorderability, and merge flushes (delay flushes until after unrelated writes, then adjacent flushes can be collapsed). Collapsing flushes should get us better than linear scaling (since we collapes N writes + M flushes into N writes and 1 flush). However the writes themselves scale worse than linearly, since they now span a larger disk space and cause higher seek penalties. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:57 PM, Christoph Hellwig wrote: On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. As the person who has written quite a bit of the current O_SYNC implementation and also reviewed the rest of it I can tell you that those flushes won't be coalesced. If we always rewrite the same block we do the cache flush from the fsync method and there's is nothing to coalesced it there. If you actually do modify metadata (e.g. by using the new real O_SYNC instead of the old one that always was O_DSYNC that I introduced in 2.6.33 but that isn't picked up by userspace yet) you might hit a very limited transaction merging window in some filesystems, but it's generally very small for a good reason. If it were too large we'd make the once progress wait for I/O in another just because we might expect transactions to coalesced later. There's been some long discussion about that fsync transaction batching tuning for ext3 a while ago. I definitely don't expect flush merging for a single guest, but for multiple guests there is certainly an opportunity for merging. Most likely we don't take advantage of it and that's one of the problems. Copying data into pagecache so that we can merge the flushes seems like a very unsatisfactory implementation. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 03:14:10PM +, Chris Webb wrote: Anthony Liguori anth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? Following Christoph Hellwig's patch series from last September, I'm pretty convinced that (a) isn't true apart from the inability to disable the write-cache at run-time, which is something that neither recent linux nor windows seem to want to do out-of-the box. Given that modern SATA drives come with fairly substantial write-caches nowadays which operating systems leave on without widespread disaster, I don't really believe in (b) either, at least for the ide and scsi case. Filesystems know they have to flush the disk cache to avoid corruption. (Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so I know virtio-blk has to be avoided for current windows and obsolete linux when writeback caching is on.) I can certainly imagine (c) might be the case, although when I use strace to watch the IO to the block device, I see pretty regular fdatasyncs being issued by the guests, interleaved with the writes, so I'm not sure how likely the problem would be in practice. Perhaps my test guests were unrepresentatively well-behaved. However, the potentially unlimited time-window for loss of incorrectly unsynced data is also something one could imagine fixing at the qemu level. Perhaps I should be implementing something like cache=writeback,flushtimeout=N which, upon a write being issued to the block device, starts an N second timer if it isn't already running. The timer is destroyed on flush, and if it expires before it's destroyed, a gratuitous flush is sent. Do you think this is worth doing? Just a simple 'while sleep 10; do sync; done' on the host even! We've used cache=none and cache=writethrough, and whilst performance is fine with a single guest accessing a disk, when we chop the disks up with LVM and run a even a small handful of guests, the constant seeking to serve tiny synchronous IOs leads to truly abysmal throughput---we've seen less than 700kB/s streaming write rates within guests when the backing store is capable of 100MB/s. With cache=writeback, there's still IO contention between guests, but the write granularity is a bit coarser, so the host's elevator seems to get a bit more of a chance to help us out and we can at least squeeze out 5-10MB/s from two or three concurrently running guests, getting a total of 20-30% of the performance of the underlying block device rather than a total of around 5%. Hi Chris, Are you using CFQ in the host? What is the host kernel version? I am not sure what is the problem here but you might want to play with IO controller and put these guests in individual cgroups and see if you get better throughput even with cache=writethrough. If the problem is that if sync writes from different guests get intermixed resulting in more seeks, IO controller might help as these writes will now go on different group service trees and in CFQ, we try to service requests from one service tree at a time for a period before we switch the service tree. The issue will be that all the logic is in CFQ and it works at leaf nodes of storage stack and not at LVM nodes. So first you might want to try it with single partitioned disk. If it helps, then it might help with LVM configuration also (IO control working at leaf nodes). Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Vivek Goyal vgo...@redhat.com writes: Are you using CFQ in the host? What is the host kernel version? I am not sure what is the problem here but you might want to play with IO controller and put these guests in individual cgroups and see if you get better throughput even with cache=writethrough. Hi. We're using the deadline IO scheduler on 2.6.32.7. We got better performance from deadline than from cfq when we last tested, which was admittedly around the 2.6.30 timescale so is now a rather outdated measurement. If the problem is that if sync writes from different guests get intermixed resulting in more seeks, IO controller might help as these writes will now go on different group service trees and in CFQ, we try to service requests from one service tree at a time for a period before we switch the service tree. Thanks for the suggestion: I'll have a play with this. I currently use /sys/kernel/uids/N/cpu_share with one UID per guest to divide up the CPU between guests, but this could just as easily be done with a cgroup per guest if a side-effect is to provide a hint about IO independence to CFQ. Best wishes, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Mon, Mar 15, 2010 at 08:27:25PM -0500, Anthony Liguori wrote: Actually cache=writeback is as safe as any normal host is with a volatile disk cache, except that in this case the disk cache is actually a lot larger. With a properly implemented filesystem this will never cause corruption. Metadata corruption, not necessarily corruption of data stored in a file. Again, this will not cause metadata corruption either if the filesystem loses barriers, although we may lose up to the cache size of new (data or metadata operations). The consistency of the filesystem is still guaranteed. Not all software uses fsync as much as they should. And often times, it's for good reason (like ext3). If an application needs data on disk it must call fsync, or there is no guaranteed at all, even on ext3. And with growing disk caches these issues show up on normal disks often enough that people have realized it by now. IIUC, an O_DIRECT write using cache=writeback is not actually on the spindle when the write() completes. Rather, an explicit fsync() would be required. That will cause data corruption in many applications (like databases) regardless of whether the fs gets metadata corruption. It's neither for O_DIRECT without qemu involved. The O_DIRECT write goes through the disk cache and requires and explicit fsync or O_SYNC open flag to make sure it goes to disk. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 08:48 PM, Anthony Liguori wrote: On 03/15/2010 04:27 AM, Avi Kivity wrote: That's only beneficial if the cache is shared. Otherwise, you could use the balloon to evict cache when memory is tight. Shared cache is mostly a desktop thing where users run similar workloads. For servers, it's much less likely. So a modified-guest doesn't help a lot here. Not really. In many cloud environments, there's a set of common images that are instantiated on each node. Usually this is because you're running a horizontally scalable application or because you're supporting an ephemeral storage model. But will these servers actually benefit from shared cache? So the images are shared, they boot up, what then? - apache really won't like serving static files from the host pagecache - dynamic content (java, cgi) will be mostly in anonymous memory, not pagecache - ditto for application servers - what else are people doing? In fact, with ephemeral storage, you typically want to use cache=writeback since you aren't providing data guarantees across shutdown/failure. Interesting point. We'd need a cache=volatile for this use case to avoid the fdatasync()s we do now. Also useful for -snapshot. In fact I have a patch for this somewhere I can dig out. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 10:23 PM, Chris Webb wrote: Avi Kivitya...@redhat.com writes: On 03/15/2010 10:07 AM, Balbir Singh wrote: Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. Hi Avi. One observation about your recommendation for cache=none: We run hosts of VMs accessing drives backed by logical volumes carved out from md RAID1. Each host has 32GB RAM and eight cores, divided between (say) twenty virtual machines, which pretty much fill the available memory on the host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback caching turned on get advertised to the guest as having a write-cache, and FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback isn't acting as cache=neverflush like it would have done a year ago. I know that comparing performance for cache=none against that unsafe behaviour would be somewhat unfair!) Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. For example, even IDE with cache=writeback easily beats virtio with cache=none in most of the guest filesystem performance tests I've tried. The anecdotal feedback from clients is also very strongly in favour of cache=writeback. Is this with qcow2, raw file, or direct volume access? I can understand it for qcow2, but for direct volume access this shouldn't happen. The guest schedules as many writes as it can, followed by a sync. The host (and disk) can then reschedule them whether they are in the writeback cache or in the block layer, and must sync in the same way once completed. Perhaps what we need is bdrv_aio_submit() which can take a number of requests. For direct volume access, this allows easier reordering (io_submit() should plug the queues before it starts processing and unplug them when done, though I don't see the code for this?). For qcow2, we can coalesce metadata updates for multiple requests into one RMW (for example, a sequential write split into multiple 64K-256K write requests). Christoph/Kevin? With a host full of cache=none guests, IO contention between guests is hugely problematic with non-stop seek from the disks to service tiny O_DIRECT writes (especially without virtio), many of which needn't have been synchronous if only there had been some way for the guest OS to tell qemu that. Running with cache=writeback seems to reduce the frequency of disk flush per guest to a much more manageable level, and to allow the host's elevator to optimise writing out across the guests in between these flushes. The host eventually has to turn the writes into synchronous writes, no way around that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Am 16.03.2010 10:17, schrieb Avi Kivity: On 03/15/2010 10:23 PM, Chris Webb wrote: Avi Kivitya...@redhat.com writes: On 03/15/2010 10:07 AM, Balbir Singh wrote: Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. Hi Avi. One observation about your recommendation for cache=none: We run hosts of VMs accessing drives backed by logical volumes carved out from md RAID1. Each host has 32GB RAM and eight cores, divided between (say) twenty virtual machines, which pretty much fill the available memory on the host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback caching turned on get advertised to the guest as having a write-cache, and FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback isn't acting as cache=neverflush like it would have done a year ago. I know that comparing performance for cache=none against that unsafe behaviour would be somewhat unfair!) Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. For example, even IDE with cache=writeback easily beats virtio with cache=none in most of the guest filesystem performance tests I've tried. The anecdotal feedback from clients is also very strongly in favour of cache=writeback. Is this with qcow2, raw file, or direct volume access? I can understand it for qcow2, but for direct volume access this shouldn't happen. The guest schedules as many writes as it can, followed by a sync. The host (and disk) can then reschedule them whether they are in the writeback cache or in the block layer, and must sync in the same way once completed. Perhaps what we need is bdrv_aio_submit() which can take a number of requests. For direct volume access, this allows easier reordering (io_submit() should plug the queues before it starts processing and unplug them when done, though I don't see the code for this?). For qcow2, we can coalesce metadata updates for multiple requests into one RMW (for example, a sequential write split into multiple 64K-256K write requests). We already do merge sequential writes back into one larger request. So this is in fact a case that wouldn't benefit from such changes. It may help for other cases. But even if it did, coalescing metadata writes in qcow2 sounds like a good way to mess up, so I'd stay with doing it only for the data itself. Apart from that, wouldn't your points apply to writeback as well? Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/16/2010 11:54 AM, Kevin Wolf wrote: Is this with qcow2, raw file, or direct volume access? I can understand it for qcow2, but for direct volume access this shouldn't happen. The guest schedules as many writes as it can, followed by a sync. The host (and disk) can then reschedule them whether they are in the writeback cache or in the block layer, and must sync in the same way once completed. Perhaps what we need is bdrv_aio_submit() which can take a number of requests. For direct volume access, this allows easier reordering (io_submit() should plug the queues before it starts processing and unplug them when done, though I don't see the code for this?). For qcow2, we can coalesce metadata updates for multiple requests into one RMW (for example, a sequential write split into multiple 64K-256K write requests). We already do merge sequential writes back into one larger request. So this is in fact a case that wouldn't benefit from such changes. I'm not happy with that. It increases overall latency. With qcow2 it's fine, but I'd let requests to raw volumes flow unaltered. It may help for other cases. But even if it did, coalescing metadata writes in qcow2 sounds like a good way to mess up, so I'd stay with doing it only for the data itself. I don't see why. Apart from that, wouldn't your points apply to writeback as well? They do, but for writeback the host kernel already does all the coalescing/merging/blah for us. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi, cache=writeback can be faster than cache=none for the same reasons a disk cache speeds up access. As long as the I/O mix contains more asynchronous then synchronous writes it allows the host to do much more reordering, only limited by the cache size (which can be quite huge when using the host pagecache) and the amount of cache flushes coming from the host. If you have a fsync heavy workload or metadata operation with a filesystem like the current XFS you will get lots of cache flushes that make the use of the additional cache limits. If you don't have a of lot of cache flushes, e.g. due to dumb applications that do not issue fsync, or even run ext3 in it's default mode never issues cache flushes the benefit will be enormous, but the data loss and possible corruption will be enormous. But even for something like btrfs that does provide data integrity but issues cache flushes fairly effeciently data=writeback may provide a quite nice speedup, especially if using multiple guest accessing the same spindle(s). But I wouldn't be surprised if IBM's exteme differences are indeed due to the extremly unsafe ext3 default behaviour. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/16/2010 12:26 PM, Christoph Hellwig wrote: Avi, cache=writeback can be faster than cache=none for the same reasons a disk cache speeds up access. As long as the I/O mix contains more asynchronous then synchronous writes it allows the host to do much more reordering, only limited by the cache size (which can be quite huge when using the host pagecache) and the amount of cache flushes coming from the host. If you have a fsync heavy workload or metadata operation with a filesystem like the current XFS you will get lots of cache flushes that make the use of the additional cache limits. Are you talking about direct volume access or qcow2? For direct volume access, I still don't get it. The number of barriers issues by the host must equal (or exceed, but that's pointless) the number of barriers issued by the guest. cache=writeback allows the host to reorder writes, but so does cache=none. Where does the difference come from? Put it another way. In an unvirtualized environment, if you implement a write cache in a storage driver (not device), and sync it on a barrier request, would you expect to see a performance improvement? If you don't have a of lot of cache flushes, e.g. due to dumb applications that do not issue fsync, or even run ext3 in it's default mode never issues cache flushes the benefit will be enormous, but the data loss and possible corruption will be enormous. Shouldn't the host never issue cache flushes in this case? (for direct volume access; qcow2 still needs flushes for metadata integrity). But even for something like btrfs that does provide data integrity but issues cache flushes fairly effeciently data=writeback may provide a quite nice speedup, especially if using multiple guest accessing the same spindle(s). But I wouldn't be surprised if IBM's exteme differences are indeed due to the extremly unsafe ext3 default behaviour. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote: Are you talking about direct volume access or qcow2? Doesn't matter. For direct volume access, I still don't get it. The number of barriers issues by the host must equal (or exceed, but that's pointless) the number of barriers issued by the guest. cache=writeback allows the host to reorder writes, but so does cache=none. Where does the difference come from? Put it another way. In an unvirtualized environment, if you implement a write cache in a storage driver (not device), and sync it on a barrier request, would you expect to see a performance improvement? cache=none only allows very limited reorderning in the host. O_DIRECT is synchronous on the host, so there's just some very limited reordering going on in the elevator if we have other I/O going on in parallel. In addition to that the disk writecache can perform limited reodering and caching, but the disk cache has a rather limited size. The host pagecache gives a much wieder opportunity to reorder, especially if the guest workload is not cache flush heavy. If the guest workload is extremly cache flush heavy the usefulness of the pagecache is rather limited, as we'll only use very little of it, but pay by having to do a data copy. If the workload is not cache flush heavy, and we have multiple guests doing I/O to the same spindles it will allow the host do do much more efficient data writeout by beeing able to do better ordered (less seeky) and bigger I/O (especially if the host has real storage compared to ide for the guest). If you don't have a of lot of cache flushes, e.g. due to dumb applications that do not issue fsync, or even run ext3 in it's default mode never issues cache flushes the benefit will be enormous, but the data loss and possible corruption will be enormous. Shouldn't the host never issue cache flushes in this case? (for direct volume access; qcow2 still needs flushes for metadata integrity). If the guest never issues a flush the host will neither, indeed. Data will only go to disk by background writeout or memory pressure. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/16/2010 12:44 PM, Christoph Hellwig wrote: On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote: Are you talking about direct volume access or qcow2? Doesn't matter. For direct volume access, I still don't get it. The number of barriers issues by the host must equal (or exceed, but that's pointless) the number of barriers issued by the guest. cache=writeback allows the host to reorder writes, but so does cache=none. Where does the difference come from? Put it another way. In an unvirtualized environment, if you implement a write cache in a storage driver (not device), and sync it on a barrier request, would you expect to see a performance improvement? cache=none only allows very limited reorderning in the host. O_DIRECT is synchronous on the host, so there's just some very limited reordering going on in the elevator if we have other I/O going on in parallel. Presumably there is lots of I/O going on, or we wouldn't be having this conversation. In addition to that the disk writecache can perform limited reodering and caching, but the disk cache has a rather limited size. The host pagecache gives a much wieder opportunity to reorder, especially if the guest workload is not cache flush heavy. If the guest workload is extremly cache flush heavy the usefulness of the pagecache is rather limited, as we'll only use very little of it, but pay by having to do a data copy. If the workload is not cache flush heavy, and we have multiple guests doing I/O to the same spindles it will allow the host do do much more efficient data writeout by beeing able to do better ordered (less seeky) and bigger I/O (especially if the host has real storage compared to ide for the guest). Let's assume the guest has virtio (I agree with IDE we need reordering on the host). The guest sends batches of I/O separated by cache flushes. If the batches are smaller than the virtio queue length, ideally things look like: io_submit(..., batch_size_1); io_getevents(..., batch_size_1); fdatasync(); io_submit(..., batch_size_2); io_getevents(..., batch_size_2); fdatasync(); io_submit(..., batch_size_3); io_getevents(..., batch_size_3); fdatasync(); (certainly that won't happen today, but it could in principle). How does a write cache give any advantage? The host kernel sees _exactly_ the same information as it would from a bunch of threaded pwritev()s followed by fdatasync(). (wish: IO_CMD_ORDERED_FDATASYNC) If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Let's say the virtio queue size was unlimited. What merging/reordering opportunity are we missing on the host? Again we have exactly the same information: either the pagecache lru + radix tree that identifies all dirty pages in disk order, or the block queue with pending requests that contains exactly the same information. Something is wrong. Maybe it's my understanding, but on the other hand it may be a piece of kernel code. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Avi Kivity a...@redhat.com [2010-03-16 13:08:28]: On 03/16/2010 12:44 PM, Christoph Hellwig wrote: On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote: Are you talking about direct volume access or qcow2? Doesn't matter. For direct volume access, I still don't get it. The number of barriers issues by the host must equal (or exceed, but that's pointless) the number of barriers issued by the guest. cache=writeback allows the host to reorder writes, but so does cache=none. Where does the difference come from? Put it another way. In an unvirtualized environment, if you implement a write cache in a storage driver (not device), and sync it on a barrier request, would you expect to see a performance improvement? cache=none only allows very limited reorderning in the host. O_DIRECT is synchronous on the host, so there's just some very limited reordering going on in the elevator if we have other I/O going on in parallel. Presumably there is lots of I/O going on, or we wouldn't be having this conversation. We are speaking of multiple VM's doing I/O in parallel. In addition to that the disk writecache can perform limited reodering and caching, but the disk cache has a rather limited size. The host pagecache gives a much wieder opportunity to reorder, especially if the guest workload is not cache flush heavy. If the guest workload is extremly cache flush heavy the usefulness of the pagecache is rather limited, as we'll only use very little of it, but pay by having to do a data copy. If the workload is not cache flush heavy, and we have multiple guests doing I/O to the same spindles it will allow the host do do much more efficient data writeout by beeing able to do better ordered (less seeky) and bigger I/O (especially if the host has real storage compared to ide for the guest). Let's assume the guest has virtio (I agree with IDE we need reordering on the host). The guest sends batches of I/O separated by cache flushes. If the batches are smaller than the virtio queue length, ideally things look like: io_submit(..., batch_size_1); io_getevents(..., batch_size_1); fdatasync(); io_submit(..., batch_size_2); io_getevents(..., batch_size_2); fdatasync(); io_submit(..., batch_size_3); io_getevents(..., batch_size_3); fdatasync(); (certainly that won't happen today, but it could in principle). How does a write cache give any advantage? The host kernel sees _exactly_ the same information as it would from a bunch of threaded pwritev()s followed by fdatasync(). Are you suggesting that the model with cache=writeback gives us the same I/O pattern as cache=none, so there are no opportunities for optimization? (wish: IO_CMD_ORDERED_FDATASYNC) If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Let's say the virtio queue size was unlimited. What merging/reordering opportunity are we missing on the host? Again we have exactly the same information: either the pagecache lru + radix tree that identifies all dirty pages in disk order, or the block queue with pending requests that contains exactly the same information. Something is wrong. Maybe it's my understanding, but on the other hand it may be a piece of kernel code. I assume you are talking of dedicated disk partitions and not individual disk images residing on the same partition. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/16/2010 04:27 PM, Balbir Singh wrote: Let's assume the guest has virtio (I agree with IDE we need reordering on the host). The guest sends batches of I/O separated by cache flushes. If the batches are smaller than the virtio queue length, ideally things look like: io_submit(..., batch_size_1); io_getevents(..., batch_size_1); fdatasync(); io_submit(..., batch_size_2); io_getevents(..., batch_size_2); fdatasync(); io_submit(..., batch_size_3); io_getevents(..., batch_size_3); fdatasync(); (certainly that won't happen today, but it could in principle). How does a write cache give any advantage? The host kernel sees _exactly_ the same information as it would from a bunch of threaded pwritev()s followed by fdatasync(). Are you suggesting that the model with cache=writeback gives us the same I/O pattern as cache=none, so there are no opportunities for optimization? Yes. The guest also has a large cache with the same optimization algorithm. (wish: IO_CMD_ORDERED_FDATASYNC) If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Let's say the virtio queue size was unlimited. What merging/reordering opportunity are we missing on the host? Again we have exactly the same information: either the pagecache lru + radix tree that identifies all dirty pages in disk order, or the block queue with pending requests that contains exactly the same information. Something is wrong. Maybe it's my understanding, but on the other hand it may be a piece of kernel code. I assume you are talking of dedicated disk partitions and not individual disk images residing on the same partition. Correct. Images in files introduce new writes which can be optimized. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Avi Kivity a...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 10:07 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Avi Kivity a...@redhat.com [2010-03-15 10:27:45]: On 03/15/2010 10:07 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. cache=off works for *direct I/O* supported filesystems and my concern is that one of the side-effects is that idle VM's can consume a lot of memory (assuming all the memory is available to them). As the number of VM's grow, they could cache a whole lot of memory. In my experiments I found that the total amount of memory cached far exceeded the mapped ratio by a large amount when we had idle VM's. The philosophy of this patch is to move the caching to the _host_ and let the host maintain the cache instead of the guest. One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. Yes, it is, I've taken a quick look. I am not sure if de-duplication would be the best approach, may be dropping the page in the page cache might be a good first step. Data consistency would be much easier to maintain that way, as long as the guest is not writing frequently to that page, we don't need the page cache in the host. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 11:17 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 10:27:45]: On 03/15/2010 10:07 AM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-03-15 09:48:05]: On 03/15/2010 09:22 AM, Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singhbal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. Well, for a guest, host page cache is a lot slower than guest page cache. Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. cache=off works for *direct I/O* supported filesystems and my concern is that one of the side-effects is that idle VM's can consume a lot of memory (assuming all the memory is available to them). As the number of VM's grow, they could cache a whole lot of memory. In my experiments I found that the total amount of memory cached far exceeded the mapped ratio by a large amount when we had idle VM's. The philosophy of this patch is to move the caching to the _host_ and let the host maintain the cache instead of the guest. That's only beneficial if the cache is shared. Otherwise, you could use the balloon to evict cache when memory is tight. Shared cache is mostly a desktop thing where users run similar workloads. For servers, it's much less likely. So a modified-guest doesn't help a lot here. One of the reasons I created a boot parameter was to deal with selective enablement for cases where memory is the most important resource being managed. I do see a hit in performance with my results (please see the data below), but the savings are quite large. The other solution mentioned in the TODOs is to have the balloon driver invoke this path. The sysctl also allows the guest to tune the amount of unmapped page cache if needed. The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. Yes, it is, I've taken a quick look. I am not sure if de-duplication would be the best approach, may be dropping the page in the page cache might be a good first step. Data consistency would be much easier to maintain that way, as long as the guest is not writing frequently to that page, we don't need the page cache in the host. Trimming the host page cache should happen automatically under pressure. Since the page is cached by the guest, it won't be re-read, so the host page is not frequently used and then dropped. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Avi Kivity a...@redhat.com [2010-03-15 11:27:56]: The knobs are for 1. Selective enablement 2. Selective control of the % of unmapped pages An alternative path is to enable KSM for page cache. Then we have direct read-only guest access to host page cache, without any guest modifications required. That will be pretty difficult to achieve though - will need a readonly bit in the page cache radix tree, and teach all paths to honour it. Yes, it is, I've taken a quick look. I am not sure if de-duplication would be the best approach, may be dropping the page in the page cache might be a good first step. Data consistency would be much easier to maintain that way, as long as the guest is not writing frequently to that page, we don't need the page cache in the host. Trimming the host page cache should happen automatically under pressure. Since the page is cached by the guest, it won't be re-read, so the host page is not frequently used and then dropped. Yes, agreed, but dropping is easier than tagging cache as read-only and getting everybody to understand read-only cached pages. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache!=none, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. The patch is applied against mmotm feb-11-2010. Hi, If you go ahead with this, please add the boot parameter its description to Documentation/kernel-parameters.txt. TODOS - 1. Balance slab cache as well 2. Invoke the balance routines from the balloon driver --- ~Randy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 04:27 AM, Avi Kivity wrote: That's only beneficial if the cache is shared. Otherwise, you could use the balloon to evict cache when memory is tight. Shared cache is mostly a desktop thing where users run similar workloads. For servers, it's much less likely. So a modified-guest doesn't help a lot here. Not really. In many cloud environments, there's a set of common images that are instantiated on each node. Usually this is because you're running a horizontally scalable application or because you're supporting an ephemeral storage model. In fact, with ephemeral storage, you typically want to use cache=writeback since you aren't providing data guarantees across shutdown/failure. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi Kivity a...@redhat.com writes: On 03/15/2010 10:07 AM, Balbir Singh wrote: Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. Hi Avi. One observation about your recommendation for cache=none: We run hosts of VMs accessing drives backed by logical volumes carved out from md RAID1. Each host has 32GB RAM and eight cores, divided between (say) twenty virtual machines, which pretty much fill the available memory on the host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback caching turned on get advertised to the guest as having a write-cache, and FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback isn't acting as cache=neverflush like it would have done a year ago. I know that comparing performance for cache=none against that unsafe behaviour would be somewhat unfair!) Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. For example, even IDE with cache=writeback easily beats virtio with cache=none in most of the guest filesystem performance tests I've tried. The anecdotal feedback from clients is also very strongly in favour of cache=writeback. With a host full of cache=none guests, IO contention between guests is hugely problematic with non-stop seek from the disks to service tiny O_DIRECT writes (especially without virtio), many of which needn't have been synchronous if only there had been some way for the guest OS to tell qemu that. Running with cache=writeback seems to reduce the frequency of disk flush per guest to a much more manageable level, and to allow the host's elevator to optimise writing out across the guests in between these flushes. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 03:23 PM, Chris Webb wrote: Avi Kivitya...@redhat.com writes: On 03/15/2010 10:07 AM, Balbir Singh wrote: Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. Hi Avi. One observation about your recommendation for cache=none: We run hosts of VMs accessing drives backed by logical volumes carved out from md RAID1. Each host has 32GB RAM and eight cores, divided between (say) twenty virtual machines, which pretty much fill the available memory on the host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback caching turned on get advertised to the guest as having a write-cache, and FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback isn't acting as cache=neverflush like it would have done a year ago. I know that comparing performance for cache=none against that unsafe behaviour would be somewhat unfair!) I knew someone would do this... This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. cache=writethrough provides a much stronger data guarantee. Even in the event of a host failure, data integrity will be preserved. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote: I knew someone would do this... This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. cache=writethrough provides a much stronger data guarantee. Even in the event of a host failure, data integrity will be preserved. Actually cache=writeback is as safe as any normal host is with a volatile disk cache, except that in this case the disk cache is actually a lot larger. With a properly implemented filesystem this will never cause corruption. You will lose recent updates after the last sync/fsync/etc up to the size of the cache, but filesystem metadata should never be corrupted, and data that has been forced to disk using fsync/O_SYNC should never be lost either. If it is that's a bug somewhere in the stack, but in my powerfail testing we never did so using xfs or ext3/4 after I fixed up the fsync code in the latter two. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/15/2010 07:43 PM, Christoph Hellwig wrote: On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote: I knew someone would do this... This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. cache=writethrough provides a much stronger data guarantee. Even in the event of a host failure, data integrity will be preserved. Actually cache=writeback is as safe as any normal host is with a volatile disk cache, except that in this case the disk cache is actually a lot larger. With a properly implemented filesystem this will never cause corruption. Metadata corruption, not necessarily corruption of data stored in a file. You will lose recent updates after the last sync/fsync/etc up to the size of the cache, but filesystem metadata should never be corrupted, and data that has been forced to disk using fsync/O_SYNC should never be lost either. Not all software uses fsync as much as they should. And often times, it's for good reason (like ext3). This is mitigated by the fact that there's usually a short window of time before metadata is flushed to disk. Adding another layer increases that delay. IIUC, an O_DIRECT write using cache=writeback is not actually on the spindle when the write() completes. Rather, an explicit fsync() would be required. That will cause data corruption in many applications (like databases) regardless of whether the fs gets metadata corruption. You could argue that the software should disable writeback caching on the virtual disk, but we don't currently support that so even if the application did, it's not going to help. Regards, Anthony Liguori If it is that's a bug somewhere in the stack, but in my powerfail testing we never did so using xfs or ext3/4 after I fixed up the fsync code in the latter two. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Chris Webb ch...@arachsys.com [2010-03-15 20:23:54]: Avi Kivity a...@redhat.com writes: On 03/15/2010 10:07 AM, Balbir Singh wrote: Yes, it is a virtio call away, but is the cost of paying twice in terms of memory acceptable? Usually, it isn't, which is why I recommend cache=off. Hi Avi. One observation about your recommendation for cache=none: We run hosts of VMs accessing drives backed by logical volumes carved out from md RAID1. Each host has 32GB RAM and eight cores, divided between (say) twenty virtual machines, which pretty much fill the available memory on the host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback caching turned on get advertised to the guest as having a write-cache, and FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback isn't acting as cache=neverflush like it would have done a year ago. I know that comparing performance for cache=none against that unsafe behaviour would be somewhat unfair!) Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. For example, even IDE with cache=writeback easily beats virtio with cache=none in most of the guest filesystem performance tests I've tried. The anecdotal feedback from clients is also very strongly in favour of cache=writeback. With a host full of cache=none guests, IO contention between guests is hugely problematic with non-stop seek from the disks to service tiny O_DIRECT writes (especially without virtio), many of which needn't have been synchronous if only there had been some way for the guest OS to tell qemu that. Running with cache=writeback seems to reduce the frequency of disk flush per guest to a much more manageable level, and to allow the host's elevator to optimise writing out across the guests in between these flushes. Thanks for the inputs above, they are extremely useful. The goal of these patches is that with cache != none, we allow double caching when needed and then slowly take away unmapped pages, pushing the caching to the host. There are knobs to control how much, etc and the whole feature is enabled via a boot parameter. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Randy Dunlap randy.dun...@oracle.com [2010-03-15 08:46:31]: On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote: Hi, If you go ahead with this, please add the boot parameter its description to Documentation/kernel-parameters.txt. I certainly will, thanks for keeping a watch. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html