Re: [RFC] vhost-blk implementation
Inspired by vhost-net implementation, I did initial prototype of vhost-blk to see if it provides any benefits over QEMU virtio-blk. I haven't handled all the error cases, fixed naming conventions etc., but the implementation is stable to play with. I tried not to deviate from vhost-net implementation where possible. Can you also send the qemu side of it? with vhost-blk: # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct 64+0 records in 64+0 records out 8388608 bytes (84 GB) copied, 126.135 seconds, 665 MB/s real2m6.137s user0m0.281s sys 0m14.725s without vhost-blk: (virtio) --- # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct 64+0 records in 64+0 records out 8388608 bytes (84 GB) copied, 275.466 seconds, 305 MB/s real4m35.468s user0m0.373s sys 0m48.074s Which caching mode is this? I assume data=writeback, because otherwise you'd be doing synchronous I/O directly from the handler. +static int do_handle_io(struct file *file, uint32_t type, uint64_t sector, + struct iovec *iov, int in) +{ + loff_t pos = sector 8; + int ret = 0; + + if (type VIRTIO_BLK_T_FLUSH) { + ret = vfs_fsync(file, file-f_path.dentry, 1); + } else if (type VIRTIO_BLK_T_OUT) { + ret = vfs_writev(file, iov, in, pos); + } else { + ret = vfs_readv(file, iov, in, pos); + } + return ret; I have to admit I don't understand the vhost architecture at all, but where do the actual data pointers used by the iovecs reside? vfs_readv/writev expect both the iovec itself and the buffers pointed to by it to reside in userspace, so just using kernel buffers here will break badly on architectures with different user/kernel mappings. A lot of this is fixable using simple set_fs co tricks, but for direct I/O which uses get_user_pages even that will fail badly. Also it seems like you're doing all the I/O synchronous here? For data=writeback operations that could explain the read speedup as you're avoiding context switches, but for actual write I/O which has to get data to disk (either directly from vfs_writev or later through vfs_fsync) this seems like a really bad idea stealing a lot of guest time that should happen in the background. Other than that the code seems quite nice and simple, but one huge problem is that it'll only support raw images, and thus misses out on all the nice image formats used in qemu deployments, especially qcow2. It's also missing the ioctl magic we're having in various places, both for controlling host devices like cdroms and SG passthrough. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] vhost-blk implementation
On Tue, Mar 23, 2010 at 12:03:14PM +0200, Avi Kivity wrote: I also think it should be done at the bio layer. File I/O is going to be slower, if we do vhost-blk we should concentrate on maximum performance. The block layer also exposes more functionality we can use (asynchronous barriers for example). The block layer is more flexible, but that limits you to only stack directly ontop of a block device, which is extremly inflexible. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] vhost-blk implementation
On Thu, Mar 25, 2010 at 08:29:03AM +0200, Avi Kivity wrote: We still have a virtio implementation in userspace for file-based images. In any case, the file APIs are not asynchronous so we'll need a thread pool. That will probably minimize the difference in performance between the userspace and kernel implementations. The kernel has real aio when doing direct I/O, although currently we can't easily use it from kernelspace. Thejre's a few people who've been looking into making it usable for that, though. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] vhost-blk implementation
On Wed, Mar 24, 2010 at 01:22:37PM -0700, Badari Pulavarty wrote: Yes. This is with default (writeback) cache model. As mentioned earlier, readhead is helping here and most cases, data would be ready in the pagecache. Ok. cache=writeback performance is something I haven't bothered looking at at all. For cache=none any streaming write or random workload with large enough record sizes got basically the same performance as native using kernel aio, and same for write but slightly degraded for reads using the thread pool. See my attached JLS presentation for some numbers. iovecs and buffers are user-space pointers (from the host kernel point of view). They are guest address. So, I don't need to do any set_fs tricks. Right now they're not ceclared as such, so sparse would complain. Yes. QEMU virtio-blk is batching up all the writes and handing of the work to another thread. If using the thread pool. If using kernel aio it performs the io_submit system call directly. What do should I do here ? I can create bunch of kernel threads to do the IO for me. Or some how fit and reuse AIO io_submit() mechanism. Whats the best way here ? I hate do duplicate all the code VFS is doing. The only thing you can do currently is adding a thread pool. It might be able to use io_submit for the O_DIRECT case with some refactoring. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] vhost-blk implementation
On Wed, Mar 24, 2010 at 01:22:37PM -0700, Badari Pulavarty wrote: iovecs and buffers are user-space pointers (from the host kernel point of view). They are guest address. So, I don't need to do any set_fs tricks. From verifying the code and using the sparse annotations it appears that the actual buffers are userspace pointers, but the iovecs in the virtqueue are kernel level pointers, so you would need some annotations. While we're at it here is a patch fixing the remaining sparse warnings in vhost-blk: Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/drivers/vhost/blk.c === --- linux-2.6.orig/drivers/vhost/blk.c 2010-04-05 21:15:11.638004250 +0200 +++ linux-2.6/drivers/vhost/blk.c 2010-04-05 21:16:13.238004599 +0200 @@ -86,7 +86,7 @@ static void handle_blk(struct vhost_blk nvecs++; BUG_ON(vq-iov[nvecs].iov_len != 1); - if (copy_to_user(vq-iov[nvecs].iov_base, status, sizeof status) 0) { + if (copy_to_user(vq-iov[nvecs].iov_base, status, sizeof status)) { printk(copy to user failed\n); vhost_discard_vq_desc(vq); break; @@ -199,7 +199,7 @@ static struct miscdevice vhost_blk_misc vhost_blk_fops, }; -int vhost_blk_init(void) +static int vhost_blk_init(void) { int r = vhost_init(); if (r) @@ -216,7 +216,7 @@ err_init: } module_init(vhost_blk_init); -void vhost_blk_exit(void) +static void vhost_blk_exit(void) { misc_deregister(vhost_blk_misc); vhost_cleanup(); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: fix sparse warnings
Index: linux-2.6/drivers/vhost/net.c === --- linux-2.6.orig/drivers/vhost/net.c 2010-04-05 21:13:24.196004388 +0200 +++ linux-2.6/drivers/vhost/net.c 2010-04-05 21:13:32.726004109 +0200 @@ -641,7 +641,7 @@ static struct miscdevice vhost_net_misc vhost_net_fops, }; -int vhost_net_init(void) +static int vhost_net_init(void) { int r = vhost_init(); if (r) @@ -658,7 +658,7 @@ err_init: } module_init(vhost_net_init); -void vhost_net_exit(void) +static void vhost_net_exit(void) { misc_deregister(vhost_net_misc); vhost_cleanup(); Index: linux-2.6/drivers/vhost/vhost.c === --- linux-2.6.orig/drivers/vhost/vhost.c2010-04-05 21:12:58.806004249 +0200 +++ linux-2.6/drivers/vhost/vhost.c 2010-04-05 21:14:44.681010674 +0200 @@ -710,7 +710,7 @@ int vhost_log_write(struct vhost_virtque return 0; } -int translate_desc(struct vhost_dev *dev, u64 addr, u32 len, +static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len, struct iovec iov[], int iov_size) { const struct vhost_memory_region *reg; @@ -736,7 +736,7 @@ int translate_desc(struct vhost_dev *dev _iov = iov + ret; size = reg-memory_size - addr + reg-guest_phys_addr; _iov-iov_len = min((u64)len, size); - _iov-iov_base = (void *)(unsigned long) + _iov-iov_base = (void __user *)(unsigned long) (reg-userspace_addr + addr - reg-guest_phys_addr); s += size; addr += size; @@ -990,7 +990,7 @@ void vhost_discard_vq_desc(struct vhost_ * want to notify the guest, using eventfd. */ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len) { - struct vring_used_elem *used; + struct vring_used_elem __user *used; /* The virtqueue contains a ring of used buffers. Get a pointer to the * next entry in that used ring. */ @@ -1014,7 +1014,8 @@ int vhost_add_used(struct vhost_virtqueu smp_wmb(); /* Log used ring entry write. */ log_write(vq-log_base, - vq-log_addr + ((void *)used - (void *)vq-used), + vq-log_addr + + ((void __user *)used - (void __user *)vq-used), sizeof *used); /* Log used index update. */ log_write(vq-log_base, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] vhost-blk implementation
On Thu, Mar 25, 2010 at 04:00:56PM +0100, Asdo wrote: Would the loop device provide the features of a block device? I recall barrier support at least has been added recently. It does, but not in a very efficient way. Is it recommended to run kvm on a loopback mounted file compared to on a raw file? No. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shouldn't cache=none be the default for drives?
On Thu, Apr 08, 2010 at 10:05:09AM +0400, Michael Tokarev wrote: LVM volumes. This is because with cache=none, the virtual disk image is opened with O_DIRECT flag, which means all I/O bypasses host scheduler and buffer cache. O_DIRECT does not bypass the I/O scheduler, only the page cache. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: fix sparse warnings
Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/drivers/vhost/net.c === --- linux-2.6.orig/drivers/vhost/net.c 2010-04-05 21:13:24.196004388 +0200 +++ linux-2.6/drivers/vhost/net.c 2010-04-05 21:13:32.726004109 +0200 @@ -641,7 +641,7 @@ static struct miscdevice vhost_net_misc vhost_net_fops, }; -int vhost_net_init(void) +static int vhost_net_init(void) { int r = vhost_init(); if (r) @@ -658,7 +658,7 @@ err_init: } module_init(vhost_net_init); -void vhost_net_exit(void) +static void vhost_net_exit(void) { misc_deregister(vhost_net_misc); vhost_cleanup(); Index: linux-2.6/drivers/vhost/vhost.c === --- linux-2.6.orig/drivers/vhost/vhost.c2010-04-05 21:12:58.806004249 +0200 +++ linux-2.6/drivers/vhost/vhost.c 2010-04-05 21:14:44.681010674 +0200 @@ -710,7 +710,7 @@ int vhost_log_write(struct vhost_virtque return 0; } -int translate_desc(struct vhost_dev *dev, u64 addr, u32 len, +static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len, struct iovec iov[], int iov_size) { const struct vhost_memory_region *reg; @@ -736,7 +736,7 @@ int translate_desc(struct vhost_dev *dev _iov = iov + ret; size = reg-memory_size - addr + reg-guest_phys_addr; _iov-iov_len = min((u64)len, size); - _iov-iov_base = (void *)(unsigned long) + _iov-iov_base = (void __user *)(unsigned long) (reg-userspace_addr + addr - reg-guest_phys_addr); s += size; addr += size; @@ -990,7 +990,7 @@ void vhost_discard_vq_desc(struct vhost_ * want to notify the guest, using eventfd. */ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len) { - struct vring_used_elem *used; + struct vring_used_elem __user *used; /* The virtqueue contains a ring of used buffers. Get a pointer to the * next entry in that used ring. */ @@ -1014,7 +1014,8 @@ int vhost_add_used(struct vhost_virtqueu smp_wmb(); /* Log used ring entry write. */ log_write(vq-log_base, - vq-log_addr + ((void *)used - (void *)vq-used), + vq-log_addr + + ((void __user *)used - (void __user *)vq-used), sizeof *used); /* Log used index update. */ log_write(vq-log_base, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Tue, May 04, 2010 at 02:08:24PM +0930, Rusty Russell wrote: On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. ISTR Christoph had withdrawn some patches in this area, and was waiting for him to resubmit? Any patches I've withdrawn in this area are withdrawn for good. But what I really need to do is to review Michaels spec updates, sorry. UI'll get back to it today. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Qemu-KVM 0.12.3 and Multipath - Assertion
On Tue, May 04, 2010 at 04:01:35PM +0200, Kevin Wolf wrote: Great, I'm going to submit it as a proper patch then. Christoph, by now I'm pretty sure it's right, but can you have another look if this is correct, anyway? It looks correct to me - we really shouldn't update the the fields until bdrv_aio_cancel has returned. In fact we cannot cancel a request more often than we can, so there's a fairly high chance it will complete. Reviewed-by: Christoph Hellwig h...@lst.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Fri, Feb 19, 2010 at 12:22:20AM +0200, Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. The whole virtio-blk interface is full of warts. It has been extended rather ad-hoc, so that is rather expected. One issue I struggled with especially is how type field mixes bits and non-bit values. I ended up simply defining all legal values, so that we have CMD = 2, CMD_OUT = 3 and so on. It's basically a complete mess without much logic behind it. +\change_unchanged +the high bit +\change_inserted 0 1266497301 + (VIRTIO_BLK_T_BARRIER) +\change_unchanged + indicates that this request acts as a barrier and that all preceeding requests + must be complete before this one, and all following requests must not be + started until this is complete. + +\change_inserted 0 1266504385 + Note that a barrier does not flush caches in the underlying backend device + in host, and thus does not serve as data consistency guarantee. + Driver must use FLUSH request to flush the host cache. +\change_unchanged I'm not sure it's even worth documenting it. I can't see any way to actually implement safe behaviour with the VIRTIO_BLK_T_BARRIER-style barriers. Btw, did I mention that .lyx is a a really horrible format to review diffs for? Plain latex would be a lot better.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] virtio-spec: document block CMD and FLUSH
On Tue, Apr 20, 2010 at 02:46:35AM +0100, Jamie Lokier wrote: Does this mean that virtio-blk supports all three combinations? 1. FLUSH that isn't a barrier 2. FLUSH that is also a barrier 3. Barrier that is not a flush 1 is good for fsync-like operations; 2 is good for journalling-like ordered operations. 3 sounds like it doesn't mean a lot as the host cache provides no guarantees and has no ordering facility that can be used. No. The Linux virtio_blk guest driver either supports data integrity by using FLUSH or can send down BARRIER requests which aren't much help at all. Qemu only implements FLUSH anyway. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/2] close all the block drivers before the qemu process exits
On Wed, May 12, 2010 at 07:46:52PM +0900, MORITA Kazutaka wrote: This patch calls the close handler of the block driver before the qemu process exits. This is necessary because the sheepdog block driver releases the lock of VM images in the close handler. Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Looks good in principle, except that bdrv_first is gone and has been replaced with a real list in the meantime, so this won't even apply. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [block]: Fix scsi-generic breakage in find_image_format()
On Sat, May 15, 2010 at 06:30:52AM -0700, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org This patch adds a special BlockDriverState-sg check in block.c:find_image_format() after bdrv_file_open() - block/raw-posix.c:hdev_open() has been called to determine if we are dealing with a Linux host scsi-generic device or not. The patch then returns the BlockDriver * from find_protocol(), skipping the subsequent bdrv_read() and rest of find_image_format(). That's not quite correct as we don't want to expose formats directly. Returning bdrv_find_format(raw); should fix it for now, although I really hate having these special cases in block.c. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] [block]: Skip refresh_total_sectors() for scsi-generic devices
On Sat, May 15, 2010 at 06:30:59AM -0700, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org This patch adds a BlockDriverState-sg check in block.c:bdrv_common_open() to skip the new refresh_total_sectors() call once we know we are working with a scsi-generic device. We go ahead and skip this call for scsi-generic devices because block/raw-posix.c:raw_getlength() - lseek() will return -ESPIPE. How about moving that check into refresh_total_sectors? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm hangs if multipath device is queing
On Tue, May 18, 2010 at 03:22:36PM +0200, Kevin Wolf wrote: I think it's stuck here in an endless loop: while (laiocb-ret == -EINPROGRESS) qemu_laio_completion_cb(laiocb-ctx); Can you verify this by single-stepping one or two loop iterations? ret and errno after the read call could be interesting, too. Maybe the compiler is just too smart. Without some form of barrier it could just optimize the loop away as laiocb-ret couldn't change in a normal single-threaded environment. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for May 18
On Tue, May 18, 2010 at 08:52:36AM -0500, Anthony Liguori wrote: This should be filed in launchpad as a qemu bug and it should be tested against the latest git. This bug sounds like we're using an int to represent sector offset somewhere but there's not enough info in the bug report to figure out for sure. I just audited the virtio-blk - raw - aio=threads path and I don't see an obvious place that we're getting it wrong. FYI: I'm going to ignore everything that's in launchpad - even more than in the stupid SF bugtracker. While the SF one is almost unsuable launchpad is entirely unsuable. If you don't have an account with the evil spacement empire you can't even check the email addresses of the reporters, so any communication with them is entirely impossible. It's time we get a proper bugzilla.qemu.org for both qemu and qemu-kvm that can be used sanely. If you ask nicely you might even get a virtual instance of bugzilla.kernel.org which works quite nicely. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: the 1Tb block issue
On Tue, May 18, 2010 at 08:38:22PM +0300, Avi Kivity wrote: Yes. Why would Linux post overlapping requests? makes 0x sense. There may be a guest bug in here too. Christoph? Overlapping writes are entirely fine from the guest POV, although they should be rather unusual. We can update a page and send it out again when it gets redirtied while still out on the wire. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH +stable] block: don't attempt to merge overlapping requests
On Wed, May 19, 2010 at 10:23:44AM +0100, Stefan Hajnoczi wrote: On Wed, May 19, 2010 at 10:06 AM, Avi Kivity a...@redhat.com wrote: In the cache=writeback case the virtio-blk guest driver does: blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, ...) I don't follow. ?What's the implication? I was wondering whether the queue is incorrectly set to a mode where overlapping write requests aren't being ordered. Anyway, Christoph says overlapping write requests can be issued so my theory is broken. They can happen, but won't during usual pagecache based writeback. So this should not happen for the pagecache based mke2fs workload. It could happen for a workload with mkfs that uses O_DIRECT. And we need to handle it either way. And btw, our barrier handling for devices using multiwrite (aka virtio) is utterly busted right now, patch will follow ASAP. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
On Fri, May 21, 2010 at 09:49:56PM +0100, Stefan Hajnoczi wrote: http://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps Requires kernel support - not sure if enough of utrace is in mainline for this to work out-of-the-box across distros. Nothing of utrace is in mainline, nevermind the whole systemtap code which is intentionally keep out of the kernel tree. Using this means that for every probe in userspace code you need to keep the configured source tree of the currently running kernel around, which is completely unusable for typical developer setups. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On Tue, May 25, 2010 at 02:25:53PM +0300, Avi Kivity wrote: Currently if someone wants to add a new block format, they have to upstream it and wait for a new qemu to be released. With a plugin API, they can add a new block format to an existing, supported qemu. So? Unless we want a stable driver ABI which I fundamentally oppose as it would make block driver development hell they'd have to wait for a new release of the block layer. It's really just going to be a lot of pain for no major gain. qemu releases are frequent enough, and if users care enough they can also easily patch qemu. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw disks no longer work in latest kvm (kvm-88 was fine)
On Sat, May 29, 2010 at 04:42:59PM +0700, Antoine Martin wrote: Can someone explain the aio options? All I can find is this: # qemu-system-x86_64 -h | grep -i aio [,addr=A][,id=name][,aio=threads|native] I assume it means the aio=threads emulates the kernel's aio with separate threads? And is therefore likely to be slower, right? Is there a reason why aio=native is not the default? Shouldn't aio=threads be the fallback? The kernel AIO support is unfortunately not a very generic API. It only supports O_DIRECT I/O (cache=none for qemu), and if used on a filesystems it might still block if we need to perform block allocations. We could probably make it the default for block devices, but I'm not a big fan of these kind of conditional defaults. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw disks no longer work in latest kvm (kvm-88 was fine)
On Sat, May 29, 2010 at 10:55:18AM +0100, Stefan Hajnoczi wrote: I would expect that aio=native is faster but benchmarks show that this isn't true for all workloads. In what benchmark do you see worse results for aio=native compared to aio=threads? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default
Err, I'll take this one back for now pending some more discussion. What we need more urgently is the writeback cache flag, which is now implemented in qemu, patch following ASAP. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/24] compatfd is included before, and it is compiled unconditionally
Btw, what's the state of getting compatfd upstream? It's a pretty annoying difference between qemu upstream and qemu-kvm. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/24] compatfd is included before, and it is compiled unconditionally
On Tue, Sep 22, 2009 at 03:25:13PM +0200, Juan Quintela wrote: Christoph Hellwig h...@infradead.org wrote: Btw, what's the state of getting compatfd upstream? It's a pretty annoying difference between qemu upstream and qemu-kvm. I haven't tried. I can try to send a patch. Do you have any use case that will help the cause? Well, the eventfd compat is used in the thread pool AIO code. I don't know what difference it makes, but I really hate this code beeing different in both trees. I want to see compatfd used either in both or none. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] qemu-kvm: virtio-net: Re-instate GSO code removed upstream
I might sound like a broken record, but why isn't the full GSO support for virtio-net upstream in qemu? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm tuning guide
On Wed, Sep 30, 2009 at 08:20:35AM +0200, Avi Kivity wrote: On 09/30/2009 07:09 AM, Nikola Ciprich wrote: The default, IDE, is highly supported by guests but may be slow, especially with disk arrays. If your guest supports it, use the virtio interface: Avi, what is the status of data integrity issues Chris Hellwig summarized some time ago? I don't know. Christoph? On the qemu side everything is in git HEAD now, but I'm not sure about the qemu-0.11 release as I haven't really followed it. For the guest kernel the virtio cache flush support is now in mainline (past-2.6.31). For the host kernel side about 2/3 of the fixes are now in mainline (past-2.6.31) with the others hopefully getting in this merge window. Is it safe to recommend virtio to newbies already? I think so. I wouldn't. At least not for people caring about their data. It will take a while to promote the guest side fixes to all the interesting guests. IDE has the major advantage that cache flush support has been around in the guest driver for a long time so we only need to fix the host side which is a lot easier. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/24] compatfd is included before, and it is compiled unconditionally
On Thu, Oct 01, 2009 at 01:58:10PM +0200, Juan Quintela wrote: Discused with Anthony about it. signalfd is complicated for qemu upstream (too difficult to use properly), and eventfd ... The current eventfd emulation is worse than the pipe code that it substitutes. His suggestion here was to create a new abstraction with an API like: push_notify() pop_notify() and then you can implement it with eventfd() pipes/whatever. What was missing for you of compatfd: qemu_eventfd/qemu_signalfd? Do a push_notify()/pop_notify() work for you? I don't desperately want to use it myself anyway. I just want to get rid of the highly annoyind spurious differences in the AIO code due to use of compatfd. I would be perfectly fine with just killing this use of eventfd in qemu-kvm. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync guest calls made async on host - SQLite performance
On Sun, Oct 11, 2009 at 11:16:42AM +0200, Avi Kivity wrote: if scsi is used, you incur the cost of virtualization, if virtio is used, your guests fsyncs incur less cost. So back to the question to the kvm team. It appears that with the stock KVM setup customers who need higher data integrity (through fsync) should steer away from virtio for the moment. Is that assessment correct? Christoph, wasn't there a bug where the guest didn't wait for requests in response to a barrier request? Can't remember anything like that. The bug was the complete lack of cache flush infrastructure for virtio, and the lack of advertising a volative write cache on ide. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync guest calls made async on host - SQLite performance
On Wed, Oct 14, 2009 at 08:03:41PM +0900, Avi Kivity wrote: Can't remember anything like that. The bug was the complete lack of cache flush infrastructure for virtio, and the lack of advertising a volative write cache on ide. By complete flush infrastructure, you mean host-side and guest-side support for a new barrier command, yes? The cache flush command, not barrier command. The new virtio code implements barrier the same way we do for IDE and SCSI - all barrier semantics are implemented by generic code in the block layer by draining the queues, the only thing we send over the wire are cache flush commands in strategic places. But can't this be also implemented using QUEUE_ORDERED_DRAIN, and on the host side disabling the backing device write cache? I'm talking about cache=none, primarily. Yes, it could. But as I found out in a long discussion with Stephen it's not actually nessecary. All filesystems do the right thing for a device not claiming to support barriers if it doesn't include write caches, that is implement ordering internally. So there is no urge to set QUEUE_ORDERED_DRAIN for the case without write cache. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-blk: fallback to draining the queue if barrier ops are not supported
On Wed, Oct 14, 2009 at 07:38:45PM +0400, Michael Tokarev wrote: Avi Kivity wrote: Early implementations of virtio devices did not support barrier operations, but did commit the data to disk. In such cases, drain the queue to emulate barrier operations. Are there any implementation currently that actually supports barriers? As far as I remember there's no way to invoke barriers from a user-space application on linux, and this is how kvm/qemu is running on this OS. Ignore all the barrier talk. The way Linux uses the various storage transport the primitives are queue draining (done entirely in the guest block layer) and cache flushes. Fdatasync is exactly the same primitive as a WIN FLUSH CACHE in ATA or SYNCHRONIZE cache in SCSI module the lack or ranges in fdatasync - but that is just a performance optimization and not actually used by Linux guests for now. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync guest calls made async on host - SQLite performance
On Thu, Oct 15, 2009 at 01:56:40AM +0900, Avi Kivity wrote: Does virtio say it has a write cache or not (and how does one say it?)? Historically it didn't and the only safe way to use virtio was in cache=writethrough mode. Since qemu git as of 4th Sempember and Linux 2.6.32-rc there is a virtio-blk feature to communicate the existance of a volatile write cache, and the support for a cache flush command. With the combination of these two data=writeback and data=none modes are safe for the first time. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync guest calls made async on host - SQLite performance
On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote: Historically it didn't and the only safe way to use virtio was in cache=writethrough mode. Which should be the default on Ubuntu's kvm that this report is concerned with so I'm a bit confused. So can we please get the detailed setup where this happens, that is: filesystem used in the guest any volume manager / software raid used in the guest kernel version in the guest image format used qemu command line including caching mode, using ide/scsi/virtio, etc qemu/kvm version filesystem used in the host any volume manager / software raid used in the host kernel version in the host Avi's patch is a performance optimization, not a correctness issue? It could actually minimally degrade performace. For the existing filesystems as upper layer it doesn not improve correctness either. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync guest calls made async on host - SQLite performance
On Thu, Oct 15, 2009 at 02:17:02PM +0200, Christoph Hellwig wrote: On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote: Historically it didn't and the only safe way to use virtio was in cache=writethrough mode. Which should be the default on Ubuntu's kvm that this report is concerned with so I'm a bit confused. So can we please get the detailed setup where this happens, that is: filesystem used in the guest any volume manager / software raid used in the guest kernel version in the guest image format used qemu command line including caching mode, using ide/scsi/virtio, etc qemu/kvm version filesystem used in the host any volume manager / software raid used in the host kernel version in the host And very important the mount options (/proc/self/mounts) of both host and guest. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
On Wed, Oct 28, 2009 at 09:11:29AM +0100, Hannes Reinecke wrote: The problem is I don't have any documentation for the LSI parallel SCSI controller. So I don't know if and in what shape I/O is passed down, nor anything else. And as the SCSI disk emulation is really tied into the LSI parallel SCSI controller, any change in the former is likely to break the latter. scsi-disk is not tied into the lsi/symbios driver. It's also used by the esp driver and usb-mcd. But it's a complete mess, and all my attempts to untangle it have so far failed with my brain hurting for a long time after. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 3/4] scsi-disk: Factor out SCSI command emulation
On Tue, Oct 27, 2009 at 04:28:59PM +0100, Hannes Reinecke wrote: Other drives might want to use SCSI command emulation without going through the SCSI disk abstraction, as this imposes too many limits on the emulation. Might be a good idea to move something like this first into the series and share the CDB decoder between scsi and ide (atapi) as a start. A little bit of refactoring of the CDB decoder, e.g. into one function per opcode (-family) won't hurt either. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
On Wed, Oct 28, 2009 at 08:25:22PM +0100, Hannes Reinecke wrote: I don't think we really need two modes. My preferred interface here is to pass down scatter-gather lists down with every xfer; this way it'll be the responsibility of the driver to create the lists in the first place. If it has hardware scatter-gather support it can just pass them down, if not it can as easily create a scatter-gather list with just one element as a bounce buffer. Yes. If this really causes performance problems for esp we can add bounce buffering in that driver later. So something like - Get next request - Attach iovec/bounc-buffer - handle request (command/write/read) - complete request (callback) Btw, from some previuous attempts to sort out this code here are some thing that I think would be beneficial: - try to create generic scsi device/request structures that the hba driver can access, and which contain additional private data for scsi-disk/generic. Information in the generic one would include the information about the data transfer, the tag and the command block. - try to get rid of the tag indexing stuff by just using a pointer to the generic scsi request in the hba drivers. That should get rid of a lot of useless list searching. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
On Thu, Oct 29, 2009 at 01:57:40PM +0100, Gerd Hoffmann wrote: Trying to go forward in review+bisect friendly baby steps. Here is what I have now: http://repo.or.cz/w/qemu/kraxel.git?a=shortlog;h=refs/heads/scsi.v1 It is far from being completed, will continue tomorrow. Should give a idea of the direction I'm heading to though. Comments welcome. Nice. I had patches for the header renames, too - the current naming is really awkward. Would be great if Anthony could take those ASAP. Btw, I also splitted out a cdrom.h from the new scsi.h, not sure if it's worth it for the two prototypes. Just wondering, shouldn't scsi-bus.c just become scsi.c now that it's growing various non-bus glue bits? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
On Thu, Oct 29, 2009 at 10:14:19AM -0500, Anthony Liguori wrote: Which patches are those? http://repo.or.cz/w/qemu/kraxel.git?a=commitdiff;h=1ee5ee08e4427c3db7e1322d30cc0e58e5ca48b9 and http://repo.or.cz/w/qemu/kraxel.git?a=commitdiff;h=a6e6178185786c582141f993272e00521d3f125a -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/4] KVM-trace port to tracepoints
On Wed, Jul 23, 2008 at 01:08:41PM +0300, Avi Kivity wrote: trace_mark() is implement kvmtrace, which is propagated to userspace. So while trace_mark() itself is not a userspace interface, one of its users is. It's an unstable interface. But so is dmesg; that's the nature of tracing. Trace_mark is as stable as any other kernel interface, and the data you pass through it is as stable as you want it to. In most cases like kvmtrace or my spu scheduler tracing code the trace data is directly forwarded through a userspace interface, and that is as stable as any freeform interface, e.g. as like printk mentioned above. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
On Thu, Dec 11, 2008 at 05:49:47PM +0100, Andrea Arcangeli wrote: On Thu, Dec 11, 2008 at 05:11:08PM +0100, Gerd Hoffmann wrote: Yes. But kernel aio requires O_DIRECT, so aio users are affected nevertheless. Are you sure? It surely wasn't the case... Mainline kernel aio only implements O_DIRECT. Some RHEL version had support for buffered kernel AIO. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call minutes for Feb 1
On Tue, Feb 01, 2011 at 05:36:13PM +0100, Jan Kiszka wrote: kvm_cpu_exec/kvm_run, and start wondering What needs to be done to upstream so that qemu-kvm could use that implementation?. If they differ, the reasons need to be understood and patched away, either by fixing/enhancing upstream or simplifying qemu-kvm. Once the upstream changes are merged back, a qemu-kvm patch is posted to switch to that version. while I'm not an expert in that area I really like you're approach. I'd really prefer to let you finish up all the major work that way before starting massive revamping like the glib main loop. Resolving the qemu/qemu-kvm schisma surely is more important for the overall project than rewriting existing functionality to look a little nicer. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Use mmap for working with disk image V2
How do you plan to handle I/O errors or ENOSPC conditions? Note that shared writeable mappings are by far the feature in the VM/FS code that is most error prone, including the impossiblity of doing sensible error handling. The version that accidentally used MAP_PRIVATE actually makes a lot of sense for an equivalent of qemu's snapshot mode where the image is readonly and changes are kept private as long as the amount of modified blocks is small enough to not kill the host VM, but using shared writeable mappings just sems dangerous. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [qemu-iotests][PATCH] Update rbd support
@@ -43,6 +43,10 @@ _supported_fmt raw _supported_proto generic _supported_os Linux +# rbd images are not growable +if [ $IMGPROTO = rbd ]; then +_notrun image protocol $IMGPROTO does not support growable images +fi I suspect we only support the weird writing past size for the file protocol, so we should only run the test for it. Or does sheepdog do anything special about it? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tool: add QCOW verions 1 read/write support
On Wed, Apr 13, 2011 at 08:01:58PM +0100, Prasad Joshi wrote: The patch only implements the basic read write support for QCOW version 1 images. Many of the QCOW features are not implmented, for example What's the point? Qcow1 has been deprecated for a long time. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO
On Mon, Jun 14, 2010 at 02:44:31AM -0700, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512) so that it can be used by BSG AIO for write()/read() of struct sg_io_v4. Jusre remove the nb_sectors argument, we already get the length passed in the iovec. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_blk: support barriers without FLUSH feature
If we want to support barriers with the cache=writethrough mode in qemu we need to tell the block layer that we only need queue drains to implement a barrier. Follow the model set by SCSI and IDE and assume that there is no volatile write cache if the host doesn't advertize it. While this might imply working barriers on old qemu versions or other hypervisors that actually have a volatile write cache this is only a cosmetic issue - these hypervisors don't guarantee any data integrity with or without this patch, but with the patch we at least provide data ordering. Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/drivers/block/virtio_blk.c === --- linux-2.6.orig/drivers/block/virtio_blk.c 2010-06-08 12:01:09.625254254 +0200 +++ linux-2.6/drivers/block/virtio_blk.c2010-06-15 14:38:18.518273290 +0200 @@ -366,12 +366,32 @@ static int __devinit virtblk_probe(struc vblk-disk-driverfs_dev = vdev-dev; index++; - /* If barriers are supported, tell block layer that queue is ordered */ - if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) + if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) { + /* +* If the FLUSH feature is supported we do have support for +* flushing a volatile write cache on the host. Use that +* to implement write barrier support. +*/ blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, virtblk_prepare_flush); - else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) + } else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) { + /* +* If the BARRIER feature is supported the host expects us +* to order request by tags. This implies there is not +* volatile write cache on the host, and that the host +* never re-orders outstanding I/O. This feature is not +* useful for real life scenarious and deprecated. +*/ blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL); + } else { + /* +* If the FLUSH feature is not supported we must assume that +* the host does not perform any kind of volatile write +* caching. We still need to drain the queue to provider +* proper barrier semantics. +*/ + blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL); + } /* If disk is read-only in the host, the guest should obey */ if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call minutes for June 15
On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote: KVM/qemu patches - patch rate is high, documentation is low, review is low - patches need to include better descriptions and documentation - will slow down patch writers - will make it easier for patch reviewers What is the qemu patch review policy anyway? There are no Reviewed-by: included in the actual commits, and the requirement for a positive review also seem to vary a lot, up to the point that some commiters commit code that has never hit a public mailing list before. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] Add 'serial' attribute to virtio-blk devices
On Fri, Jun 18, 2010 at 01:38:02PM -0500, Ryan Harper wrote: Create a new attribute for virtio-blk devices that will fetch the serial number of the block device. This attribute can be used by udev to create disk/by-id symlinks for devices that don't have a UUID (filesystem) associated with them. ATA_IDENTIFY strings are special in that they can be up to 20 chars long and aren't required to be NULL-terminated. The buffer is also zero-padded meaning that if the serial is 19 chars or less that we get a NULL terminated string. When copying this value into a string buffer, we must be careful to copy up to the NULL (if it present) and only 20 if it is longer and not to attempt to NULL terminate; this isn't needed. Why is this virtio-blk specific? In a later mail you mention you want to use it for udev. So please export this from scsi/libata as well and we have one proper interface that we can use for all devices. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Graphical virtualisation management system
On Thu, Jun 24, 2010 at 02:01:52PM -0500, Javier Guerra Giraldez wrote: On Thu, Jun 24, 2010 at 1:32 PM, Freddie Cash fjwc...@gmail.com wrote: ??* virt-manager which requires X and seems to be more desktop-oriented; don't know about the others, but virt-manager runs only on the admin station. on the VM hosts you run only libvirtd, which doesn't need X While it can connect to remote systems it seems totally unusable for that to me. For one thing working over higher latency links like DSL or even transatlantik links seems to be almost impossible. Second I still haven't figure out how to install and manage a system using the serial console with KVM, which certainly contributes to the complete lack of usability above. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote: Also, ext4 is _very_ slow on O_SYNC writes (which is used in kvm with default cache). Yeah, we probably need to switch to sync_file_range() to avoid the journal commit on every write. No, we don't. sync_file_range does not actually provide any data integrity. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Mon, Aug 16, 2010 at 03:34:12PM -0500, Anthony Liguori wrote: On 08/16/2010 01:42 PM, Christoph Hellwig wrote: On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote: Also, ext4 is _very_ slow on O_SYNC writes (which is used in kvm with default cache). Yeah, we probably need to switch to sync_file_range() to avoid the journal commit on every write. No, we don't. sync_file_range does not actually provide any data integrity. What do you mean by data integrity? sync_file_range only does pagecache-level writeout of the file data. It nevers calls into the actual filesystem, that means any block allocations (for filling holes / converting preallocated space in normal filesystems, or every write in COW-based filesstems like qcow2) never get flushes to disk, and even more importantly the disk write cache is never flushed. In short it's completely worthless for any real filesystem. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Tue, Aug 17, 2010 at 12:23:01PM +0300, Avi Kivity wrote: On 08/17/2010 12:07 PM, Christoph Hellwig wrote: In short it's completely worthless for any real filesystem. The documentation should be updated then. It suggests that it is usable for data integrity. The manpage has a warning section documenting what I said above since I added it in January. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Tue, Aug 17, 2010 at 07:56:04AM -0500, Anthony Liguori wrote: But assuming that you had a preallocated disk image, it would effectively flush the page cache so it sounds like the only real issue is sparse and growable files. For preallocated as in using fallocate() we still converting unwritten to regular extents and do have metadata updates. For preallocated as in writining zeroes into the whole image earlier we do indeed only care about the data, and will not have metadata for most filesystems. That still leaves COW based filesystems that need to allocate new blocks on every write, and from my reading NFS also needs the -fsync callout to actually commit unstable data to disk. and even more importantly the disk write cache is never flushed. The point is that we don't want to flush the disk write cache. The intention of writethrough is not to make the disk cache writethrough but to treat the host's cache as writethrough. We need to make sure data is not in the disk write cache if want to provide data integrity. It has nothing to do with the qemu caching mode - for data=writeback or none it's commited as part of the fdatasync call, and for data=writethrough it's commited as part of the O_SYNC write. Note that both these path end up calling the filesystems -fsync method which is what's require to make writes stable. That's exactly what is missing out in sync_file_range, and that's why that API is not useful at all for data integrity operations. It's also what makes fsync slow on extN - but the fix to that is not to not provide data integrity but rather to make fsync fast. There's various other filesystems that can already do it, and if you insist on using those that are slow for this operation you'll have to suffer until that issue is fixed for them. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote: On 08/17/2010 08:07 AM, Christoph Hellwig wrote: The point is that we don't want to flush the disk write cache. The intention of writethrough is not to make the disk cache writethrough but to treat the host's cache as writethrough. We need to make sure data is not in the disk write cache if want to provide data integrity. When the guest explicitly flushes the emulated disk's write cache. Not on every single write completion. That depends on the cache= mode. For cache=none and cache=writeback we present a write-back cache to the guest, and the guest does explicit cache flushes. For cache=writethrough we present a writethrough cache to the guest, and we need to make sure data actually has hit the disk before returning I/O completion to the guest. It has nothing to do with the qemu caching mode - for data=writeback or none it's commited as part of the fdatasync call, and for data=writethrough it's commited as part of the O_SYNC write. Note that both these path end up calling the filesystems -fsync method which is what's require to make writes stable. That's exactly what is missing out in sync_file_range, and that's why that API is not useful at all for data integrity operations. For normal writes from a guest, we don't need to follow the write with an fsync(). We should only need to issue an fsync() given an explicit flush from the guest. Define normal writes. For cache=none and cache=writeback we don't have to, and instead do explicit calls to fsync()/fdatasync() calls when a we a cache flush from the guest. For data=writethrough we guarantee data has made it to disk, and we implement this using O_DSYNC/O_SYNC when opening the file. That tells the operating system to not return until data has hit the disk. For Linux this is internally implement using a range-fsync/fdatasync after the actual write. fsync() being slow is orthogonal to my point. I don't see why we need to do an fsync() on *every* write. It should only be necessary when a guest injects an actual barrier. See above. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Tue, Aug 17, 2010 at 09:39:15AM -0500, Anthony Liguori wrote: The type of cache we present to the guest only should relate to how the hypervisor caches the storage. It should be independent of how data is cached by the disk. It is. There can be many levels of caching in a storage hierarchy and each hierarchy cached independently of the next level. If the user has a disk with a writeback cache, if we expose a writethrough cache to the guest, it's not our responsibility to make sure that we break through the writeback cache on the disk. The users doesn't know or have to care about the caching. The users uses O_SYNC/fsync to tell it wants data on disk, and it's the operating systems job to make that happen. The situation with qemu is the same - if we tell the guest that we do not have a volatile write cache that needs explicit management the guest can rely on the fact that it does not have to do manual cache management. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Tue, Aug 17, 2010 at 09:44:49AM -0500, Anthony Liguori wrote: I think the real issue is we're mixing host configuration with guest visible state. The last time I proposed to decouple the two you and Avi were heavily opposed to it.. With O_SYNC, we're causing cache=writethrough to do writethrough through two layers of the storage heirarchy. I don't think that's necessary or desirable though. It's absolutely nessecary if we tell the guest that we do not have a volatile write cache. Which is the only good reason to use data=writethrough anyway - except for dealing with old guests that can't handle volatile writecache it's an absolutely stupid mode of operation. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Tue, Aug 17, 2010 at 09:54:07AM -0500, Anthony Liguori wrote: This is simply unrealistic. O_SYNC might force data to be on a platter when using a directly attached disk but many NAS's actually do writeback caching and relying on having an UPS to preserve data integrity. There's really no way in the general case to ensure that data is actually on a platter once you've involved a complex storage setup or you assume FUA Yes, there is. If you have an array that has batter backup it handles this internally. The normal case is to not set the WCE bit in the mode page, which tells the operating system not ever send SYNCHRONIZE_CACHE commands. I have one array that sets a WCE bit neveless, but it also doesn't flush it's non-volatile cache in SYNCHRONIZE_CACHE, but rather implements it as an effective no-op. Let me put it another way. If an admin knows the disks on a machine have battery backed cache, he's likely to leave writeback caching enabled. We are currently giving the admin two choices with QEMU, either ignore the fact that the disk is battery backed and do write through caching of the disk or do writeback caching in the host which Again, this is not qemu's business at all. Qemu is not different from any other application requiring data integrity. If that admin really thinks he needs to overide the storage provided settings he can mount the filesystem using -o nobarrier and we will not send cache flushes. I would in general recommend against this, as an external UPS still has lots of failure modes that this doesn't account for. Arrays with internal non-volatile memory already do the right thing anyway. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JFYI: ext4 bug triggerable by kvm
On Tue, Aug 17, 2010 at 05:59:07PM +0300, Avi Kivity wrote: I agree, but there's another case: tell the guest that we have a write cache, use O_DSYNC, but only flush the disk cache on guest flushes. O_DSYNC flushes the disk write cache and any filesystem that supports non-volatile cache. The disk cache is not an abstraction exposed to applications. I believe this can be approximated by mounting the host filesystem with barrier=0? Mounting the host filesystem with nobarrier means we will never explicit flush the volatile write cache on the disk. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2] virtio-blk physical block size
On Mon, Jan 04, 2010 at 01:38:51PM +1030, Rusty Russell wrote: I thought this was what I was doing, but I have shown over and over that I have no idea about block devices. Our current driver treats BLK_SIZE as the logical and physical size (see blk_queue_logical_block_size). I have no idea what logical vs. physical actually means. Anyone? Most importantly, is it some Linux-internal difference or a real I/O-visible distinction? Those should be the same for any sane interface. They are for classical disk devices with larger block sizes (MO, s390 dasd) and also for the now appearing 4k sector scsi disks. But in the ide world people are concerned about dos/window legacy compatiblity so they came up with a nasty hack: - there is a physical block size as used by the disk internally (4k initially) - all the interfaces to the operating system still happen in the traditional 512 byte blocks to not break any existing assumptions - to make sure modern operating systems can optimize for the larger physical sectors the disks expose this size, too. - even worse disks can also have alignment hacks for the traditional DOS partitions tables, so that the 512 byte block zero might even have an offset into the first larger physical block. This is also exposed in the ATA identify information. All in all I don't think this mess is a good idea to replicate in virtio. Virtio by defintion requires virtualization aware guests, so we should just follow the SCSI way of larger real block sizes here. Rusty. ---end quoted text--- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] Advertise IDE physical block size as 4K
On Tue, Dec 29, 2009 at 02:39:38PM +0100, Luca Tettamanti wrote: Linux tools put the first partition at sector 63 (512-byte) to retain compatibility with Windows; Well, some of them, and depending on the exact disks. It's all rather complicated. It has been discussed for hardware disk design with 4k sectors, and somehow there were plans to map sectors so that the Linux partition scheme results in nicely aligned filesystem blocks Ugh, I hope you're wrong ;-) AFAICS remapping will lead only to headaches... Linux does not have any problem with aligned partitions. Linux doesn't care. As doesn't windows. But performance on mis-aligned partitions will suck badly - both on 4k sector drives, SSDs or probably various copy on write layers in virtualization once you hit the worst case. Fortunately the block topology information present in recent ATA and SCSI standards allows the storage hardware to tell about the required alignment, and Linux now has a topology API to expose it, which is used by the most recent versions of the partitioning tools and filesystem creation tools. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] Advertise IDE physical block size as 4K
On Tue, Dec 29, 2009 at 12:07:58PM +0200, Avi Kivity wrote: Guests use this number as a hint for alignment and I/O request sizes. Given that modern disks have 4K block sizes, and cached file-backed images also have 4K block sizes, this hint can improve guest performance. We probably need to make this configurable depending on machine type. It should be the default for -M 0.13 only as it can affect guest code paths. The information is correct per the ATA spec, but: (a) as mentioned above it should not be used for old machine types (b) we need to sort out passing through the first block alignment bits that are also in IDENTIFY word 106 if using a raw block device underneat (b) probably need to adjust the physical blocks size depending on the underlying storage topology. I have a patch in my queue for a while now dealing with (b) and parts of (c), but it's been preempted by more urgent work. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2] virtio-blk physical block size
On Tue, Jan 05, 2010 at 08:16:15PM +, Jamie Lokier wrote: It would be good if virtio relayed the backing device's basic topology hints, so: - If the backing dev is a real disk with 512-byte sectors, virtio should indicate 512-byte blocks to the guest. - If the backing dev is a real disk with 4096-byte sectors, virtio should indicate 4096-byte blocks to the guest. With databases and filesystems, if you care about data integrity: - If the backing dev is a real disk with 4096-byte sectors, or a file whose access is through a 4096-byte-per-page cache, virtio must indicate 4096-byte blocks otherwise guest journalling is not host-powerfail safe. You get the idea. If there is only one parameter, it really should be at least as large as the smallest unit which may be corrupted by writes when errors occur. It's not that easy. IDE only supports larger sectors sizes with the physical sector size attribute, not native larger sectors. While scsi does support it it's untypical and I would not expect the guests to not always get it right. So the best is to use the transport native way to express that we have larger sectors and expect the guest to do the right thing. I've done some work on the autodetection of the larger sector sizes a while ago. But now that people brought up migration I wonder if that makes sense - if we migrate from a 512 byte sector size disk to a 4096 byte sector size disk we can't simply change guest visible attributes. Maybe we should pick one on image creation and then stick to it. For an image format we could write down this information in the image, but for a raw images that's impossible. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 3/4] Add HYPER-V apic access MSRs.
On Sun, Jan 17, 2010 at 02:20:32PM +0200, Avi Kivity wrote: +addr = kmap_atomic(page, KM_USER0); +clear_user_page(addr, vaddr, page); +kunmap_atomic(addr, KM_USER0); Surprising that clear_user_page needs kmap_atomic() (but true). There's a clear_user_highpage helper to take care of it for you. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 3/4] Add HYPER-V apic access MSRs.
On Sun, Jan 17, 2010 at 02:41:42PM +0200, Gleb Natapov wrote: I copied code from the instead of using helper faction for some unknown to me reason. Anyway if I can't get struct page from user virtual address I can't use it. Actually I am not sure the page should be zeroed at all. Spec only descries first dword of the page and doesn't require zeroing the reset as far as I see. There's a clear_user() function that just takes a user virtual address and a size. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/12] Add get_user_pages() variant that fails if major fault is required.
On Tue, Jan 05, 2010 at 04:12:48PM +0200, Gleb Natapov wrote: This patch add get_user_pages() variant that only succeeds if getting a reference to a page doesn't require major fault. +int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int flags = FOLL_TOUCH | FOLL_MINOR; + + if (pages) + flags |= FOLL_GET; + if (write) + flags |= FOLL_WRITE; + if (force) + flags |= FOLL_FORCE; + + return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas); Wouldn't it be better to just export __get_user_pages as a proper user interface, maybe replacing get_user_pages by it entirely? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/12] Export __get_user_pages_fast.
On Tue, Jan 05, 2010 at 04:12:47PM +0200, Gleb Natapov wrote: KVM will use it to try and find a page without falling back to slow gup. That is why get_user_pages_fast() is not enough. Btw, it seems like currently is declared unconditionally in linux/mm.h but only implemented by x86, and you code using it needs ifdefs for that. I think you should just introduce a stub that always returns an error here. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Seabios incompatible with Linux 2.6.26 host?
On Thu, Feb 04, 2010 at 03:34:24PM +0100, Pierre Riteau wrote: I think I traced back the issue to the switch from Bochs BIOS to Seabios. By forcing the usage of Bochs BIOS 5f08bb45861f54be478b25075b90d2406a0f8bb3 works, while it dies without the -bios override. Unfortunately, newer versions don't seem to work with Bochs BIOS. Upgrading the host kernel to 2.6.32 (Debian Squeeze) solves the issue. No problem on Fedora 12 as well. Even Linux 2.6.31 stopped working as a host for me since that commit. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw disks no longer work in latest kvm (kvm-88 was fine)
On Sun, Mar 07, 2010 at 12:32:38PM +0300, Michael Tokarev wrote: Antoine Martin wrote: [] https://sourceforge.net/tracker/?func=detailatid=893831aid=2933400group_id=180599 The initial report is almost 8 weeks old! Is data-corruption and data loss somehow less important than the hundreds of patches that have been submitted since?? Or is there a fix somewhere I've missed? I can only guess that the info collected so far is not sufficient to understand what's going on: except of I/O error writing block NNN we does not have anything at all. So it's impossible to know where the problem is. Actually it is, and the bug has been fixed long ago in: commit e2a305fb13ff0f5cf6ff80aaa90a5ed5954c Author: Christoph Hellwig h...@lst.de Date: Tue Jan 26 14:49:08 2010 +0100 block: avoid creating too large iovecs in multiwrite_merge I've asked for it be added to the -stable series but that hasn't happened so far. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw disks no longer work in latest kvm (kvm-88 was fine)
On Sun, Mar 07, 2010 at 07:30:06PM +0200, Avi Kivity wrote: It may also be that glibc is emulating preadv, incorrectly. I've done a quick audit of all pathes leading to pread and all seem to align correctly. So either a broken glibc emulation or something else outside the block layer seems likely. Antoine, can you check this? ltrace may help, or run 'strings libc.so | grep pread'. Or just add an #undef CONFIG_PREADV just before the first #ifdef CONFIG_PREADV in posix-aio-compat.c and see if that works. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux-aio usable?
On Mon, Mar 08, 2010 at 11:10:29AM +0200, Avi Kivity wrote: Are there any potential pitfalls? It won't work well unless running on a block device (partition or LVM). It will actually work well on pre-allocated filesystem images, at least on XFS and NFS. The real pitfal is that cache=none is required for kernel support as it only supports O_DIRECT. Is there any reason one should not compile that feature by default? It's compiled by default if libaio and it's development headers are found. Does it do anything if not explicitly run with aio=native? No. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote: I knew someone would do this... This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. cache=writethrough provides a much stronger data guarantee. Even in the event of a host failure, data integrity will be preserved. Actually cache=writeback is as safe as any normal host is with a volatile disk cache, except that in this case the disk cache is actually a lot larger. With a properly implemented filesystem this will never cause corruption. You will lose recent updates after the last sync/fsync/etc up to the size of the cache, but filesystem metadata should never be corrupted, and data that has been forced to disk using fsync/O_SYNC should never be lost either. If it is that's a bug somewhere in the stack, but in my powerfail testing we never did so using xfs or ext3/4 after I fixed up the fsync code in the latter two. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Mon, Mar 15, 2010 at 08:27:25PM -0500, Anthony Liguori wrote: Actually cache=writeback is as safe as any normal host is with a volatile disk cache, except that in this case the disk cache is actually a lot larger. With a properly implemented filesystem this will never cause corruption. Metadata corruption, not necessarily corruption of data stored in a file. Again, this will not cause metadata corruption either if the filesystem loses barriers, although we may lose up to the cache size of new (data or metadata operations). The consistency of the filesystem is still guaranteed. Not all software uses fsync as much as they should. And often times, it's for good reason (like ext3). If an application needs data on disk it must call fsync, or there is no guaranteed at all, even on ext3. And with growing disk caches these issues show up on normal disks often enough that people have realized it by now. IIUC, an O_DIRECT write using cache=writeback is not actually on the spindle when the write() completes. Rather, an explicit fsync() would be required. That will cause data corruption in many applications (like databases) regardless of whether the fs gets metadata corruption. It's neither for O_DIRECT without qemu involved. The O_DIRECT write goes through the disk cache and requires and explicit fsync or O_SYNC open flag to make sure it goes to disk. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi, cache=writeback can be faster than cache=none for the same reasons a disk cache speeds up access. As long as the I/O mix contains more asynchronous then synchronous writes it allows the host to do much more reordering, only limited by the cache size (which can be quite huge when using the host pagecache) and the amount of cache flushes coming from the host. If you have a fsync heavy workload or metadata operation with a filesystem like the current XFS you will get lots of cache flushes that make the use of the additional cache limits. If you don't have a of lot of cache flushes, e.g. due to dumb applications that do not issue fsync, or even run ext3 in it's default mode never issues cache flushes the benefit will be enormous, but the data loss and possible corruption will be enormous. But even for something like btrfs that does provide data integrity but issues cache flushes fairly effeciently data=writeback may provide a quite nice speedup, especially if using multiple guest accessing the same spindle(s). But I wouldn't be surprised if IBM's exteme differences are indeed due to the extremly unsafe ext3 default behaviour. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Mar 16
On Tue, Mar 16, 2010 at 12:38:02PM +0200, Avi Kivity wrote: On 03/16/2010 12:31 PM, Daniel P. Berrange wrote: Polling loops are an indication that something is wrong. Except when people suggest they are the right answer, qcow high watermark ;-P I liked Anthony's suggestion of an lvm2 block format driver. No polling. I have done some work on linking the new lvm library to qemu to control snapshotting. But introducing a whole new block format seems like a lot of duplication to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote: Are you talking about direct volume access or qcow2? Doesn't matter. For direct volume access, I still don't get it. The number of barriers issues by the host must equal (or exceed, but that's pointless) the number of barriers issued by the guest. cache=writeback allows the host to reorder writes, but so does cache=none. Where does the difference come from? Put it another way. In an unvirtualized environment, if you implement a write cache in a storage driver (not device), and sync it on a barrier request, would you expect to see a performance improvement? cache=none only allows very limited reorderning in the host. O_DIRECT is synchronous on the host, so there's just some very limited reordering going on in the elevator if we have other I/O going on in parallel. In addition to that the disk writecache can perform limited reodering and caching, but the disk cache has a rather limited size. The host pagecache gives a much wieder opportunity to reorder, especially if the guest workload is not cache flush heavy. If the guest workload is extremly cache flush heavy the usefulness of the pagecache is rather limited, as we'll only use very little of it, but pay by having to do a data copy. If the workload is not cache flush heavy, and we have multiple guests doing I/O to the same spindles it will allow the host do do much more efficient data writeout by beeing able to do better ordered (less seeky) and bigger I/O (especially if the host has real storage compared to ide for the guest). If you don't have a of lot of cache flushes, e.g. due to dumb applications that do not issue fsync, or even run ext3 in it's default mode never issues cache flushes the benefit will be enormous, but the data loss and possible corruption will be enormous. Shouldn't the host never issue cache flushes in this case? (for direct volume access; qcow2 still needs flushes for metadata integrity). If the guest never issues a flush the host will neither, indeed. Data will only go to disk by background writeout or memory pressure. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote: If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Yes. And rememember those don't have to come from the same host. Also remember that we rather limit execssive reodering of O_DIRECT requests in the I/O scheduler because they are synchronous type I/O while we don't do that for pagecache writeback. And we don't have unlimited virtio queue size, in fact it's quite limited. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote: They should be reorderable. Otherwise host filesystems on several volumes would suffer the same problems. They are reordable, just not as extremly as the the page cache. Remember that the request queue really is just a relatively small queue of outstanding I/O, and that is absolutely intentional. Large scale _caching_ is done by the VM in the pagecache, with all the usual aging, pressure, etc algorithms applied to it. The block devices have a relatively small fixed size request queue associated with it to facilitate request merging and limited reordering and having fully set up I/O requests for the device. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. As the person who has written quite a bit of the current O_SYNC implementation and also reviewed the rest of it I can tell you that those flushes won't be coalesced. If we always rewrite the same block we do the cache flush from the fsync method and there's is nothing to coalesced it there. If you actually do modify metadata (e.g. by using the new real O_SYNC instead of the old one that always was O_DSYNC that I introduced in 2.6.33 but that isn't picked up by userspace yet) you might hit a very limited transaction merging window in some filesystems, but it's generally very small for a good reason. If it were too large we'd make the once progress wait for I/O in another just because we might expect transactions to coalesced later. There's been some long discussion about that fsync transaction batching tuning for ext3 a while ago. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote: Meanwhile I looked at the code, and it looks bad. There is an IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before issuing it. In any case, qemu doesn't use it as far as I could tell, and even if it did, device-matter doesn't implement the needed -aio_fsync() operation. No one implements it, and all surrounding code is dead wood. It would require us to do asynchronous pagecache operations, which involve major surgery of the VM code. Patches to do this were rejected multiple times. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote: Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Who guarantees that less common virtio-blk and virtio-net guest drivers for non-Linux OSes are fine with it? Maybe you should add a feature flag that the guest has to ACK to enable it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio block drivers not working
I do you virtio block in a very similar setup to yours (fully static kernel, -kernel option to kvm/qemu) sucesfully for quite a a while. Can you post your kernel .config and the contents of /proc/devices and /proc/partitions to debug this further? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT
On Mon, Mar 23, 2009 at 06:17:36PM +0200, Avi Kivity wrote: Instead of introducing yet another layer of indirection, you could add block-raw-linux-aio, which would be registered before block-raw-posix (which is realy block-raw-threadpool...), and resist a -probe() if caching is enabled. Exactly the kind of comment I was about to make, but I need to read a little deeper to understand all the details. But my gut feeling is that this abstraction doesn't help us very much, especially with Avi's aiocb pools in place. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT
On Mon, Mar 23, 2009 at 12:14:58PM -0500, Anthony Liguori wrote: I'd like to see the O_DIRECT bounce buffering removed in favor of the DMA API bouncing. Once that happens, raw_read and raw_pread can disappear. block-raw-posix becomes much simpler. See my vectored I/O patches for doing the bounce buffering at the optimal place for the aio path. Note that from my reading of the qcow/qcow2 code they might send down unaligned requests, which is something the dma api would not help with. For the buffered I/O path we will always have to do some sort of buffering due to all the partition header reading / etc. And given how that part isn't performance critical my preference would be to keep doing it in bdrv_pread/write and guarantee the lowlevel drivers proper alignment. We would drop the signaling stuff and have the thread pool use an fd to signal. The big problem with that right now is that it'll cause a performance regression for certain platforms until we have the IO thread in place. Talking about signaling, does anyone remember why the Linux signalfd/ eventfd support is only in kvm but not in upstream qemu? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT
On Mon, Mar 23, 2009 at 12:14:58PM -0500, Anthony Liguori wrote: block-raw-posix needs a major overhaul. That's why I'm not even considering committing the patch as is. I have some WIP patches that split out the host device bits into separate files to get block-raw-posix down to the pure file handling bits without all the host-specific host device mess. But it's at the end of a really large pile, which needs to be rebases once we have the patches already on the list in in some form. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT
On Mon, Mar 23, 2009 at 01:10:30PM -0500, Anthony Liguori wrote: I really dislike having so many APIs. I'd rather have an aio API that took byte accesses or have pread/pwrite always be emulated with a full sector read/write I had patches to change the aio API to byte based access, and get rid of the read/write methods to only have the byte based pread/pwrite APIs, but thay got obsoleted by Avi's patch to kill the pread/pwrite ops. We could put in byte-based AIO without byte-based read/write, though. In my patches I put a flag into BlockDriverState whether we allow byte-based access to this instance or otherwise emulated it in the block layer. We still need this as many of the image formats can't deal with byte-granularity access without read-modify-write cycles, and I think we're better off having one read-modify-write handler in the block handler than one per image format that needs it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Split kvm source tarballs
On Wed, Mar 25, 2009 at 08:44:58AM -0500, Anthony Liguori wrote: That's what I figured. FWIW, the split tarballs work just fine for me. It may be worth waiting to do step 2 until the IO thread is merged. I think once that happens, we could probably do a sprint to get rid of libkvm in kvm-userspace. That would certainly simplify things. Yeah. And having the both common and split repos just confuses the heck out of any user of the repository. I think the right way to split it to wait for libkvm going away and just have a qemu-kvm repository and an entirely separate kernel module repository. It's not like there is anything common but the few exported ABI headers, and we can either keep them in both (would mean qemu-kvm can always build against a defined set of headers) or make qemu-kvm require a kernel source like the current kvm support in upstream qemu. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Split kvm source tarballs
On Wed, Mar 25, 2009 at 08:02:48PM +0200, Avi Kivity wrote: So how about this: - keep copies of the headers in the qemu repository. 'make sync' becomes a maintainer tool rather than a developer tool Yeah. That similar how we maintain the headers and some shared source file for XFS and libxfs in xfsprogs. - move qemu to the root of the repository, and reparent libkvm/ user/ and friends under it. this will give us easier merging. Yeah. While you're at it user/ might be renamed to something more descriptive. - move the external module kit into kvm.git As in your kvm development kernel tree? Not sure it's a good idea to suck this into a full kernel tree. Probably worth making it a small repository of it's own. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Split kvm source tarballs
On Wed, Mar 25, 2009 at 04:34:31PM -0500, Anthony Liguori wrote: But if you created a qemu-svn-stable branch that followed the QEMU stable tree in kvm-userspace, like the qemu-cvs branch follows trunk, then it would be pretty easy to create and maintain a kvm_stable_0_10 branch of whatever you'd like to call it in kvm-userspace. Any chance you could do this? I suspect it's just a matter of creating the branch based off of the qemu-cvs tree at dde and then doing a git-svn fetch. Slightly offtopic, but I always wondered why qemu is hosted in svn. For all project having semi-forks of qemu that they try to keep in sync or even merge back a distributed scm would work so much better. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Split kvm source tarballs
On Thu, Mar 26, 2009 at 11:09:07AM +0200, Avi Kivity wrote: If the repo contains only the kit (external-module.h and the hack scripts) we'll be left with dual repositories with their confusion and unbisectability. If the repo contains both the kit and the code, I'll need to commit every kvm.git change into that repository, which I'm sure to botch now and then. Do you see any specific problem with dropping kvm-userspace.git/kernel/* under kvm.git/scripts/kvm/? Adding more stuff to a kernel tree that doesn't apply to building that tree seems rather odd. But if it makes your life easier go for it, it's defintively much better than the current setup. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Split kvm source tarballs
On Thu, Mar 26, 2009 at 01:19:46PM -0500, Anthony Liguori wrote: Slightly offtopic, but I always wondered why qemu is hosted in svn. For all project having semi-forks of qemu that they try to keep in sync or even merge back a distributed scm would work so much better. I'm going to switch it to git fwiw. If it's soon enough maybe the kvm repository reorganization should wait for it so that the new repository can be cloned from upstream qemu? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Split kvm source tarballs
Any idea when the split will happen? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: build system Add link to qemu
On Sun, Apr 26, 2009 at 01:33:37PM +0300, Avi Kivity wrote: Jan Kiszka wrote: I'm getting closer to a working qemu-kvm, but there are still a few messy parts. The magic dance goes like this: Try a fresh fetch. ./configure make ought to work. Works for me now. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_blk: SG_IO passthru support
From: Hannes Reinecke h...@suse.de Add support for SG_IO passthru to virtio_blk. We add the scsi command block after the normal outhdr, and the scsi inhdr with full status information aswell as the sense buffer before the regular inhdr. [hch: forward ported, added the VIRTIO_BLK_F_SCSI flags, some comments and tested the whole beast] Signed-off-by: Hannes Reinecke h...@suse.de Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/drivers/block/virtio_blk.c === --- linux-2.6.orig/drivers/block/virtio_blk.c 2009-04-26 23:01:41.330960470 +0200 +++ linux-2.6/drivers/block/virtio_blk.c2009-04-27 00:18:04.708076895 +0200 @@ -37,6 +37,7 @@ struct virtblk_req struct list_head list; struct request *req; struct virtio_blk_outhdr out_hdr; + struct virtio_scsi_inhdr in_hdr; u8 status; }; @@ -49,7 +50,9 @@ static void blk_done(struct virtqueue *v spin_lock_irqsave(vblk-lock, flags); while ((vbr = vblk-vq-vq_ops-get_buf(vblk-vq, len)) != NULL) { + unsigned int nr_bytes; int error; + switch (vbr-status) { case VIRTIO_BLK_S_OK: error = 0; @@ -62,7 +65,15 @@ static void blk_done(struct virtqueue *v break; } - __blk_end_request(vbr-req, error, blk_rq_bytes(vbr-req)); + if (blk_pc_request(vbr-req)) { + vbr-req-data_len = vbr-in_hdr.residual; + nr_bytes = vbr-in_hdr.data_len; + vbr-req-sense_len = vbr-in_hdr.sense_len; + vbr-req-errors = vbr-in_hdr.errors; + } else + nr_bytes = blk_rq_bytes(vbr-req); + + __blk_end_request(vbr-req, error, nr_bytes); list_del(vbr-list); mempool_free(vbr, vblk-pool); } @@ -74,7 +85,7 @@ static void blk_done(struct virtqueue *v static bool do_req(struct request_queue *q, struct virtio_blk *vblk, struct request *req) { - unsigned long num, out, in; + unsigned long num, out = 0, in = 0; struct virtblk_req *vbr; vbr = mempool_alloc(vblk-pool, GFP_ATOMIC); @@ -99,18 +110,36 @@ static bool do_req(struct request_queue if (blk_barrier_rq(vbr-req)) vbr-out_hdr.type |= VIRTIO_BLK_T_BARRIER; - sg_set_buf(vblk-sg[0], vbr-out_hdr, sizeof(vbr-out_hdr)); - num = blk_rq_map_sg(q, vbr-req, vblk-sg+1); - sg_set_buf(vblk-sg[num+1], vbr-status, sizeof(vbr-status)); - - if (rq_data_dir(vbr-req) == WRITE) { - vbr-out_hdr.type |= VIRTIO_BLK_T_OUT; - out = 1 + num; - in = 1; - } else { - vbr-out_hdr.type |= VIRTIO_BLK_T_IN; - out = 1; - in = 1 + num; + sg_set_buf(vblk-sg[out++], vbr-out_hdr, sizeof(vbr-out_hdr)); + + /* +* If this is a packet command we need a couple of additional headers. +* Behind the normal outhdr we put a segment with the scsi command +* block, and before the normal inhdr we put the sense data and the +* inhdr with additional status information before the normal inhdr. +*/ + if (blk_pc_request(vbr-req)) + sg_set_buf(vblk-sg[out++], vbr-req-cmd, vbr-req-cmd_len); + + num = blk_rq_map_sg(q, vbr-req, vblk-sg + out); + + if (blk_pc_request(vbr-req)) { + sg_set_buf(vblk-sg[num + out + in++], vbr-req-sense, 96); + sg_set_buf(vblk-sg[num + out + in++], vbr-in_hdr, + sizeof(vbr-in_hdr)); + } + + sg_set_buf(vblk-sg[num + out + in++], vbr-status, + sizeof(vbr-status)); + + if (num) { + if (rq_data_dir(vbr-req) == WRITE) { + vbr-out_hdr.type |= VIRTIO_BLK_T_OUT; + out += num; + } else { + vbr-out_hdr.type |= VIRTIO_BLK_T_IN; + in += num; + } } if (vblk-vq-vq_ops-add_buf(vblk-vq, vblk-sg, out, in, vbr)) { @@ -149,8 +178,16 @@ static void do_virtblk_request(struct re static int virtblk_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, unsigned long data) { - return scsi_cmd_ioctl(bdev-bd_disk-queue, - bdev-bd_disk, mode, cmd, + struct gendisk *disk = bdev-bd_disk; + struct virtio_blk *vblk = disk-private_data; + + /* +* Only allow the generic SCSI ioctls if the host can support it. +*/ + if (!virtio_has_feature(vblk-vdev, VIRTIO_BLK_F_SCSI)) + return -ENOIOCTLCMD; + + return scsi_cmd_ioctl(disk-queue, disk, mode, cmd, (void __user *)data); } @@ -356,6 +393,7 @@ static
[PATCH] virtio-blk: add SGI_IO passthru support
Add support for SG_IO passthru (packet commands) to the virtio-blk backend. Conceptually based on an older patch from Hannes Reinecke but largely rewritten to match the code structure and layering in virtio-blk. Note that currently we issue the hose SG_IO synchronously. We could easily switch to async I/O, but that would required either bloating the VirtIOBlockReq by the size of struct sg_io_hdr or an additional memory allocation for each SG_IO request. Signed-off-by: Christoph Hellwig h...@lst.de Index: qemu/hw/virtio-blk.h === --- qemu.orig/hw/virtio-blk.h 2009-04-26 16:50:38.154074532 +0200 +++ qemu/hw/virtio-blk.h2009-04-26 22:51:16.838076869 +0200 @@ -28,6 +28,9 @@ #define VIRTIO_BLK_F_SIZE_MAX 1 /* Indicates maximum segment size */ #define VIRTIO_BLK_F_SEG_MAX2 /* Indicates maximum # of segments */ #define VIRTIO_BLK_F_GEOMETRY 4 /* Indicates support of legacy geometry */ +#define VIRTIO_BLK_F_RO 5 /* Disk is read-only */ +#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/ +#define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */ struct virtio_blk_config { @@ -70,6 +73,15 @@ struct virtio_blk_inhdr unsigned char status; }; +/* SCSI pass-through header */ +struct virtio_scsi_inhdr +{ +uint32_t errors; +uint32_t data_len; +uint32_t sense_len; +uint32_t residual; +}; + void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs); #endif Index: qemu/hw/virtio-blk.c === --- qemu.orig/hw/virtio-blk.c 2009-04-26 16:50:38.160074667 +0200 +++ qemu/hw/virtio-blk.c2009-04-27 10:25:19.278074514 +0200 @@ -15,6 +15,9 @@ #include sysemu.h #include virtio-blk.h #include block_int.h +#ifdef __linux__ +# include scsi/sg.h +#endif typedef struct VirtIOBlock { @@ -35,6 +38,7 @@ typedef struct VirtIOBlockReq VirtQueueElement elem; struct virtio_blk_inhdr *in; struct virtio_blk_outhdr *out; +struct virtio_scsi_inhdr *scsi; QEMUIOVector qiov; struct VirtIOBlockReq *next; } VirtIOBlockReq; @@ -103,6 +107,108 @@ static VirtIOBlockReq *virtio_blk_get_re return req; } +#ifdef __linux__ +static void virtio_blk_handle_scsi(VirtIOBlockReq *req) +{ +struct sg_io_hdr hdr; +int ret, size = 0; +int status; +int i; + +/* + * We require at least one output segment each for the virtio_blk_outhdr + * and the SCSI command block. + * + * We also at least require the virtio_blk_inhdr, the virtio_scsi_inhdr + * and the sense buffer pointer in the input segments. + */ +if (req-elem.out_num 2 || req-elem.in_num 3) { +virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR); +return; +} + +/* + * No support for bidirection commands yet. + */ +if (req-elem.out_num 2 req-elem.in_num 3) { +virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP); +return; +} + +/* + * The scsi inhdr is placed in the second-to-last input segment, just + * before the regular inhdr. + */ +req-scsi = (void *)req-elem.in_sg[req-elem.in_num - 2].iov_base; +size = sizeof(*req-in) + sizeof(*req-scsi); + +memset(hdr, 0, sizeof(struct sg_io_hdr)); +hdr.interface_id = 'S'; +hdr.cmd_len = req-elem.out_sg[1].iov_len; +hdr.cmdp = req-elem.out_sg[1].iov_base; +hdr.dxfer_len = 0; + +if (req-elem.out_num 2) { +/* + * If there are more than the minimally required 2 output segments + * there is write payload starting from the third iovec. + */ +hdr.dxfer_direction = SG_DXFER_TO_DEV; +hdr.iovec_count = req-elem.out_num - 2; + +for (i = 0; i hdr.iovec_count; i++) +hdr.dxfer_len += req-elem.out_sg[i + 2].iov_len; + +hdr.dxferp = req-elem.out_sg + 2; + +} else if (req-elem.in_num 3) { +/* + * If we have more than 3 input segments the guest wants to actually + * read data. + */ +hdr.dxfer_direction = SG_DXFER_FROM_DEV; +hdr.iovec_count = req-elem.in_num - 3; +for (i = 0; i hdr.iovec_count; i++) +hdr.dxfer_len += req-elem.in_sg[i].iov_len; + +hdr.dxferp = req-elem.in_sg; +size += hdr.dxfer_len; +} else { +/* + * Some SCSI commands don't actually transfer any data. + */ +hdr.dxfer_direction = SG_DXFER_NONE; +} + +hdr.sbp = req-elem.in_sg[req-elem.in_num - 3].iov_base; +hdr.mx_sb_len = req-elem.in_sg[req-elem.in_num - 3].iov_len; +size += hdr.mx_sb_len; + +ret = bdrv_ioctl(req-dev-bs, SG_IO, hdr); +if (ret) { +status = VIRTIO_BLK_S_UNSUPP; +hdr.status = ret; +hdr.resid = hdr.dxfer_len; +} else if (hdr.status) { +status = VIRTIO_BLK_S_IOERR; +} else { +status
Re: [ANNOUNCE] kvm-85 release
On Mon, Apr 27, 2009 at 04:45:23PM +0200, Alexander Graf wrote: - Disabled pwritev() support until glibc stops writing random junk. See https://bugzilla.redhat.com/497429 Wouldn't it be useful to have it disabled upstream then? No glibc has been released with the broken one yet, I gusee Mark must have been running some sort of snapshot. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] virtio-blk: add SGI_IO passthru support
On Mon, Apr 27, 2009 at 12:15:31PM +0300, Avi Kivity wrote: I think that's worthwhile. The extra bloat is trivial (especially as the number of inflight virtio requests is tightly bounded), and stalling the vcpu for requests is a pain. Ok, new patch will follow ASAP. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] virtio-blk: add SG_IO passthru support
Add support for SG_IO passthru (packet commands) to the virtio-blk backend. Conceptually based on an older patch from Hannes Reinecke but largely rewritten to match the code structure and layering in virtio-blk aswell as doing asynchronous I/O. Signed-off-by: Christoph Hellwig h...@lst.de Index: qemu/hw/virtio-blk.h === --- qemu.orig/hw/virtio-blk.h 2009-04-28 11:42:14.059074434 +0200 +++ qemu/hw/virtio-blk.h2009-04-28 11:44:24.930074531 +0200 @@ -28,6 +28,9 @@ #define VIRTIO_BLK_F_SIZE_MAX 1 /* Indicates maximum segment size */ #define VIRTIO_BLK_F_SEG_MAX2 /* Indicates maximum # of segments */ #define VIRTIO_BLK_F_GEOMETRY 4 /* Indicates support of legacy geometry */ +#define VIRTIO_BLK_F_RO 5 /* Disk is read-only */ +#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/ +#define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */ struct virtio_blk_config { @@ -70,6 +73,15 @@ struct virtio_blk_inhdr unsigned char status; }; +/* SCSI pass-through header */ +struct virtio_scsi_inhdr +{ +uint32_t errors; +uint32_t data_len; +uint32_t sense_len; +uint32_t residual; +}; + void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs); #endif Index: qemu/hw/virtio-blk.c === --- qemu.orig/hw/virtio-blk.c 2009-04-28 11:42:14.066074487 +0200 +++ qemu/hw/virtio-blk.c2009-04-28 11:52:45.836079580 +0200 @@ -15,6 +15,9 @@ #include sysemu.h #include virtio-blk.h #include block_int.h +#ifdef __linux__ +# include scsi/sg.h +#endif typedef struct VirtIOBlock { @@ -35,6 +38,8 @@ typedef struct VirtIOBlockReq VirtQueueElement elem; struct virtio_blk_inhdr *in; struct virtio_blk_outhdr *out; +struct virtio_scsi_inhdr *scsi; +struct sg_io_hdr scsi_hdr; QEMUIOVector qiov; struct VirtIOBlockReq *next; } VirtIOBlockReq; @@ -103,6 +108,108 @@ static VirtIOBlockReq *virtio_blk_get_re return req; } +#ifdef __linux__ +static void virtio_blk_scsi_complete(void *opaque, int ret) +{ +VirtIOBlockReq *req = opaque; +int status; + +if (ret) { +status = VIRTIO_BLK_S_UNSUPP; +req-scsi_hdr.status = -ret; +req-scsi_hdr.resid = req-scsi_hdr.dxfer_len; +} else if (req-scsi_hdr.status) { +status = VIRTIO_BLK_S_IOERR; +} else { +status = VIRTIO_BLK_S_OK; +} + +req-scsi-errors = req-scsi_hdr.status; +req-scsi-residual = req-scsi_hdr.resid; +req-scsi-sense_len = req-scsi_hdr.sb_len_wr; +req-scsi-data_len = req-scsi_hdr.dxfer_len; + +virtio_blk_req_complete(req, status); +} + +static void virtio_blk_handle_scsi(VirtIOBlockReq *req) +{ +int i; + +/* + * We require at least one output segment each for the virtio_blk_outhdr + * and the SCSI command block. + * + * We also at least require the virtio_blk_inhdr, the virtio_scsi_inhdr + * and the sense buffer pointer in the input segments. + */ +if (req-elem.out_num 2 || req-elem.in_num 3) { +virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR); +return; +} + +/* + * No support for bidirection commands yet. + */ +if (req-elem.out_num 2 req-elem.in_num 3) { +virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP); +return; +} + +/* + * The scsi inhdr is placed in the second-to-last input segment, just + * before the regular inhdr. + */ +req-scsi = (void *)req-elem.in_sg[req-elem.in_num - 2].iov_base; + +memset(req-scsi_hdr, 0, sizeof(struct sg_io_hdr)); +req-scsi_hdr.interface_id = 'S'; +req-scsi_hdr.cmd_len = req-elem.out_sg[1].iov_len; +req-scsi_hdr.cmdp = req-elem.out_sg[1].iov_base; +req-scsi_hdr.dxfer_len = 0; + +if (req-elem.out_num 2) { +/* + * If there are more than the minimally required 2 output segments + * there is write payload starting from the third iovec. + */ +req-scsi_hdr.dxfer_direction = SG_DXFER_TO_DEV; +req-scsi_hdr.iovec_count = req-elem.out_num - 2; + +for (i = 0; i req-scsi_hdr.iovec_count; i++) +req-scsi_hdr.dxfer_len += req-elem.out_sg[i + 2].iov_len; +req-scsi_hdr.dxferp = req-elem.out_sg + 2; +} else if (req-elem.in_num 3) { +/* + * If we have more than 3 input segments the guest wants to actually + * read data. + */ +req-scsi_hdr.dxfer_direction = SG_DXFER_FROM_DEV; +req-scsi_hdr.iovec_count = req-elem.in_num - 3; + +for (i = 0; i req-scsi_hdr.iovec_count; i++) +req-scsi_hdr.dxfer_len += req-elem.in_sg[i].iov_len; +req-scsi_hdr.dxferp = req-elem.in_sg; +} else { +/* + * Some SCSI commands don't actually transfer any data. + */ +req-scsi_hdr.dxfer_direction