Re: [RFC] vhost-blk implementation

2010-03-24 Thread Christoph Hellwig
 Inspired by vhost-net implementation, I did initial prototype 
 of vhost-blk to see if it provides any benefits over QEMU virtio-blk.
 I haven't handled all the error cases, fixed naming conventions etc.,
 but the implementation is stable to play with. I tried not to deviate
 from vhost-net implementation where possible.

Can you also send the qemu side of it?

 with vhost-blk:
 
 
 # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
 64+0 records in
 64+0 records out
 8388608 bytes (84 GB) copied, 126.135 seconds, 665 MB/s
 
 real2m6.137s
 user0m0.281s
 sys 0m14.725s
 
 without vhost-blk: (virtio)
 ---
 
 # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
 64+0 records in
 64+0 records out
 8388608 bytes (84 GB) copied, 275.466 seconds, 305 MB/s
 
 real4m35.468s
 user0m0.373s
 sys 0m48.074s

Which caching mode is this?  I assume data=writeback, because otherwise
you'd be doing synchronous I/O directly from the handler.

 +static int do_handle_io(struct file *file, uint32_t type, uint64_t sector,
 + struct iovec *iov, int in)
 +{
 + loff_t pos = sector  8;
 + int ret = 0;
 +
 + if (type  VIRTIO_BLK_T_FLUSH)  {
 + ret = vfs_fsync(file, file-f_path.dentry, 1);
 + } else if (type  VIRTIO_BLK_T_OUT) {
 + ret = vfs_writev(file, iov, in, pos);
 + } else {
 + ret = vfs_readv(file, iov, in, pos);
 + }
 + return ret;

I have to admit I don't understand the vhost architecture at all, but
where do the actual data pointers used by the iovecs reside?
vfs_readv/writev expect both the iovec itself and the buffers
pointed to by it to reside in userspace, so just using kernel buffers
here will break badly on architectures with different user/kernel
mappings.  A lot of this is fixable using simple set_fs  co tricks,
but for direct I/O which uses get_user_pages even that will fail badly.

Also it seems like you're doing all the I/O synchronous here?  For
data=writeback operations that could explain the read speedup
as you're avoiding context switches, but for actual write I/O
which has to get data to disk (either directly from vfs_writev or
later through vfs_fsync) this seems like a really bad idea stealing
a lot of guest time that should happen in the background.


Other than that the code seems quite nice and simple, but one huge
problem is that it'll only support raw images, and thus misses out
on all the nice image formats used in qemu deployments, especially
qcow2.  It's also missing the ioctl magic we're having in various
places, both for controlling host devices like cdroms and SG
passthrough.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-24 Thread Christoph Hellwig
On Tue, Mar 23, 2010 at 12:03:14PM +0200, Avi Kivity wrote:
 I also think it should be done at the bio layer.  File I/O is going to  
 be slower, if we do vhost-blk we should concentrate on maximum  
 performance.  The block layer also exposes more functionality we can use  
 (asynchronous barriers for example).

The block layer is more flexible, but that limits you to only stack
directly ontop of a block device, which is extremly inflexible.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-25 Thread Christoph Hellwig
On Thu, Mar 25, 2010 at 08:29:03AM +0200, Avi Kivity wrote:
 We still have a virtio implementation in userspace for file-based images.

 In any case, the file APIs are not asynchronous so we'll need a thread  
 pool.  That will probably minimize the difference in performance between  
 the userspace and kernel implementations.

The kernel has real aio when doing direct I/O, although currently we
can't easily use it from kernelspace.  Thejre's a few people who've been
looking into making it usable for that, though.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-25 Thread Christoph Hellwig
On Wed, Mar 24, 2010 at 01:22:37PM -0700, Badari Pulavarty wrote:
 Yes. This is with default (writeback) cache model. As mentioned earlier,  
 readhead is helping here
 and most cases, data would be ready in the pagecache.

Ok.  cache=writeback performance is something I haven't bothered looking
at at all.  For cache=none any streaming write or random workload with
large enough record sizes got basically the same performance as native
using kernel aio, and same for write but slightly degraded for reads
using the thread pool.  See my attached JLS presentation for some
numbers.

   
 iovecs and buffers are user-space pointers (from the host kernel point  
 of view). They are
 guest address. So, I don't need to do any set_fs tricks.

Right now they're not ceclared as such, so sparse would complain.

 Yes. QEMU virtio-blk is batching up all the writes and handing of the  
 work to another
 thread.

If using the thread pool.  If using kernel aio it performs the io_submit
system call directly.

 What do should I do here ? I can create bunch of kernel threads to do  
 the IO for me.
 Or some how fit and reuse AIO io_submit() mechanism. Whats the best way  
 here ?
 I hate do duplicate all the code VFS is doing.

The only thing you can do currently is adding a thread pool.  It might
be able to use io_submit for the O_DIRECT case with some refactoring.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-04-05 Thread Christoph Hellwig
On Wed, Mar 24, 2010 at 01:22:37PM -0700, Badari Pulavarty wrote:
 iovecs and buffers are user-space pointers (from the host kernel point  
 of view). They are
 guest address. So, I don't need to do any set_fs tricks.

From verifying the code and using the sparse annotations it appears
that the actual buffers are userspace pointers, but the iovecs in
the virtqueue are kernel level pointers, so you would need some
annotations.

While we're at it here is a patch fixing the remaining sparse 
warnings in vhost-blk:


Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/vhost/blk.c
===
--- linux-2.6.orig/drivers/vhost/blk.c  2010-04-05 21:15:11.638004250 +0200
+++ linux-2.6/drivers/vhost/blk.c   2010-04-05 21:16:13.238004599 +0200
@@ -86,7 +86,7 @@ static void handle_blk(struct vhost_blk
nvecs++;
BUG_ON(vq-iov[nvecs].iov_len != 1);
 
-   if (copy_to_user(vq-iov[nvecs].iov_base, status, sizeof 
status)  0) {
+   if (copy_to_user(vq-iov[nvecs].iov_base, status, sizeof 
status)) {
printk(copy to user failed\n);
vhost_discard_vq_desc(vq);
break;
@@ -199,7 +199,7 @@ static struct miscdevice vhost_blk_misc
vhost_blk_fops,
 };
 
-int vhost_blk_init(void)
+static int vhost_blk_init(void)
 {
int r = vhost_init();
if (r)
@@ -216,7 +216,7 @@ err_init:
 }
 module_init(vhost_blk_init);
 
-void vhost_blk_exit(void)
+static void vhost_blk_exit(void)
 {
misc_deregister(vhost_blk_misc);
vhost_cleanup();
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost: fix sparse warnings

2010-04-05 Thread Christoph Hellwig

Index: linux-2.6/drivers/vhost/net.c
===
--- linux-2.6.orig/drivers/vhost/net.c  2010-04-05 21:13:24.196004388 +0200
+++ linux-2.6/drivers/vhost/net.c   2010-04-05 21:13:32.726004109 +0200
@@ -641,7 +641,7 @@ static struct miscdevice vhost_net_misc
vhost_net_fops,
 };
 
-int vhost_net_init(void)
+static int vhost_net_init(void)
 {
int r = vhost_init();
if (r)
@@ -658,7 +658,7 @@ err_init:
 }
 module_init(vhost_net_init);
 
-void vhost_net_exit(void)
+static void vhost_net_exit(void)
 {
misc_deregister(vhost_net_misc);
vhost_cleanup();
Index: linux-2.6/drivers/vhost/vhost.c
===
--- linux-2.6.orig/drivers/vhost/vhost.c2010-04-05 21:12:58.806004249 
+0200
+++ linux-2.6/drivers/vhost/vhost.c 2010-04-05 21:14:44.681010674 +0200
@@ -710,7 +710,7 @@ int vhost_log_write(struct vhost_virtque
return 0;
 }
 
-int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
+static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
   struct iovec iov[], int iov_size)
 {
const struct vhost_memory_region *reg;
@@ -736,7 +736,7 @@ int translate_desc(struct vhost_dev *dev
_iov = iov + ret;
size = reg-memory_size - addr + reg-guest_phys_addr;
_iov-iov_len = min((u64)len, size);
-   _iov-iov_base = (void *)(unsigned long)
+   _iov-iov_base = (void __user *)(unsigned long)
(reg-userspace_addr + addr - reg-guest_phys_addr);
s += size;
addr += size;
@@ -990,7 +990,7 @@ void vhost_discard_vq_desc(struct vhost_
  * want to notify the guest, using eventfd. */
 int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 {
-   struct vring_used_elem *used;
+   struct vring_used_elem __user *used;
 
/* The virtqueue contains a ring of used buffers.  Get a pointer to the
 * next entry in that used ring. */
@@ -1014,7 +1014,8 @@ int vhost_add_used(struct vhost_virtqueu
smp_wmb();
/* Log used ring entry write. */
log_write(vq-log_base,
- vq-log_addr + ((void *)used - (void *)vq-used),
+ vq-log_addr +
+  ((void __user *)used - (void __user *)vq-used),
  sizeof *used);
/* Log used index update. */
log_write(vq-log_base,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-04-05 Thread Christoph Hellwig
On Thu, Mar 25, 2010 at 04:00:56PM +0100, Asdo wrote:
 Would the loop device provide the features of a block device? I recall  
 barrier support at least has been added recently.

It does, but not in a very efficient way.

 Is it recommended to run kvm on a loopback mounted file compared to on a  
 raw file?

No.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shouldn't cache=none be the default for drives?

2010-04-08 Thread Christoph Hellwig
On Thu, Apr 08, 2010 at 10:05:09AM +0400, Michael Tokarev wrote:
 LVM volumes.  This is because with cache=none, the virtual disk
 image is opened with O_DIRECT flag, which means all I/O bypasses
 host scheduler and buffer cache.

O_DIRECT does not bypass the I/O scheduler, only the page cache.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: fix sparse warnings

2010-04-13 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig h...@lst.de


Index: linux-2.6/drivers/vhost/net.c
===
--- linux-2.6.orig/drivers/vhost/net.c  2010-04-05 21:13:24.196004388 +0200
+++ linux-2.6/drivers/vhost/net.c   2010-04-05 21:13:32.726004109 +0200
@@ -641,7 +641,7 @@ static struct miscdevice vhost_net_misc
vhost_net_fops,
 };
 
-int vhost_net_init(void)
+static int vhost_net_init(void)
 {
int r = vhost_init();
if (r)
@@ -658,7 +658,7 @@ err_init:
 }
 module_init(vhost_net_init);
 
-void vhost_net_exit(void)
+static void vhost_net_exit(void)
 {
misc_deregister(vhost_net_misc);
vhost_cleanup();
Index: linux-2.6/drivers/vhost/vhost.c
===
--- linux-2.6.orig/drivers/vhost/vhost.c2010-04-05 21:12:58.806004249 
+0200
+++ linux-2.6/drivers/vhost/vhost.c 2010-04-05 21:14:44.681010674 +0200
@@ -710,7 +710,7 @@ int vhost_log_write(struct vhost_virtque
return 0;
 }
 
-int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
+static int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
   struct iovec iov[], int iov_size)
 {
const struct vhost_memory_region *reg;
@@ -736,7 +736,7 @@ int translate_desc(struct vhost_dev *dev
_iov = iov + ret;
size = reg-memory_size - addr + reg-guest_phys_addr;
_iov-iov_len = min((u64)len, size);
-   _iov-iov_base = (void *)(unsigned long)
+   _iov-iov_base = (void __user *)(unsigned long)
(reg-userspace_addr + addr - reg-guest_phys_addr);
s += size;
addr += size;
@@ -990,7 +990,7 @@ void vhost_discard_vq_desc(struct vhost_
  * want to notify the guest, using eventfd. */
 int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 {
-   struct vring_used_elem *used;
+   struct vring_used_elem __user *used;
 
/* The virtqueue contains a ring of used buffers.  Get a pointer to the
 * next entry in that used ring. */
@@ -1014,7 +1014,8 @@ int vhost_add_used(struct vhost_virtqueu
smp_wmb();
/* Log used ring entry write. */
log_write(vq-log_base,
- vq-log_addr + ((void *)used - (void *)vq-used),
+ vq-log_addr +
+  ((void __user *)used - (void __user *)vq-used),
  sizeof *used);
/* Log used index update. */
log_write(vq-log_base,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-04 Thread Christoph Hellwig
On Tue, May 04, 2010 at 02:08:24PM +0930, Rusty Russell wrote:
 On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
  I took a stub at documenting CMD and FLUSH request types in virtio
  block.  Christoph, could you look over this please?
  
  I note that the interface seems full of warts to me,
  this might be a first step to cleaning them.
 
 ISTR Christoph had withdrawn some patches in this area, and was waiting
 for him to resubmit?

Any patches I've withdrawn in this area are withdrawn for good.  But
what I really need to do is to review Michaels spec updates, sorry.

UI'll get back to it today.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Qemu-KVM 0.12.3 and Multipath - Assertion

2010-05-04 Thread Christoph Hellwig
On Tue, May 04, 2010 at 04:01:35PM +0200, Kevin Wolf wrote:
 Great, I'm going to submit it as a proper patch then.
 
 Christoph, by now I'm pretty sure it's right, but can you have another
 look if this is correct, anyway?

It looks correct to me - we really shouldn't update the the fields
until bdrv_aio_cancel has returned.  In fact we cannot cancel a request
more often than we can, so there's a fairly high chance it will
complete.


Reviewed-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-04 Thread Christoph Hellwig
On Fri, Feb 19, 2010 at 12:22:20AM +0200, Michael S. Tsirkin wrote:
 I took a stub at documenting CMD and FLUSH request types in virtio
 block.  Christoph, could you look over this please?
 
 I note that the interface seems full of warts to me,
 this might be a first step to cleaning them.

The whole virtio-blk interface is full of warts.  It has been
extended rather ad-hoc, so that is rather expected.

 One issue I struggled with especially is how type
 field mixes bits and non-bit values. I ended up
 simply defining all legal values, so that we have
 CMD = 2, CMD_OUT = 3 and so on.

It's basically a complete mess without much logic behind it.

 +\change_unchanged
 +the high bit
 +\change_inserted 0 1266497301
 + (VIRTIO_BLK_T_BARRIER)
 +\change_unchanged
 + indicates that this request acts as a barrier and that all preceeding 
 requests
 + must be complete before this one, and all following requests must not be
 + started until this is complete.
 +
 +\change_inserted 0 1266504385
 + Note that a barrier does not flush caches in the underlying backend device
 + in host, and thus does not serve as data consistency guarantee.
 + Driver must use FLUSH request to flush the host cache.
 +\change_unchanged

I'm not sure it's even worth documenting it.  I can't see any way to
actually implement safe behaviour with the VIRTIO_BLK_T_BARRIER-style
barriers.


Btw, did I mention that .lyx is a a really horrible format to review
diffs for?  Plain latex would be a lot better..
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-04 Thread Christoph Hellwig
On Tue, Apr 20, 2010 at 02:46:35AM +0100, Jamie Lokier wrote:
 Does this mean that virtio-blk supports all three combinations?
 
1. FLUSH that isn't a barrier
2. FLUSH that is also a barrier
3. Barrier that is not a flush
 
 1 is good for fsync-like operations;
 2 is good for journalling-like ordered operations.
 3 sounds like it doesn't mean a lot as the host cache provides no
 guarantees and has no ordering facility that can be used.

No.  The Linux virtio_blk guest driver either supports data integrity
by using FLUSH or can send down BARRIER requests which aren't much
help at all.  Qemu only implements FLUSH anyway.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/2] close all the block drivers before the qemu process exits

2010-05-12 Thread Christoph Hellwig
On Wed, May 12, 2010 at 07:46:52PM +0900, MORITA Kazutaka wrote:
 This patch calls the close handler of the block driver before the qemu
 process exits.
 
 This is necessary because the sheepdog block driver releases the lock
 of VM images in the close handler.
 
 Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp

Looks good in principle, except that bdrv_first is gone and has been
replaced with a real list in the meantime, so this won't even apply.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [block]: Fix scsi-generic breakage in find_image_format()

2010-05-16 Thread Christoph Hellwig
On Sat, May 15, 2010 at 06:30:52AM -0700, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 This patch adds a special BlockDriverState-sg check in 
 block.c:find_image_format()
 after bdrv_file_open() - block/raw-posix.c:hdev_open() has been called to 
 determine
 if we are dealing with a Linux host scsi-generic device or not.
 
 The patch then returns the BlockDriver * from find_protocol(), skipping the 
 subsequent
 bdrv_read() and rest of find_image_format().

That's not quite correct as we don't want to expose formats directly.
Returning

bdrv_find_format(raw);

should fix it for now, although I really hate having these special
cases in block.c.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] [block]: Skip refresh_total_sectors() for scsi-generic devices

2010-05-16 Thread Christoph Hellwig
On Sat, May 15, 2010 at 06:30:59AM -0700, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 This patch adds a BlockDriverState-sg check in block.c:bdrv_common_open()
 to skip the new refresh_total_sectors() call once we know we are working with
 a scsi-generic device.
 
 We go ahead and skip this call for scsi-generic devices because
 block/raw-posix.c:raw_getlength() - lseek() will return -ESPIPE.

How about moving that check into refresh_total_sectors?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm hangs if multipath device is queing

2010-05-19 Thread Christoph Hellwig
On Tue, May 18, 2010 at 03:22:36PM +0200, Kevin Wolf wrote:
 I think it's stuck here in an endless loop:
 
 while (laiocb-ret == -EINPROGRESS)
 qemu_laio_completion_cb(laiocb-ctx);
 
 Can you verify this by single-stepping one or two loop iterations? ret
 and errno after the read call could be interesting, too.

Maybe the compiler is just too smart.  Without some form of barrier
it could just optimize the loop away as laiocb-ret couldn't change
in a normal single-threaded environment.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for May 18

2010-05-19 Thread Christoph Hellwig
On Tue, May 18, 2010 at 08:52:36AM -0500, Anthony Liguori wrote:
 This should be filed in launchpad as a qemu bug and it should be tested  
 against the latest git.  This bug sounds like we're using an int to  
 represent sector offset somewhere but there's not enough info in the bug  
 report to figure out for sure.  I just audited the virtio-blk - raw -  
 aio=threads path and I don't see an obvious place that we're getting it  
 wrong.

FYI: I'm going to ignore everything that's in launchpad - even more than
in the stupid SF bugtracker.  While the SF one is almost unsuable
launchpad is entirely unsuable.  If you don't have an account with the
evil spacement empire you can't even check the email addresses of the
reporters, so any communication with them is entirely impossible.

It's time we get a proper bugzilla.qemu.org for both qemu and qemu-kvm
that can be used sanely.  If you ask nicely you might even get a virtual
instance of bugzilla.kernel.org which works quite nicely.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: the 1Tb block issue

2010-05-19 Thread Christoph Hellwig
On Tue, May 18, 2010 at 08:38:22PM +0300, Avi Kivity wrote:
 Yes.  Why would Linux post overlapping requests? makes  
 0x sense.

 There may be a guest bug in here too.  Christoph?

Overlapping writes are entirely fine from the guest POV, although they
should be rather unusual.  We can update a page and send it out again
when it gets redirtied while still out on the wire.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH +stable] block: don't attempt to merge overlapping requests

2010-05-19 Thread Christoph Hellwig
On Wed, May 19, 2010 at 10:23:44AM +0100, Stefan Hajnoczi wrote:
 On Wed, May 19, 2010 at 10:06 AM, Avi Kivity a...@redhat.com wrote:
  In the cache=writeback case the virtio-blk guest driver does:
 
  blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, ...)
 
 
  I don't follow. ?What's the implication?
 
 I was wondering whether the queue is incorrectly set to a mode where
 overlapping write requests aren't being ordered.  Anyway, Christoph
 says overlapping write requests can be issued so my theory is broken.

They can happen, but won't during usual pagecache based writeback.  So
this should not happen for the pagecache based mke2fs workload.  It
could happen for a workload with mkfs that uses O_DIRECT.  And we
need to handle it either way.

And btw, our barrier handling for devices using multiwrite (aka virtio)
is utterly busted right now, patch will follow ASAP.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Christoph Hellwig
On Fri, May 21, 2010 at 09:49:56PM +0100, Stefan Hajnoczi wrote:
 http://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps
 
 Requires kernel support - not sure if enough of utrace is in mainline
 for this to work out-of-the-box across distros.

Nothing of utrace is in mainline, nevermind the whole systemtap code
which is intentionally keep out of the kernel tree.  Using this means
that for every probe in userspace code you need to keep the configured
source tree of the currently running kernel around, which is completely
unusable for typical developer setups.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Christoph Hellwig
On Tue, May 25, 2010 at 02:25:53PM +0300, Avi Kivity wrote:
 Currently if someone wants to add a new block format, they have to  
 upstream it and wait for a new qemu to be released.  With a plugin API,  
 they can add a new block format to an existing, supported qemu.

So?  Unless we want a stable driver ABI which I fundamentally oppose as
it would make block driver development hell they'd have to wait for
a new release of the block layer.  It's really just going to be a lot
of pain for no major gain.  qemu releases are frequent enough, and if
users care enough they can also easily patch qemu.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw disks no longer work in latest kvm (kvm-88 was fine)

2010-05-29 Thread Christoph Hellwig
On Sat, May 29, 2010 at 04:42:59PM +0700, Antoine Martin wrote:
 Can someone explain the aio options?
 All I can find is this:
 # qemu-system-x86_64 -h | grep -i aio
[,addr=A][,id=name][,aio=threads|native]
 I assume it means the aio=threads emulates the kernel's aio with
 separate threads? And is therefore likely to be slower, right?
 Is there a reason why aio=native is not the default? Shouldn't
 aio=threads be the fallback?

The kernel AIO support is unfortunately not a very generic API.
It only supports O_DIRECT I/O (cache=none for qemu), and if used on
a filesystems it might still block if we need to perform block
allocations.  We could probably make it the default for block devices,
but I'm not a big fan of these kind of conditional defaults.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw disks no longer work in latest kvm (kvm-88 was fine)

2010-05-29 Thread Christoph Hellwig
On Sat, May 29, 2010 at 10:55:18AM +0100, Stefan Hajnoczi wrote:
 I would expect that aio=native is faster but benchmarks show that this
 isn't true for all workloads.

In what benchmark do you see worse results for aio=native compared to
aio=threads?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default

2009-09-17 Thread Christoph Hellwig
Err, I'll take this one back for now pending some more discussion.
What we need more urgently is the writeback cache flag, which is now
implemented in qemu, patch following ASAP.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/24] compatfd is included before, and it is compiled unconditionally

2009-09-22 Thread Christoph Hellwig
Btw, what's the state of getting compatfd upstream?  It's a pretty
annoying difference between qemu upstream and qemu-kvm.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/24] compatfd is included before, and it is compiled unconditionally

2009-09-22 Thread Christoph Hellwig
On Tue, Sep 22, 2009 at 03:25:13PM +0200, Juan Quintela wrote:
 Christoph Hellwig h...@infradead.org wrote:
  Btw, what's the state of getting compatfd upstream?  It's a pretty
  annoying difference between qemu upstream and qemu-kvm.
 
 I haven't tried.  I can try to send a patch.  Do you have any use case
 that will help the cause?

Well, the eventfd compat is used in the thread pool AIO code.  I don't
know what difference it makes, but I really hate this code beeing
different in both trees.  I want to see compatfd used either in both or
none.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] qemu-kvm: virtio-net: Re-instate GSO code removed upstream

2009-09-30 Thread Christoph Hellwig
I might sound like a broken record, but why isn't the full GSO support
for virtio-net upstream in qemu?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm tuning guide

2009-09-30 Thread Christoph Hellwig
On Wed, Sep 30, 2009 at 08:20:35AM +0200, Avi Kivity wrote:
 On 09/30/2009 07:09 AM, Nikola Ciprich wrote:
 The default, IDE, is highly supported by guests but may be slow, especially 
 with disk arrays. If your guest supports it, use the virtio interface:
 Avi,
 what is the status of data integrity issues Chris Hellwig summarized some 
 time ago?


 I don't know.  Christoph?

On the qemu side everything is in git HEAD now, but I'm not sure about
the qemu-0.11 release as I haven't really followed it.

For the guest kernel the virtio cache flush support is now in mainline
(past-2.6.31).  For the host kernel side about 2/3 of the fixes are now
in mainline (past-2.6.31) with the others hopefully getting in this
merge window.


 Is it safe to recommend virtio to newbies already?

 I think so.

I wouldn't.  At least not for people caring about their data.  It will
take a while to promote the guest side fixes to all the interesting
guests.  IDE has the major advantage that cache flush support has been
around in the guest driver for a long time so we only need to fix the
host side which is a lot easier.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/24] compatfd is included before, and it is compiled unconditionally

2009-10-01 Thread Christoph Hellwig
On Thu, Oct 01, 2009 at 01:58:10PM +0200, Juan Quintela wrote:
 Discused with Anthony about it.  signalfd is complicated for qemu
 upstream (too difficult to use properly), and eventfd ...  The current
 eventfd emulation is worse than the pipe code that it substitutes.
 
 His suggestion here was to create a new abstraction with an API like:
 
 push_notify()
 
 pop_notify()
 
 and then you can implement it with eventfd() pipes/whatever.
 
 What was missing for you of compatfd: qemu_eventfd/qemu_signalfd?
 Do a push_notify()/pop_notify() work for you?

I don't desperately want to use it myself anyway.  I just want to get
rid of the highly annoyind spurious differences in the AIO code due
to use of compatfd.  I would be perfectly fine with just killing this
use of eventfd in qemu-kvm.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync guest calls made async on host - SQLite performance

2009-10-13 Thread Christoph Hellwig
On Sun, Oct 11, 2009 at 11:16:42AM +0200, Avi Kivity wrote:
if scsi is used, you incur the cost of virtualization,
if virtio is used, your guests fsyncs incur less cost.
 
 So back to the question to the kvm team.  It appears that with the 
 stock KVM setup customers who need higher data integrity (through 
 fsync) should steer away from virtio for the moment.
 
 Is that assessment correct?
 
 
 Christoph, wasn't there a bug where the guest didn't wait for requests 
 in response to a barrier request?

Can't remember anything like that.  The bug was the complete lack of
cache flush infrastructure for virtio, and the lack of advertising a
volative write cache on ide.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync guest calls made async on host - SQLite performance

2009-10-14 Thread Christoph Hellwig
On Wed, Oct 14, 2009 at 08:03:41PM +0900, Avi Kivity wrote:
 Can't remember anything like that.  The bug was the complete lack of
 cache flush infrastructure for virtio, and the lack of advertising a
 volative write cache on ide.

 
 By complete flush infrastructure, you mean host-side and guest-side 
 support for a new barrier command, yes?

The cache flush command, not barrier command.  The new virtio code
implements barrier the same way we do for IDE and SCSI - all barrier
semantics are implemented by generic code in the block layer by draining
the queues, the only thing we send over the wire are cache flush
commands in strategic places.

 But can't this be also implemented using QUEUE_ORDERED_DRAIN, and on the 
 host side disabling the backing device write cache?  I'm talking about 
 cache=none, primarily.

Yes, it could.  But as I found out in a long discussion with Stephen
it's not actually nessecary.  All filesystems do the right thing for
a device not claiming to support barriers if it doesn't include write
caches, that is implement ordering internally.  So there is no urge to
set QUEUE_ORDERED_DRAIN for the case without write cache.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-blk: fallback to draining the queue if barrier ops are not supported

2009-10-14 Thread Christoph Hellwig
On Wed, Oct 14, 2009 at 07:38:45PM +0400, Michael Tokarev wrote:
 Avi Kivity wrote:
 Early implementations of virtio devices did not support barrier operations,
 but did commit the data to disk.  In such cases, drain the queue to emulate
 barrier operations.
 
 Are there any implementation currently that actually supports barriers?
 As far as I remember there's no way to invoke barriers from a user-space
 application on linux, and this is how kvm/qemu is running on this OS.

Ignore all the barrier talk.  The way Linux uses the various storage
transport the primitives are queue draining (done entirely in the guest
block layer) and cache flushes.  Fdatasync is exactly the same primitive
as a WIN FLUSH CACHE in ATA or SYNCHRONIZE cache in SCSI module the lack
or ranges in fdatasync - but that is just a performance optimization and
not actually used by Linux guests for now.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync guest calls made async on host - SQLite performance

2009-10-14 Thread Christoph Hellwig
On Thu, Oct 15, 2009 at 01:56:40AM +0900, Avi Kivity wrote:
 Does virtio say it has a write cache or not (and how does one say it?)?

Historically it didn't and the only safe way to use virtio was in
cache=writethrough mode.  Since qemu git as of 4th Sempember and Linux
2.6.32-rc there is a virtio-blk feature to communicate the existance
of a volatile write cache, and the support for a cache flush command.
With the combination of these two data=writeback and data=none modes
are safe for the first time.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync guest calls made async on host - SQLite performance

2009-10-15 Thread Christoph Hellwig
On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote:
 Historically it didn't and the only safe way to use virtio was in
 cache=writethrough mode.
 
 Which should be the default on Ubuntu's kvm that this report is 
 concerned with so I'm a bit confused.

So can we please get the detailed setup where this happens, that is:

filesystem used in the guest
any volume manager / software raid used in the guest
kernel version in the guest
image format used
qemu command line including caching mode, using ide/scsi/virtio, etc
qemu/kvm version
filesystem used in the host
any volume manager / software raid used in the host
kernel version in the host


 Avi's patch is a performance optimization, not a correctness issue?

It could actually minimally degrade performace.  For the existing
filesystems as upper layer it doesn not improve correctness either.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync guest calls made async on host - SQLite performance

2009-10-15 Thread Christoph Hellwig
On Thu, Oct 15, 2009 at 02:17:02PM +0200, Christoph Hellwig wrote:
 On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote:
  Historically it didn't and the only safe way to use virtio was in
  cache=writethrough mode.
  
  Which should be the default on Ubuntu's kvm that this report is 
  concerned with so I'm a bit confused.
 
 So can we please get the detailed setup where this happens, that is:
 
 filesystem used in the guest
 any volume manager / software raid used in the guest
 kernel version in the guest
 image format used
 qemu command line including caching mode, using ide/scsi/virtio, etc
 qemu/kvm version
 filesystem used in the host
 any volume manager / software raid used in the host
 kernel version in the host

And very important the mount options (/proc/self/mounts) of both host
and guest.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation

2009-10-28 Thread Christoph Hellwig
On Wed, Oct 28, 2009 at 09:11:29AM +0100, Hannes Reinecke wrote:
 The problem is I don't have any documentation for the LSI parallel
 SCSI controller. So I don't know if and in what shape I/O is passed
 down, nor anything else. And as the SCSI disk emulation is really
 tied into the LSI parallel SCSI controller, any change in the former
 is likely to break the latter.

scsi-disk is not tied into the lsi/symbios driver.  It's also used
by the esp driver and usb-mcd.  But it's a complete mess, and all my
attempts to untangle it have so far failed with my brain hurting for
a long time after.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/4] scsi-disk: Factor out SCSI command emulation

2009-10-28 Thread Christoph Hellwig
On Tue, Oct 27, 2009 at 04:28:59PM +0100, Hannes Reinecke wrote:
 
 Other drives might want to use SCSI command emulation without
 going through the SCSI disk abstraction, as this imposes too
 many limits on the emulation.

Might be a good idea to move something like this first into the series
and share the CDB decoder between scsi and ide (atapi) as a start.  A
little bit of refactoring of the CDB decoder, e.g. into one function
per opcode (-family) won't hurt either.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation

2009-10-28 Thread Christoph Hellwig
On Wed, Oct 28, 2009 at 08:25:22PM +0100, Hannes Reinecke wrote:
 I don't think we really need two modes.
 My preferred interface here is to pass down scatter-gather lists down
 with every xfer; this way it'll be the responsibility of the driver to
 create the lists in the first place. If it has hardware scatter-gather
 support it can just pass them down, if not it can as easily create a
 scatter-gather list with just one element as a bounce buffer.

Yes.  If this really causes performance problems for esp we can add
bounce buffering in that driver later.

 So something like
 - Get next request
 - Attach iovec/bounc-buffer
 - handle request (command/write/read)
 - complete request (callback)

Btw, from some previuous attempts to sort out this code here are some
thing that I think would be beneficial:

 - try to create generic scsi device/request structures that the hba
   driver can access, and which contain additional private data for
   scsi-disk/generic.  Information in the generic one would include
   the information about the data transfer, the tag and the command
   block.
 - try to get rid of the tag indexing stuff by just using a pointer to
   the generic scsi request in the hba drivers.  That should get rid
   of a lot of useless list searching.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation

2009-10-29 Thread Christoph Hellwig
On Thu, Oct 29, 2009 at 01:57:40PM +0100, Gerd Hoffmann wrote:
 Trying to go forward in review+bisect friendly baby steps.  Here is what 
 I have now:
 
 http://repo.or.cz/w/qemu/kraxel.git?a=shortlog;h=refs/heads/scsi.v1
 
 It is far from being completed, will continue tomorrow.  Should give a 
 idea of the direction I'm heading to though.  Comments welcome.

Nice.

I had patches for the header renames, too - the current naming is
really awkward.  Would be great if Anthony could take those ASAP.

Btw, I also splitted out a cdrom.h from the new scsi.h, not sure if it's
worth it for the two prototypes.

Just wondering, shouldn't scsi-bus.c just become scsi.c now that it's
growing various non-bus glue bits?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation

2009-10-29 Thread Christoph Hellwig
On Thu, Oct 29, 2009 at 10:14:19AM -0500, Anthony Liguori wrote:
 Which patches are those?

http://repo.or.cz/w/qemu/kraxel.git?a=commitdiff;h=1ee5ee08e4427c3db7e1322d30cc0e58e5ca48b9

and

http://repo.or.cz/w/qemu/kraxel.git?a=commitdiff;h=a6e6178185786c582141f993272e00521d3f125a

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/4] KVM-trace port to tracepoints

2008-07-23 Thread Christoph Hellwig
On Wed, Jul 23, 2008 at 01:08:41PM +0300, Avi Kivity wrote:
 trace_mark() is implement kvmtrace, which is propagated to userspace.   
 So while trace_mark() itself is not a userspace interface, one of its  
 users is.

 It's an unstable interface.  But so is dmesg; that's the nature of tracing.

Trace_mark is as stable as any other kernel interface, and
the data you pass through it is as stable as you want it to.  In most
cases like kvmtrace or my spu scheduler tracing code the trace data
is directly forwarded through a userspace interface, and that is as
stable as any freeform interface, e.g. as like printk mentioned above.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool

2008-12-11 Thread Christoph Hellwig
On Thu, Dec 11, 2008 at 05:49:47PM +0100, Andrea Arcangeli wrote:
 On Thu, Dec 11, 2008 at 05:11:08PM +0100, Gerd Hoffmann wrote:
  Yes.  But kernel aio requires O_DIRECT, so aio users are affected
  nevertheless.
 
 Are you sure? It surely wasn't the case...

Mainline kernel aio only implements O_DIRECT.  Some RHEL version had
support for buffered kernel AIO.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call minutes for Feb 1

2011-02-01 Thread Christoph Hellwig
On Tue, Feb 01, 2011 at 05:36:13PM +0100, Jan Kiszka wrote:
 kvm_cpu_exec/kvm_run, and start wondering What needs to be done to
 upstream so that qemu-kvm could use that implementation?. If they
 differ, the reasons need to be understood and patched away, either by
 fixing/enhancing upstream or simplifying qemu-kvm. Once the upstream
 changes are merged back, a qemu-kvm patch is posted to switch to that
 version.

while I'm not an expert in that area I really like you're approach.  I'd
really prefer to let you finish up all the major work that way before
starting massive revamping like the glib main loop.  Resolving the
qemu/qemu-kvm schisma surely is more important for the overall project
than rewriting existing functionality to look a little nicer.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Use mmap for working with disk image V2

2011-04-11 Thread Christoph Hellwig
How do you plan to handle I/O errors or ENOSPC conditions?  Note that
shared writeable mappings are by far the feature in the VM/FS code
that is most error prone, including the impossiblity of doing sensible
error handling.

The version that accidentally used MAP_PRIVATE actually makes a lot of
sense for an equivalent of qemu's snapshot mode where the image is
readonly and changes are kept private as long as the amount of modified
blocks is small enough to not kill the host VM, but using shared
writeable mappings just sems dangerous.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [qemu-iotests][PATCH] Update rbd support

2011-04-12 Thread Christoph Hellwig
 @@ -43,6 +43,10 @@ _supported_fmt raw
  _supported_proto generic
  _supported_os Linux
  
 +# rbd images are not growable
 +if [ $IMGPROTO = rbd ]; then
 +_notrun image protocol $IMGPROTO does not support growable images
 +fi

I suspect we only support the weird writing past size for the
file protocol, so we should only run the test for it.

Or does sheepdog do anything special about it?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tool: add QCOW verions 1 read/write support

2011-04-14 Thread Christoph Hellwig
On Wed, Apr 13, 2011 at 08:01:58PM +0100, Prasad Joshi wrote:
 The patch only implements the basic read write support for QCOW version 1
 images. Many of the QCOW features are not implmented, for example

What's the point?  Qcow1 has been deprecated for a long time.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO

2010-06-14 Thread Christoph Hellwig
On Mon, Jun 14, 2010 at 02:44:31AM -0700, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical
 to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512)
 so that it can be used by BSG AIO for write()/read() of struct sg_io_v4.

Jusre remove the nb_sectors argument, we already get the length passed
in the iovec.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_blk: support barriers without FLUSH feature

2010-06-15 Thread Christoph Hellwig
If we want to support barriers with the cache=writethrough mode in qemu
we need to tell the block layer that we only need queue drains to
implement a barrier.  Follow the model set by SCSI and IDE and assume
that there is no volatile write cache if the host doesn't advertize it.
While this might imply working barriers on old qemu versions or other
hypervisors that actually have a volatile write cache this is only a
cosmetic issue - these hypervisors don't guarantee any data integrity
with or without this patch, but with the patch we at least provide
data ordering.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2010-06-08 12:01:09.625254254 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2010-06-15 14:38:18.518273290 
+0200
@@ -366,12 +366,32 @@ static int __devinit virtblk_probe(struc
vblk-disk-driverfs_dev = vdev-dev;
index++;
 
-   /* If barriers are supported, tell block layer that queue is ordered */
-   if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+   if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
+   /*
+* If the FLUSH feature is supported we do have support for
+* flushing a volatile write cache on the host.  Use that
+* to implement write barrier support.
+*/
blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH,
  virtblk_prepare_flush);
-   else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
+   } else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) {
+   /*
+* If the BARRIER feature is supported the host expects us
+* to order request by tags.  This implies there is not
+* volatile write cache on the host, and that the host
+* never re-orders outstanding I/O.  This feature is not
+* useful for real life scenarious and deprecated.
+*/
blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL);
+   } else {
+   /*
+* If the FLUSH feature is not supported we must assume that
+* the host does not perform any kind of volatile write
+* caching. We still need to drain the queue to provider
+* proper barrier semantics.
+*/
+   blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
+   }
 
/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call minutes for June 15

2010-06-15 Thread Christoph Hellwig
On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote:
 KVM/qemu patches
 - patch rate is high, documentation is low, review is low
 - patches need to include better descriptions and documentation
   - will slow down patch writers
   - will make it easier for patch reviewers

What is the qemu patch review policy anyway?  There are no
Reviewed-by: included in the actual commits, and the requirement
for a positive review also seem to vary a lot, up to the point that
some commiters commit code that has never hit a public mailing list
before.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread Christoph Hellwig
On Fri, Jun 18, 2010 at 01:38:02PM -0500, Ryan Harper wrote:
 Create a new attribute for virtio-blk devices that will fetch the serial 
 number
 of the block device.  This attribute can be used by udev to create disk/by-id
 symlinks for devices that don't have a UUID (filesystem) associated with them.
 
 ATA_IDENTIFY strings are special in that they can be up to 20 chars long
 and aren't required to be NULL-terminated.  The buffer is also zero-padded
 meaning that if the serial is 19 chars or less that we get a NULL terminated
 string.  When copying this value into a string buffer, we must be careful to
 copy up to the NULL (if it present) and only 20 if it is longer and not to
 attempt to NULL terminate; this isn't needed.

Why is this virtio-blk specific?  In a later mail you mention you want
to use it for udev.  So please export this from scsi/libata as well and
we have one proper interface that we can use for all devices.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphical virtualisation management system

2010-06-25 Thread Christoph Hellwig
On Thu, Jun 24, 2010 at 02:01:52PM -0500, Javier Guerra Giraldez wrote:
 On Thu, Jun 24, 2010 at 1:32 PM, Freddie Cash fjwc...@gmail.com wrote:
  ??* virt-manager which requires X and seems to be more desktop-oriented;
 
 don't know about the others, but virt-manager runs only on the admin
 station.  on the VM hosts you run only libvirtd, which doesn't need X

While it can connect to remote systems it seems totally unusable for
that to me.  For one thing working over higher latency links like DSL
or even transatlantik links seems to be almost impossible.  Second I
still haven't figure out how to install and manage a system using the
serial console with KVM, which certainly contributes to the complete
lack of usability above.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-16 Thread Christoph Hellwig
On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
 Also, ext4 is _very_ slow on O_SYNC writes (which is
 used in kvm with default cache).
 
 Yeah, we probably need to switch to sync_file_range() to avoid the
 journal commit on every write.
 

No, we don't.  sync_file_range does not actually provide any data
integrity.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Mon, Aug 16, 2010 at 03:34:12PM -0500, Anthony Liguori wrote:
 On 08/16/2010 01:42 PM, Christoph Hellwig wrote:
 On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
 Also, ext4 is _very_ slow on O_SYNC writes (which is
 used in kvm with default cache).
 Yeah, we probably need to switch to sync_file_range() to avoid the
 journal commit on every write.
 
 No, we don't.  sync_file_range does not actually provide any data
 integrity.
 
 What do you mean by data integrity?

sync_file_range only does pagecache-level writeout of the file data.
It nevers calls into the actual filesystem, that means any block
allocations (for filling holes / converting preallocated space in normal
filesystems, or every write in COW-based filesstems like qcow2) never
get flushes to disk, and even more importantly the disk write cache is
never flushed.

In short it's completely worthless for any real filesystem.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 12:23:01PM +0300, Avi Kivity wrote:
  On 08/17/2010 12:07 PM, Christoph Hellwig wrote:
 
 In short it's completely worthless for any real filesystem.
 
 
 The documentation should be updated then.  It suggests that it is
 usable for data integrity.

The manpage has a warning section documenting what I said above since
I added it in January.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 07:56:04AM -0500, Anthony Liguori wrote:
 But assuming that you had a preallocated disk image, it would
 effectively flush the page cache so it sounds like the only real
 issue is sparse and growable files.

For preallocated as in using fallocate() we still converting unwritten
to regular extents and do have metadata updates.  For preallocated as
in writining zeroes into the whole image earlier we do indeed only
care about the data, and will not have metadata for most filesystems.
That still leaves COW based filesystems that need to allocate new blocks
on every write, and from my reading NFS also needs the -fsync callout
to actually commit unstable data to disk.

   and even more importantly the disk write cache is
 never flushed.
 
 The point is that we don't want to flush the disk write cache.  The
 intention of writethrough is not to make the disk cache writethrough
 but to treat the host's cache as writethrough.

We need to make sure data is not in the disk write cache if want to
provide data integrity.  It has nothing to do with the qemu caching
mode - for data=writeback or none it's commited as part of the fdatasync
call, and for data=writethrough it's commited as part of the O_SYNC
write.  Note that both these path end up calling the filesystems -fsync
method which is what's require to make writes stable.  That's exactly
what is missing out in sync_file_range, and that's why that API is not
useful at all for data integrity operations.  It's also what makes
fsync slow on extN - but the fix to that is not to not provide data
integrity but rather to make fsync fast.  There's various other
filesystems that can already do it, and if you insist on using those
that are slow for this operation you'll have to suffer until that
issue is fixed for them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote:
 On 08/17/2010 08:07 AM, Christoph Hellwig wrote:
 The point is that we don't want to flush the disk write cache.  The
 intention of writethrough is not to make the disk cache writethrough
 but to treat the host's cache as writethrough.
 
 We need to make sure data is not in the disk write cache if want to
 provide data integrity.
 
 When the guest explicitly flushes the emulated disk's write cache.
 Not on every single write completion.

That depends on the cache= mode.  For cache=none and cache=writeback
we present a write-back cache to the guest, and the guest does explicit
cache flushes.  For cache=writethrough we present a writethrough cache
to the guest, and we need to make sure data actually has hit the disk
before returning I/O completion to the guest.

It has nothing to do with the qemu caching
 mode - for data=writeback or none it's commited as part of the fdatasync
 call, and for data=writethrough it's commited as part of the O_SYNC
 write.  Note that both these path end up calling the filesystems -fsync
 method which is what's require to make writes stable.  That's exactly
 what is missing out in sync_file_range, and that's why that API is not
 useful at all for data integrity operations.
 
 For normal writes from a guest, we don't need to follow the write
 with an fsync().  We should only need to issue an fsync() given an
 explicit flush from the guest.

Define normal writes.  For cache=none and cache=writeback we don't
have to, and instead do explicit calls to fsync()/fdatasync() calls
when a we a cache flush from the guest.  For data=writethrough we
guarantee data has made it to disk, and we implement this using
O_DSYNC/O_SYNC when opening the file.  That tells the operating system
to not return until data has hit the disk.   For Linux this is
internally implement using a range-fsync/fdatasync after the actual
write.

 fsync() being slow is orthogonal to my point.  I don't see why we
 need to do an fsync() on *every* write.  It should only be necessary
 when a guest injects an actual barrier.

See above.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:39:15AM -0500, Anthony Liguori wrote:
 The type of cache we present to the guest only should relate to how
 the hypervisor caches the storage.  It should be independent of how
 data is cached by the disk.

It is.

 There can be many levels of caching in a storage hierarchy and each
 hierarchy cached independently of the next level.
 
 If the user has a disk with a writeback cache, if we expose a
 writethrough cache to the guest, it's not our responsibility to make
 sure that we break through the writeback cache on the disk.

The users doesn't know or have to care about the caching.  The
users uses O_SYNC/fsync to tell it wants data on disk, and it's the
operating systems job to make that happen.   The situation with qemu
is the same - if we tell the guest that we do not have a volatile write
cache that needs explicit management the guest can rely on the fact
that it does not have to do manual cache management.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:44:49AM -0500, Anthony Liguori wrote:
 I think the real issue is we're mixing host configuration with guest
 visible state.

The last time I proposed to decouple the two you and Avi were heavily
opposed to it..

 With O_SYNC, we're causing cache=writethrough to do writethrough
 through two layers of the storage heirarchy.  I don't think that's
 necessary or desirable though.

It's absolutely nessecary if we tell the guest that we do not have
a volatile write cache.  Which is the only good reason to use
data=writethrough anyway - except for dealing with old guests that
can't handle volatile writecache it's an absolutely stupid mode of
operation.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:54:07AM -0500, Anthony Liguori wrote:
 This is simply unrealistic.  O_SYNC might force data to be on a
 platter when using a directly attached disk but many NAS's actually
 do writeback caching and relying on having an UPS to preserve data
 integrity.  There's really no way in the general case to ensure that
 data is actually on a platter once you've involved a complex storage
 setup or you assume FUA

Yes, there is.  If you have an array that has batter backup it handles
this internally.  The normal case is to not set the WCE bit in the
mode page, which tells the operating system not ever send
SYNCHRONIZE_CACHE commands.  I have one array that sets a WCE bit
neveless, but it also doesn't flush it's non-volatile cache in
SYNCHRONIZE_CACHE, but rather implements it as an effective no-op.

 Let me put it another way.  If an admin knows the disks on a machine
 have battery backed cache, he's likely to leave writeback caching
 enabled.
 
 We are currently giving the admin two choices with QEMU, either
 ignore the fact that the disk is battery backed and do write through
 caching of the disk or do writeback caching in the host which

Again, this is not qemu's business at all.  Qemu is not different from
any other application requiring data integrity.  If that admin really
thinks he needs to overide the storage provided settings he can
mount the filesystem using -o nobarrier and we will not send cache
flushes.  I would in general recommend against this, as an external
UPS still has lots of failure modes that this doesn't account for.
Arrays with internal non-volatile memory already do the right thing
anyway.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 05:59:07PM +0300, Avi Kivity wrote:
 I agree, but there's another case: tell the guest that we have a
 write cache, use O_DSYNC, but only flush the disk cache on guest
 flushes.

O_DSYNC flushes the disk write cache and any filesystem that supports
non-volatile cache.   The disk cache is not an abstraction
exposed to applications.

 I believe this can be approximated by mounting the host filesystem
 with barrier=0?

Mounting the host filesystem with nobarrier means we will never explicit
flush the volatile write cache on the disk.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH v2] virtio-blk physical block size

2010-01-04 Thread Christoph Hellwig
On Mon, Jan 04, 2010 at 01:38:51PM +1030, Rusty Russell wrote:
 I thought this was what I was doing, but I have shown over and over that
 I have no idea about block devices.
 
 Our current driver treats BLK_SIZE as the logical and physical size (see
 blk_queue_logical_block_size).
 
 I have no idea what logical vs. physical actually means.  Anyone?  Most
 importantly, is it some Linux-internal difference or a real I/O-visible
 distinction?

Those should be the same for any sane interface.  They are for classical
disk devices with larger block sizes (MO, s390 dasd) and also for the
now appearing 4k sector scsi disks.  But in the ide world people are
concerned about dos/window legacy compatiblity so they came up with a
nasty hack:

 - there is a physical block size as used by the disk internally
   (4k initially)
 - all the interfaces to the operating system still happen in the
   traditional 512 byte blocks to not break any existing assumptions
 - to make sure modern operating systems can optimize for the larger
   physical sectors the disks expose this size, too.
 - even worse disks can also have alignment hacks for the traditional
   DOS partitions tables, so that the 512 byte block zero might even
   have an offset into the first larger physical block.  This is also
   exposed in the ATA identify information.

All in all I don't think this mess is a good idea to replicate in
virtio.  Virtio by defintion requires virtualization aware guests, so we
should just follow the SCSI way of larger real block sizes here.

 
 Rusty.
 
---end quoted text---
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] Advertise IDE physical block size as 4K

2010-01-04 Thread Christoph Hellwig
On Tue, Dec 29, 2009 at 02:39:38PM +0100, Luca Tettamanti wrote:
 Linux tools put the first partition at sector 63 (512-byte) to retain
 compatibility with Windows;

Well, some of them, and depending on the exact disks.  It's all rather
complicated.

  It has been discussed for hardware disk design with 4k sectors, and
  somehow there were plans to map sectors so that the Linux partition
  scheme results in nicely aligned filesystem blocks
 
 Ugh, I hope you're wrong ;-) AFAICS remapping will lead only to
 headaches... Linux does not have any problem with aligned partitions.

Linux doesn't care.  As doesn't windows.  But performance on mis-aligned
partitions will suck badly - both on 4k sector drives, SSDs or probably
various copy on write layers in virtualization once you hit the worst
case.  Fortunately the block topology information present in recent
ATA and SCSI standards allows the storage hardware to tell about the
required alignment, and Linux now has a topology API to expose it, which
is used by the most recent versions of the partitioning tools and
filesystem creation tools.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] Advertise IDE physical block size as 4K

2010-01-04 Thread Christoph Hellwig
On Tue, Dec 29, 2009 at 12:07:58PM +0200, Avi Kivity wrote:
 Guests use this number as a hint for alignment and I/O request sizes.  Given
 that modern disks have 4K block sizes, and cached file-backed images also
 have 4K block sizes, this hint can improve guest performance.
 
 We probably need to make this configurable depending on machine type.  It
 should be the default for -M 0.13 only as it can affect guest code paths.

The information is correct per the ATA spec, but:

 (a) as mentioned above it should not be used for old machine types
 (b) we need to sort out passing through the first block alignment bits
 that are also in IDENTIFY word 106 if using a raw block device
 underneat
 (b) probably need to adjust the physical blocks size depending on the
 underlying storage topology.

I have a patch in my queue for a while now dealing with (b) and parts of
(c), but it's been preempted by more urgent work.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH v2] virtio-blk physical block size

2010-01-08 Thread Christoph Hellwig
On Tue, Jan 05, 2010 at 08:16:15PM +, Jamie Lokier wrote:
 It would be good if virtio relayed the backing device's basic topology
 hints, so:
 
 - If the backing dev is a real disk with 512-byte sectors,
   virtio should indicate 512-byte blocks to the guest.
 
 - If the backing dev is a real disk with 4096-byte sectors,
   virtio should indicate 4096-byte blocks to the guest.
 
 With databases and filesystems, if you care about data integrity:
 
 - If the backing dev is a real disk with 4096-byte sectors,
   or a file whose access is through a 4096-byte-per-page cache,
   virtio must indicate 4096-byte blocks otherwise guest
   journalling is not host-powerfail safe.
 
 You get the idea.  If there is only one parameter, it really should be
 at least as large as the smallest unit which may be corrupted by
 writes when errors occur.

It's not that easy.  IDE only supports larger sectors sizes with the
physical sector size attribute, not native larger sectors.  While scsi
does support it it's untypical and I would not expect the guests to
not always get it right.  So the best is to use the transport native
way to express that we have larger sectors and expect the guest to do
the right thing.

I've done some work on the autodetection of the larger sector sizes
a while ago.  But now that people brought up migration I wonder if that
makes sense - if we migrate from a 512 byte sector size disk to a 4096
byte sector size disk we can't simply change guest visible attributes.

Maybe we should pick one on image creation and then stick to it.  For an
image format we could write down this information in the image, but for
a raw images that's impossible.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 3/4] Add HYPER-V apic access MSRs.

2010-01-17 Thread Christoph Hellwig
On Sun, Jan 17, 2010 at 02:20:32PM +0200, Avi Kivity wrote:
 +addr = kmap_atomic(page, KM_USER0);
 +clear_user_page(addr, vaddr, page);
 +kunmap_atomic(addr, KM_USER0);


 Surprising that clear_user_page needs kmap_atomic() (but true).

There's a clear_user_highpage helper to take care of it for you.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 3/4] Add HYPER-V apic access MSRs.

2010-01-17 Thread Christoph Hellwig
On Sun, Jan 17, 2010 at 02:41:42PM +0200, Gleb Natapov wrote:
 I copied code from the instead of using helper faction for
 some unknown to me reason. Anyway if I can't get struct page from
 user virtual address I can't use it. Actually I am not sure the page
 should be zeroed at all. Spec only descries first dword of the page and
 doesn't require zeroing the reset as far as I see.

There's a clear_user() function that just takes a user virtual address
and a size.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/12] Add get_user_pages() variant that fails if major fault is required.

2010-01-17 Thread Christoph Hellwig
On Tue, Jan 05, 2010 at 04:12:48PM +0200, Gleb Natapov wrote:
 This patch add get_user_pages() variant that only succeeds if getting
 a reference to a page doesn't require major fault.

  
 +int get_user_pages_noio(struct task_struct *tsk, struct mm_struct *mm,
 + unsigned long start, int nr_pages, int write, int force,
 + struct page **pages, struct vm_area_struct **vmas)
 +{
 + int flags = FOLL_TOUCH | FOLL_MINOR;
 +
 + if (pages)
 + flags |= FOLL_GET;
 + if (write)
 + flags |= FOLL_WRITE;
 + if (force)
 + flags |= FOLL_FORCE;
 +
 + return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);

Wouldn't it be better to just export __get_user_pages as a proper user
interface, maybe replacing get_user_pages by it entirely?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/12] Export __get_user_pages_fast.

2010-01-17 Thread Christoph Hellwig
On Tue, Jan 05, 2010 at 04:12:47PM +0200, Gleb Natapov wrote:
 KVM will use it to try and find a page without falling back to slow
 gup. That is why get_user_pages_fast() is not enough.

Btw, it seems like currently is declared unconditionally in linux/mm.h
but only implemented by x86, and you code using it needs ifdefs for
that.  I think you should just introduce a stub that always returns
an error here.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seabios incompatible with Linux 2.6.26 host?

2010-02-05 Thread Christoph Hellwig
On Thu, Feb 04, 2010 at 03:34:24PM +0100, Pierre Riteau wrote:
 I think I traced back the issue to the switch from Bochs BIOS to Seabios. By 
 forcing the usage of Bochs BIOS 5f08bb45861f54be478b25075b90d2406a0f8bb3 
 works, while it dies without the -bios override.
 Unfortunately, newer versions don't seem to work with Bochs BIOS.
 
 Upgrading the host kernel to 2.6.32 (Debian Squeeze) solves the issue. No 
 problem on Fedora 12 as well.

Even Linux 2.6.31 stopped working as a host for me since that commit.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw disks no longer work in latest kvm (kvm-88 was fine)

2010-03-07 Thread Christoph Hellwig
On Sun, Mar 07, 2010 at 12:32:38PM +0300, Michael Tokarev wrote:
 Antoine Martin wrote:
 []
  https://sourceforge.net/tracker/?func=detailatid=893831aid=2933400group_id=180599
 
 
  The initial report is almost 8 weeks old!
  Is data-corruption and data loss somehow less important than the
  hundreds of patches that have been submitted since?? Or is there a fix
  somewhere I've missed?
 
 I can only guess that the info collected so far is not sufficient to
 understand what's going on: except of I/O error writing block NNN
 we does not have anything at all.  So it's impossible to know where
 the problem is.

Actually it is, and the bug has been fixed long ago in:

commit e2a305fb13ff0f5cf6ff80aaa90a5ed5954c
Author: Christoph Hellwig h...@lst.de
Date:   Tue Jan 26 14:49:08 2010 +0100

block: avoid creating too large iovecs in multiwrite_merge


I've asked for it be added to the -stable series but that hasn't
happened so far.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw disks no longer work in latest kvm (kvm-88 was fine)

2010-03-07 Thread Christoph Hellwig
On Sun, Mar 07, 2010 at 07:30:06PM +0200, Avi Kivity wrote:
 It may also be that glibc is emulating preadv, incorrectly.

I've done a quick audit of all pathes leading to pread and all seem
to align correctly.  So either a broken glibc emulation or something
else outside the block layer seems likely.

 Antoine, can you check this?  ltrace may help, or run 'strings libc.so |  
 grep pread'.

Or just add an

#undef CONFIG_PREADV

just before the first

#ifdef CONFIG_PREADV

in posix-aio-compat.c and see if that works.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-aio usable?

2010-03-08 Thread Christoph Hellwig
On Mon, Mar 08, 2010 at 11:10:29AM +0200, Avi Kivity wrote:
 Are there any potential pitfalls?


 It won't work well unless running on a block device (partition or LVM).

It will actually work well on pre-allocated filesystem images, at least
on XFS and NFS.

The real pitfal is that cache=none is required for kernel support as
it only supports O_DIRECT.

 Is there any reason one should not compile that feature by default?

It's compiled by default if libaio and it's development headers are
found.

 Does it do anything if not explicitly run with aio=native?

No.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Christoph Hellwig
On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote:
 I knew someone would do this...

 This really gets down to your definition of safe behaviour.  As it  
 stands, if you suffer a power outage, it may lead to guest corruption.

 While we are correct in advertising a write-cache, write-caches are  
 volatile and should a drive lose power, it could lead to data  
 corruption.  Enterprise disks tend to have battery backed write caches  
 to prevent this.

 In the set up you're emulating, the host is acting as a giant write  
 cache.  Should your host fail, you can get data corruption.

 cache=writethrough provides a much stronger data guarantee.  Even in the  
 event of a host failure, data integrity will be preserved.

Actually cache=writeback is as safe as any normal host is with a
volatile disk cache, except that in this case the disk cache is
actually a lot larger.  With a properly implemented filesystem this
will never cause corruption.  You will lose recent updates after
the last sync/fsync/etc up to the size of the cache, but filesystem
metadata should never be corrupted, and data that has been forced to
disk using fsync/O_SYNC should never be lost either.  If it is that's
a bug somewhere in the stack, but in my powerfail testing we never did
so using xfs or ext3/4 after I fixed up the fsync code in the latter
two.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Christoph Hellwig
On Mon, Mar 15, 2010 at 08:27:25PM -0500, Anthony Liguori wrote:
 Actually cache=writeback is as safe as any normal host is with a
 volatile disk cache, except that in this case the disk cache is
 actually a lot larger.  With a properly implemented filesystem this
 will never cause corruption.

 Metadata corruption, not necessarily corruption of data stored in a file.

Again, this will not cause metadata corruption either if the filesystem
loses barriers, although we may lose up to the cache size of new (data
or metadata operations).  The consistency of the filesystem is still
guaranteed.

 Not all software uses fsync as much as they should.  And often times,  
 it's for good reason (like ext3).

If an application needs data on disk it must call fsync, or there
is no guaranteed at all, even on ext3.  And with growing disk caches
these issues show up on normal disks often enough that people have
realized it by now.


 IIUC, an O_DIRECT write using cache=writeback is not actually on the  
 spindle when the write() completes.  Rather, an explicit fsync() would  
 be required.  That will cause data corruption in many applications (like  
 databases) regardless of whether the fs gets metadata corruption.

It's neither for O_DIRECT without qemu involved.  The O_DIRECT write
goes through the disk cache and requires and explicit fsync or O_SYNC
open flag to make sure it goes to disk.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Christoph Hellwig
Avi,

cache=writeback can be faster than cache=none for the same reasons
a disk cache speeds up access.  As long as the I/O mix contains more
asynchronous then synchronous writes it allows the host to do much
more reordering, only limited by the cache size (which can be quite
huge when using the host pagecache) and the amount of cache flushes
coming from the host.  If you have a fsync heavy workload or metadata
operation with a filesystem like the current XFS you will get lots
of cache flushes that make the use of the additional cache limits.

If you don't have a of lot of cache flushes, e.g. due to dumb
applications that do not issue fsync, or even run ext3 in it's default
mode never issues cache flushes the benefit will be enormous, but the
data loss and possible corruption will be enormous.

But even for something like btrfs that does provide data integrity
but issues cache flushes fairly effeciently data=writeback may
provide a quite nice speedup, especially if using multiple guest
accessing the same spindle(s).

But I wouldn't be surprised if IBM's exteme differences are indeed due
to the extremly unsafe ext3 default behaviour.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Mar 16

2010-03-16 Thread Christoph Hellwig
On Tue, Mar 16, 2010 at 12:38:02PM +0200, Avi Kivity wrote:
 On 03/16/2010 12:31 PM, Daniel P. Berrange wrote:
 Polling loops are an indication that something is wrong.
  
 Except when people suggest they are the right answer, qcow high
 watermark ;-P


 I liked Anthony's suggestion of an lvm2 block format driver.  No polling.

I have done some work on linking the new lvm library to qemu to control
snapshotting.  But introducing a whole new block format seems like a lot
of duplication to me.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Christoph Hellwig
On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
 Are you talking about direct volume access or qcow2?

Doesn't matter.

 For direct volume access, I still don't get it.  The number of barriers 
 issues by the host must equal (or exceed, but that's pointless) the 
 number of barriers issued by the guest.  cache=writeback allows the host 
 to reorder writes, but so does cache=none.  Where does the difference 
 come from?
 
 Put it another way.  In an unvirtualized environment, if you implement a 
 write cache in a storage driver (not device), and sync it on a barrier 
 request, would you expect to see a performance improvement?

cache=none only allows very limited reorderning in the host.  O_DIRECT
is synchronous on the host, so there's just some very limited reordering
going on in the elevator if we have other I/O going on in parallel.
In addition to that the disk writecache can perform limited reodering
and caching, but the disk cache has a rather limited size.  The host
pagecache gives a much wieder opportunity to reorder, especially if
the guest workload is not cache flush heavy.  If the guest workload
is extremly cache flush heavy the usefulness of the pagecache is rather
limited, as we'll only use very little of it, but pay by having to do
a data copy.  If the workload is not cache flush heavy, and we have
multiple guests doing I/O to the same spindles it will allow the host
do do much more efficient data writeout by beeing able to do better
ordered (less seeky) and bigger I/O (especially if the host has real
storage compared to ide for the guest).

 If you don't have a of lot of cache flushes, e.g. due to dumb
 applications that do not issue fsync, or even run ext3 in it's default
 mode never issues cache flushes the benefit will be enormous, but the
 data loss and possible corruption will be enormous.

 
 Shouldn't the host never issue cache flushes in this case? (for direct 
 volume access; qcow2 still needs flushes for metadata integrity).

If the guest never issues a flush the host will neither, indeed.  Data
will only go to disk by background writeout or memory pressure.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote:
 If the batch size is larger than the virtio queue size, or if there are 
 no flushes at all, then yes the huge write cache gives more opportunity 
 for reordering.  But we're already talking hundreds of requests here.

Yes.  And rememember those don't have to come from the same host.  Also
remember that we rather limit execssive reodering of O_DIRECT requests
in the I/O scheduler because they are synchronous type I/O while
we don't do that for pagecache writeback.

And we don't have unlimited virtio queue size, in fact it's quite
limited.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote:
 They should be reorderable.  Otherwise host filesystems on several 
 volumes would suffer the same problems.

They are reordable, just not as extremly as the the page cache.
Remember that the request queue really is just a relatively small queue
of outstanding I/O, and that is absolutely intentional.  Large scale
_caching_ is done by the VM in the pagecache, with all the usual aging,
pressure, etc algorithms applied to it.  The block devices have a
relatively small fixed size request queue associated with it to
facilitate request merging and limited reordering and having fully
set up I/O requests for the device.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote:
 Chris, can you carry out an experiment?  Write a program that pwrite()s 
 a byte to a file at the same location repeatedly, with the file opened 
 using O_SYNC.  Measure the write rate, and run blktrace on the host to 
 see what the disk (/dev/sda, not the volume) sees.  Should be a (write, 
 flush, write, flush) per pwrite pattern or similar (for writing the data 
 and a journal block, perhaps even three writes will be needed).
 
 Then scale this across multiple guests, measure and trace again.  If 
 we're lucky, the flushes will be coalesced, if not, we need to work on it.

As the person who has written quite a bit of the current O_SYNC
implementation and also reviewed the rest of it I can tell you that
those flushes won't be coalesced.  If we always rewrite the same block
we do the cache flush from the fsync method and there's is nothing
to coalesced it there.  If you actually do modify metadata (e.g. by
using the new real O_SYNC instead of the old one that always was O_DSYNC
that I introduced in 2.6.33 but that isn't picked up by userspace yet)
you might hit a very limited transaction merging window in some
filesystems, but it's generally very small for a good reason.  If it
were too large we'd make the once progress wait for I/O in another just
because we might expect transactions to coalesced later.  There's been
some long discussion about that fsync transaction batching tuning
for ext3 a while ago.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote:
 Meanwhile I looked at the code, and it looks bad.  There is an 
 IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before 
 issuing it.  In any case, qemu doesn't use it as far as I could tell, 
 and even if it did, device-matter doesn't implement the needed 
 -aio_fsync() operation.

No one implements it, and all surrounding code is dead wood.  It would
require us to do asynchronous pagecache operations, which involve
major surgery of the VM code.  Patches to do this were rejected multiple
times.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-11 Thread Christoph Hellwig
On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote:
 Some virtio devices are known to have guest drivers which expect a notify to 
 be
 processed synchronously and spin waiting for completion.  Only enable 
 ioeventfd
 for virtio-blk and virtio-net for now.

Who guarantees that less common virtio-blk and virtio-net guest drivers
for non-Linux OSes are fine with it?  Maybe you should add a feature flag
that the guest has to ACK to enable it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio block drivers not working

2009-03-22 Thread Christoph Hellwig
I do you virtio block in a very similar setup to yours (fully static
kernel, -kernel option to kvm/qemu) sucesfully for quite a a while.

Can you post your kernel .config and the contents of /proc/devices
and /proc/partitions to debug this further?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT

2009-03-23 Thread Christoph Hellwig
On Mon, Mar 23, 2009 at 06:17:36PM +0200, Avi Kivity wrote:
 Instead of introducing yet another layer of indirection, you could add  
 block-raw-linux-aio, which would be registered before block-raw-posix  
 (which is realy block-raw-threadpool...), and resist a -probe() if  
 caching is enabled.

Exactly the kind of comment I was about to make, but I need to read a
little deeper to understand all the details.

But my gut feeling is that this abstraction doesn't help us very much,
especially with Avi's aiocb pools in place.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT

2009-03-23 Thread Christoph Hellwig
On Mon, Mar 23, 2009 at 12:14:58PM -0500, Anthony Liguori wrote:
 I'd like to see the O_DIRECT bounce buffering removed in favor of the  
 DMA API bouncing.  Once that happens, raw_read and raw_pread can  
 disappear.  block-raw-posix becomes much simpler.

See my vectored I/O patches for doing the bounce buffering at the
optimal place for the aio path. Note that from my reading of the
qcow/qcow2 code they might send down unaligned requests, which is
something the dma api would not help with.

For the buffered I/O path we will always have to do some sort of buffering
due to all the partition header reading / etc.  And given how that part
isn't performance critical my preference would be to keep doing it in
bdrv_pread/write and guarantee the lowlevel drivers proper alignment.

 We would drop the signaling stuff and have the thread pool use an fd to  
 signal.  The big problem with that right now is that it'll cause a  
 performance regression for certain platforms until we have the IO thread  
 in place.

Talking about signaling, does anyone remember why the Linux signalfd/
eventfd support is only in kvm but not in upstream qemu?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT

2009-03-23 Thread Christoph Hellwig
On Mon, Mar 23, 2009 at 12:14:58PM -0500, Anthony Liguori wrote:
 block-raw-posix needs a major overhaul.  That's why I'm not even  
 considering committing the patch as is.

I have some WIP patches that split out the host device bits into
separate files to get block-raw-posix down to the pure file handling
bits without all the host-specific host device mess.  But it's at the
end of a really large pile, which needs to be rebases once we have the
patches already on the list in in some form.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH][RFC] Linux AIO support when using O_DIRECT

2009-03-23 Thread Christoph Hellwig
On Mon, Mar 23, 2009 at 01:10:30PM -0500, Anthony Liguori wrote:
 I really dislike having so many APIs.  I'd rather have an aio API that 
 took byte accesses or have pread/pwrite always be emulated with a full 
 sector read/write

I had patches to change the aio API to byte based access, and get rid
of the read/write methods to only have the byte based pread/pwrite
APIs, but thay got obsoleted by Avi's patch to kill the pread/pwrite
ops.  We could put in byte-based AIO without byte-based read/write,
though.  In my patches I put a flag into BlockDriverState whether we
allow byte-based access to this instance or otherwise emulated it in
the block layer.  We still need this as many of the image formats can't
deal with byte-granularity access without read-modify-write cycles,
and I think we're better off having one read-modify-write handler in
the block handler than one per image format that needs it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Split kvm source tarballs

2009-03-25 Thread Christoph Hellwig
On Wed, Mar 25, 2009 at 08:44:58AM -0500, Anthony Liguori wrote:
 That's what I figured.  FWIW, the split tarballs work just fine for me.

 It may be worth waiting to do step 2 until the IO thread is merged.  I  
 think once that happens, we could probably do a sprint to get rid of  
 libkvm in kvm-userspace.  That would certainly simplify things.

Yeah.  And having the both common and split repos just confuses the
heck out of any user of the repository.  I think the right way to split
it to wait for libkvm going away and just have a qemu-kvm repository
and an entirely separate kernel module repository.  It's not like there
is anything common but the few exported ABI headers, and we can either
keep them in both (would mean qemu-kvm can always build against a
defined set of headers) or make qemu-kvm require a kernel source like
the current kvm support in upstream qemu.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Split kvm source tarballs

2009-03-25 Thread Christoph Hellwig
On Wed, Mar 25, 2009 at 08:02:48PM +0200, Avi Kivity wrote:
 So how about this:

 - keep copies of the headers in the qemu repository. 'make sync' becomes  
 a maintainer tool rather than a developer tool

Yeah.  That similar how we maintain the headers and some shared source
file for XFS and libxfs in xfsprogs.

 - move qemu to the root of the repository, and reparent libkvm/ user/  
 and friends under it.  this will give us easier merging.

Yeah.  While you're at it user/ might be renamed to something more
descriptive.

 - move the external module kit into kvm.git

As in your kvm development kernel tree?  Not sure it's a good idea
to suck this into a full kernel tree.  Probably worth making it a small
repository of it's own.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Split kvm source tarballs

2009-03-26 Thread Christoph Hellwig
On Wed, Mar 25, 2009 at 04:34:31PM -0500, Anthony Liguori wrote:
 But if you created a qemu-svn-stable branch that followed the QEMU  
 stable tree in kvm-userspace, like the qemu-cvs branch follows trunk,  
 then it would be pretty easy to create and maintain a kvm_stable_0_10  
 branch of whatever you'd like to call it in kvm-userspace.

 Any chance you could do this?  I suspect it's just a matter of creating  
 the branch based off of the qemu-cvs tree at dde and then doing a  
 git-svn fetch.

Slightly offtopic, but I always wondered why qemu is hosted in svn.  For
all project having semi-forks of qemu that they try to keep in sync
or even merge back a distributed scm would work so much better.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Split kvm source tarballs

2009-03-26 Thread Christoph Hellwig
On Thu, Mar 26, 2009 at 11:09:07AM +0200, Avi Kivity wrote:
 If the repo contains only the kit (external-module.h and the hack  
 scripts) we'll be left with dual repositories with their confusion and  
 unbisectability.  If the repo contains both the kit and the code, I'll  
 need to commit every kvm.git change into that repository, which I'm sure  
 to botch now and then.

 Do you see any specific problem with dropping kvm-userspace.git/kernel/*  
 under kvm.git/scripts/kvm/?

Adding more stuff to a kernel tree that doesn't apply to building that
tree seems rather odd.  But if it makes your life easier go for it, it's
defintively much better than the current setup.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Split kvm source tarballs

2009-03-26 Thread Christoph Hellwig
On Thu, Mar 26, 2009 at 01:19:46PM -0500, Anthony Liguori wrote:
 Slightly offtopic, but I always wondered why qemu is hosted in svn.  For
 all project having semi-forks of qemu that they try to keep in sync
 or even merge back a distributed scm would work so much better.
   

 I'm going to switch it to git fwiw.

If it's soon enough maybe the kvm repository reorganization should wait
for it so that the new repository can be cloned from upstream qemu?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Split kvm source tarballs

2009-04-16 Thread Christoph Hellwig
Any idea when the split will happen?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: build system Add link to qemu

2009-04-26 Thread Christoph Hellwig
On Sun, Apr 26, 2009 at 01:33:37PM +0300, Avi Kivity wrote:
 Jan Kiszka wrote:
 I'm getting closer to a working qemu-kvm, but there are still a few
 messy parts. The magic dance goes like this:
   

 Try a fresh fetch.  ./configure  make ought to work.

Works for me now.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_blk: SG_IO passthru support

2009-04-27 Thread Christoph Hellwig
From: Hannes Reinecke h...@suse.de

Add support for SG_IO passthru to virtio_blk.  We add the scsi command
block after the normal outhdr, and the scsi inhdr with full status
information aswell as the sense buffer before the regular inhdr.

[hch: forward ported, added the VIRTIO_BLK_F_SCSI flags, some comments
 and tested the whole beast]


Signed-off-by: Hannes Reinecke h...@suse.de
Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2009-04-26 23:01:41.330960470 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2009-04-27 00:18:04.708076895 
+0200
@@ -37,6 +37,7 @@ struct virtblk_req
struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
+   struct virtio_scsi_inhdr in_hdr;
u8 status;
 };
 
@@ -49,7 +50,9 @@ static void blk_done(struct virtqueue *v
 
spin_lock_irqsave(vblk-lock, flags);
while ((vbr = vblk-vq-vq_ops-get_buf(vblk-vq, len)) != NULL) {
+   unsigned int nr_bytes;
int error;
+
switch (vbr-status) {
case VIRTIO_BLK_S_OK:
error = 0;
@@ -62,7 +65,15 @@ static void blk_done(struct virtqueue *v
break;
}
 
-   __blk_end_request(vbr-req, error, blk_rq_bytes(vbr-req));
+   if (blk_pc_request(vbr-req)) {
+   vbr-req-data_len = vbr-in_hdr.residual;
+   nr_bytes = vbr-in_hdr.data_len;
+   vbr-req-sense_len = vbr-in_hdr.sense_len;
+   vbr-req-errors = vbr-in_hdr.errors;
+   } else
+   nr_bytes = blk_rq_bytes(vbr-req);
+
+   __blk_end_request(vbr-req, error, nr_bytes);
list_del(vbr-list);
mempool_free(vbr, vblk-pool);
}
@@ -74,7 +85,7 @@ static void blk_done(struct virtqueue *v
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
   struct request *req)
 {
-   unsigned long num, out, in;
+   unsigned long num, out = 0, in = 0;
struct virtblk_req *vbr;
 
vbr = mempool_alloc(vblk-pool, GFP_ATOMIC);
@@ -99,18 +110,36 @@ static bool do_req(struct request_queue 
if (blk_barrier_rq(vbr-req))
vbr-out_hdr.type |= VIRTIO_BLK_T_BARRIER;
 
-   sg_set_buf(vblk-sg[0], vbr-out_hdr, sizeof(vbr-out_hdr));
-   num = blk_rq_map_sg(q, vbr-req, vblk-sg+1);
-   sg_set_buf(vblk-sg[num+1], vbr-status, sizeof(vbr-status));
-
-   if (rq_data_dir(vbr-req) == WRITE) {
-   vbr-out_hdr.type |= VIRTIO_BLK_T_OUT;
-   out = 1 + num;
-   in = 1;
-   } else {
-   vbr-out_hdr.type |= VIRTIO_BLK_T_IN;
-   out = 1;
-   in = 1 + num;
+   sg_set_buf(vblk-sg[out++], vbr-out_hdr, sizeof(vbr-out_hdr));
+
+   /*
+* If this is a packet command we need a couple of additional headers.
+* Behind the normal outhdr we put a segment with the scsi command
+* block, and before the normal inhdr we put the sense data and the
+* inhdr with additional status information before the normal inhdr.
+*/
+   if (blk_pc_request(vbr-req))
+   sg_set_buf(vblk-sg[out++], vbr-req-cmd, vbr-req-cmd_len);
+
+   num = blk_rq_map_sg(q, vbr-req, vblk-sg + out);
+
+   if (blk_pc_request(vbr-req)) {
+   sg_set_buf(vblk-sg[num + out + in++], vbr-req-sense, 96);
+   sg_set_buf(vblk-sg[num + out + in++], vbr-in_hdr,
+  sizeof(vbr-in_hdr));
+   }
+
+   sg_set_buf(vblk-sg[num + out + in++], vbr-status,
+  sizeof(vbr-status));
+
+   if (num) {
+   if (rq_data_dir(vbr-req) == WRITE) {
+   vbr-out_hdr.type |= VIRTIO_BLK_T_OUT;
+   out += num;
+   } else {
+   vbr-out_hdr.type |= VIRTIO_BLK_T_IN;
+   in += num;
+   }
}
 
if (vblk-vq-vq_ops-add_buf(vblk-vq, vblk-sg, out, in, vbr)) {
@@ -149,8 +178,16 @@ static void do_virtblk_request(struct re
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 unsigned cmd, unsigned long data)
 {
-   return scsi_cmd_ioctl(bdev-bd_disk-queue,
- bdev-bd_disk, mode, cmd,
+   struct gendisk *disk = bdev-bd_disk;
+   struct virtio_blk *vblk = disk-private_data;
+
+   /*
+* Only allow the generic SCSI ioctls if the host can support it.
+*/
+   if (!virtio_has_feature(vblk-vdev, VIRTIO_BLK_F_SCSI))
+   return -ENOIOCTLCMD;
+
+   return scsi_cmd_ioctl(disk-queue, disk, mode, cmd,
  (void __user *)data);
 }
 
@@ -356,6 +393,7 @@ static

[PATCH] virtio-blk: add SGI_IO passthru support

2009-04-27 Thread Christoph Hellwig
Add support for SG_IO passthru (packet commands) to the virtio-blk
backend.  Conceptually based on an older patch from Hannes Reinecke
but largely rewritten to match the code structure and layering in
virtio-blk.

Note that currently we issue the hose SG_IO synchronously.  We could
easily switch to async I/O, but that would required either bloating
the VirtIOBlockReq by the size of struct sg_io_hdr or an additional
memory allocation for each SG_IO request.


Signed-off-by: Christoph Hellwig h...@lst.de

Index: qemu/hw/virtio-blk.h
===
--- qemu.orig/hw/virtio-blk.h   2009-04-26 16:50:38.154074532 +0200
+++ qemu/hw/virtio-blk.h2009-04-26 22:51:16.838076869 +0200
@@ -28,6 +28,9 @@
 #define VIRTIO_BLK_F_SIZE_MAX   1   /* Indicates maximum segment size */
 #define VIRTIO_BLK_F_SEG_MAX2   /* Indicates maximum # of segments */
 #define VIRTIO_BLK_F_GEOMETRY   4   /* Indicates support of legacy 
geometry */
+#define VIRTIO_BLK_F_RO 5   /* Disk is read-only */
+#define VIRTIO_BLK_F_BLK_SIZE   6   /* Block size of disk is available*/
+#define VIRTIO_BLK_F_SCSI   7   /* Supports scsi command passthru */
 
 struct virtio_blk_config
 {
@@ -70,6 +73,15 @@ struct virtio_blk_inhdr
 unsigned char status;
 };
 
+/* SCSI pass-through header */
+struct virtio_scsi_inhdr
+{
+uint32_t errors;
+uint32_t data_len;
+uint32_t sense_len;
+uint32_t residual;
+};
+
 void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs);
 
 #endif
Index: qemu/hw/virtio-blk.c
===
--- qemu.orig/hw/virtio-blk.c   2009-04-26 16:50:38.160074667 +0200
+++ qemu/hw/virtio-blk.c2009-04-27 10:25:19.278074514 +0200
@@ -15,6 +15,9 @@
 #include sysemu.h
 #include virtio-blk.h
 #include block_int.h
+#ifdef __linux__
+# include scsi/sg.h
+#endif
 
 typedef struct VirtIOBlock
 {
@@ -35,6 +38,7 @@ typedef struct VirtIOBlockReq
 VirtQueueElement elem;
 struct virtio_blk_inhdr *in;
 struct virtio_blk_outhdr *out;
+struct virtio_scsi_inhdr *scsi;
 QEMUIOVector qiov;
 struct VirtIOBlockReq *next;
 } VirtIOBlockReq;
@@ -103,6 +107,108 @@ static VirtIOBlockReq *virtio_blk_get_re
 return req;
 }
 
+#ifdef __linux__
+static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
+{
+struct sg_io_hdr hdr;
+int ret, size = 0;
+int status;
+int i;
+
+/*
+ * We require at least one output segment each for the virtio_blk_outhdr
+ * and the SCSI command block.
+ *
+ * We also at least require the virtio_blk_inhdr, the virtio_scsi_inhdr
+ * and the sense buffer pointer in the input segments.
+ */
+if (req-elem.out_num  2 || req-elem.in_num  3) {
+virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
+return;
+}
+
+/*
+ * No support for bidirection commands yet.
+ */
+if (req-elem.out_num  2  req-elem.in_num  3) {
+virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
+return;
+}
+
+/*
+ * The scsi inhdr is placed in the second-to-last input segment, just
+ * before the regular inhdr.
+ */
+req-scsi = (void *)req-elem.in_sg[req-elem.in_num - 2].iov_base;
+size = sizeof(*req-in) + sizeof(*req-scsi);
+
+memset(hdr, 0, sizeof(struct sg_io_hdr));
+hdr.interface_id = 'S';
+hdr.cmd_len = req-elem.out_sg[1].iov_len;
+hdr.cmdp = req-elem.out_sg[1].iov_base;
+hdr.dxfer_len = 0;
+
+if (req-elem.out_num  2) {
+/*
+ * If there are more than the minimally required 2 output segments
+ * there is write payload starting from the third iovec.
+ */
+hdr.dxfer_direction = SG_DXFER_TO_DEV;
+hdr.iovec_count = req-elem.out_num - 2;
+
+for (i = 0; i  hdr.iovec_count; i++)
+hdr.dxfer_len += req-elem.out_sg[i + 2].iov_len;
+
+hdr.dxferp = req-elem.out_sg + 2;
+
+} else if (req-elem.in_num  3) {
+/*
+ * If we have more than 3 input segments the guest wants to actually
+ * read data.
+ */
+hdr.dxfer_direction = SG_DXFER_FROM_DEV;
+hdr.iovec_count = req-elem.in_num - 3;
+for (i = 0; i  hdr.iovec_count; i++)
+hdr.dxfer_len += req-elem.in_sg[i].iov_len;
+
+hdr.dxferp = req-elem.in_sg;
+size += hdr.dxfer_len;
+} else {
+/*
+ * Some SCSI commands don't actually transfer any data.
+ */
+hdr.dxfer_direction = SG_DXFER_NONE;
+}
+
+hdr.sbp = req-elem.in_sg[req-elem.in_num - 3].iov_base;
+hdr.mx_sb_len = req-elem.in_sg[req-elem.in_num - 3].iov_len;
+size += hdr.mx_sb_len;
+
+ret = bdrv_ioctl(req-dev-bs, SG_IO, hdr);
+if (ret) {
+status = VIRTIO_BLK_S_UNSUPP;
+hdr.status = ret;
+hdr.resid = hdr.dxfer_len;
+} else if (hdr.status) {
+status = VIRTIO_BLK_S_IOERR;
+} else {
+status

Re: [ANNOUNCE] kvm-85 release

2009-04-27 Thread Christoph Hellwig
On Mon, Apr 27, 2009 at 04:45:23PM +0200, Alexander Graf wrote:
  - Disabled pwritev() support until glibc stops writing random junk.
See https://bugzilla.redhat.com/497429

 Wouldn't it be useful to have it disabled upstream then?

No glibc has been released with the broken one yet, I gusee Mark must
have been running some sort of snapshot.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] virtio-blk: add SGI_IO passthru support

2009-04-28 Thread Christoph Hellwig
On Mon, Apr 27, 2009 at 12:15:31PM +0300, Avi Kivity wrote:
 I think that's worthwhile.  The extra bloat is trivial (especially as 
 the number of inflight virtio requests is tightly bounded), and stalling 
 the vcpu for requests is a pain.

Ok, new patch will follow ASAP.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] virtio-blk: add SG_IO passthru support

2009-04-28 Thread Christoph Hellwig
Add support for SG_IO passthru (packet commands) to the virtio-blk
backend.  Conceptually based on an older patch from Hannes Reinecke
but largely rewritten to match the code structure and layering in
virtio-blk aswell as doing asynchronous I/O.


Signed-off-by: Christoph Hellwig h...@lst.de

Index: qemu/hw/virtio-blk.h
===
--- qemu.orig/hw/virtio-blk.h   2009-04-28 11:42:14.059074434 +0200
+++ qemu/hw/virtio-blk.h2009-04-28 11:44:24.930074531 +0200
@@ -28,6 +28,9 @@
 #define VIRTIO_BLK_F_SIZE_MAX   1   /* Indicates maximum segment size */
 #define VIRTIO_BLK_F_SEG_MAX2   /* Indicates maximum # of segments */
 #define VIRTIO_BLK_F_GEOMETRY   4   /* Indicates support of legacy 
geometry */
+#define VIRTIO_BLK_F_RO 5   /* Disk is read-only */
+#define VIRTIO_BLK_F_BLK_SIZE   6   /* Block size of disk is available*/
+#define VIRTIO_BLK_F_SCSI   7   /* Supports scsi command passthru */
 
 struct virtio_blk_config
 {
@@ -70,6 +73,15 @@ struct virtio_blk_inhdr
 unsigned char status;
 };
 
+/* SCSI pass-through header */
+struct virtio_scsi_inhdr
+{
+uint32_t errors;
+uint32_t data_len;
+uint32_t sense_len;
+uint32_t residual;
+};
+
 void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs);
 
 #endif
Index: qemu/hw/virtio-blk.c
===
--- qemu.orig/hw/virtio-blk.c   2009-04-28 11:42:14.066074487 +0200
+++ qemu/hw/virtio-blk.c2009-04-28 11:52:45.836079580 +0200
@@ -15,6 +15,9 @@
 #include sysemu.h
 #include virtio-blk.h
 #include block_int.h
+#ifdef __linux__
+# include scsi/sg.h
+#endif
 
 typedef struct VirtIOBlock
 {
@@ -35,6 +38,8 @@ typedef struct VirtIOBlockReq
 VirtQueueElement elem;
 struct virtio_blk_inhdr *in;
 struct virtio_blk_outhdr *out;
+struct virtio_scsi_inhdr *scsi;
+struct sg_io_hdr scsi_hdr;
 QEMUIOVector qiov;
 struct VirtIOBlockReq *next;
 } VirtIOBlockReq;
@@ -103,6 +108,108 @@ static VirtIOBlockReq *virtio_blk_get_re
 return req;
 }
 
+#ifdef __linux__
+static void virtio_blk_scsi_complete(void *opaque, int ret)
+{
+VirtIOBlockReq *req = opaque;
+int status;
+
+if (ret) {
+status = VIRTIO_BLK_S_UNSUPP;
+req-scsi_hdr.status = -ret;
+req-scsi_hdr.resid = req-scsi_hdr.dxfer_len;
+} else if (req-scsi_hdr.status) {
+status = VIRTIO_BLK_S_IOERR;
+} else {
+status = VIRTIO_BLK_S_OK;
+}
+
+req-scsi-errors = req-scsi_hdr.status;
+req-scsi-residual = req-scsi_hdr.resid;
+req-scsi-sense_len = req-scsi_hdr.sb_len_wr;
+req-scsi-data_len = req-scsi_hdr.dxfer_len;
+
+virtio_blk_req_complete(req, status);
+}
+
+static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
+{
+int i;
+
+/*
+ * We require at least one output segment each for the virtio_blk_outhdr
+ * and the SCSI command block.
+ *
+ * We also at least require the virtio_blk_inhdr, the virtio_scsi_inhdr
+ * and the sense buffer pointer in the input segments.
+ */
+if (req-elem.out_num  2 || req-elem.in_num  3) {
+virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
+return;
+}
+
+/*
+ * No support for bidirection commands yet.
+ */
+if (req-elem.out_num  2  req-elem.in_num  3) {
+virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
+return;
+}
+
+/*
+ * The scsi inhdr is placed in the second-to-last input segment, just
+ * before the regular inhdr.
+ */
+req-scsi = (void *)req-elem.in_sg[req-elem.in_num - 2].iov_base;
+
+memset(req-scsi_hdr, 0, sizeof(struct sg_io_hdr));
+req-scsi_hdr.interface_id = 'S';
+req-scsi_hdr.cmd_len = req-elem.out_sg[1].iov_len;
+req-scsi_hdr.cmdp = req-elem.out_sg[1].iov_base;
+req-scsi_hdr.dxfer_len = 0;
+
+if (req-elem.out_num  2) {
+/*
+ * If there are more than the minimally required 2 output segments
+ * there is write payload starting from the third iovec.
+ */
+req-scsi_hdr.dxfer_direction = SG_DXFER_TO_DEV;
+req-scsi_hdr.iovec_count = req-elem.out_num - 2;
+
+for (i = 0; i  req-scsi_hdr.iovec_count; i++)
+req-scsi_hdr.dxfer_len += req-elem.out_sg[i + 2].iov_len;
+req-scsi_hdr.dxferp = req-elem.out_sg + 2;
+} else if (req-elem.in_num  3) {
+/*
+ * If we have more than 3 input segments the guest wants to actually
+ * read data.
+ */
+req-scsi_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
+req-scsi_hdr.iovec_count = req-elem.in_num - 3;
+
+for (i = 0; i  req-scsi_hdr.iovec_count; i++)
+req-scsi_hdr.dxfer_len += req-elem.in_sg[i].iov_len;
+req-scsi_hdr.dxferp = req-elem.in_sg;
+} else {
+/*
+ * Some SCSI commands don't actually transfer any data.
+ */
+req-scsi_hdr.dxfer_direction

  1   2   3   >