Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-03 Thread Christoph Hellwig
On Tue, Nov 03, 2015 at 10:08:13AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote:
> > > Then I would argue for naming this differently. Make it an optional
> > > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> > > achieved via using a bypass or other means in the backend not the
> > > business of the driver.
> > > 
> > 
> > With a name like that, who wouldn't pass that flag? ;-)
> 
> xHCI for example, vs. something like 10G ethernet... but yes I agree it
> sucks. I don't like that sort of policy anywhere in drivers. On the
> other hand the platform doesn't have much information to make that sort
> of decision either.

Mabye because it should simply use what's optimal?  E.g. passthrough
whenever possible, where arguments against possible are:  dma_mask, vfio
requirements, kernel command line option.  This is what a lot of
architectures already do, I remember the SGI Origin / Altix code has the
same behavior as well.  Those IOMMUs already had the 64 bit passthrough
and 32-bit sliding window in addition to the real IOMMU 10 years ago.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] scsi: cleanup ioctl headers and provide UAPI versions

2015-09-25 Thread Christoph Hellwig
The whole series looks good to me.  Thanks for picking this work up!

Reviewed-by: Christoph Hellwig <h...@lst.de>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-08-09 Thread Christoph Hellwig
On Fri, Aug 07, 2015 at 06:45:26PM +0200, Peter Zijlstra wrote:
 Its just the swait_wake_all() that is not. The entire purpose of them
 was to have something that allows bounded execution (RT and all).

Still not sure i that might be a too big burden for mainline, but at
least it's not as severe..
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-08-07 Thread Christoph Hellwig
On Fri, Aug 07, 2015 at 01:14:15PM +0200, Peter Zijlstra wrote:
 On that, we cannot convert completions to swait. Because swait wake_all
 must not happen from IRQ context, and complete_all() typically is used
 from just that.

If swait queues aren't useable from IRQ context they will be fairly
useless.  What's the problem with making them irq safe?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] vhost/scsi: Add ANY_LAYOUT support

2015-02-02 Thread Christoph Hellwig
Hi Nic,

Al has been rewriting the vhost code to use iov_iter primitives,
can you please rebase it on top of that istead of using the obsolete
infrastructure?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-blk: Add vhost-blk support v2

2012-10-09 Thread Christoph Hellwig
 +static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file *file)
 +{
 +
 + struct inode *inode = file-f_mapping-host;
 + struct block_device *bdev = inode-i_bdev;
 + int ret;

Please just pass the block_device directly instead of a file struct.

 +
 + ret = vhost_blk_bio_make(req, bdev);
 + if (ret  0)
 + return ret;
 +
 + vhost_blk_bio_send(req);
 +
 + return ret;
 +}

Then again how simple the this function is it probably should just go
away entirely.

 + set_fs(USER_DS);

What do we actually need the set_fs for here?

 +
 +static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file)
 +{
 +
 + *file = vhost_blk_stop_vq(blk, blk-vq);
 +}

What is the point of this helper?  Also I can't see anyone actually
using the returned struct file.

 + case VIRTIO_BLK_T_FLUSH:
 + ret = vfs_fsync(file, 1);
 + status = ret  0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
 + if (!vhost_blk_set_status(req, status))
 + vhost_add_used_and_signal(blk-dev, vq, head, ret);
 + break;

Sending a fsync here is actually wrong in two different ways:

 a) it operates at the filesystem level instead of bio level
 b) it's a blocking operation

It should instead send a REQ_FLUSH bio using the same submission scheme
as the read/write requests.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm forum schedule?

2012-09-26 Thread Christoph Hellwig
Does anyone know when the kvm forum schedule for this year will be
published?  I'm especially curious if Friday will be a full conference
day or if it makes sense to fly back in the afternoon.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

2012-08-07 Thread Christoph Hellwig
At least after review is done I really think this patch sopuld be folded
into the previous one.

Some more comments below:

 @@ -58,6 +58,12 @@ struct virtblk_req
   struct bio *bio;
   struct virtio_blk_outhdr out_hdr;
   struct virtio_scsi_inhdr in_hdr;
 + struct work_struct work;
 + struct virtio_blk *vblk;

I think using bio-bi_private for the virtio_blk pointer would
be cleaner.


 + bool is_flush;
 + bool req_flush;
 + bool req_data;
 + bool req_fua;

This could be a bitmap, or better even a single state field.

 +static int virtblk_bio_send_flush(struct virtio_blk *vblk,
 +   struct virtblk_req *vbr)

 +static int virtblk_bio_send_data(struct virtio_blk *vblk,
 +  struct virtblk_req *vbr)

These should only get the struct virtblk_req * argument as the virtio_blk
structure is easily derivable from it.

 +static inline void virtblk_bio_done_flush(struct virtio_blk *vblk,
 +   struct virtblk_req *vbr)
  {
 + if (vbr-req_data) {
 + /* Send out the actual write data */
 + struct virtblk_req *_vbr;
 + _vbr = virtblk_alloc_req(vblk, GFP_NOIO);
 + if (!_vbr) {
 + bio_endio(vbr-bio, -ENOMEM);
 + goto out;
 + }
 + _vbr-req_fua = vbr-req_fua;
 + _vbr-bio = vbr-bio;
 + _vbr-vblk = vblk;
 + INIT_WORK(_vbr-work, virtblk_bio_send_data_work);
 + queue_work(virtblk_wq, _vbr-work);

The _vbr naming isn't too nice.  Also can you explain why the original
request can't be reused in a comment here?

Also if using a state variable I think the whole code would be
a bit cleaner if the bio_done helpers are combined.

 - if (writeback  !use_bio)
 + if (writeback)
   blk_queue_flush(vblk-disk-queue, REQ_FLUSH);

Shouldn't this be REQ_FLUSH | REQ_FUA for the bio case?

Btw, did you verify that flushes really work correctly for all cases
using tracing in qemu?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 0/2] Improve virtio-blk performance

2012-08-07 Thread Christoph Hellwig
On Tue, Aug 07, 2012 at 04:47:13PM +0800, Asias He wrote:
 1) Ramdisk device
  With bio-based IO path, sequential read/write, random read/write
  IOPS boost : 28%, 24%, 21%, 16%
  Latency improvement: 32%, 17%, 21%, 16%
 2) Fusion IO device
  With bio-based IO path, sequential read/write, random read/write
  IOPS boost : 11%, 11%, 13%, 10%
  Latency improvement: 10%, 10%, 12%, 10%

Do you also have numbers for normal SAS/SATA disks?  The changelog should
have a reall good explanation of why this is optional, and numbers are a
very important part of that.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 4/4] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

2012-08-06 Thread Christoph Hellwig
On Thu, Aug 02, 2012 at 02:43:04PM +0800, Asias He wrote:
 Even if it has a payload waiting is highly suboptimal and it should
 use a non-blocking sequencing like it is done in the request layer.

 So, for REQ_FLUSH, what we need is that send out the VIRTIO_BLK_T_FLUSH and 
 not to wait.

If it's REQ_FLUSH without data a VIRTIO_BLK_T_FLUSH should be sent out only,
if it's a REQ_FLUSH that has data a VIRTIO_BLK_T_FLUSH should be sent out,
but instead of waiting for it to finish the I/O completion handler should
then submit the actual write.

 We still need to wait until the actual write is finished here?
 Like,

 REQ_FUA is emulated by:
 1. Send the actual write
 2. Wait until the actual write is finished
 3. Send VIRTIO_BLK_T_FLUSH to device
 4. Signal the end of the write to upper layer

Remove step 2 and run step 3 from the I/O completion handler.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 4/4] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

2012-08-02 Thread Christoph Hellwig
On Thu, Aug 02, 2012 at 02:25:56PM +0800, Asias He wrote:
 We need to support both REQ_FLUSH and REQ_FUA for bio based path since
 it does not get the sequencing of REQ_FUA into REQ_FLUSH that request
 based drivers can request.
 
 REQ_FLUSH is emulated by:
 1. Send VIRTIO_BLK_T_FLUSH to device
 2. Wait until the flush is finished

There is no need to wait for the flush to finish if the REQ_FLUSH
request has no data payload.

Even if it has a payload waiting is highly suboptimal and it should
use a non-blocking sequencing like it is done in the request layer.

 
 REQ_FUA is emulated by:
 1. Send the actual write
 2. Wait until the actual write is finished
 3. Send VIRTIO_BLK_T_FLUSH to device
 4. Wait until the flush is finished
 5. Signal the end of the write to upper layer

The same comment about not blocking applies here as well.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-07-30 Thread Christoph Hellwig
On Mon, Jul 30, 2012 at 09:31:06AM +0200, Paolo Bonzini wrote:
 You only need to add REQ_FLUSH support.  The virtio-blk protocol does
 not support REQ_FUA, because there's no easy way to do it in userspace.

A bio-based driver needs to handle both REQ_FLUSH and REQ_FUA as it does
not get the sequencing of REQ_FUA into REQ_FLUSH that request based drivers
can request.  To what the REQ_FUA request gets translated is a different story.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-07-30 Thread Christoph Hellwig
On Mon, Jul 30, 2012 at 12:43:12PM +0800, Asias He wrote:
 I think we can add REQ_FLUSH  REQ_FUA support to bio path and that 
 deserves another patch.

Adding it is a requirement for merging the code.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-07-30 Thread Christoph Hellwig
On Mon, Jul 30, 2012 at 11:25:51AM +0930, Rusty Russell wrote:
 I consider this approach a half-way step.  Quick attempts on my laptop
 and I couldn't find a case where the bio path was a loss, but in theory
 if the host wasn't doing any reordering and it was a slow device, you'd
 want the guest to do so.
 
 I'm not sure if current qemu can be configured to do such a thing?


The host kernel will do the I/O scheduling for you unless you explicitly
disable it.  And we should be able to assume an administrator will only
disable it when they have a reason for it - if not they'll get worse
performance for non-virtualized workloads as well.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6

2012-07-18 Thread Christoph Hellwig
On Wed, Jul 18, 2012 at 08:42:21AM -0500, Anthony Liguori wrote:
 
 If you add support for a new command, you need to provide userspace
 a way to disable this command.  If you change what gets reported for
 VPD, you need to provide userspace a way to make VPD look like what
 it did in a previous version.
 
 Basically, you need to be able to make a TCM device behave 100% the
 same as it did in an older version of the kernel.
 
 This is unique to virtualization due to live migration.  If you
 migrate from a 3.6 kernel to a 3.8 kernel, you need to make sure
 that the 3.8 kernel's TCM device behaves exactly like the 3.6 kernel
 because the guest that is interacting with it does not realize that
 live migration happened.

I don't think these strict live migration rules apply to SCSI targets.

Real life storage systems get new features and different behaviour with
firmware upgrades all the time, and SCSI initiators deal with that just
fine.  I don't see any reason to be more picky just because we're
virtualized.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND 0/5] Add vhost-blk support

2012-07-14 Thread Christoph Hellwig
Please send a version that does direct block I/O similar to xen-blkback
for now.  If we get proper in-kernel aio support one day you can add
back file backend support.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss

2012-04-24 Thread Christoph Hellwig
On Mon, Apr 16, 2012 at 09:34:41AM +0100, Stefan Hajnoczi wrote:
 On Sun, Apr 15, 2012 at 5:16 AM, Ron Edison r...@idthq.com wrote:
  The server is a Dell R710 with an H700 controller with 1gb of nvcache. 
  Writeback cache is enabled on the controller. There is a mix of linux and 
  windows guests, some with qcow2 format vdisks and others with raw format 
  vdisks. Some of these guests have wb cache enabled on the vdisks and some 
  do not.
 
 -drive cache=writeback is safe when the guest flushes appropriately.
 If the guest is not sending flushes (e.g. ext3/4 barrier=0) then there
 are no guarantees.

Which is the default for ext3 on RHEL/CentOS  6, btw.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO

2012-01-02 Thread Christoph Hellwig
On Mon, Jan 02, 2012 at 05:12:00PM +0100, Paolo Bonzini wrote:
 On 01/01/2012 05:45 PM, Stefan Hajnoczi wrote:
 By the way, drivers for solid-state devices can set QUEUE_FLAG_NONROT
 to hint that seek time optimizations may be sub-optimal.  NBD and
 other virtual/pseudo device drivers set this flag.  Should virtio-blk
 set it and how does it affect performance?
 
 By itself is not a good idea in general.
 
 When QEMU uses O_DIRECT, the guest should not use QUEUE_FLAG_NONROT
 unless it is active for the host disk as well.  (In doubt, as is the
 case for remote hosts accessed over NFS, I would also avoid NONROT
 and allow more coalescing).

Do we have any benchmark numbers where QUEUE_FLAG_NONROT makes a
difference?  I tried a few times, and the only constant measureable
thing was that it regressed performance when used for rotating devices
in a few benchmarks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO

2012-01-02 Thread Christoph Hellwig
On Sun, Jan 01, 2012 at 04:45:42PM +, Stefan Hajnoczi wrote:
 win.  The fact that you added batching suggests there is some benefit
 to what the request-based code path does.  So find out what's good
 about the request-based code path and how to get the best of both
 worlds.

Batching pretty much always is a winner.  The maximum bio size is small
enough that we'll frequently see multiple contiguos bios.  Because of
that the Md layer fo example uses the same kind of batching.  I've tried
to make this more general by passing a bio list to -make_request and
make the on-stack plugging work on bios, but in the timeslice I had
available for that I didn't manage to actually make it work.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO

2012-01-02 Thread Christoph Hellwig
On Mon, Jan 02, 2012 at 05:18:13PM +0100, Paolo Bonzini wrote:
 I tried a few times, and the only constant measureable
 thing was that it regressed performance when used for rotating devices
 in a few benchmarks.
 
 Were you trying with cache=none or writeback?  For cache=none,
 that's exactly what I'd expect.  cache=writeback could be more
 interesting...

cache=none - cache=writeback isn't something people should ever use
except for read-only or extremly read-mostly workloads.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] virtio-blk: implement -make_request

2011-12-22 Thread Christoph Hellwig
On Thu, Dec 22, 2011 at 12:20:01PM +, Stefan Hajnoczi wrote:
 virtblk_make_request() checks that the queue is not plugged.  Do we
 need to do that here too?

biot based drivers don't have a queue that could be plugged.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] qemu-io tests: More fine grained control of qemu paths

2011-12-08 Thread Christoph Hellwig
Thanks a lot Lucas,

I've applied the patches.  And sorry for the delay, I'm pretty busy at the
moment.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM for the enterprise

2011-11-19 Thread Christoph Hellwig
On Fri, Nov 18, 2011 at 11:25:38AM -0500, Michael Waite wrote:
  Hi,
 The Open Virtualization Alliance is going to be having a webinar on
 December 8th which is intended to help promote KVM as an enterprise
 class hypervisor. I see so much great engineering work going on to
 make KVM a really robust technology and we want to help tell this
 story to the world-wide audience.

Maybe it's time to create a kvm-marketing@whatever list for sneaky
marketing BS from people like you instead of spamming development lists
like this one?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Add wrapper script around QEMU to test kernels

2011-11-08 Thread Christoph Hellwig
On Tue, Nov 08, 2011 at 04:41:40PM +0200, Avi Kivity wrote:
 On 11/06/2011 03:35 AM, Alexander Graf wrote:
  To quickly get going, just execute the following as user:
 
  $ ./Documentation/run-qemu.sh -r / -a init=/bin/bash
 
  This will drop you into a shell on your rootfs.
 
 
 Doesn't work on Fedora 15.  F15's qemu-kvm doesn't have -machine or
 -virtfs.  Even qemu.git on F15 won't build virtfs since xattr.h
 detection is broken (patch posted).

Nevermind that running virtfs as a rootfs is a really dumb idea.  You
do now want to run a VM that has a rootfs that gets changed all the
time behind your back.

Running qemu -snapshot on the actual root block device is the only
safe way to reuse the host installation, although it gets a bit
complicated if people have multiple devices mounted into the namespace.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Add wrapper script around QEMU to test kernels

2011-11-08 Thread Christoph Hellwig
On Tue, Nov 08, 2011 at 04:57:04PM +0200, Avi Kivity wrote:
  Running qemu -snapshot on the actual root block device is the only
  safe way to reuse the host installation, although it gets a bit
  complicated if people have multiple devices mounted into the namespace.
 
 How is -snapshot any different?  If the host writes a block after the
 guest has been launched, but before that block was cowed, then the guest
 will see the new block.

Right, thinko - qemu's snapshots are fairly useless due to sitting
ontop of the file to be modified.

 It could work with a btrfs snapshot, but not everyone uses that.

Or LVM snapshot.  Either way, just reusing the root fs without care
is a dumb idea, and I really don't want any tool or script that
encurages such braindead behaviour in the kernel tree.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Add wrapper script around QEMU to test kernels

2011-11-08 Thread Christoph Hellwig
On Tue, Nov 08, 2011 at 05:26:03PM +0200, Pekka Enberg wrote:
 On Tue, Nov 8, 2011 at 4:52 PM, Christoph Hellwig h...@infradead.org wrote:
  Nevermind that running virtfs as a rootfs is a really dumb idea. ?You
  do now want to run a VM that has a rootfs that gets changed all the
  time behind your back.
 
 It's rootfs binaries that are shared, not configuration. It's
 unfortunate but works OK for the single user use case it's meant for.
 It's obviously not a proper solution for the generic case. We were
 hoping that we could use something like overlayfs to hide the issue
 under the rug. Do you think that's also a really dumb thing to do?

It doesn't hide your issues.  Any kind of unioning will have massive
consistency issues (as in will corrupt your fs if you do stupid things)
if the underlying layer is allowed to be written to.  Thus all the
fuzz about making sure the underlying fs can never be mounted writeable
in the union mount patches.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-04 Thread Christoph Hellwig
On Fri, Nov 04, 2011 at 10:38:21AM +0200, Pekka Enberg wrote:
 Hi Linus,
 
 Please consider pulling the latest KVM tool tree from:

There still is absolutely zero reason to throw it in the kernel tree.

Please prepare a nice standalone git repository and tarball for it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-04 Thread Christoph Hellwig
On Fri, Nov 04, 2011 at 02:35:18PM +0200, Pekka Enberg wrote:
 We are reusing kernel code and headers and I am not interested in
 copying them over. Living in the kernel tree is part of the design,
 whether you like it or not.

That's pretty much a blanko argument for throwing everything into the
kernel.  What's next, throwing your jvm into the kernel because you like
some kernel helpers?

We've been through this a few times - there is no reason why a tool
using the KVM ioctls should be considered close to the kernel - all
other users get away just fine staying out of tree, and that has
helped to keep the ABI stable.

This sounds more like you should create a libkernelutil with useful
data structures we use in the kernel for userspace programs.  I'd love
to use that for a few projects, btw.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 5] virtio: document functions better

2011-11-03 Thread Christoph Hellwig
On Thu, Nov 03, 2011 at 06:12:49PM +1030, Rusty Russell wrote:
 The old documentation is left over from when we used a structure with
 strategy pointers.

Looks good,

Reviewed-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2 of 5] virtio: rename virtqueue_add_buf_gfp to virtqueue_add_buf

2011-11-03 Thread Christoph Hellwig
On Thu, Nov 03, 2011 at 06:12:50PM +1030, Rusty Russell wrote:
 Remove wrapper functions. This makes the allocation type explicit in
 all callers; I used GPF_KERNEL where it seemed obvious, left it at
 GFP_ATOMIC otherwise.

Looks good,

Reviewed-by: Christoph Hellwig h...@lst.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3 of 5] virtio: support unlocked queue kick

2011-11-03 Thread Christoph Hellwig
On Thu, Nov 03, 2011 at 06:12:51PM +1030, Rusty Russell wrote:
 Based on patch by Christoph for virtio_blk speedup:

Please credit it to Stefan - he also sent a pointer to his original
version in reply to the previous thread.

Also shouldn't virtqueue_kick have kerneldoc comments?


I also notices that you documented the functions bother here and in
the first patch in the headers.  At least historically the kerneldoc
tools didn't parse comments at declarations, but only at the function
defintions.  Did you check these actually get picked up?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] virtio: support unlocked queue kick

2011-11-02 Thread Christoph Hellwig
On Wed, Nov 02, 2011 at 01:49:36PM +1030, Rusty Russell wrote:
 I thought it was still a WIP?

The whole series - yes.  This patch (and the serial number rewrite): no
- these are pretty much rock solid.

 Since the problem is contention on the lock inside the block layer, the
 simplest solution is to have a separate lock to protect the virtqueue.

As long as we still use a -request_fn based driver that is not going
to buy us anything, in fact it's going to make things worse.
-request_fn based drivers always have the queue lock held over the
invocation of -request_fn anyway, and then need it around the call
to __blk_end_request_all.  So you might minimally reduce contention
time, but skyrocket the number of lock acquisations when separating
them without changes to the block layer.

With the -make_request_fn based driver vlkb-lock does't protect
anything but the virtuequeue anyway, but not having to take it
over the wakeup there is a) done easily and b) neatly fits the model.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] block: add bio_map_sg

2011-10-06 Thread Christoph Hellwig
On Thu, Oct 06, 2011 at 12:51:39AM +0200, Boaz Harrosh wrote:
 I have some questions.
 
 - Could we later use this bio_map_sg() to implement blk_rq_map_sg() and
   remove some duplicated code?

I didn't even think about that, but it actually looks very possible
to factor the meat in the for each bvec loop into a common helper
used by both.

 - Don't you need to support a chained bio (bio-next != NULL)? Because
   I did not see any looping in the last patch 
   [PATCH 5/5] virtio-blk: implement -make_request
   Or is it that -make_request() is a single bio at a time?
   If so could we benefit from both bio-chaining and sg-chaning to
   make bigger IOs?

At this point -make_request is a single bio interface.  I have some
WIP patches to do the onstack plugging per bio, at which point it would
change to take a list.  For this to work we'd need major changes to
all -make_request drivers.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-blk: implement -make_request

2011-10-06 Thread Christoph Hellwig
On Thu, Oct 06, 2011 at 12:22:14PM +1030, Rusty Russell wrote:
 On Wed, 05 Oct 2011 15:54:08 -0400, Christoph Hellwig h...@infradead.org 
 wrote:
  Add an alternate I/O path that implements -make_request for virtio-blk.
  This is required for high IOPs devices which get slowed down to 1/5th of
  the native speed by all the locking, memory allocation and other overhead
  in the request based I/O path.
 
 Ouch.
 
 I'd be tempted to just switch across to this, though I'd be interested
 to see if the simple add_buf change I referred to before has some effect
 by itself (I doubt it).

Benchmarking this more extensively even on low-end devices is number
on my todo list after sorting out the virtqueue race and implementing
flush/fua support.  I'd really prefer to switch over to it
unconditionally if the performance numbers allow it.

 Also, though it's overkill I'd use standard list primitives rather than
 open-coding a single linked list.

I really prefer using standard helpers, but using a doubly linked list
and increasing memory usage seems like such a waste.  Maybe I should
annoy Linus by proposing another iteration of a common single linked
list implementation :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] virtio: support unlocked queue kick

2011-10-05 Thread Christoph Hellwig
Split virtqueue_kick to be able to do the actual notification outside the
lock protecting the virtqueue.  This patch was originally done by
Stefan Hajnoczi, but I can't find the original one anymore and had to
recreated it from memory.  Pointers to the original or corrections for
the commit message are welcome.

Index: linux-2.6/drivers/virtio/virtio_ring.c
===
--- linux-2.6.orig/drivers/virtio/virtio_ring.c 2011-09-15 15:28:55.891347016 
+0200
+++ linux-2.6/drivers/virtio/virtio_ring.c  2011-10-03 18:45:32.492738431 
+0200
@@ -237,9 +237,11 @@ add_head:
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
-void virtqueue_kick(struct virtqueue *_vq)
+bool virtqueue_kick_prepare(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   bool need_kick = false;
+
u16 new, old;
START_USE(vq);
/* Descriptors and available array need to be set before we expose the
@@ -253,15 +255,32 @@ void virtqueue_kick(struct virtqueue *_v
/* Need to update avail index before checking if we should notify */
virtio_mb();
 
-   if (vq-event ?
-   vring_need_event(vring_avail_event(vq-vring), new, old) :
-   !(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY))
-   /* Prod other side to tell it about changes. */
-   vq-notify(vq-vq);
-
+   if (vq-event) {
+   if (vring_need_event(vring_avail_event(vq-vring), new, old))
+   need_kick = true;
+   } else {
+   if (!(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY))
+   need_kick = true;
+   }
END_USE(vq);
+   return need_kick;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
+
+void virtqueue_notify(struct virtqueue *_vq)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* Prod other side to tell it about changes. */
+   vq-notify(_vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_notify);
+
+void virtqueue_kick(struct virtqueue *vq)
+{
+   if (virtqueue_kick_prepare(vq))
+   virtqueue_notify(vq);
 }
-EXPORT_SYMBOL_GPL(virtqueue_kick);
 
 static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 {
Index: linux-2.6/include/linux/virtio.h
===
--- linux-2.6.orig/include/linux/virtio.h   2011-09-15 15:28:57.078857804 
+0200
+++ linux-2.6/include/linux/virtio.h2011-10-03 18:41:07.309766531 +0200
@@ -86,6 +86,8 @@ static inline int virtqueue_add_buf(stru
 }
 
 void virtqueue_kick(struct virtqueue *vq);
+bool virtqueue_kick_prepare(struct virtqueue *vq);
+void virtqueue_notify(struct virtqueue *vq);
 
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] virtio-blk: implement -make_request

2011-10-05 Thread Christoph Hellwig
Add an alternate I/O path that implements -make_request for virtio-blk.
This is required for high IOPs devices which get slowed down to 1/5th of
the native speed by all the locking, memory allocation and other overhead
in the request based I/O path.

This patch is not quite merge ready due to two issues:

 - it doesn't implement FUA and FLUSH requests yet
 - it hardcodes which I/O path to chose

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2011-10-05 10:36:42.883913334 
-0400
+++ linux-2.6/drivers/block/virtio_blk.c2011-10-05 15:29:35.591405323 
-0400
@@ -11,6 +11,8 @@
 
 #define PART_BITS 4
 
+static int use_make_request = 1;
+
 static int major, index;
 struct workqueue_struct *virtblk_wq;
 
@@ -20,6 +22,7 @@ struct virtio_blk
 
struct virtio_device *vdev;
struct virtqueue *vq;
+   wait_queue_head_t queue_wait;
 
/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -39,11 +42,13 @@ struct virtio_blk
 struct virtblk_req
 {
void *private;
+   struct virtblk_req *next;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
u8 kind;
 #define VIRTIO_BLK_REQUEST 0x00
-#define VIRTIO_BLK_INTERNAL0x01
+#define VIRTIO_BLK_BIO 0x01
+#define VIRTIO_BLK_INTERNAL0x02
u8 status;
 };
 
@@ -74,10 +79,17 @@ static void virtblk_request_done(struct
mempool_free(vbr, vblk-pool);
 }
 
+static void virtblk_bio_done(struct virtio_blk *vblk,
+   struct virtblk_req *vbr)
+{
+   bio_endio(vbr-private, virtblk_result(vbr));
+   mempool_free(vbr, vblk-pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
-   struct virtblk_req *vbr;
+   struct virtblk_req *vbr, *head = NULL, *tail = NULL;
unsigned int len;
unsigned long flags;
 
@@ -88,15 +100,47 @@ static void blk_done(struct virtqueue *v
virtblk_request_done(vblk, vbr);
break;
case VIRTIO_BLK_INTERNAL:
-   complete(vbr-private);
+   case VIRTIO_BLK_BIO:
+   if (head) {
+   tail-next = vbr;
+   tail = vbr;
+   } else {
+   tail = head = vbr;
+   }
break;
default:
BUG();
}
}
-   /* In case queue is stopped waiting for more buffers. */
-   blk_start_queue(vblk-disk-queue);
+
+   if (!use_make_request) {
+   /* In case queue is stopped waiting for more buffers. */
+   blk_start_queue(vblk-disk-queue);
+   }
spin_unlock_irqrestore(vblk-lock, flags);
+
+   wake_up(vblk-queue_wait);
+
+   /*
+* Process completions after freeing up space in the virtqueue and
+* dropping the lock.
+*/
+   while (head) {
+   vbr = head;
+   head = head-next;
+
+   switch (vbr-kind) {
+   case VIRTIO_BLK_BIO:
+   virtblk_bio_done(vblk, vbr);
+   break;
+   case VIRTIO_BLK_INTERNAL:
+   complete(vbr-private);
+   break;
+   default:
+   BUG();
+   }
+
+   }
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -111,6 +155,7 @@ static bool do_req(struct request_queue
return false;
 
vbr-private = req;
+   vbr-next = NULL;
vbr-kind = VIRTIO_BLK_REQUEST;
 
if (req-cmd_flags  REQ_FLUSH) {
@@ -199,6 +244,128 @@ static void do_virtblk_request(struct re
virtqueue_kick(vblk-vq);
 }
 
+struct virtblk_plug_cb {
+   struct blk_plug_cb cb;
+   struct virtio_blk *vblk;
+};
+
+static void virtblk_unplug(struct blk_plug_cb *bcb)
+{
+   struct virtblk_plug_cb *cb =
+   container_of(bcb, struct virtblk_plug_cb, cb);
+
+   virtqueue_notify(cb-vblk-vq);
+   kfree(cb);
+}
+
+static bool virtblk_plugged(struct virtio_blk *vblk)
+{
+   struct blk_plug *plug = current-plug;
+   struct virtblk_plug_cb *cb;
+
+   if (!plug)
+   return false;
+
+   list_for_each_entry(cb, plug-cb_list, cb.list) {
+   if (cb-cb.callback == virtblk_unplug  cb-vblk == vblk)
+   return true;
+   }
+
+   /* Not currently on the callback list */
+   cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+   if (!cb)
+   return false;
+
+   cb-vblk = vblk;
+   cb-cb.callback = virtblk_unplug;
+   list_add(cb-cb.list, plug-cb_list);
+   return true;
+}
+
+static void

[PATCH 3/5] virtio-blk: remove the unused list of pending requests

2011-10-05 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2011-10-03 19:55:29.061215040 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2011-10-03 19:55:54.196412176 
+0200
@@ -24,9 +24,6 @@ struct virtio_blk
/* The disk structure for the kernel. */
struct gendisk *disk;
 
-   /* Request tracking. */
-   struct list_head reqs;
-
mempool_t *pool;
 
/* Process context for config space updates */
@@ -41,7 +38,6 @@ struct virtio_blk
 
 struct virtblk_req
 {
-   struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
@@ -85,7 +81,6 @@ static void blk_done(struct virtqueue *v
}
 
__blk_end_request_all(vbr-req, error);
-   list_del(vbr-list);
mempool_free(vbr, vblk-pool);
}
/* In case queue is stopped waiting for more buffers. */
@@ -170,7 +165,6 @@ static bool do_req(struct request_queue
return false;
}
 
-   list_add_tail(vbr-list, vblk-reqs);
return true;
 }
 
@@ -368,7 +362,6 @@ static int __devinit virtblk_probe(struc
goto out;
}
 
-   INIT_LIST_HEAD(vblk-reqs);
spin_lock_init(vblk-lock);
vblk-vdev = vdev;
vblk-sg_elems = sg_elems;
@@ -526,9 +519,6 @@ static void __devexit virtblk_remove(str
 
flush_work(vblk-config_work);
 
-   /* Nothing should be pending. */
-   BUG_ON(!list_empty(vblk-reqs));
-
/* Stop all the virtqueues. */
vdev-config-reset(vdev);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] RFC: -make_request support for virtio-blk

2011-10-05 Thread Christoph Hellwig
This patchset allows the virtio-blk driver to support much higher IOP
rates which can be driven out of modern PCI-e flash devices.  At this
point it really is just a RFC due to various issues.

The first four patches are infrastructure that could go in fairly
soon as far as I'm concerned.  Patch 5 implements the actual -make_request
support and still has a few issues, see there for more details.  With
it I can driver my PCI-e test devices to 85-90% of the native IOPS
and bandwith, but be warned that this is still a fairly low end setup
as far as expensive flash storage is concerned.

One big downside that is has is that it current exposes a nasty race
in the qemu virtqueue code - just running xfstests inside a guest
using the new virtio-blk driver (even on a slow device) will trigger
it and lead to a filesystem shutdown.  I've tracked it down to getting
data I/O segments overwritten with status s/g list entries, but got
lost at that point.  I can start a separate thread on it.

Besides that it is missing a few features, and we have to decided
how to select which mode to use in virtio-blk - either a module option,
sysfs attribute or something that the host communicates.  Or maybe
decide that just going with -make_request alone is fine, even on
my cheap laptop SSD it actually is just as fast if not slightly
faster than the request based variant on my laptop.

There are a few other bottlenecks in virtio that this exposes.  The
first one is the low queue length of just 128 entries in the virtio-blk
queue - to drive higher IOPs with a deep queue we absolutely need
to increment that.

Comments welcome!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] virtio-blk: reimplement the serial attribute without using requests

2011-10-05 Thread Christoph Hellwig
If we want to do bio-based I/O in virtio-blk we have to implement reading
the serial attribute ourselves.  Do that and also prepare struct virtblk_req
for dealing with different types of requests.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2011-10-03 20:32:12.997713070 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2011-10-03 20:37:28.836714193 
+0200
@@ -38,12 +38,42 @@ struct virtio_blk
 
 struct virtblk_req
 {
-   struct request *req;
+   void *private;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
+   u8 kind;
+#define VIRTIO_BLK_REQUEST 0x00
+#define VIRTIO_BLK_INTERNAL0x01
u8 status;
 };
 
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+   switch (vbr-status) {
+   case VIRTIO_BLK_S_OK:
+   return 0;
+   case VIRTIO_BLK_S_UNSUPP:
+   return -ENOTTY;
+   default:
+   return -EIO;
+   }
+}
+
+static void virtblk_request_done(struct virtio_blk *vblk,
+   struct virtblk_req *vbr)
+{
+   struct request *req = vbr-private;
+
+   if (req-cmd_type == REQ_TYPE_BLOCK_PC) {
+   req-resid_len = vbr-in_hdr.residual;
+   req-sense_len = vbr-in_hdr.sense_len;
+   req-errors = vbr-in_hdr.errors;
+   }
+
+   __blk_end_request_all(req, virtblk_result(vbr));
+   mempool_free(vbr, vblk-pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
@@ -53,35 +83,16 @@ static void blk_done(struct virtqueue *v
 
spin_lock_irqsave(vblk-lock, flags);
while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
-   int error;
-
-   switch (vbr-status) {
-   case VIRTIO_BLK_S_OK:
-   error = 0;
-   break;
-   case VIRTIO_BLK_S_UNSUPP:
-   error = -ENOTTY;
-   break;
-   default:
-   error = -EIO;
-   break;
-   }
-
-   switch (vbr-req-cmd_type) {
-   case REQ_TYPE_BLOCK_PC:
-   vbr-req-resid_len = vbr-in_hdr.residual;
-   vbr-req-sense_len = vbr-in_hdr.sense_len;
-   vbr-req-errors = vbr-in_hdr.errors;
+   switch (vbr-kind) {
+   case VIRTIO_BLK_REQUEST:
+   virtblk_request_done(vblk, vbr);
break;
-   case REQ_TYPE_SPECIAL:
-   vbr-req-errors = (error != 0);
+   case VIRTIO_BLK_INTERNAL:
+   complete(vbr-private);
break;
default:
-   break;
+   BUG();
}
-
-   __blk_end_request_all(vbr-req, error);
-   mempool_free(vbr, vblk-pool);
}
/* In case queue is stopped waiting for more buffers. */
blk_start_queue(vblk-disk-queue);
@@ -99,28 +110,24 @@ static bool do_req(struct request_queue
/* When another request finishes we'll try again. */
return false;
 
-   vbr-req = req;
+   vbr-private = req;
+   vbr-kind = VIRTIO_BLK_REQUEST;
 
if (req-cmd_flags  REQ_FLUSH) {
vbr-out_hdr.type = VIRTIO_BLK_T_FLUSH;
vbr-out_hdr.sector = 0;
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
+   vbr-out_hdr.ioprio = req_get_ioprio(req);
} else {
switch (req-cmd_type) {
case REQ_TYPE_FS:
vbr-out_hdr.type = 0;
-   vbr-out_hdr.sector = blk_rq_pos(vbr-req);
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
+   vbr-out_hdr.sector = blk_rq_pos(req);
+   vbr-out_hdr.ioprio = req_get_ioprio(req);
break;
case REQ_TYPE_BLOCK_PC:
vbr-out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
vbr-out_hdr.sector = 0;
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
-   break;
-   case REQ_TYPE_SPECIAL:
-   vbr-out_hdr.type = VIRTIO_BLK_T_GET_ID;
-   vbr-out_hdr.sector = 0;
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
+   vbr-out_hdr.ioprio = req_get_ioprio(req);
break;
default:
/* We don't put anything else in the queue. */
@@ -136,13 +143,14 @@ static bool do_req(struct request_queue
 * block, and before the normal inhdr we put the sense data and the
 * inhdr

[PATCH 1/5] block: add bio_map_sg

2011-10-05 Thread Christoph Hellwig
Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg.
This helper is useful for any driver that wants to create a scatterlist
from its -make_request method.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/block/blk-merge.c
===
--- linux-2.6.orig/block/blk-merge.c2011-10-04 11:49:32.857014742 -0400
+++ linux-2.6/block/blk-merge.c 2011-10-04 13:37:51.305630951 -0400
@@ -199,6 +199,69 @@ new_segment:
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
+/*
+ * map a bio to a scatterlist, return number of sg entries setup. Caller
+ * must make sure sg can hold bio-bi_phys_segments entries
+ */
+int bio_map_sg(struct request_queue *q, struct bio *bio,
+ struct scatterlist *sglist)
+{
+   struct bio_vec *bvec, *bvprv;
+   struct scatterlist *sg;
+   int nsegs, cluster;
+   unsigned long i;
+
+   nsegs = 0;
+   cluster = blk_queue_cluster(q);
+
+   bvprv = NULL;
+   sg = NULL;
+   bio_for_each_segment(bvec, bio, i) {
+   int nbytes = bvec-bv_len;
+
+   if (bvprv  cluster) {
+   if (sg-length + nbytes  queue_max_segment_size(q))
+   goto new_segment;
+
+   if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
+   goto new_segment;
+   if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
+   goto new_segment;
+
+   sg-length += nbytes;
+   } else {
+new_segment:
+   if (!sg)
+   sg = sglist;
+   else {
+   /*
+* If the driver previously mapped a shorter
+* list, we could see a termination bit
+* prematurely unless it fully inits the sg
+* table on each mapping. We KNOW that there
+* must be more entries here or the driver
+* would be buggy, so force clear the
+* termination bit to avoid doing a full
+* sg_init_table() in drivers for each command.
+*/
+   sg-page_link = ~0x02;
+   sg = sg_next(sg);
+   }
+
+   sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
+   nsegs++;
+   }
+   bvprv = bvec;
+   } /* segments in bio */
+
+   if (sg)
+   sg_mark_end(sg);
+
+   BUG_ON(bio-bi_phys_segments  nsegs  bio-bi_phys_segments);
+   return nsegs;
+}
+EXPORT_SYMBOL(bio_map_sg);
+
 static inline int ll_new_hw_segment(struct request_queue *q,
struct request *req,
struct bio *bio)
Index: linux-2.6/include/linux/blkdev.h
===
--- linux-2.6.orig/include/linux/blkdev.h   2011-10-04 13:37:13.216148915 
-0400
+++ linux-2.6/include/linux/blkdev.h2011-10-04 13:37:51.317613617 -0400
@@ -854,6 +854,8 @@ extern void blk_queue_flush_queueable(st
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device 
*bdev);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct 
scatterlist *);
+extern int bio_map_sg(struct request_queue *q, struct bio *bio,
+   struct scatterlist *sglist);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] RFC: -make_request support for virtio-blk

2011-10-05 Thread Christoph Hellwig
On Wed, Oct 05, 2011 at 04:31:16PM -0400, Vivek Goyal wrote:
 So you no longer believe that request queue overhead can be brought
 down to mangeable levels for these fast devices. And instead go for
 bio based drivers and give up on merging and implement own FLUSH/FUA
 machinery.

Not in a reasonable time frame. Having a common interface is still the
grand plan, but the virtio performance issues are hurting people so
badly that we need a fix soon.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Fix refcounting in hugetlbfs quota handling

2011-08-12 Thread Christoph Hellwig
On Thu, Aug 11, 2011 at 04:40:59PM +1000, David Gibson wrote:
 Linus, please apply
 
 hugetlbfs tracks the current usage of hugepages per hugetlbfs
 mountpoint.  To correctly track this when hugepages are released, it
 must find the right hugetlbfs super_block from the struct page
 available in free_huge_page().

a superblock is not a mountpoint, it's a filesystem instance.  You can happily
have a single filesystem mounted at multiple mount points.

 However, this usage is buggy, because nothing ensures that the
 address_space is not freed before all the hugepages that belonged to
 it are.  In practice that will usually be the case, but if extra page
 references have been taken by e.g. drivers or kvm doing
 get_user_pages() then the file, inode and address space may be
 destroyed before all the pages.
 
 In addition, the quota functions use the mapping only to get the inode
 then the super_block.  However, most of the callers already have the
 inode anyway and have to get the mapping from there.
 
 This patch, therefore, stores a pointer to the inode instead of the
 address_space in the page private data for hugepages.

What's sthe point?  The lifetime of inode-i_mapping is exactly the
same as that of the inode, except for those few filesystem that use
one from a different inode (and then for the whole lifetime of the
inode), so I can't see how your patch will make a difference.

 More
 importantly it correctly adjusts the reference count on the inodes
 when they're added to the page private data.  This ensures that the
 inode (and therefore the super block) will not be freed before we use
 it from free_huge_page.

That seems like the real fix.  And even if you'd still do the other bits
it should be a separate patch/

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: Support releasing lock during kick

2011-08-10 Thread Christoph Hellwig
Any progress on these patches?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-01 Thread Christoph Hellwig
On Mon, Aug 01, 2011 at 01:46:33PM +0800, Liu Yuan wrote:
 - I focused on using vfs interfaces in the kernel, so that I can
 use it for file-backed devices.
 Our use-case scenario is mostly file-backed images.
 
 vhost-blk's that uses Linux AIO also support file-backed images.
 Actually, I have run Guests both on raw partition and raw file
 images.

Note tjat it will only perform very well for preallocated images (just
like aio=native in qemu) - if you use sparse images it will have to
call into the allocator, which blocks for metadata I/O even in aio mode.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Christoph Hellwig
On Fri, Jul 29, 2011 at 03:59:53PM +0800, Liu Yuan wrote:
 I noted bdrv_aio_multiwrite() do the murging job, but  I am not sure

Just like I/O schedulers it's actually fairly harmful on high IOPS,
low latency devices.  I've just started doing a lot of qemu bencharks,
and disabling that multiwrite mess alone gives fairly nice speedups.

The major issue seems to be additional memory allocations and cache
lines - a problem that actually is fairly inherent all over the qemu
code.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk

2011-07-28 Thread Christoph Hellwig
On Thu, Jul 28, 2011 at 10:29:05PM +0800, Liu Yuan wrote:
 From: Liu Yuan tailai...@taobao.com
 
 Vhost-blk driver is an in-kernel accelerator, intercepting the
 IO requests from KVM virtio-capable guests. It is based on the
 vhost infrastructure.
 
 This is supposed to be a module over latest kernel tree, but it
 needs some symbols from fs/aio.c and fs/eventfd.c to compile with.
 So currently, after applying the patch, you need to *recomplie*
 the kernel.
 
 Usage:
 $kernel-src: make M=drivers/vhost
 $kernel-src: sudo insmod drivers/vhost/vhost_blk.ko
 
 After insmod, you'll see /dev/vhost-blk created. done!

You'll need to send the changes for existing code separately.

If you're going mostly for raw blockdevice access just calling
submit_bio will shave even more overhead off, and simplify the
code a lot.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 6/6] scsi-disk: Check for supported commands

2011-07-26 Thread Christoph Hellwig
On Fri, Jul 22, 2011 at 04:51:17PM +0200, Hannes Reinecke wrote:
 Not every command is support for any device type. This patch adds
 a check for rejecting unsupported commands.
 
 Signed-off-by: Hannes Reinecke h...@suse.de

This seems to conflic with Markus' series.  But if we want to invest
any major effort into it, we really need to different dispatch tables
for different device types.  There's two sane ways to do it:

one top-level handler with a switch per device type, or tables
with a handler pointer with a device type.  I'm fine with either one.

What I really don't get with this patch is the listing of all the different
SCSI device types.  It's already a mistake that we tried to handle disks
and CDROMs with the same driver, but adding even more just makes it worth.

IMHO we should simply have one file per SCSI spec, e.g. an spc.c for the
common bits, then an sbc.c for disks, and mmc.c for cdroms.  Maybe more
the day we add more emulated device types.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Native Linux KVM tool for 3.1

2011-07-25 Thread Christoph Hellwig
On Mon, Jul 25, 2011 at 10:14:13AM +0200, Alexander Graf wrote:
 So instead of thinking a bit and trying to realize that there might be a 
 reason people don't want all their user space in the kernel tree you go ahead 
 and start your own crusade of creating a new user space. Great. That's how I 
 always hoped Linux would be :(.

It's not Linux in general yet, it's mostly a crusade of a few with a
political agenda.

  So i wanted to have a lightweight tool that allows me to test KVM and 
  tools/kvm/ does that very nicely: i type './kvm run' and i can test a 
  native bzImage (which has some virtualization options enabled as 
  well) on the _host_ distro i am running, booting to a text shell 
  prompt.
 
 I do that all the time.
 
   $ qemu-kvm -nographic -kernel arch/x86/boot/bzImage -append console=ttyS0

Same here.  I can't be bothered with all the stuid distro booting crap.

  I can do that without downloading any (inevitably outdated) 
  virtualization images or maintaining my own ones. Maintaining host 
  userspace is more than enough for me.
 
 Who would need images? I usually only run -kernel and -initrd directly to 
 test out things. Or if I really want to boot into a real system I do 
 -snapshot /dev/sda.

Indeed.

 
  So, since we already have the lguest tool in the kernel tree, why 
  cannot we have the much more capable tools/kvm/ in the tree?
 
 Lguest is in Documentation/ for a reason. It's not meant as a user space tool 
 that you take as-is and use. It's meant for documenting how lguest works in 
 general. I admit though, that that's also the reason people don't use it :).

I'd also say it's rather misplaced there, and at least in the storage
area that I know most it didn't help it from totally misunderstanding
kernel concepts and taking them into protocols (e.g. virtio barrier
support).  That for me is a reason why you don't want to couple thing
too tightly, at least you'll have to document and/or explain the
protocol to someone. 
tight

And another argument, calling toyvisor2 kvm is a really bad idea.  The
kvm binary has been used for the kvm-patched qemu binary for quite a
while in various distros, so re-using that name will cause utter
confusion.

I'm happy that you guys do another independent userspace for kvm, but
please:

  a) give it a useful name
  b) just develop it where it belongs, your own little git repository
 somewhere
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Native Linux KVM tool for 3.1

2011-07-25 Thread Christoph Hellwig
On Mon, Jul 25, 2011 at 01:08:10PM +0200, Ingo Molnar wrote:
 Fact is that developing ABIs within an integrated project is 
 *amazingly* powerful. You should try it one day, instead of 
 criticizing it :-)

I've been doing this long before you declare it the rosetta stone.  Some
of the worst ABIs I know come from that kind of development, e.g. all
the ioctls we have in xfs, inherited from the tightly integrated IRIX
development cycle.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Native Linux KVM tool for 3.1

2011-07-25 Thread Christoph Hellwig
On Mon, Jul 25, 2011 at 01:34:25PM +0200, Olivier Galibert wrote:
 You need someone with taste in the loop.  But if you do, evolved is
 always better than designed before you actually know what you need.
 
 As I'm sure you perfectly know, for the matter.

Neither is actually helpful.  You reall want reference implementation
on both sides of an ABI, _and_ documentation.  And yes, usually you'll
need a few iterations over it.  In theory you can do that in a tightly
integrated enviroment as well, but practice shows simply boilds down
to commiting the bloody thing.

I'm also not sure why we even bother to focus with this side-line
discussion.  It's not like the kvm (as in the kernel kvm module)
developers have written the kvm tools.  It's just another userspace for
the kvm userspace (the fifth if I count correctly), and so far the
reference and often only implementation of any new kvm module feature
is for qemu-kvm.  So no matter where kvm tools lives, if you guys one
day actually start doing major kvm core features it will still evolve
discussing them with the main consumer of those interfaces.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Tue, Jun 14, 2011 at 05:30:24PM +0200, Hannes Reinecke wrote:
 Which is exactly the problem I was referring to.
 When using more than one channel the request ordering
 _as seen by the initiator_ has to be preserved.
 
 This is quite hard to do from a device's perspective;
 it might be able to process the requests _in the order_ they've
 arrived, but it won't be able to figure out the latency of each
 request, ie how it'll take the request to be delivered to the
 initiator.
 
 What we need to do here is to ensure that virtio will deliver
 the requests in-order across all virtqueues. Not sure whether it
 does this already.

This only matters for ordered tags, or implicit or explicit HEAD OF
QUEUE tags.  For everything else there's no ordering requirement.
Given that ordered tags don't matter in practice and we don't have
to support them this just leaves HEAD OF QUEUE.  I suspect the
HEAD OF QUEUE semantics need to be implemented using underlying
draining of all queues, which should be okay given that it's
usually used in slow path commands.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Sun, Jun 12, 2011 at 10:51:41AM +0300, Michael S. Tsirkin wrote:
 For example, if the driver is crazy enough to put
 all write requests on one queue and all barriers
 on another one, how is the device supposed to ensure
 ordering?

There is no such things as barriers in SCSI.  The thing that comes
closest is ordered tags, which neither Linux nor any mainstream OS
uses, and which we don't have to (and generally don't want to)
implement.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 10:23:26AM +0200, Paolo Bonzini wrote:
 I agree here, in fact I misread Hannes's comment as if a driver
 uses more than one queue it is responsibility of the driver to
 ensure strict request ordering.  If you send requests to different
 queues, you know that those requests are independent.  I don't think
 anything else is feasible in the virtio framework.

That doesn't really fit very well with the SAM model.  If we want
to use multiple queues for a single LUN it has to be transparent to
the SCSI command stream.  Then again I don't quite see the use for
that anyway.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote:
 I think we're missing a level of addressing.  We need the ability to
 talk to multiple target ports in order for list target ports to make
 sense.  Right now there is one implicit target that handles all
 commands.  That means there is one fixed I_T Nexus.
 
 If we introduce list target ports we also need a way to say This
 CDB is destined for target port #0.  Then it is possible to enumerate
 target ports and address targets independently of the LUN field in the
 CDB.
 
 I'm pretty sure this is also how SAS and other transports work.  In
 their framing they include the target port.

Yes, exactly.  Hierachial LUNs are a nasty fringe feature that we should
avoid as much as possible, that is for everything but IBM vSCSI which is
braindead enough to force them.

 The question is whether we really need to support multiple targets on
 a virtio-scsi adapter or not.  If you are selectively mapping LUNs
 that the guest may access, then multiple targets are not necessary.
 If we want to do pass-through of the entire SCSI bus then we need
 multiple targets but I'm not sure if there are other challenges like
 dependencies on the transport (Fibre Channel, SAS, etc) which make it
 impossible to pass through bus-level access?

I don't think bus-level pass through is either easily possible nor
desirable.  What multiple targets are useful for is allowing more
virtual disks than we have virtual PCI slots.  We could do this by
supporting multiple LUNs, but given that many SCSI ressources are
target-based doing multiple targets most likely is the more scabale
and more logical variant.  E.g. we could much more easily have one
virtqueue per target than per LUN.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio scsi host draft specification, v3

2011-06-29 Thread Christoph Hellwig
On Wed, Jun 29, 2011 at 12:23:38PM +0200, Hannes Reinecke wrote:
 The general idea here is that we can support NPIV.
 With NPIV we'll have several scsi_hosts, each of which is assigned a
 different set of LUNs by the array.
 With virtio we need to able to react on LUN remapping on the array
 side, ie we need to be able to issue a 'REPORT LUNS' command and
 add/remove LUNs on the fly. This means we have to expose the
 scsi_host in some way via virtio.
 
 This is impossible with a one-to-one mapping between targets and
 LUNs. The actual bus-level pass-through will be just on the SCSI
 layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN
 remapping internally on the host is a totally different matter.
 Same goes for the transport details; I doubt we will expose all the
 dingy details of the various transports, but rather restrict
 ourselves to an abstract transport.

If we want to support traditional NPIV that's what we have to do.
I still hope we'll see broad SR-IOV support for FC adapters soon,
which would ease all this greatly.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: Support releasing lock during kick

2011-06-19 Thread Christoph Hellwig
On Sun, Jun 19, 2011 at 10:48:41AM +0300, Michael S. Tsirkin wrote:
 diff --git a/block/blk-core.c b/block/blk-core.c
 index 4ce953f..a8672ec 100644
 --- a/block/blk-core.c
 +++ b/block/blk-core.c
 @@ -433,6 +433,8 @@ void blk_run_queue(struct request_queue *q)
   spin_lock_irqsave(q-queue_lock, flags);
   __blk_run_queue(q);
   spin_unlock_irqrestore(q-queue_lock, flags);
 + if (q-request_done)
 + q-request_done(q);

We have quite a few cases where __blk_run_queue is called directly, and
one more (although not applicable to virtio-blk) that calls -request_fn
directly.

I think Stefan's way is the way to go for now, releasing and reacquiring
the queue lock once in -request_fn is much less than the common IDE and
SCSI setups do today.

Eventually -queue_lock should be split from the driver-internal lock,
and we could do a more efficient calling convention than the one per
request blk_peek_request.  I've started looking into that, but it's
going to take a while.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Christoph Hellwig
On Thu, Jun 16, 2011 at 09:21:03AM +0300, Pekka Enberg wrote:
 And btw, we use sync_file_range() 

Which doesn't help you at all.  sync_file_range is just a hint for VM
writeback, but never commits filesystem metadata nor the physical
disk's write cache.  In short it's a completely dangerous interface, and
that is pretty well documented in the man page.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Christoph Hellwig
On Thu, Jun 16, 2011 at 12:34:04PM +0300, Pekka Enberg wrote:
 Hi Christoph,
 
 On Thu, Jun 16, 2011 at 09:21:03AM +0300, Pekka Enberg wrote:
  And btw, we use sync_file_range()
 
 On Thu, Jun 16, 2011 at 12:24 PM, Christoph Hellwig h...@infradead.org 
 wrote:
  Which doesn't help you at all. ?sync_file_range is just a hint for VM
  writeback, but never commits filesystem metadata nor the physical
  disk's write cache. ?In short it's a completely dangerous interface, and
  that is pretty well documented in the man page.
 
 Doh - I didn't read it carefully enough and got hung up with:
 
 Therefore, unless the application is strictly performing overwrites of
 already-instantiated disk blocks, there are no guarantees that the data 
 will
 be available after a crash.
 
 without noticing that it obviously doesn't work with filesystems like
 btrfs that do copy-on-write.

You also missed:

 This system call does not flush disk write caches and thus does not
  provide any data integrity on systems with volatile disk write
  caches.

so it's not safe if you either have a cache, or are using btrfs, or
are using a sparse image, or are using an image preallocated using
fallocate/posix_fallocate.

 What's the right thing to do here? Is fdatasync() sufficient?

Yes.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Christoph Hellwig
On Thu, Jun 16, 2011 at 12:57:36PM +0300, Pekka Enberg wrote:
 Uh-oh. Someone needs to apply this patch to sync_file_range():

There actually are a few cases where using it makes sense.  It's just
the minority.  

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Christoph Hellwig
On Thu, Jun 16, 2011 at 01:22:30PM +0200, Ingo Molnar wrote:
 Such as? I don't think apps can actually know whether disk blocks 
 have been 'instantiated' by a particular filesystem or not, so the 
 manpage:

In general they can't.  The only good use case for sync_file_range
is to paper over^H^H^H^H^H^H^H^H^Hcontrol write back behaviour.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Native Linux KVM tool v2

2011-06-16 Thread Christoph Hellwig
On Thu, Jun 16, 2011 at 01:40:45PM +0200, Ingo Molnar wrote:
 Filesystems that cannot guarantee that should map their 
 sync_file_range() implementation to fdatasync() or so, right?

Filesystems aren't even told about sync_file_range, it's purely a VM
thing, which is the root of the problem.

In-kernel we have all the infrastructure for a real ranged
fsync/fdatasync, and once we get a killer users for that can triviall
export it at the syscall level.  I don't think mapping sync_file_range
with it's weird set of flags and confusing behaviour to it is a good
idea, though.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] qemu-img: Add cache command line option

2011-06-16 Thread Christoph Hellwig
On Wed, Jun 15, 2011 at 09:46:10AM -0400, Federico Simoncelli wrote:
 qemu-img currently writes disk images using writeback and filling
 up the cache buffers which are then flushed by the kernel preventing
 other processes from accessing the storage.
 This is particularly bad in cluster environments where time-based
 algorithms might be in place and accessing the storage within
 certain timeouts is critical.
 This patch adds the option to choose a cache method when writing
 disk images.

Allowing to chose the mode is of course fine, but what about also
choosing a good default?  writethrough doesn't really make any sense
for qemu-img, given that we can trivially flush the cache at the end
of the operations.  I'd also say that using the buffer cache doesn't
make sense either, as there is little point in caching these operations.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] qemu-kvm: Use upstream kvm_cpu_exec

2011-05-20 Thread Christoph Hellwig
On Fri, May 20, 2011 at 07:12:39PM +0200, Jan Kiszka wrote:
 Upstream's and qemu-kvm's kvm_cpu_exec are not logically equivalent so

s/not/now/?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/13] kvm/powerpc: Allow book3s_hv guests to use SMT processor modes

2011-05-11 Thread Christoph Hellwig
On Wed, May 11, 2011 at 08:46:56PM +1000, Paul Mackerras wrote:
 arch/powerpc/sysdev/xics/icp-native.c.

What kernel tree do I need to actually have that file?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC][PATCH] Import Linux headers for KVM and vhost

2011-05-03 Thread Christoph Hellwig
On Tue, May 03, 2011 at 04:05:37PM +0200, Jan Kiszka wrote:
 This helps reducing our build-time checks for feature support in the
 available Linux kernel headers. And it helps users that do not have
 sufficiently recent headers installed on their build machine.
 
 Header upstate is triggered via 'make update-linux-headers', optionally
 specifying a kernel directory as source via 'LINUX='.

Why not make it a shell scripts?  It's not like it's part of the build
system.  I also think the source tree argument should be mandatory instead
of grabbing random headers.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 9/9] qemu-kvm: hpet: Add MSI support for in-kernel irqchip mode

2011-04-26 Thread Christoph Hellwig
On Tue, Apr 26, 2011 at 04:01:00PM +0200, Jan Kiszka wrote:
 +static bool modifying_bit(uint64_t old, uint64_t new, uint64_t mask)
 +{
 +return (old ^ new)  mask;
 +}
 +

A more usual name would be toggle_bit.  But you're passing in a mask
to be modified, so it would be more a toggle_bits or toggle_mask.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 9/9] qemu-kvm: hpet: Add MSI support for in-kernel irqchip mode

2011-04-26 Thread Christoph Hellwig
On Tue, Apr 26, 2011 at 07:06:34PM +0200, Jan Kiszka wrote:
 On 2011-04-26 18:06, Christoph Hellwig wrote:
  On Tue, Apr 26, 2011 at 04:01:00PM +0200, Jan Kiszka wrote:
  +static bool modifying_bit(uint64_t old, uint64_t new, uint64_t mask)
  +{
  +return (old ^ new)  mask;
  +}
  +
  
  A more usual name would be toggle_bit.  But you're passing in a mask
  to be modified, so it would be more a toggle_bits or toggle_mask.
  
 
 This function is checking for a change. But I've no problem renaming it
 to toggling_bit or so if that's preferred.

Err, you're right, I read it as doing a ^=.  So keeping your naming
scheme sounds fine.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [qemu-iotests][PATCH] Update rbd support

2011-04-21 Thread Christoph Hellwig
On Tue, Apr 12, 2011 at 10:42:00PM -0700, Josh Durgin wrote:
  I suspect we only support the weird writing past size for the
  file protocol, so we should only run the test for it.
  
  Or does sheepdog do anything special about it?
 
 Sheepdog supports it by truncating to the right size if a write
 would be past the end. I'm not sure if other protocols support
 it.

I've changed 016 to require the file or sheepdog protocols, and then
applied the rest of your patch.  Thanks a lot!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tool: add QCOW verions 1 read/write support

2011-04-14 Thread Christoph Hellwig
On Wed, Apr 13, 2011 at 08:01:58PM +0100, Prasad Joshi wrote:
 The patch only implements the basic read write support for QCOW version 1
 images. Many of the QCOW features are not implmented, for example

What's the point?  Qcow1 has been deprecated for a long time.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [qemu-iotests][PATCH] Update rbd support

2011-04-12 Thread Christoph Hellwig
 @@ -43,6 +43,10 @@ _supported_fmt raw
  _supported_proto generic
  _supported_os Linux
  
 +# rbd images are not growable
 +if [ $IMGPROTO = rbd ]; then
 +_notrun image protocol $IMGPROTO does not support growable images
 +fi

I suspect we only support the weird writing past size for the
file protocol, so we should only run the test for it.

Or does sheepdog do anything special about it?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Use mmap for working with disk image V2

2011-04-11 Thread Christoph Hellwig
How do you plan to handle I/O errors or ENOSPC conditions?  Note that
shared writeable mappings are by far the feature in the VM/FS code
that is most error prone, including the impossiblity of doing sensible
error handling.

The version that accidentally used MAP_PRIVATE actually makes a lot of
sense for an equivalent of qemu's snapshot mode where the image is
readonly and changes are kept private as long as the amount of modified
blocks is small enough to not kill the host VM, but using shared
writeable mappings just sems dangerous.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call minutes for Feb 1

2011-02-01 Thread Christoph Hellwig
On Tue, Feb 01, 2011 at 05:36:13PM +0100, Jan Kiszka wrote:
 kvm_cpu_exec/kvm_run, and start wondering What needs to be done to
 upstream so that qemu-kvm could use that implementation?. If they
 differ, the reasons need to be understood and patched away, either by
 fixing/enhancing upstream or simplifying qemu-kvm. Once the upstream
 changes are merged back, a qemu-kvm patch is posted to switch to that
 version.

while I'm not an expert in that area I really like you're approach.  I'd
really prefer to let you finish up all the major work that way before
starting massive revamping like the glib main loop.  Resolving the
qemu/qemu-kvm schisma surely is more important for the overall project
than rewriting existing functionality to look a little nicer.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify

2010-11-11 Thread Christoph Hellwig
On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote:
 Some virtio devices are known to have guest drivers which expect a notify to 
 be
 processed synchronously and spin waiting for completion.  Only enable 
 ioeventfd
 for virtio-blk and virtio-net for now.

Who guarantees that less common virtio-blk and virtio-net guest drivers
for non-Linux OSes are fine with it?  Maybe you should add a feature flag
that the guest has to ACK to enable it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 32-bit qemu on current x86-64 kernel segfauls very early

2010-11-02 Thread Christoph Hellwig
On Sun, Oct 31, 2010 at 09:06:29AM -0400, Christoph Hellwig wrote:
 With Linus' git tree from today I can't boot qemu when using kvm.  It
 seems to do fine, just glacially slow without -enable-kvm.  The command
 simplest command line that fails is:
 
   /opt/qemu/bin/qemu-system-x86_64 -enable-kvm

This issue was caused by commit 9581d442b9058d3699b4be568b6e5eae38a41493

KVM: Fix fs/gs reload oops with invalid ldt

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 32-bit qemu on current x86-64 kernel segfauls very early

2010-11-02 Thread Christoph Hellwig
On Tue, Nov 02, 2010 at 11:59:48AM -0400, Avi Kivity wrote:
  KVM: Fix fs/gs reload oops with invalid ldt
 
 
 Interesting, I guess we corrupt %fs on x86_64.
 
 Intel or AMD?

Intel:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Duo CPU T9600  @ 2.80GHz
stepping: 6
cpu MHz : 2133.000
cache size  : 6144 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm dts tpr_shadow vnmi 
flexpriority
bogomips: 5588.00
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


32-bit qemu on current x86-64 kernel segfauls very early

2010-10-31 Thread Christoph Hellwig
With Linus' git tree from today I can't boot qemu when using kvm.  It
seems to do fine, just glacially slow without -enable-kvm.  The command
simplest command line that fails is:

/opt/qemu/bin/qemu-system-x86_64 -enable-kvm

I tried to get a backtrace from gdb, but it looks like:

(gdb) bt
#0  0x0806c491 in ?? ()
#1  0x0806cc38 in ?? ()
#2  0x0811be10 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk XFS corruption

2010-09-25 Thread Christoph Hellwig
FYI, qemu 0.12.2 is missing:

block: fix sector comparism in multiwrite_req_compare

which in the past was very good at triggering XFS guest corruption.
Please try with the patch applied or even better latests qemu from git.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk XFS corruption

2010-09-25 Thread Christoph Hellwig
On Sat, Sep 25, 2010 at 05:40:34PM +0200, Peter Lieven wrote:
 
 Am 25.09.2010 um 17:37 schrieb Christoph Hellwig:
 
  FYI, qemu 0.12.2 is missing:
 
 you mean 0.12.4 not 0.12.2, don't you?

Yes, sorry. (but 0.12.2 is of course missing it, too..)

  which in the past was very good at triggering XFS guest corruption.
  Please try with the patch applied or even better latests qemu from git.
  
 
 i'm just trying with 0.12.5. 
 
 i'm not so familiar with git. is there a command to pull only patches
 that are marked as stable and will be in the next official release?

All the qemu stable releases are tagged and you can check do

git-checkout v0.12.5

but that's not the main git HEAD which would also be interesting.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-blk: put request that was created to retrieve the device id

2010-09-21 Thread Christoph Hellwig
On Fri, Sep 17, 2010 at 09:58:48AM -0500, Ryan Harper wrote:
 Since __bio_map_kern() sets up bio-bi_end_io = bio_map_kern_endio
 (which does a bio_put(bio)) doesn't that ensure we don't leak?

Indeed, that should take care of it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-blk: put request that was created to retrieve the device id

2010-09-09 Thread Christoph Hellwig
On Thu, Sep 09, 2010 at 05:00:42PM -0400, Mike Snitzer wrote:
 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 1260628..831e75c 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -199,6 +199,7 @@ static int virtblk_get_id(struct gendisk *disk, char 
 *id_str)
   struct virtio_blk *vblk = disk-private_data;
   struct request *req;
   struct bio *bio;
 + int err;
  
   bio = bio_map_kern(vblk-disk-queue, id_str, VIRTIO_BLK_ID_BYTES,
  GFP_KERNEL);
 @@ -212,7 +213,10 @@ static int virtblk_get_id(struct gendisk *disk, char 
 *id_str)
   }
  
   req-cmd_type = REQ_TYPE_SPECIAL;
 - return blk_execute_rq(vblk-disk-queue, vblk-disk, req, false);
 + err = blk_execute_rq(vblk-disk-queue, vblk-disk, req, false);
 + blk_put_request(req);

This looks correct as far as the request is concerned, but we're still
leaking the bio.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: Make BSG detection more sane in hdev_open()

2010-08-22 Thread Christoph Hellwig
On Sat, Aug 21, 2010 at 04:01:15PM -0700, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 Greetings hch, tomo and Co,

What tree is this against?  I can't see any specificc BSG support in qemu.

Even more I think all this in the wrong place.  The only reason
SG_IO support was shoe-horned into raw-posix is that we only had -drive
and no way to specify other devices.  We now have -device and a scsi
generic device should be just that without an attached driver, so that
we can get rid of all the special casing for sd devices.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Mon, Aug 16, 2010 at 03:34:12PM -0500, Anthony Liguori wrote:
 On 08/16/2010 01:42 PM, Christoph Hellwig wrote:
 On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
 Also, ext4 is _very_ slow on O_SYNC writes (which is
 used in kvm with default cache).
 Yeah, we probably need to switch to sync_file_range() to avoid the
 journal commit on every write.
 
 No, we don't.  sync_file_range does not actually provide any data
 integrity.
 
 What do you mean by data integrity?

sync_file_range only does pagecache-level writeout of the file data.
It nevers calls into the actual filesystem, that means any block
allocations (for filling holes / converting preallocated space in normal
filesystems, or every write in COW-based filesstems like qcow2) never
get flushes to disk, and even more importantly the disk write cache is
never flushed.

In short it's completely worthless for any real filesystem.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 12:23:01PM +0300, Avi Kivity wrote:
  On 08/17/2010 12:07 PM, Christoph Hellwig wrote:
 
 In short it's completely worthless for any real filesystem.
 
 
 The documentation should be updated then.  It suggests that it is
 usable for data integrity.

The manpage has a warning section documenting what I said above since
I added it in January.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 07:56:04AM -0500, Anthony Liguori wrote:
 But assuming that you had a preallocated disk image, it would
 effectively flush the page cache so it sounds like the only real
 issue is sparse and growable files.

For preallocated as in using fallocate() we still converting unwritten
to regular extents and do have metadata updates.  For preallocated as
in writining zeroes into the whole image earlier we do indeed only
care about the data, and will not have metadata for most filesystems.
That still leaves COW based filesystems that need to allocate new blocks
on every write, and from my reading NFS also needs the -fsync callout
to actually commit unstable data to disk.

   and even more importantly the disk write cache is
 never flushed.
 
 The point is that we don't want to flush the disk write cache.  The
 intention of writethrough is not to make the disk cache writethrough
 but to treat the host's cache as writethrough.

We need to make sure data is not in the disk write cache if want to
provide data integrity.  It has nothing to do with the qemu caching
mode - for data=writeback or none it's commited as part of the fdatasync
call, and for data=writethrough it's commited as part of the O_SYNC
write.  Note that both these path end up calling the filesystems -fsync
method which is what's require to make writes stable.  That's exactly
what is missing out in sync_file_range, and that's why that API is not
useful at all for data integrity operations.  It's also what makes
fsync slow on extN - but the fix to that is not to not provide data
integrity but rather to make fsync fast.  There's various other
filesystems that can already do it, and if you insist on using those
that are slow for this operation you'll have to suffer until that
issue is fixed for them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote:
 On 08/17/2010 08:07 AM, Christoph Hellwig wrote:
 The point is that we don't want to flush the disk write cache.  The
 intention of writethrough is not to make the disk cache writethrough
 but to treat the host's cache as writethrough.
 
 We need to make sure data is not in the disk write cache if want to
 provide data integrity.
 
 When the guest explicitly flushes the emulated disk's write cache.
 Not on every single write completion.

That depends on the cache= mode.  For cache=none and cache=writeback
we present a write-back cache to the guest, and the guest does explicit
cache flushes.  For cache=writethrough we present a writethrough cache
to the guest, and we need to make sure data actually has hit the disk
before returning I/O completion to the guest.

It has nothing to do with the qemu caching
 mode - for data=writeback or none it's commited as part of the fdatasync
 call, and for data=writethrough it's commited as part of the O_SYNC
 write.  Note that both these path end up calling the filesystems -fsync
 method which is what's require to make writes stable.  That's exactly
 what is missing out in sync_file_range, and that's why that API is not
 useful at all for data integrity operations.
 
 For normal writes from a guest, we don't need to follow the write
 with an fsync().  We should only need to issue an fsync() given an
 explicit flush from the guest.

Define normal writes.  For cache=none and cache=writeback we don't
have to, and instead do explicit calls to fsync()/fdatasync() calls
when a we a cache flush from the guest.  For data=writethrough we
guarantee data has made it to disk, and we implement this using
O_DSYNC/O_SYNC when opening the file.  That tells the operating system
to not return until data has hit the disk.   For Linux this is
internally implement using a range-fsync/fdatasync after the actual
write.

 fsync() being slow is orthogonal to my point.  I don't see why we
 need to do an fsync() on *every* write.  It should only be necessary
 when a guest injects an actual barrier.

See above.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:39:15AM -0500, Anthony Liguori wrote:
 The type of cache we present to the guest only should relate to how
 the hypervisor caches the storage.  It should be independent of how
 data is cached by the disk.

It is.

 There can be many levels of caching in a storage hierarchy and each
 hierarchy cached independently of the next level.
 
 If the user has a disk with a writeback cache, if we expose a
 writethrough cache to the guest, it's not our responsibility to make
 sure that we break through the writeback cache on the disk.

The users doesn't know or have to care about the caching.  The
users uses O_SYNC/fsync to tell it wants data on disk, and it's the
operating systems job to make that happen.   The situation with qemu
is the same - if we tell the guest that we do not have a volatile write
cache that needs explicit management the guest can rely on the fact
that it does not have to do manual cache management.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:44:49AM -0500, Anthony Liguori wrote:
 I think the real issue is we're mixing host configuration with guest
 visible state.

The last time I proposed to decouple the two you and Avi were heavily
opposed to it..

 With O_SYNC, we're causing cache=writethrough to do writethrough
 through two layers of the storage heirarchy.  I don't think that's
 necessary or desirable though.

It's absolutely nessecary if we tell the guest that we do not have
a volatile write cache.  Which is the only good reason to use
data=writethrough anyway - except for dealing with old guests that
can't handle volatile writecache it's an absolutely stupid mode of
operation.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 09:54:07AM -0500, Anthony Liguori wrote:
 This is simply unrealistic.  O_SYNC might force data to be on a
 platter when using a directly attached disk but many NAS's actually
 do writeback caching and relying on having an UPS to preserve data
 integrity.  There's really no way in the general case to ensure that
 data is actually on a platter once you've involved a complex storage
 setup or you assume FUA

Yes, there is.  If you have an array that has batter backup it handles
this internally.  The normal case is to not set the WCE bit in the
mode page, which tells the operating system not ever send
SYNCHRONIZE_CACHE commands.  I have one array that sets a WCE bit
neveless, but it also doesn't flush it's non-volatile cache in
SYNCHRONIZE_CACHE, but rather implements it as an effective no-op.

 Let me put it another way.  If an admin knows the disks on a machine
 have battery backed cache, he's likely to leave writeback caching
 enabled.
 
 We are currently giving the admin two choices with QEMU, either
 ignore the fact that the disk is battery backed and do write through
 caching of the disk or do writeback caching in the host which

Again, this is not qemu's business at all.  Qemu is not different from
any other application requiring data integrity.  If that admin really
thinks he needs to overide the storage provided settings he can
mount the filesystem using -o nobarrier and we will not send cache
flushes.  I would in general recommend against this, as an external
UPS still has lots of failure modes that this doesn't account for.
Arrays with internal non-volatile memory already do the right thing
anyway.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-17 Thread Christoph Hellwig
On Tue, Aug 17, 2010 at 05:59:07PM +0300, Avi Kivity wrote:
 I agree, but there's another case: tell the guest that we have a
 write cache, use O_DSYNC, but only flush the disk cache on guest
 flushes.

O_DSYNC flushes the disk write cache and any filesystem that supports
non-volatile cache.   The disk cache is not an abstraction
exposed to applications.

 I believe this can be approximated by mounting the host filesystem
 with barrier=0?

Mounting the host filesystem with nobarrier means we will never explicit
flush the volatile write cache on the disk.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JFYI: ext4 bug triggerable by kvm

2010-08-16 Thread Christoph Hellwig
On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
 Also, ext4 is _very_ slow on O_SYNC writes (which is
 used in kvm with default cache).
 
 Yeah, we probably need to switch to sync_file_range() to avoid the
 journal commit on every write.
 

No, we don't.  sync_file_range does not actually provide any data
integrity.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphical virtualisation management system

2010-06-25 Thread Christoph Hellwig
On Thu, Jun 24, 2010 at 02:01:52PM -0500, Javier Guerra Giraldez wrote:
 On Thu, Jun 24, 2010 at 1:32 PM, Freddie Cash fjwc...@gmail.com wrote:
  ??* virt-manager which requires X and seems to be more desktop-oriented;
 
 don't know about the others, but virt-manager runs only on the admin
 station.  on the VM hosts you run only libvirtd, which doesn't need X

While it can connect to remote systems it seems totally unusable for
that to me.  For one thing working over higher latency links like DSL
or even transatlantik links seems to be almost impossible.  Second I
still haven't figure out how to install and manage a system using the
serial console with KVM, which certainly contributes to the complete
lack of usability above.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread Christoph Hellwig
On Fri, Jun 18, 2010 at 01:38:02PM -0500, Ryan Harper wrote:
 Create a new attribute for virtio-blk devices that will fetch the serial 
 number
 of the block device.  This attribute can be used by udev to create disk/by-id
 symlinks for devices that don't have a UUID (filesystem) associated with them.
 
 ATA_IDENTIFY strings are special in that they can be up to 20 chars long
 and aren't required to be NULL-terminated.  The buffer is also zero-padded
 meaning that if the serial is 19 chars or less that we get a NULL terminated
 string.  When copying this value into a string buffer, we must be careful to
 copy up to the NULL (if it present) and only 20 if it is longer and not to
 attempt to NULL terminate; this isn't needed.

Why is this virtio-blk specific?  In a later mail you mention you want
to use it for udev.  So please export this from scsi/libata as well and
we have one proper interface that we can use for all devices.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_blk: support barriers without FLUSH feature

2010-06-15 Thread Christoph Hellwig
If we want to support barriers with the cache=writethrough mode in qemu
we need to tell the block layer that we only need queue drains to
implement a barrier.  Follow the model set by SCSI and IDE and assume
that there is no volatile write cache if the host doesn't advertize it.
While this might imply working barriers on old qemu versions or other
hypervisors that actually have a volatile write cache this is only a
cosmetic issue - these hypervisors don't guarantee any data integrity
with or without this patch, but with the patch we at least provide
data ordering.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2010-06-08 12:01:09.625254254 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2010-06-15 14:38:18.518273290 
+0200
@@ -366,12 +366,32 @@ static int __devinit virtblk_probe(struc
vblk-disk-driverfs_dev = vdev-dev;
index++;
 
-   /* If barriers are supported, tell block layer that queue is ordered */
-   if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+   if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
+   /*
+* If the FLUSH feature is supported we do have support for
+* flushing a volatile write cache on the host.  Use that
+* to implement write barrier support.
+*/
blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH,
  virtblk_prepare_flush);
-   else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
+   } else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) {
+   /*
+* If the BARRIER feature is supported the host expects us
+* to order request by tags.  This implies there is not
+* volatile write cache on the host, and that the host
+* never re-orders outstanding I/O.  This feature is not
+* useful for real life scenarious and deprecated.
+*/
blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL);
+   } else {
+   /*
+* If the FLUSH feature is not supported we must assume that
+* the host does not perform any kind of volatile write
+* caching. We still need to drain the queue to provider
+* proper barrier semantics.
+*/
+   blk_queue_ordered(q, QUEUE_ORDERED_DRAIN, NULL);
+   }
 
/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call minutes for June 15

2010-06-15 Thread Christoph Hellwig
On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote:
 KVM/qemu patches
 - patch rate is high, documentation is low, review is low
 - patches need to include better descriptions and documentation
   - will slow down patch writers
   - will make it easier for patch reviewers

What is the qemu patch review policy anyway?  There are no
Reviewed-by: included in the actual commits, and the requirement
for a positive review also seem to vary a lot, up to the point that
some commiters commit code that has never hit a public mailing list
before.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO

2010-06-14 Thread Christoph Hellwig
On Mon, Jun 14, 2010 at 02:44:31AM -0700, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical
 to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512)
 so that it can be used by BSG AIO for write()/read() of struct sg_io_v4.

Jusre remove the nb_sectors argument, we already get the length passed
in the iovec.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw disks no longer work in latest kvm (kvm-88 was fine)

2010-05-29 Thread Christoph Hellwig
On Sat, May 29, 2010 at 04:42:59PM +0700, Antoine Martin wrote:
 Can someone explain the aio options?
 All I can find is this:
 # qemu-system-x86_64 -h | grep -i aio
[,addr=A][,id=name][,aio=threads|native]
 I assume it means the aio=threads emulates the kernel's aio with
 separate threads? And is therefore likely to be slower, right?
 Is there a reason why aio=native is not the default? Shouldn't
 aio=threads be the fallback?

The kernel AIO support is unfortunately not a very generic API.
It only supports O_DIRECT I/O (cache=none for qemu), and if used on
a filesystems it might still block if we need to perform block
allocations.  We could probably make it the default for block devices,
but I'm not a big fan of these kind of conditional defaults.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw disks no longer work in latest kvm (kvm-88 was fine)

2010-05-29 Thread Christoph Hellwig
On Sat, May 29, 2010 at 10:55:18AM +0100, Stefan Hajnoczi wrote:
 I would expect that aio=native is faster but benchmarks show that this
 isn't true for all workloads.

In what benchmark do you see worse results for aio=native compared to
aio=threads?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Christoph Hellwig
On Tue, May 25, 2010 at 02:25:53PM +0300, Avi Kivity wrote:
 Currently if someone wants to add a new block format, they have to  
 upstream it and wait for a new qemu to be released.  With a plugin API,  
 they can add a new block format to an existing, supported qemu.

So?  Unless we want a stable driver ABI which I fundamentally oppose as
it would make block driver development hell they'd have to wait for
a new release of the block layer.  It's really just going to be a lot
of pain for no major gain.  qemu releases are frequent enough, and if
users care enough they can also easily patch qemu.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Christoph Hellwig
On Fri, May 21, 2010 at 09:49:56PM +0100, Stefan Hajnoczi wrote:
 http://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps
 
 Requires kernel support - not sure if enough of utrace is in mainline
 for this to work out-of-the-box across distros.

Nothing of utrace is in mainline, nevermind the whole systemtap code
which is intentionally keep out of the kernel tree.  Using this means
that for every probe in userspace code you need to keep the configured
source tree of the currently running kernel around, which is completely
unusable for typical developer setups.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm hangs if multipath device is queing

2010-05-19 Thread Christoph Hellwig
On Tue, May 18, 2010 at 03:22:36PM +0200, Kevin Wolf wrote:
 I think it's stuck here in an endless loop:
 
 while (laiocb-ret == -EINPROGRESS)
 qemu_laio_completion_cb(laiocb-ctx);
 
 Can you verify this by single-stepping one or two loop iterations? ret
 and errno after the read call could be interesting, too.

Maybe the compiler is just too smart.  Without some form of barrier
it could just optimize the loop away as laiocb-ret couldn't change
in a normal single-threaded environment.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >