from:"Rusty Russell"

[PATCH] virtio: Fix GFP flags passed from the virtio balloon driver

2010-04-21 Thread Rusty Russell

From: Balbir Singh bal...@linux.vnet.ibm.com

The virtio balloon driver can dig into the reservation pools
of the OS to satisfy a balloon request. This is not advisable
and other balloon drivers (drivers/xen/balloon.c) avoid this
as well. The patch also adds changes to avoid printing a warning
if allocation fails, since we retry after sometime anyway.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Signed-off-by: Rusty Russell ru...@rustcorp.com.au
Cc: kvm kvm@vger.kernel.org
Cc: sta...@kernel.org
---
 drivers/virtio/virtio_balloon.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 369f2ee..f8ffe8c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -102,7 +102,8 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
num = min(num, ARRAY_SIZE(vb-pfns));
 
for (vb-num_pfns = 0; vb-num_pfns  num; vb-num_pfns++) {
-   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY);
+   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
+   __GFP_NOMEMALLOC | __GFP_NOWARN);
if (!page) {
if (printk_ratelimit())
dev_printk(KERN_INFO, vb-vdev-dev,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-03 Thread Rusty Russell

On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
 I took a stub at documenting CMD and FLUSH request types in virtio
 block.  Christoph, could you look over this please?
 
 I note that the interface seems full of warts to me,
 this might be a first step to cleaning them.

ISTR Christoph had withdrawn some patches in this area, and was waiting
for him to resubmit?

I've given up on figuring out the block device.  What seem to me to be sane
semantics along the lines of memory barriers are foreign to disk people: they
want (and depend on) flushing everywhere.

For example, tdb transactions do not require a flush, they only require what
I would call a barrier: that prior data be written out before any future data.
Surely that would be more efficient in general than a flush!  In fact, TDB
wants only writes to *that file* (and metadata) written out first; it has no
ordering issues with other I/O on the same device.

A generic I/O interface would allow you to specify this request depends on 
these
outstanding requests and leave it at that.  It might have some sync flush
command for dumb applications and OSes.  The userspace API might be not be as
precise and only allow such a barrier against all prior writes on this fd.

ISTR someone mentioning a desire for such an API years ago, so CC'ing the
usual I/O suspects...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-04 Thread Rusty Russell

On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
 Jens Axboe wrote:
  On Tue, May 04 2010, Rusty Russell wrote:
   ISTR someone mentioning a desire for such an API years ago, so CC'ing the
   usual I/O suspects...
  
  It would be nice to have a more fuller API for this, but the reality is
  that only the flush approach is really workable. Even just strict
  ordering of requests could only be supported on SCSI, and even there the
  kernel still lacks proper guarantees on error handling to prevent
  reordering there.
 
 There's a few I/O scheduling differences that might be useful:
 
 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
before a BARRIER.  That might be useful for time-critical WRITEs,
and those issued by high I/O priority.

This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that.  We really want these writes must
occur before this write.  That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.

 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
only for data belonging to a particular file (e.g. fdatasync with
no file size change, even on btrfs if O_DIRECT was used for the
writes being committed).  That would entail tagging FLUSHes and
WRITEs with a fs-specific identifier (such as inode number), opaque
to the scheduler which only checks equality.

This is closer.  In userspace I'd be happy with a all prior writes to this
struct file before all future writes.  Even if the original guarantees were
stronger (ie. inode basis).  We currently implement transactions using 4 fsync
/msync pairs.

write_recovery_data(fd);
fsync(fd);
msync(mmap);
write_recovery_header(fd);
fsync(fd);
msync(mmap);
overwrite_with_new_data(fd);
fsync(fd);
msync(mmap);
remove_recovery_header(fd);
fsync(fd);
msync(mmap);

Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.

 In other words, FLUSH can be more relaxed than BARRIER inside the
 kernel.  It's ironic that we think of fsync as stronger than
 fbarrier outside the kernel :-)

It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-04 Thread Rusty Russell

On Wed, 5 May 2010 04:24:59 am Christoph Hellwig wrote:
 On Fri, Feb 19, 2010 at 12:22:20AM +0200, Michael S. Tsirkin wrote:
  I took a stub at documenting CMD and FLUSH request types in virtio
  block.  Christoph, could you look over this please?
  
  I note that the interface seems full of warts to me,
  this might be a first step to cleaning them.
 
 The whole virtio-blk interface is full of warts.  It has been
 extended rather ad-hoc, so that is rather expected.
 
  One issue I struggled with especially is how type
  field mixes bits and non-bit values. I ended up
  simply defining all legal values, so that we have
  CMD = 2, CMD_OUT = 3 and so on.
 
 It's basically a complete mess without much logic behind it.
 
  +\change_unchanged
  +the high bit
  +\change_inserted 0 1266497301
  + (VIRTIO_BLK_T_BARRIER)
  +\change_unchanged
  + indicates that this request acts as a barrier and that all preceeding 
  requests
  + must be complete before this one, and all following requests must not be
  + started until this is complete.
  +
  +\change_inserted 0 1266504385
  + Note that a barrier does not flush caches in the underlying backend device
  + in host, and thus does not serve as data consistency guarantee.
  + Driver must use FLUSH request to flush the host cache.
  +\change_unchanged
 
 I'm not sure it's even worth documenting it.  I can't see any way to
 actually implement safe behaviour with the VIRTIO_BLK_T_BARRIER-style
 barriers.
 
 
 Btw, did I mention that .lyx is a a really horrible format to review
 diffs for?  Plain latex would be a lot better..

Yeah, or export as text post that diff for content review.  I do like the
patches to the lyx source though (I check all versions into revision control,
before and after merging changes, which makes it easy to produce annotated
versions).

Cheers,
Rusty.

 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio: put last_used and last_avail index into ring itself.

2010-05-05 Thread Rusty Russell

On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote:
  virtio: put last_used and last_avail index into ring itself.
  
  Generally, the other end of the virtio ring doesn't need to see where
  you're up to in consuming the ring.  However, to completely understand
  what's going on from the outside, this information must be exposed.
  For example, if you want to save and restore a virtio_ring, but you're
  not the consumer because the kernel is using it directly.
  
  Fortunately, we have room to expand: the ring is always a whole number
  of pages and there's hundreds of bytes of padding after the avail ring
  and the used ring, whatever the number of descriptors (which must be a
  power of 2).
  
  We add a feature bit so the guest can tell the host that it's writing
  out the current value there, if it wants to use that.
  
  Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
 I've been looking at this patch some more (more on why
 later), and I wonder: would it be better to add some
 alignment to the last used index address, so that
 if we later add more stuff at the tail, it all
 fits in a single cache line?

In theory, but not in practice.  We don't have many rings, so the
difference between 1 and 2 cache lines is not very much.

 We use a new feature bit anyway, so layout change should not be
 a problem.
 
 Since I raised the question of caches: for used ring,
 the ring is not aligned to 64 bit, so on CPUs with 64 bit
 or larger cache lines, used entries will often cross
 cache line boundaries. Am I right and might it
 have been better to align ring entries to cache line boundaries?
 
 What do you think?

I think everyone is settled on 128 byte cache lines for the forseeable
future, so it's not really an issue.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: question on virtio

2010-05-05 Thread Rusty Russell

On Wed, 5 May 2010 08:39:47 pm Michael S. Tsirkin wrote:
 Hi!
 I see this in virtio_ring.c:
 
 /* Put entry in available array (but don't update avail-idx *
  until they do sync). */
 
 Why is it done this way?
 It seems that updating the index straight away would be simpler, while
 this might allow the host to specilatively look up the buffer and handle
 it, without waiting for the kick.

I agree.  From my TODO:
what if we actually expose in -add_buf?

I don't *think* anyone adds buffers without being ready for them to be used,
so changing this should be safe.

Want to give it a try and report back?

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-05 Thread Rusty Russell

On Thu, 6 May 2010 06:28:14 am Michael S. Tsirkin wrote:
 Rusty,
 this is a simplified form of a patch you posted in the past.
 I have a vhost patch that, using this feature, shows external
 to host bandwidth grow from 5 to 7 GB/s, by avoiding
 an interrupt in the window after previous interrupt
 was sent and before interrupts were disabled for the vq.
 With vhost under some external to host loads I see
 this window being hit about 30% sometimes.

Fascinating.  So you use this to guess if the guest is still processing?
I haven't thought about it hard, but is that racy?

Obviously happy to apply this when you finalize it.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-06 Thread Rusty Russell

On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
 On Wed, 5 May 2010 14:28:41 +0930
 Rusty Russell ru...@rustcorp.com.au wrote:
 
  On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
   Jens Axboe wrote:
On Tue, May 04 2010, Rusty Russell wrote:
 ISTR someone mentioning a desire for such an API years ago, so CC'ing 
 the
 usual I/O suspects...

It would be nice to have a more fuller API for this, but the reality is
that only the flush approach is really workable. Even just strict
ordering of requests could only be supported on SCSI, and even there the
kernel still lacks proper guarantees on error handling to prevent
reordering there.
   
   There's a few I/O scheduling differences that might be useful:
   
   1. The I/O scheduler could freely move WRITEs before a FLUSH but not
  before a BARRIER.  That might be useful for time-critical WRITEs,
  and those issued by high I/O priority.
  
  This is only because noone actually wants flushes or barriers, though
  I/O people seem to only offer that.  We really want these writes must
  occur before this write.  That offers maximum choice to the I/O subsystem
  and potentially to smart (virtual?) disks.
  
   2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
  only for data belonging to a particular file (e.g. fdatasync with
  no file size change, even on btrfs if O_DIRECT was used for the
  writes being committed).  That would entail tagging FLUSHes and
  WRITEs with a fs-specific identifier (such as inode number), opaque
  to the scheduler which only checks equality.
  
  This is closer.  In userspace I'd be happy with a all prior writes to this
  struct file before all future writes.  Even if the original guarantees were
  stronger (ie. inode basis).  We currently implement transactions using 4 
  fsync
  /msync pairs.
  
  write_recovery_data(fd);
  fsync(fd);
  msync(mmap);
  write_recovery_header(fd);
  fsync(fd);
  msync(mmap);
  overwrite_with_new_data(fd);
  fsync(fd);
  msync(mmap);
  remove_recovery_header(fd);
  fsync(fd);
  msync(mmap);
 
 Seems over-zealous.
 If the recovery_header held a strong checksum of the recovery_data you would
 not need the first fsync, and as long as you have two places to write recovery
 data, you don't need the 3rd and 4th syncs.
 Just:
   write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
   fsync / msync
   overwrite_with_new_data()
 
 To recovery you choose the most recent log_space and replay the content.
 That may be a redundant operation, but that is no loss.

I think you missed a checksum for the new data?  Otherwise we can't tell if
the new data is completely written.  But yes, I will steal this scheme for
TDB2, thanks!

In practice, it's the first sync which is glacial, the rest are pretty cheap.

 Also cannot see the point of msync if you have already performed an fsync,
 and if there is a point, I would expect you to call msync before
 fsync... Maybe there is some subtlety there that I am not aware of.

I assume it's this from the msync man page:

   msync()  flushes  changes  made  to the in-core copy of a file that was
   mapped into memory using mmap(2) back to disk.   Without  use  of  this
   call  there  is  no guarantee that changes are written back before mun‐
   map(2) is called. 

  It's an implementation detail; barrier has less flexibility because it has
  less information about what is required. I'm saying I want to give you as
  much information as I can, even if you don't use it yet.
 
 Only we know that approach doesn't work.
 People will learn that they don't need to give the extra information to still
 achieve the same result - just like they did with ext3 and fsync.
 Then when we improve the implementation to only provide the guarantees that
 you asked for, people will complain that they are getting empty files that
 they didn't expect.

I think that's an oversimplification: IIUC that occurred to people *not*
using fsync().  They weren't using it because it was too slow.  Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.

And we can't write userspace interfaces for idiots only.

 The abstraction I would like to see is a simple 'barrier' that contains no
 data and has a filesystem-wide effect.

I think you lack ambition ;)

Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md?  Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio: put last_used and last_avail index into ring itself.

2010-05-06 Thread Rusty Russell

On Thu, 6 May 2010 03:57:55 pm Michael S. Tsirkin wrote:
 On Thu, May 06, 2010 at 10:22:12AM +0930, Rusty Russell wrote:
  On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote:
   What do you think?
  
  I think everyone is settled on 128 byte cache lines for the forseeable
  future, so it's not really an issue.
 
 You mean with 64 bit descriptors we will be bouncing a cache line
 between host and guest, anyway?

I'm confused by this entire thread.

Descriptors are 16 bytes.  They are at the start, so presumably aligned to
cache boundaries.

Available ring follows that at 2 bytes per entry, so it's also packed nicely
into cachelines.

Then there's padding to page boundary.  That puts us on a cacheline again
for the used ring; also 2 bytes per entry.

I don't see how any change in layout could be more cache friendly?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-06 Thread Rusty Russell

On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
 On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
  +   /* We publish the last-seen used index at the end of the available ring.
  +* It is at the end for backwards compatibility. */
  +   vr-last_used_idx =(vr)-avail-ring[num];
  +   /* Verify that last used index does not spill over the used ring. */
  +   BUG_ON((void *)vr-last_used_idx +
  +  sizeof *vr-last_used_idx  (void *)vr-used);
}
 
 
 Shouldn't this be on its own cache line?

It's next to the available ring; because that's where the guest publishes
its data.  That whole page is guest-write, host-read.

Putting it on a cacheline by itself would be a slight pessimization; the host
cpu would have to get the last_used_idx cacheline and the avail descriptor
cacheline every time.  This way, they are sometimes the same cacheline.

Hope that clarifies,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-06 Thread Rusty Russell

On Thu, 6 May 2010 03:49:46 pm Michael S. Tsirkin wrote:
 Now, I also added an mb() in guest between read and write so
 that last used index write can not get ahead of used index read.
 It does feel good to have it there, but I can not say why
 it's helpful. Works fine without it, but then these
 subtle races might be hard to trigger. What do you think?

I couldn't see that in the patch?  I don't think it's necessary
though, since the write of depends last_used depends on the read of
used (and no platform we care about would reorder such a thing).

I'm reasonably happy, but we should write some convenient test for
missing interrupts.

I'm thinking of a sender which does a loop: blasts 1MB of UDP packets,
then prints the time and sleep(1).  The receiver would print the time
every 1MB of received data.  The two times should almost exactly correspond.

Assuming that the network doesn't overflow and lose stuff, this should
identify any missing wakeup/interrupts (depending on direction used).

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio: put last_used and last_avail index into ring itself.

2010-05-10 Thread Rusty Russell

On Sun, 9 May 2010 06:27:33 pm Michael S. Tsirkin wrote:
 On Fri, May 07, 2010 at 12:35:39PM +0930, Rusty Russell wrote:
  Then there's padding to page boundary.  That puts us on a cacheline again
  for the used ring; also 2 bytes per entry.
  
 
 Hmm, is used ring really 2 bytes per entry?

Err, no, I am an idiot.

 /* u32 is used here for ids for padding reasons. */
 struct vring_used_elem {
 /* Index of start of used descriptor chain. */
 __u32 id;
 /* Total length of the descriptor chain which was used (written to) */
 __u32 len;
 };
 
 struct vring_used {
 __u16 flags;
 __u16 idx;
 struct vring_used_elem ring[];
 };

OK, now I get it.  Sorry, I was focussed on the avail ring.

 I thought that used ring has 8 bytes per entry, and that struct
 vring_used is aligned at page boundary, this
 would mean that ring element is at offset 4 bytes from page boundary.
 Thus with cacheline size 128 bytes, each 4th element crosses
 a cacheline boundary. If we had a 4 byte padding after idx, each
 used element would always be completely within a single cacheline.

I think the numbers are: every 16th entry hits two cachelines.  So currently
the first 15 entries are free (assuming we hit the idx cacheline anyway),
then 1 in 16 cost 2 cachelines.  That makes the aligned version win when
N  240.

But, we access the array linearly.  So the extra cacheline cost is in fact
amortized.  I doubt it could be measured, but maybe vring_get_buf() should
prefetch?  While you're there, we could use an  rather than a mod on the
calculation, which may actually be measurable :)

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio: imply disable_cb on callbacks

2010-05-18 Thread Rusty Russell

On Tue, 18 May 2010 09:03:17 pm Michael S. Tsirkin wrote:
 Rusty, the patch virtio: imply disable_cb on callbacks is on your tree.
 I'd like to figure out how it works: for example:
 
 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -69,6 +69,8 @@ static void blk_done(struct virtqueue *v
 /* In case queue is stopped waiting for more buffers. */
 blk_start_queue(vblk-disk-queue);
 spin_unlock_irqrestore(vblk-lock, flags);
 +
 +   vq-vq_ops-enable_cb(vq);
  }
 
  static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 
 
 Since this does not check the return status from enable_cb,
 it seems we could loose an interrupt if it arrives
 between poll and callback enable?

It's been quite a while since I wrote this patch, and never really liked it
enough to polish it.

We would need to enable this *before* reading the queue, AFAICT.

I'll remove it from my series; it's in the wilderness area already :)

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell

On Wed, 12 May 2010 04:57:22 am Avi Kivity wrote:
 On 05/07/2010 06:23 AM, Rusty Russell wrote:
  On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
 
  On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
   
  + /* We publish the last-seen used index at the end of the available ring.
  +  * It is at the end for backwards compatibility. */
  + vr-last_used_idx =(vr)-avail-ring[num];
  + /* Verify that last used index does not spill over the used ring. */
  + BUG_ON((void *)vr-last_used_idx +
  +sizeof *vr-last_used_idx   (void *)vr-used);
 }
 
 
  Shouldn't this be on its own cache line?
   
  It's next to the available ring; because that's where the guest publishes
  its data.  That whole page is guest-write, host-read.
 
  Putting it on a cacheline by itself would be a slight pessimization; the 
  host
  cpu would have to get the last_used_idx cacheline and the avail descriptor
  cacheline every time.  This way, they are sometimes the same cacheline.
 
 If one peer writes the tail of the available ring, while the other reads 
 last_used_idx, it's a false bounce, no?

I think we're talking about the last 2 entries of the avail ring.  That means
the worst case is 1 false bounce every time around the ring.  I think that's
why we're debating it instead of measuring it :)

Note that this is a exclusive-shared-exclusive bounce only, too.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost-net: utilize PUBLISH_USED_IDX feature

2010-05-19 Thread Rusty Russell

On Thu, 20 May 2010 07:57:18 am Michael S. Tsirkin wrote:
 On Wed, May 19, 2010 at 08:04:51PM +0300, Avi Kivity wrote:
  On 05/18/2010 04:19 AM, Michael S. Tsirkin wrote:
  With PUBLISH_USED_IDX, guest tells us which used entries
  it has consumed. This can be used to reduce the number
  of interrupts: after we write a used entry, if the guest has not yet
  consumed the previous entry, or if the guest has already consumed the
  new entry, we do not need to interrupt.
  This imporves bandwidth by 30% under some workflows.
 
  Signed-off-by: Michael S. Tsirkinm...@redhat.com
  ---
 
  Rusty, Dave, this patch depends on the patch
  virtio: put last seen used index into ring itself
  which is currently destined at Rusty's tree.
  Rusty, if you are taking that one for 2.6.35, please
  take this one as well.
  Dave, any objections?
 
 
  I object: I think the index should have its own cacheline,
 
 The issue here is that host/guest do not know each
 other's cache line size. I guess we could just put it
 at offset 128 or something like that ... Rusty?

I was assuming you'd put it at the end of the padding.

I think it's a silly optimization, but Avi obviously feels strongly about
it and I respect his opinion.

Please resubmit...
Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell

On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
  Note that this is a exclusive-shared-exclusive bounce only, too.
 
 
 A bounce is a bounce.

I tried to measure this to show that you were wrong, but I was only able
to show that you're right.  How annoying.  Test code below.

 Virtio is already way too bouncy due to the indirection between the 
 avail/used rings and the descriptor pool.

I tried to do a more careful analysis below, and I think this is an
overstatement.

 A device with out of order 
 completion (like virtio-blk) will quickly randomize the unused 
 descriptor indexes, so every descriptor fetch will require a bounce.
 
 In contrast, if the rings hold the descriptors themselves instead of 
 pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for 
 every descriptor, amortized.

We already have indirect, this would be a logical next step.  So let's
think about it. The avail ring would contain 64 bit values, the used ring
would contain indexes into the avail ring.

So client writes descriptor page and adds to avail ring, then writes to
index.  Server reads index, avail ring, descriptor page (3).  Writes used
entry (1).  Updates last_used (1).  Client reads used (1), derefs avail (1),
updates last_used (1), cleans descriptor page (1).

That's 9 cacheline transfers, worst case.  Best case of a half-full ring
in steady state, assuming 128-byte cache lines, the avail ring costs are
1/16, the used entry is 1/64.  This drops it to 6 and 9/64 transfers.

(Note, the current scheme adds 2 more cacheline transfers, for the descriptor
table, worst case.  Assuming indirect, we get 2/8 xfer best case.  Either way,
it's not the main source of cacheline xfers).

Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.

Any other thoughts?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell

On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
 On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
   Note that this is a exclusive-shared-exclusive bounce only, too.
  
  
  A bounce is a bounce.
 
 I tried to measure this to show that you were wrong, but I was only able
 to show that you're right.  How annoying.  Test code below.

This time for sure!

#define _GNU_SOURCE
#include unistd.h
#include sched.h
#include err.h
#include stdbool.h
#include stdint.h
#include string.h
#include stdlib.h
#include stdio.h
#include sys/time.h
#include sys/mman.h

/* We share memory via an mmap. */
struct counter {
unsigned int cacheline1;
char pad[256];
unsigned int cacheline2;
};

#define MAX_BOUNCES 1

enum mode {
SHARE,
UNSHARE,
LOCKSHARE,
LOCKUNSHARE,
};

int main(int argc, char *argv[])
{
cpu_set_t cpuset;
volatile struct counter *counter;
struct timeval start, stop;
bool child;
unsigned int count;
uint64_t usec;
enum mode mode;

if (argc != 4)
errx(1, Usage: cachebounce share|unshare|lockshare|lockunshare 
cpu0 cpu1);

if (strcmp(argv[1], share) == 0)
mode = SHARE;
else if (strcmp(argv[1], unshare) == 0)
mode = UNSHARE;
else if (strcmp(argv[1], lockshare) == 0)
mode = LOCKSHARE;
else if (strcmp(argv[1], lockunshare) == 0)
mode = LOCKSHARE;
else
errx(1, Usage: cachebounce share|unshare|lockshare|lockunshare 
cpu0 cpu1);

CPU_ZERO(cpuset);

counter = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, 
MAP_ANONYMOUS|MAP_SHARED, -1, 0);
if (counter == MAP_FAILED)
err(1, Mapping page);

/* Fault it in. */
counter-cacheline1 = counter-cacheline2 = 0;

child = (fork() == 0);

CPU_SET(atoi(argv[2 + child]), cpuset);
if (sched_setaffinity(getpid(), sizeof(cpu_set_t), cpuset) != 0)
err(1, Calling sched_setaffinity());

gettimeofday(start, NULL);

if (child) {
count = 1;
switch (mode) {
case SHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);
count++;
counter-cacheline1 = count;
count++;
}
break;
case UNSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);
count++;
counter-cacheline2 = count;
count++;
}
break;
case LOCKSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while 
(__sync_val_compare_and_swap(counter-cacheline1, count, count+1)
   != count);
count += 2;
}
break;
case LOCKUNSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);

__sync_val_compare_and_swap(counter-cacheline2, count, count+1);
count += 2;
}
break;
}
} else {
count = 0;
switch (mode) {
case SHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);
count++;
counter-cacheline1 = count;
count++;
}
break;
case UNSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline2 != count);
count++;
counter-cacheline1 = count;
count++;
}
break;
case LOCKSHARE:
while (count  MAX_BOUNCES

Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Rusty Russell

On Thu, 20 May 2010 04:30:56 pm Avi Kivity wrote:
 On 05/20/2010 08:01 AM, Rusty Russell wrote:
 
  A device with out of order
  completion (like virtio-blk) will quickly randomize the unused
  descriptor indexes, so every descriptor fetch will require a bounce.
 
  In contrast, if the rings hold the descriptors themselves instead of
  pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
  every descriptor, amortized.
   
  We already have indirect, this would be a logical next step.  So let's
  think about it. The avail ring would contain 64 bit values, the used ring
  would contain indexes into the avail ring.
 
 Have just one ring, no indexes.  The producer places descriptors into 
 the ring and updates the head,  The consumer copies out descriptors to 
 be processed and copies back in completed descriptors.  Chaining is 
 always linear.  The descriptors contain a tag that allow the producer to 
 identify the completion.

This could definitely work.  The original reason for the page boundaries
was for untrusted inter-guest communication: with appropriate page protections
they could see each other's rings and a simply inter-guest copy hypercall
could verify that the other guest really exposed that data via virtio ring.

But, cute as that is, we never did that.  And it's not clear that it wins
much over simply having the hypervisor read both rings directly.

  Can we do better?  The obvious idea is to try to get rid of last_used and
  used, and use the ring itself.  We would use an invalid entry to mark the
  head of the ring.
 
 Interesting!  So a peer will read until it hits a wall.  But how to 
 update the wall atomically?
 
 Maybe we can have a flag in the descriptor indicate headness or 
 tailness.  Update looks ugly though: write descriptor with head flag, 
 write next descriptor with head flag, remove flag from previous descriptor.

I was thinking a separate magic invalid entry.  To publish an 3 descriptor
chain, you would write descriptors 2 and 3, write an invalid entry at 4,
barrier, write entry 1.  It is a bit ugly, yes, but not terrible.

I think that a simple simulator for this is worth writing, which tracks
cacheline moves under various fullness scenarios...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2-RFC 0/2] virtio: put last seen used index into ring itself

2010-05-31 Thread Rusty Russell

On Thu, 27 May 2010 05:20:35 am Michael S. Tsirkin wrote:
 Here's a rewrite of the original patch with a new layout.
 I haven't tested it yet so no idea how this performs, but
 I think this addresses the cache bounce issue raised by Avi.
 Posting for early flames/comments.

Sorry, not without some evidence that it'll actually reduce cacheline
bouncing.  I *think* it will, but it's not obvious: the host may keep
looking at avail_idx as we're updating last_seen.  Or does qemu always
look at both together anyway?

Can someone convince me this is a win?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx

2010-06-03 Thread Rusty Russell

On Wed, 2 Jun 2010 12:17:12 am Michael S. Tsirkin wrote:
 This adds an (unused) option to put available ring before control (avail
 index, flags), and adds padding between index and flags. This avoids
 cache line sharing between control and ring, and also makes it possible
 to extend avail control without incurring extra cache misses.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

No no no no.  254?  You're trying to Morton me![1]

How's this (untested):

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -74,8 +74,8 @@ struct vring {
 /* The standard layout for the ring is a continuous chunk of memory which looks
  * like this.  We assume num is a power of 2.
  *
- * struct vring
- * {
+ * struct vring {
+ * *** The driver writes to this part.
  * // The actual descriptors (16 bytes each)
  * struct vring_desc desc[num];
  *
@@ -84,9 +84,11 @@ struct vring {
  * __u16 avail_idx;
  * __u16 available[num];
  *
- * // Padding to the next align boundary.
+ * // Padding so used_flags is on the next align boundary.
  * char pad[];
+ * __u16 last_used; // On a cacheline of its own.
  *
+ * *** The device writes to this part.
  * // A ring of used descriptor heads with free-running index.
  * __u16 used_flags;
  * __u16 used_idx;
@@ -110,6 +112,12 @@ static inline unsigned vring_size(unsign
+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
 }
 
+/* Last used index sits at the very end of the driver part of the struct */
+static inline __u16 *vring_last_used_idx(const struct vring *vr)
+{
+   return (__u16 *)vr-used - 1;
+}
+
 #ifdef __KERNEL__
 #include linux/irqreturn.h
 struct virtio_device;

Cheers,
Rusty.
[1] Andrew Morton has this technique where he posts a solution so ugly it
forces others to fix it properly.  Ego-roping, basically.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx

2010-06-04 Thread Rusty Russell

On Fri, 4 Jun 2010 08:05:43 pm Michael S. Tsirkin wrote:
 On Fri, Jun 04, 2010 at 12:04:57PM +0930, Rusty Russell wrote:
  On Wed, 2 Jun 2010 12:17:12 am Michael S. Tsirkin wrote:
   This adds an (unused) option to put available ring before control (avail
   index, flags), and adds padding between index and flags. This avoids
   cache line sharing between control and ring, and also makes it possible
   to extend avail control without incurring extra cache misses.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  No no no no.  254?  You're trying to Morton me![1]
 
 Hmm, I wonder what will we do if we want a 3rd field on
 a separate chacheline. But ok.
 
  How's this (untested):
 
 I think we also want to put flags there as well,
 they are used on interrupt path, together with last used index.

I'm uncomfortable with moving a field.

We haven't done that before and I wonder what will break with old code.

Should we instead just abandon the flags field and use last_used only?
Or, more radically, put flags == last_used when the feature is on?

Thoughts?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extending virtio_console to support multiple ports

2009-09-20 Thread Rusty Russell

On Tue, 1 Sep 2009 02:07:05 am Anthony Liguori wrote:
 Amit Shah wrote:
 Functionally speaking, both virtio-console and virtio-serial do the same 
 thing.  In fact, virtio-console is just a subset of virtio-serial.
 
 If there are problems converging the two drivers in Linux, then I 
 suggest you have two separate driver modules in Linux.  That would 
 obviously be rejected for Linux though because you cannot have two 
 drivers for the same device.  Why should qemu have a different policy?

I've been on leave, cheerfully not thinking about this.

AFAICT, a console and serial are very similar, so it makes sense to merge
them.  But a serial port doesn't have framing either; if it does, it's
something else.

And generally using virtio framing in upper layers is a mistake (one we've
made in the qemu implementations, but that's unfortunate).


Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-21 Thread Rusty Russell

On Wed, 2 Sep 2009 01:35:18 am Stephen Hemminger wrote:
 On Tue, 1 Sep 2009 14:58:19 +0800
 Xin, Xiaohui xiaohui@intel.com wrote:
 
[RFC] Virtual Machine Device Queues (VMDq) support on KVM
  
  Network adapter with VMDq technology presents multiple pairs of tx/rx 
  queues,
  and renders network L2 sorting mechanism based on MAC addresses and VLAN 
  tags
  for each tx/rx queue pair. Here we present a generic framework, in which 
  network
  traffic to/from a tx/rx queue pair can be directed from/to a KVM guest 
  without
  any software copy.
  
  Actually this framework can apply to traditional network adapters which have
  just one tx/rx queue pair. And applications using the same user/kernel 
  interface
  can utilize this framework to send/receive network traffic directly thru a 
  tx/rx
  queue pair in a network adapter.
  
  We use virtio-net architecture to illustrate the framework.
  
  
  || pop   add_buf||
  |Qemu process|  -TX   --  | Guest Kernel   |
  ||  - --  ||
  |Virtio-net  | push  get_buf||
  |  (Backend service) |  -RX   --  |  Virtio-net|
  ||  - --  |driver  |
  || push  get_buf||
  ||  ||
 |
 |
 | AIO (read  write) combined with Direct I/O
 |   (which substitute synced file operations)
  |---|
  | Host kernel  | read: copy-less with directly mapped user  |
  |  |   space to kernel, payload directly DMAed  |
  |  |   into user space  |
  |  | write: copy-less with directly mapped user |
  |  |   space to kernel, payload directly hooked |
  |  |   to a skb |
  |  ||
  |  (a likely   ||
  |   queue pair ||
  |   instance)  ||
  |  |   ||
  | NIC driver --  TUN/TAP driver   |
  |---|
 |
 |
 traditional adapter or a tx/rx queue pair
  
  The basic idea is to utilize the kernel Asynchronous I/O combined with 
  Direct
  I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
  kernel, we still can see it in SCSI tape driver.
  
  With traditional file operations, a copying of payload contents from/to the
  kernel DMA address to/from a user buffer is needed. That's what the copying 
  we
  want to save.
  
  The proposed framework is like this:
  A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue 
  pair in
  host side. KVM virto-net Backend service, the user space program submits
  asynchronous read/write I/O requests to the host kernel through TUN/TAP 
  device.
  The requests are corresponding to the vqueue elements include both 
  transmission
   receive. They can be queued in one AIO request and later, the completion 
  will
  be notified through the underlying packets tx/rx processing of the rx/tx 
  queue
  pair.
  
  Detailed path:
  
  To guest Virtio-net driver, packets receive corresponding to asynchronous 
  read
  I/O requests of Backend service.
  
  1) Guest Virtio-net driver provides header and payload address through the
  receive vqueue to Virtio-net backend service.
  
  2) Virtio-net backend service encapsulates multiple vqueue elements into
  multiple AIO control blocks and composes them into one AIO read request.
  
  3) Virtio-net backend service uses io_submit() syscall to pass the request 
  to
  the TUN/TAP device.
  
  4) Virtio-net backend service uses io_getevents() syscall to check the
  completion of the request.
  
  5) The TUN/TAP driver receives packets from the queue pair of NIC, and 
  prepares
  for Direct I/O.
 A modified NIC driver may render a skb which header is allocated in host
  kernel, but the payload buffer is directly mapped from user space buffer 
  which
  are rendered through the AIO request by the Backend service. 
  get_user_pages()
  may do this. For one AIO read request, the TUN/TAP driver maintains a list 
  for
  the directly mapped buffers, and a NIC driver tries to get the buffers as
  payload buffer to compose the new skbs. Of course, if

Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default

2009-09-22 Thread Rusty Russell

On Fri, 18 Sep 2009 03:01:42 am Christoph Hellwig wrote:
 Err, I'll take this one back for now pending some more discussion.
 What we need more urgently is the writeback cache flag, which is now
 implemented in qemu, patch following ASAP.

OK, still catching up on mail.  I'll push them out of the queue for now.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] virtio: adding __devexit to virtballoon_remove

2009-10-13 Thread Rusty Russell

Thanks, I already have this from Uwe.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] virtio: order used ring after used index read

2009-10-25 Thread Rusty Russell

On Sun, 25 Oct 2009 11:58:53 pm Michael S. Tsirkin wrote:
 On SMP guests, reads from the ring might bypass
 used index reads. This causes guest crashes
 because host writes to used index to signal ring data
 readiness.
 Fix this by inserting rmb before used ring reads.

Thanks, good spotting.  Applied.

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-net: fix data corruption with OOM

2009-10-25 Thread Rusty Russell

On Mon, 26 Oct 2009 03:33:40 am Michael S. Tsirkin wrote:
 virtio net used to unlink skbs from send queues on error,
 but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
 we do not do this. This causes guest data corruption and crashes
 with vhost since net core can requeue the skb or free it without
 it being taken off the list.
 
 This patch fixes this by queueing the skb after successfull
 transmit.

I originally thought that this was racy: as soon as we do add_buf, we need to
make sure we're ready for the callback (for virtio_pci, it's -kick, but we
shouldn't rely on that).

So a comment would be nice.  How's this?

Subject: virtio-net: fix data corruption with OOM
Date: Sun, 25 Oct 2009 19:03:40 +0200
From: Michael S. Tsirkin m...@redhat.com

virtio net used to unlink skbs from send queues on error,
but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
we do not do this. This causes guest data corruption and crashes
with vhost since net core can requeue the skb or free it without
it being taken off the list.

This patch fixes this by queueing the skb after successful
transmit.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Rusty Russell ru...@rustcorp.com.au (+ comment)
---

Rusty, here's a fix for another data corrupter I saw.
This fixes a regression from 2.6.31, so definitely
2.6.32 I think. Comments?

 drivers/net/virtio_net.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -516,8 +516,7 @@ again:
/* Free up any pending old buffers before queueing new ones. */
free_old_xmit_skbs(vi);
 
-   /* Put new one in send queue and do transmit */
-   __skb_queue_head(vi-send, skb);
+   /* Try to transmit */
capacity = xmit_skb(vi, skb);
 
/* This can happen with OOM and indirect buffers. */
@@ -531,8 +530,17 @@ again:
}
return NETDEV_TX_BUSY;
}
+   vi-svq-vq_ops-kick(vi-svq);
 
-   vi-svq-vq_ops-kick(vi-svq);
+   /*
+* Put new one in send queue.  You'd expect we'd need this before
+* xmit_skb calls add_buf(), since the callback can be triggered
+* immediately after that.  But since the callback just triggers
+* another call back here, normal network xmit locking prevents the
+* race.
+*/
+   __skb_queue_head(vi-send, skb);
+
/* Don't wait up for transmitted skbs to be freed. */
skb_orphan(skb);
nf_reset(skb);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance regression in virtio-net in 2.6.32-rc4

2009-10-27 Thread Rusty Russell

On Tue, 27 Oct 2009 05:18:35 am Michael S. Tsirkin wrote:
 Hi!
 I noticed a performance regression in virtio net: going from
 2.6.31 to 2.6.32-rc4 I see this, for guest to host communication:
...
 Size   SizeSize Time Throughput
 bytes  bytes   bytessecs.10^6bits/sec
 
  87380  16384  1638410.207806.48
...
  87380  16384  1638410.006814.60

Hmm, needs a bisect I'd say.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-05 Thread Rusty Russell

On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.

Hi Michael,

   Now everyone else has finally kicked all the tires and it seems to pass,
I've done a fairly complete review.  Generally, it's really nice; just one
bug and a few minor suggestions for polishing.

 +/* Caller must have TX VQ lock */
 +static void tx_poll_stop(struct vhost_net *net)
 +{
 + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
 + return;

likely?  Really?

 + for (;;) {
 + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in,
 +  NULL, NULL);

Danger!  You need an arg to vhost_get_vq_desc to tell it the max desc size
you can handle.  Otherwise, it's only limited by ring size, and a malicious
guest can overflow you here, and below:

 + /* Skip header. TODO: support TSO. */
 + s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
...
 +
 + use_mm(net-dev.mm);
 + mutex_lock(vq-mutex);
 + vhost_no_notify(vq);

I prefer a name like vhost_disable_notify().

 + /* OK, now we need to know about added descriptors. */
 + if (head == vq-num  vhost_notify(vq))
 + /* They could have slipped one in as we were doing that:
 +  * check again. */
 + continue;
 + /* Nothing new?  Wait for eventfd to tell us they refilled. */
 + if (head == vq-num)
 + break;
 + /* We don't need to be notified again. */
 + vhost_no_notify(vq);

Similarly, vhost_enable_notify.  This one is particularly misleading since
it doesn't actually notify anything!

In particular, this code would be neater as:

if (head == vq-num) {
if (vhost_enable_notify(vq)) {
/* Try again, they could have slipped one in. */
continue;
}
/* Nothing more to do. */
break;
}
vhost_disable_notify(vq);

Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
return true only when there's more pending.  Saves a loop around here most
of the time.  Also, the vhost_no_notify/vhost_disable_notify() can be moved
out of the loop entirely.  (It could be under an if (unlikely(enabled)), not
sure if it's worth it).

 + len = err;
 + err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);

That unsigned char * arg to memcpy_toiovec is annoying.  A patch might be
nice, separate from this effort.

 +static int vhost_net_open(struct inode *inode, struct file *f)
 +{
 + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
 + int r;
 + if (!n)
 + return -ENOMEM;
 + f-private_data = n;
 + n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
 + n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;

I have a personal dislike of calloc for structures.  In userspace, it's
because valgrind can't spot uninitialized fields.  These days a similar
argument applies in the kernel, because we have KMEMCHECK now.  If someone
adds a field to the struct and forgets to initialize it, we can spot it.

 +static void vhost_net_enable_vq(struct vhost_net *n, int index)
 +{
 + struct socket *sock = n-vqs[index].private_data;

OK, I can't help but this that presenting the vqs as an array doesn't buy
us very much.  Esp. if you change vhost_dev_init to take a NULL-terminated
varargs.  I think readability would improve.  It means passing a vq around
rather than an index.

Not completely sure it'll be a win tho.

 +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int 
 fd)
 +{
 + struct socket *sock, *oldsock = NULL;
...
 + sock = get_socket(fd);
 + if (IS_ERR(sock)) {
 + r = PTR_ERR(sock);
 + goto done;
 + }
 +
 + /* start polling new socket */
 + oldsock = vq-private_data;
...
 +done:
 + mutex_unlock(n-dev.mutex);
 + if (oldsock) {
 + vhost_net_flush_vq(n, index);
 + fput(oldsock-file);

I dislike this style; I prefer multiple different goto points, one for when
oldsock is set, and one for when it's not.

That way, gcc warns us about uninitialized variables if we get it wrong.

 +static long vhost_net_reset_owner(struct vhost_net *n)
 +{
 + struct socket *tx_sock = NULL;
 + struct socket *rx_sock = NULL;
 + long r;

This should be called err, since that's what it is.

 +static void vhost_net_set_features(struct vhost_net *n, u64 features)
 +{
 + size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
 + sizeof(struct virtio_net_hdr) : 0;
 + int i;
 + mutex_lock(n-dev.mutex);
 + n-dev.acked_features = features;

Why is this called acked_features?  Not just features?  I expected
to see

Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-05 Thread Rusty Russell

On Thu, 5 Nov 2009 03:55:42 am Paul E. McKenney wrote:
 On Wed, Nov 04, 2009 at 01:57:29PM +0200, Michael S. Tsirkin wrote:
  Can you ack this usage please?
 
 I thought I had done so in my paragraph above, but if you would like
 something a bit more formal...

snip verbose super-ack with qualifications

That's great guys.  And yes, this is a kind of read-copy-update.  And no,
there's nothing wrong with it.

But it's still nasty to use half an API.  If it were a few places I would
have open-coded it with a comment, or wrapped it.  As it is, I don't think
that would be a win.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-07 Thread Rusty Russell

On Sat, 7 Nov 2009 03:00:07 am Paul E. McKenney wrote:
 On Fri, Nov 06, 2009 at 03:31:20PM +1030, Rusty Russell wrote:
  But it's still nasty to use half an API.  If it were a few places I would
  have open-coded it with a comment, or wrapped it.  As it is, I don't think
  that would be a win.
 
 So would it help to have a rcu_read_lock_workqueue() and
 rcu_read_unlock_workqueue() that checked nesting and whether they were
 actually running in the context of a workqueue item?  Or did you have
 something else in mind?  Or am I misjudging the level of sarcasm in
 your reply?  ;-)

You read correctly.  If we get a second user, creating an API makes sense.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Rusty Russell

On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote:
 On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote:
   +/* Caller must have TX VQ lock */
   +static void tx_poll_stop(struct vhost_net *net)
   +{
   + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
   + return;
  
  likely?  Really?
 
 Hmm ... yes. tx poll stop is called on each packet (as long as we do not
 fill up 1/2 backend queue), the first call will stop polling
 the rest checks state and does nothing.
 
 This is because we normally do not care when the message has left the
 queue in backend device: we tell backend to send it and forget. We only
 start polling when backend tx queue fills up.

OK, good.

   +static void vhost_net_set_features(struct vhost_net *n, u64 features)
   +{
   + size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
   + sizeof(struct virtio_net_hdr) : 0;
   + int i;
   + mutex_lock(n-dev.mutex);
   + n-dev.acked_features = features;
  
  Why is this called acked_features?  Not just features?  I expected
  to see code which exposed these back to userspace, and didn't.
 
 Not sure how do you mean. Userspace sets them, why
 does it want to get them exposed back?

There's something about the 'acked' which rubs me the wrong way.
enabled_features is perhaps a better term than acked_features; acked
seems more a user point-of-view, enabled seems more driver POV?

set_features matches your ioctl names, but it sounds like a fn name :(

It's marginal.  And 'features' is shorter than both.

   + switch (ioctl) {
   + case VHOST_SET_VRING_NUM:
  
  I haven't looked at your userspace implementation, but does a generic
  VHOST_SET_VRING_STATE  VHOST_GET_VRING_STATE with a struct make more
  sense?  It'd be simpler here,
 
 Not by much though, right?
 
  but not sure if it'd be simpler to use?
 
 The problem is with VHOST_SET_VRING_BASE as well. I want it to be
 separate because I want to make it possible to relocate e.g. used ring
 to another address while ring is running. This would be a good debugging
 tool (you look at kernel's used ring, check descriptor, then update
 guest's used ring) and also possibly an extra way to do migration.  And
 it's nicer to have vring size separate as well, because it is
 initialized by host and never changed, right?

Actually, this looks wrong to me:

+   case VHOST_SET_VRING_BASE:
...
+   vq-avail_idx = vq-last_avail_idx = s.num;

The last_avail_idx is part of the state of the driver.  It needs to be saved
and restored over susp/resume.  The only reason it's not in the ring itself
is because I figured the other side doesn't need to see it (which is true, but
missed debugging opportunities as well as man-in-the-middle issues like this
one).  I had a patch which put this field at the end of the ring, I might
resurrect it to avoid this problem.  This is backwards compatible with all
implementations.  See patch at end.

I would drop avail_idx altogether: get_user is basically free, and simplifies
a lot.  As most state is in the ring, all you need is an ioctl to save/restore
the last_avail_idx.

 We could merge DESC, AVAIL, USED, and it will reduce the amount of code
 in userspace. With both base, size and fds separate, it seemed a bit
 more symmetrical to have desc/avail/used separate as well.
 What's your opinion?

Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT.

  For future reference, this is *exactly* the kind of thing which would have
  been nice as a followup patch.  Easy to separate, easy to review, not 
  critical
  to the core.
 
 Yes. It's not too late to split it out though: should I do it yet?

Only if you're feeling enthused.  It's lightly reviewed now.

Cheers,
Rusty.

virtio: put last_used and last_avail index into ring itself.

Generally, the other end of the virtio ring doesn't need to see where
you're up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, if you want to save and restore a virtio_ring, but you're
not the consumer because the kernel is using it directly.

Fortunately, we have room to expand: the ring is always a whole number
of pages and there's hundreds of bytes of padding after the avail ring
and the used ring, whatever the number of descriptors (which must be a
power of 2).

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/virtio/virtio_ring.c |   23 +++
 include/linux/virtio_ring.h  |   12 +++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -71,9 +71,6 @@ struct vring_virtqueue
/* Number we've added since last sync. */
unsigned int num_added;
 
-   /* Last

Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-09 Thread Rusty Russell

On Mon, 9 Nov 2009 05:40:32 pm Michael S. Tsirkin wrote:
 On Mon, Nov 9, 2009 at 8:17 AM, Rusty Russell ru...@rustcorp.com.au wrote:
  There's something about the 'acked' which rubs me the wrong way.
  enabled_features is perhaps a better term than acked_features; acked
  seems more a user point-of-view, enabled seems more driver POV?
 
 Hmm. Are you happy with the ioctl name? If yes I think being consistent
 with that is important.

I think in my original comments I noted that I preferred GET / SET, rather
than GET/ACK.

  Actually, this looks wrong to me:
 
  +   case VHOST_SET_VRING_BASE:
  ...
  +   vq-avail_idx = vq-last_avail_idx = s.num;
 
  The last_avail_idx is part of the state of the driver.  It needs to be saved
  and restored over susp/resume.
 
 
 Exactly. That's what VHOST_GET/SET_VRING_BASE does.  avail_idx is just a
 cached value for notify on empty, so what this does is clear the value.

Ah, you actually refresh it every time anyway.  Hmm, could you do my poor
brain a favor and either just get_user it in vhost_trigger_irq(), or call
it 'cached_avail_idx' or something?

   The only reason it's not in the ring itself
  is because I figured the other side doesn't need to see it (which is true, 
  but
  missed debugging opportunities as well as man-in-the-middle issues like this
  one).  I had a patch which put this field at the end of the ring, I might
  resurrect it to avoid this problem.  This is backwards compatible with all
  implementations.  See patch at end.
 
 Yes, I remember that patch. There seems to be little point though, at
 this stage.

Well, it avoids this ioctl, by exposing all the state.  We may well need it
later, to expand the ring in other ways.

  I would drop avail_idx altogether: get_user is basically free, and 
  simplifies
  a lot.  As most state is in the ring, all you need is an ioctl to 
  save/restore
  the last_avail_idx.
 
 avail_idx is there for notify on empty: I had this thought that it's
 better to leave the avail cache line alone when we are triggering
 interrupt to avoid bouncing it around if guest is updating it meanwhile
 on another CPU, and I think my testing showed that it helped
 performance, but could be a mistake.  You don't believe this can help?

I believe it could help, but this is YA case where it would have been nice to
have a dumb basic patch and this as a patch on top.  But I am going to ask
you to re-run that measurement, see if it stacks up (because it's an
interesting lesson if it does..)

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv9 3/3] vhost_net: a kernel-level virtio server

2009-11-09 Thread Rusty Russell

On Tue, 10 Nov 2009 03:52:30 am Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.

Thanks, applied.  Will be in tomorrow's linux-next.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv9 3/3] vhost_net: a kernel-level virtio server

2009-11-09 Thread Rusty Russell

One fix:

vhost: fix TUN=m VHOST_NET=y

drivers/built-in.o: In function `get_tun_socket':
net.c:(.text+0x15436e): undefined reference to `tun_get_socket'

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/vhost/Kconfig |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -1,6 +1,6 @@
 config VHOST_NET
tristate Host kernel accelerator for virtio net (EXPERIMENTAL)
-   depends on NET  EVENTFD  EXPERIMENTAL
+   depends on NET  EVENTFD  TUN  EXPERIMENTAL
---help---
  This kernel module can be loaded in host kernel to accelerate
  guest networking with virtio_net. Not to be confused with virtio_net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv9 3/3] vhost_net: a kernel-level virtio server

2009-11-11 Thread Rusty Russell

On Tue, 10 Nov 2009 10:06:37 pm Michael S. Tsirkin wrote:
 If tun is a module, vhost must be a module, too.
 If tun is built-in or disabled, vhost can be built-in.

I really like the brainbending :)  Keeps readers on their toes...

Applied,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] Split the KVM pv-ops support by feature

2009-11-17 Thread Rusty Russell

On Wed, 18 Nov 2009 10:43:12 am Alexander Graf wrote:
 Currently selecting KVM guest support enabled multiple features at once that
 not everyone necessarily wants to have, namely:

These patches make perfect sense, but please make sure Jeremy Fitzhardinge
(CC'd) is in the loop, as he split the structs in the first place.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

2009-11-22 Thread Rusty Russell

On Fri, 20 Nov 2009 04:39:19 pm Shirley Ma wrote:
 Guest virtio_net receives packets from its pre-allocated vring 
 buffers, then it delivers these packets to upper layer protocols
 as skb buffs. So it's not necessary to pre-allocate skb for each
 mergable buffer, then frees it when it's useless. 
 
 This patch has deferred skb allocation when receiving packets for
 both big packets and mergeable buffers. It reduces skb pre-allocations 
 and skb_frees.
 
 Based on Mickael  Avi's suggestion. A destroy function has been created
 to push virtio free buffs to vring for unused pages, and used page private
 to maintain page list.
 
 I didn't touch small packet skb allocation to avoid extra copies for small
 packets.
 
 This patch has tested and measured against 2.6.32-rc5 git. It is built again
  2.6.32-rc7 kernel. Tests have been done for small packets, big packets and
 mergeable buffers.
 
 The single netperf TCP_STREAM performance improved for host to guest. 
 It also reduces UDP packets drop rate.
 
 The netperf laptop results were:
 
 mtu=1500
 netperf -H xxx -l 120
 
   w/o patch   w/i patch (two runs)
 guest to host:  3336.84Mb/s   3730.14Mb/s ~ 3582.88Mb/s
 
 host to guest:  3165.10Mb/s   3370.39Mb/s ~ 3407.96Mb/s

Nice!

Is this using mergeable_rx_bufs?  Or just big_packets?

I'd like to drop big packet support from our driver, but I don't know
how many kvm hosts will not offer mergable rx bufs yet.  Anthony?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

2009-11-22 Thread Rusty Russell

On Sat, 21 Nov 2009 02:51:41 am Shirley Ma wrote:
 Signed-off-by: Shirley Ma (x...@us.ibm.com)

Hi Shirley,

   This patch seems like a good idea, but it needs some work.  As this is
the first time I've received a patch from you, I will go through it in extra
detail.  (As virtio maintainer, it's my responsibility to include this patch,
or not).

 -static void give_a_page(struct virtnet_info *vi, struct page *page)
 +static void give_pages(struct virtnet_info *vi, struct page *page)
  {
 - page-private = (unsigned long)vi-pages;
 + struct page *npage = (struct page *)page-private;
 +
 + if (!npage)
 + page-private = (unsigned long)vi-pages;
 + else {
 + /* give a page list */
 + while (npage) {
 + if (npage-private == (unsigned long)0) {
 + npage-private = (unsigned long)vi-pages;
 + break;
 + }
 + npage = (struct page *)npage-private;
 + }
 + }
   vi-pages = page;

The loops should cover both cases of the if branch.  Either way, we want
to find the last page in the list so you can set that -private pointer to 
vi-pages.

Also, the == (unsigned long)0 is a little verbose.

How about:
struct page *end;

/* Find end of list, sew whole thing into vi-pages. */
for (end = page; end-private; end = (struct page *)end-private);
end-private = (unsigned long)vi-pages;
vi-pages = page;

 +void virtio_free_pages(void *buf)

This is a terrible name; it's specific to virtio_net, plus it should be
static.  Just free_pages should be sufficient here I think.

 +{
 + struct page *page = (struct page *)buf;
 + struct page *npage;
 +
 + while (page) {
 + npage = (struct page *)page-private;
 + __free_pages(page, 0);
 + page = npage;
 + }

This smells a lot like a for loop to me?

struct page *i, *next;

for (i = buf; next; i = next) {
next = (struct page *)i-private;
__free_pages(i, 0);
}

 +static int set_skb_frags(struct sk_buff *skb, struct page *page,
 + int offset, int len)

A better name might be skb_add_frag()?

 +static void receive_skb(struct net_device *dev, void *buf,
   unsigned len)
  {
   struct virtnet_info *vi = netdev_priv(dev);
 - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 - int err;
 + struct skb_vnet_hdr *hdr;
 + struct sk_buff *skb;
   int i;
  
   if (unlikely(len  sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
 @@ -132,39 +173,71 @@ static void receive_skb(struct net_device *dev, struct 
 sk_buff *skb,
   goto drop;
   }
  
 - if (vi-mergeable_rx_bufs) {
 - unsigned int copy;
 - char *p = page_address(skb_shinfo(skb)-frags[0].page);
 + if (!vi-mergeable_rx_bufs  !vi-big_packets) {
 + skb = (struct sk_buff *)buf;

This cast is unnecessary, but a comment would be nice:
/* Simple case: the pointer is the 1514-byte skb */

 + } else {

And a matching comment here:

/* The pointer is a chain of pages. */

 + struct page *page = (struct page *)buf;
 + int copy, hdr_len, num_buf, offset;
 + char *p;
 +
 + p = page_address(page);
  
 - if (len  PAGE_SIZE)
 - len = PAGE_SIZE;
 - len -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
 + skb = netdev_alloc_skb(vi-dev, GOOD_COPY_LEN + NET_IP_ALIGN);
 + if (unlikely(!skb)) {
 + dev-stats.rx_dropped++;

Does this mean we leak all those pages?  Shouldn't we give_pages() here?

 + return;
 + }
 + skb_reserve(skb, NET_IP_ALIGN);
 + hdr = skb_vnet_hdr(skb);
  
 - memcpy(hdr-mhdr, p, sizeof(hdr-mhdr));
 - p += sizeof(hdr-mhdr);
 + if (vi-mergeable_rx_bufs) {
 + hdr_len = sizeof(hdr-mhdr);
 + memcpy(hdr-mhdr, p, hdr_len);
 + num_buf = hdr-mhdr.num_buffers;
 + offset = hdr_len;
 + if (len  PAGE_SIZE)
 + len = PAGE_SIZE;
 + } else {
 + /* big packtes 6 bytes alignment between virtio_net
 +  * header and data */
 + hdr_len = sizeof(hdr-hdr);
 + memcpy(hdr-hdr, p, hdr_len);
 + offset = hdr_len + 6;

Really?  I can't see where the current driver does this 6 byte offset.  There
should be a header, then the packet...

Ah, I see below that you set things up this way.  The better way to do this
is to create a new structure to use in various places.

struct padded_vnet_hdr {
struct virtio_net_hdr hdr;
/*

Re: [PATCH] vhost: Fix warnings and bad type handling

2009-11-22 Thread Rusty Russell

On Sun, 22 Nov 2009 08:58:01 pm Michael S. Tsirkin wrote:

 From: Alan Cox a...@linux.intel.com
 Subject: [PATCH] vhost: fix warnings on 32 bit systems

 Fix compiler warning about discarding top 32 bit
 of data on 32 bit systems, and document that
 dicarded bits must be 0.

 Signed-off-by: Alan Cox a...@linux.intel.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---

 So I think the below slightly tweaked version of
 Alan's patch is a bit better. OK?

Thanks, applied.

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

2009-11-23 Thread Rusty Russell

On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote:
   + skb = (struct sk_buff *)buf;
  This cast is unnecessary, but a comment would be nice:
 
 Without this cast there is a compile warning. 

Hi Shirley,

   Looks like buf is a void *, so no cast should be necessary.  But I could
be reading the patch wrong.

  However, I question whether making it 16 byte is the right thing: the
  ethernet header is 14 bytes long, so don't we want 8 bytes of padding?
 
 Because in QEMU it requires 10 bytes header in a separately, so one page
 is used to share between virtio_net_hdr header which is 10 bytes head
 and rest of data. So I put 6 bytes offset here between two buffers. I
 didn't look at the reason why a seperate buf is used for virtio_net_hdr
 in QEMU.

It's a qemu bug.  It insists the header be an element in the scatterlist by
itself.  Unfortunately we have to accommodate it.

However, there's no reason for the merged rxbuf and big packet layout to be
the same: for the big packet case you should layout as follows:

#define BIG_PACKET_PAD (NET_SKB_PAD - sizeof(struct virtio_net_hdr) + 
NET_IP_ALIGN)
struct big_packet_page {
struct virtio_net_hdr hdr;
char pad[BIG_PACKET_PAD];
/* Actual packet data starts here */
unsigned char data[PAGE_SIZE - BIG_PACKET_PAD - sizeof(struct 
virtio_net_hdr)];
};

Then use this type as the template for registering the sg list for the
big packet case.

  I think you can move the memcpy outside the if, like so:
  
  memcpy(hdr, p, hdr_len);
 
 I didn't move it out, because num_buf = hdr-mhdr.num_buffers;

Yes, that has to stay inside, but the memcpy can move out.  It's just a bit
neater to have more common code.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

2009-11-24 Thread Rusty Russell

On Wed, 25 Nov 2009 01:06:32 am Anthony Liguori wrote:
 So does lguest.  It's been that way since the beginning.  Fixing this 
 would result in breaking older guests.

Agreed, we can't fix it in the guests, but it's a wart.  That's why I
haven't bothered fixing it, but if someone else wants to I'll cheer all the
way.

lguest did it because I knew I could fix lguest any time; it was a bad mistake
and I will now fix lguest :)

 We really need to introduce a feature bit if we want to change this.

I don't think it's worth it.  But the spec does say that the implementation
should not rely on the framing (I think there's a note that current
implementations are buggy tho, so you should frame it separately anyway).

That way future devices can get it right, at least.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

2009-11-25 Thread Rusty Russell

On Wed, 25 Nov 2009 07:45:30 pm Michael S. Tsirkin wrote:
 Hmm, is it really worth it to save a header copy if it's linear?  We are
 going to access it anyway, and it fits into one cacheline nicely.  On
 the other hand we have more code making life harder for compiler and
 processor.

Not sure: I think there would be many places where it would be useful.

We do a similar thing in the kernel to inspect non-linear packets, and
it's served us well.

Cheers,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio: set pci bus master enable bit

2009-11-30 Thread Rusty Russell

On Mon, 30 Nov 2009 02:22:01 am Michael S. Tsirkin wrote:
 As all virtio devices perform DMA, we
 must enable bus mastering for them to be
 spec compliant.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/virtio/virtio_pci.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
 index 28d9cf7..717bae1 100644
 --- a/drivers/virtio/virtio_pci.c
 +++ b/drivers/virtio/virtio_pci.c
 @@ -648,6 +648,7 @@ static int __devinit virtio_pci_probe(struct pci_dev 
 *pci_dev,
   goto out_req_regions;
  
   pci_set_drvdata(pci_dev, vp_dev);
 + pci_set_master(pci_dev);

I can believe this, but I have no idea if it's right.  I've applied it, and
hope Jesse will comment if there's something wrong with it.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio spec: add virtio-blk max sectors feature

2009-12-07 Thread Rusty Russell

On Thu, 3 Dec 2009 08:28:38 pm Avi Kivity wrote:
 On 12/03/2009 10:42 AM, Avishay Traeger1 wrote:
  I previously submitted a patch to have the guest virtio-blk driver get the
  value for the maximum I/O size from the host bdrv, rather than assume that
  there is no limit.  Avi requested that I first patch the virtio spec
  (http://ozlabs.org/~rusty/virtio-spec/).  Below is that patch.
 
  Please CC me on replies, as I am not subscribed.
 
 
 
 Copying Rusty and virtualizat...@.

Thanks Avi...

Avishay; this would be the total sectors in an I/O, as separate from SIZE_MAX
(maximum size of any single scatterlist entry) and SEG_MAX (maximum number of
scatterlist entries)?

Seems like a reasonable idea; esp if you need it.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

2009-12-13 Thread Rusty Russell

On Fri, 11 Dec 2009 11:03:25 pm Shirley Ma wrote:
 Signed-off-by: x...@us.ibm.com

Hi Shirley,

   These patches look quite close.  More review to follow :)

   This title needs revision.  It should start with virtio: (all the virtio
patches do, for easy identification after merge), eg:

Subject: virtio: Add destroy callback for unused buffers

It also needs a commit message which explains the problem and the solution.
Something like this:

There's currently no way for a virtio driver to ask for unused
buffers, so it has to keep a list itself to reclaim them at shutdown.
This is redundant, since virtio_ring stores that information.  So
add a new hook to do this: virtio_net will be the first user.

 -
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index c708ecc..bb5eb7b 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -107,6 +107,16 @@ static struct page *get_a_page(struct virtnet_info *vi, 
 gfp_t gfp_mask)
   return p;
  }
  
 +static void virtio_net_free_pages(void *buf)
 +{
 + struct page *page, *next;
 +
 + for (page = buf; page; page = next) {
 + next = (struct page *)page-private;
 + __free_pages(page, 0);
 + }
 +}
 +
  static void skb_xmit_done(struct virtqueue *svq)
  {
   struct virtnet_info *vi = svq-vdev-priv;

This belongs in one of the future patches: it will cause an unused warning
and is logically not part of this patch anyway.

 +static int vring_destroy_bufs(struct virtqueue *_vq, void (*destroy)(void *))
 +{
 + struct vring_virtqueue *vq = to_vvq(_vq);
 + void *buf;
 + unsigned int i;
 + int freed = 0;

unsigned int for return and freed counter?  Means changing prototype, but
negative return isn't useful here.

 +
 + START_USE(vq);
 +
 + for (i = 0; i  vq-vring.num; i++) {
 + if (vq-data[i]) {
 + /* detach_buf clears data, so grab it now. */
 + buf = vq-data[i];
 + detach_buf(vq, i);
 + destroy(buf);
 + freed++;

You could simplify this a bit, since the queue must be inactive:

destroy(vq-data[i]);
detach_buf(vq, i);
freed++;

 + }
 + }
 +
 + END_USE(vq);
 + return freed;

Perhaps add a:
/* That should have freed everything. */
BUG_ON(vq-num_free != vq-vring.num)

   void (*disable_cb)(struct virtqueue *vq);
   bool (*enable_cb)(struct virtqueue *vq);
 + int (*destroy_bufs)(struct virtqueue *vq, void (*destory)(void *));

destory typo :)

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls chain pages in virtio_net

2009-12-13 Thread Rusty Russell

On Fri, 11 Dec 2009 11:13:02 pm Shirley Ma wrote:
 Signed-off-by: Shirley Ma x...@us.ibm.com

I don't think there's a good way of splitting this change across multiple
patches.  And I don't think this patch will compile; I don't think we can
get rid of trim_pages yet.

We *could* first split the receive paths into multiple parts as you have
(eg. add_recvbuf_big etc), then actually rewrite them, but that's too much
work to refactor the code twice.  

So just roll all the driver changes into one patch; so you will have two
patches: one which creates the destroy_bufs API for virtio, and one which
uses it in the virtio_net driver.

 +static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct page 
 **page,
 + unsigned int *len)

This actually allocates an skb, and transfers the first page to it (via copy
and possibly the first fragment).  skb_from_page() perhaps?

 + hdr_len = sizeof(hdr-hdr);
 + offset = sizeof(struct padded_vnet_hdr);
 + }
 +
 + memcpy(hdr, p, hdr_len);
 +
 + *len -= hdr_len;

Perhaps you should return NULL here as an error if *len was  hdr_len?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

2009-12-15 Thread Rusty Russell

On Wed, 16 Dec 2009 09:10:02 am Michael S. Tsirkin wrote:
 On Wed, Dec 16, 2009 at 09:06:12AM +1030, Rusty Russell wrote:
  On Tue, 15 Dec 2009 06:52:53 am Michael S. Tsirkin wrote:
   On Mon, Dec 14, 2009 at 12:08:05PM -0800, Shirley Ma wrote:
Hello Michael,

I agree with the comments (will have two patches instead of 4 based on
Rusty's comments) except below one.

On Sun, 2009-12-13 at 12:26 +0200, Michael S. Tsirkin wrote:
 That said - do we have to use a callback?
 I think destroy_buf which returns data pointer,
 and which we call repeatedly until we get NULL
 or error, would be an a better, more flexible API.
 This is not critical though.

The reason to use this is because in virtio_net remove, it has
BUG_ON(vi-num != 0), which will be consistent with small skb packet. If
we use NULL, error then we lose the track for vi-num, since we don't
know how many buffers have been passed to ULPs or still unused.

Thanks
Shirley
   
   I dont insist, but my idea was
   
   for (;;) {
 b = vq-destroy(vq);
 if (!b)
 break;
 --vi-num;
 put_page(b);
   }
  
  In this case it should be called get_unused_buf or something.  But I like
  Shirley's approach here; destroy (with callback) accurately reflects the 
  only
  time this can be validly used.
  
  Cheers,
  Rusty.
 
 I guess the actual requirement is that device must be
 inactive.

Technically, the vq has to be inactive.  (This distinction may matter for
the multiport virtio_console work).

 
 As I said this is fine with me as well.
 But I think the callback should get vq pointer besides the
 data pointer, so that it can e.g. find the device if it needs to.
 In case of virtio net this makes it possible
 to decrement the outstanding skb counter in the callback.
 Makes sense?

Sure.  I don't really mind either way, and I'm warming to the name
detach_buf :)

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: fix high 32 bit in FEATURES ioctls

2009-12-22 Thread Rusty Russell

On Tue, 22 Dec 2009 10:09:33 pm Michael S. Tsirkin wrote:
 From: David Stevens dlstev...@us.ibm.com
 Subject: vhost: fix high 32 bit in FEATURES ioctls

Thanks, applied.

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC] virtio_blk: check for hardsector size from host

2008-05-29 Thread Rusty Russell

On Tuesday 27 May 2008 19:04:59 Christian Borntraeger wrote:
 Rusty, Jens,

 I need your opinion on the following patch. It seems to work, but I would
 like to get some feedback if this patch is the right approach:

Looks like the right approach to me.  Don't know about the block side of it...

Just that u64 seems like overkill: u32?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_config: fix len calculation of config elements

2008-05-29 Thread Rusty Russell

On Thursday 29 May 2008 19:08:01 Christian Borntraeger wrote:
 v is a pointer, to we have to use sizeof(*v) instead of sizeof(v).

How embarassing.  Applied, thanks.

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies

2008-06-17 Thread Rusty Russell

On Tuesday 17 June 2008 00:02:55 Anthony Liguori wrote:
 There's nothing that prevents zero-copy to be implemented for tun
 without vringfd.  In fact, I seem to recall that your earlier patches
 implemented zero-copy :-)

They didn't actually work.  You need to block until the data isn't being used 
any more (thread pool anyone?), or implement an aio interface.

 I like the vringfd model and I think it's a good way to move forward.
 My concern is that it introduces an extra syscall in the TX path.  Right
 now, we do a single write call whereas with vringfd we need to insert
 the TX packet into the queue, do a notify, and then wait for indication
 that the TX has succeeded.

If the guest wants notification of xmit, yes you need another syscall for 
that.  But it often doesn't (note: current vring tun ignored the NO_NOTIFY 
flag, but one thing at a time).

 I know we'll win with TSO but we don't need vringfd for TSO.  The jury's
 still out IMHO as to whether we should do vringfd or just try to merge
 TSO tun patches.

Note that we can do TSO in userspace, too.  No syscall reduction, but an VM 
exit reduction.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic

2008-07-24 Thread Rusty Russell

On Friday 25 July 2008 09:22:53 Dor Laor wrote:
 Mark McLoughlin wrote:
  vq-vring.used-flags = ~VRING_USED_F_NO_NOTIFY;
  qemu_del_timer(n-tx_timer);
  n-tx_timer_active = 0;

 As stated by newer messages, we should handle the first tx notification
 if the timer wasn't active to shorten latency.
 Cheers, Dor

Here's what lguest does at the moment.  Basically, we cut the timeout a tiny 
bit each time, until we get *fewer* packets than last time.  Then we bump it 
up again.

Rough, but seems to work (it should be a per-device var of course, not a 
static).

@@ -921,6 +922,7 @@ static void handle_net_output(int fd, st
unsigned int head, out, in, num = 0;
int len;
struct iovec iov[vq-vring.num];
+   static int last_timeout_num;
 
if (!timeout)
net_xmit_notify++;
@@ -941,6 +943,14 @@ static void handle_net_output(int fd, st
/* Block further kicks and set up a timer if we saw anything. */
if (!timeout  num)
block_vq(vq);
+
+   if (timeout) {
+   if (num  last_timeout_num)
+   timeout_usec += 10;
+   else if (timeout_usec  1)
+   timeout_usec--;
+   last_timeout_num = num;
+   }
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9][RFC] KVM virtio_net performance

2008-07-27 Thread Rusty Russell

On Saturday 26 July 2008 19:45:36 Avi Kivity wrote:
 Mark McLoughlin wrote:
  Hey,
Here's a bunch of patches attempting to improve the performance
  of virtio_net. This is more an RFC rather than a patch submission
  since, as can be seen below, not all patches actually improve the
  perfomance measurably.
 
I've tried hard to test each of these patches with as stable and
  informative a benchmark as I could find. The first benchmark is a
  netperf[1] based throughput benchmark and the second uses a flood
  ping[2] to measure latency differences.
 
Each set of figures is min/average/max/standard deviation. The
  first set is Gb/s and the second is milliseconds.
 
The network configuration used was very simple - the guest with
  a virtio_net interface and the host with a tap interface and static
  IP addresses assigned to both - e.g. there was no bridge in the host
  involved and iptables was disable in both the host and guest.
 
I used:
 
1) kvm-71-26-g6152996 with the patches that follow
 
2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
   219:bbd2611289c5 applied; these are the patches have just been
   submitted to Linus
 
The conclusions I draw are:
 
1) The length of the tx mitigation timer makes quite a difference to
   throughput achieved; we probably need a good heuristic for
   adjusting this on the fly.

 The tx mitigation timer is just one part of the equation; the other is
 the virtio ring window size, which is now fixed.

 Using a maximum sized window is good when the guest and host are running
 flat out, doing nothing but networking.  When throughput drops (because
 the guest is spending cpu on processing, or simply because the other
 side is not keeping up), we need to drop the windows size so as to
 retain acceptable latencies.

 The tx timer can then be set to a bit after the end of the window,
 acting as a safety belt in case the throughput changes.

Interestingly, I played a little with a threshold patch.  Unfortunately it 
seemed to just add YA variable to the mix; it'd take real research to figure 
out how to adjust the threshold and timeout values appropriately for 
different situations.

 This isn't too good.  Low latency is important for nfs clients (or other
 request/response workloads).  I think we can keep these low by adjusting
 the virtio window (for example, on an idle system it should be 1), so
 that the tx mitigation timer only fires when the workload transitions
 from throughput to request/response.

From reading this thread, this seems like an implementation bug.  Don't 
suppress notifications on an empty ring (ie. threshold will be 1 when nothing 
is happening).

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9][RFC] KVM virtio_net performance

2008-07-27 Thread Rusty Russell

On Saturday 26 July 2008 19:45:36 Avi Kivity wrote:
5) Eliminating an extra copy on the host-guest path only makes a
   barely measurable difference.

 That's expected on a host-guest test.  Zero copy is mostly important
 for guest-external, and with zerocopy already enabled in the guest
 (sendfile or nfs server workloads).

Note that this will also need the copyless tun patches.  If we're doing one 
copy from userspace-kernel anyway, this extra copy is probably lost in the 
noise.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Work around dhclient brokenness

2008-08-18 Thread Rusty Russell

On Monday 18 August 2008 21:44:25 Herbert Xu wrote:
 On Mon, Aug 18, 2008 at 02:40:55PM +0300, Avi Kivity wrote:
  Isn't that turned on automatically for real hardware?  And what's to
  prevent a broken dhclient together with the (presumably) hacked up
  initscripts that call ethtool?

 Well the idea is that only a fixed guest would even know about
 enabling this.

For those not following closely: We already have a method for the guest to 
accept or reject features.  Our problem is that the guest is already 
accepting the CSUM feature: but one critical userspace app (dhcp-client) can't 
actually handle it due to a bug.

The proposal is to add another mechanism, whereby the host doesn't advertise 
CSUM, but advertises a new CSUM2 feature.  The driver doesn't accept this by 
default: then guest userspace says hey, I *really can* handle CSUM.  This 
would have to be done dby resetting the device in the ethtool callback 
(that's how we renegotiate features).  And guests need a special virtio hack 
in their init scripts.

This leaves the small number of current users without CSUM (and hence GSO 
etc).  Yet they might not use dhcp with bridging anyway.  Worst of all, we 
have to document this embarrassing workaround.

Neither solution is good.  But I don't think Anthony's hack looks so bad after 
this.

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Work around dhclient brokenness

2008-08-18 Thread Rusty Russell

On Tuesday 19 August 2008 13:17:57 Chris Wedgwood wrote:
 On Tue, Aug 19, 2008 at 10:45:20AM +1000, Rusty Russell wrote:
  For those not following closely: We already have a method for the
  guest to accept or reject features.  Our problem is that the guest
  is already accepting the CSUM feature: but one critical userspace
  app (dhcp-client) can't actually handle it due to a bug.

 Can't we just get dhcp-client client fixed upstream and let the
 distro's update in a couple of months?

Herbert has already fixed the client, but it seems upstream hasn't released 
yet.

  The proposal is to add another mechanism, whereby the host doesn't
  advertise CSUM, but advertises a new CSUM2 feature.  The driver
  doesn't accept this by default: then guest userspace says hey, I
  *really can* handle CSUM.  This would have to be done dby resetting
  the device in the ethtool callback (that's how we renegotiate
  features).  And guests need a special virtio hack in their init
  scripts.

 If guest can have modified scripts for virtio and what not, they can
 have a fixed dhcp-client.

They need to do both.  This way if they don't, it still works, but networking 
is at a penalty (no CSUM offload).

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Work around dhclient brokenness

2008-08-19 Thread Rusty Russell

On Tuesday 19 August 2008 15:17:08 Herbert Xu wrote:
 On Mon, Aug 18, 2008 at 10:13:55PM -0700, Chris Wedgwood wrote:
  CSUM2 sounds so ugly though.  Features seem to get added and never
  removed  how about if this had a documented short lifetime (if it
  really must go in)?

 All we need is a simple toggle to disable checksum offload.  Every
 NIC that offers receive checksum offload allows it to be disabled.
 virtio shouldn't be any different.

 Resetting the NIC seems a bit over the top.

Not really.  We could extend the protocol, but that's currently how feature 
negotiation works: you can't do it while the device is live.  That seemed 
simplest.  I learnt from Xen :)

(Of course, we don't need to *disable* it, we need to *enable* it).

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Work around dhclient brokenness

2008-08-19 Thread Rusty Russell

On Tuesday 19 August 2008 15:28:40 Chris Wedgwood wrote:
 On Tue, Aug 19, 2008 at 03:17:08PM +1000, Herbert Xu wrote:
  All we need is a simple toggle to disable checksum offload.  Every
  NIC that offers receive checksum offload allows it to be disabled.
  virtio shouldn't be any different.

 So why CSUM2 and not an ethtool interface then?

We need both.  CSUM2 is the new virtio-level feature.  The ethtool interface 
allows you to toggle it (we don't support that currently, but that's simply 
slackness).

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Work around dhclient brokenness

2008-08-24 Thread Rusty Russell

On Wednesday 20 August 2008 01:05:55 Chris Wedgwood wrote:
 On Tue, Aug 19, 2008 at 07:10:44PM +1000, Rusty Russell wrote:
  We need both.  CSUM2 is the new virtio-level feature.

 Perhaps that's what I'm misisng.  How is this different to CSUM?

Because current guests which say they can handle CSUM can't *really* handle 
it.  That's the entire point.  We need a new one, so that existing guests 
don't ack it (because they don't understand it), and new guests can ack it 
later.

Hope that clarifies,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of page size

2008-11-12 Thread Rusty Russell

On Tuesday 11 November 2008 10:07:09 Hollis Blanchard wrote:
 Both sides of the virtio interface must agree about how big a pfn really
 is. This is particularly an issue on architectures where the page size is
 configurable (e.g. PowerPC, IA64) -- the interface must be independent of
 PAGE_SHIFT.

 Currently there are three distinct problems:
 * The shift count used when passing the physical address of the ring to a
   PCI-based back end.
 * The ring layout itself is padded to span at least two pages.
 * The balloon driver operates in units of pages.

Hi Hollis,

   The more I thought about this, the more I think we're not solving this
as neatly as we could.  The trigger was noting that we're breaking the
userspace API (vring_size and vring_init are exposed to userspace): I
know that qemu cut  pastes, but that's no excuse.

   So instead, I've introduced separate constants for each use.  Yes,
all these constants are 12/4096.  But just to be contrary, at the end
is a patch to change lguest to 128.  And there's no reason this
couldn't change in future using some guest detection scheme.

Patch stream to follow...
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] virtio: use LGUEST_VRING_ALIGN instead of relying on pagesize

2008-11-12 Thread Rusty Russell

This doesn't really matter, since lguest is i386 only at the moment,
but we could actually choose a different value.  (lguest doesn't have
a guarenteed ABI).

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 Documentation/lguest/lguest.c   |6 +++---
 drivers/lguest/lguest_device.c  |2 +-
 include/linux/lguest_launcher.h |2 ++
 3 files changed, 6 insertions(+), 4 deletions(-)

diff -r c6744f753314 Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Wed Nov 12 22:02:52 2008 +1030
+++ b/Documentation/lguest/lguest.c Wed Nov 12 22:11:03 2008 +1030
@@ -1030,7 +1030,7 @@
/* Zero out the virtqueues. */
for (vq = dev-vq; vq; vq = vq-next) {
memset(vq-vring.desc, 0,
-  vring_size(vq-config.num, getpagesize()));
+  vring_size(vq-config.num, LGUEST_VRING_ALIGN));
lg_last_avail(vq) = 0;
}
} else if (dev-desc-status  VIRTIO_CONFIG_S_FAILED) {
@@ -1211,7 +1211,7 @@
void *p;
 
/* First we need some memory for this virtqueue. */
-   pages = (vring_size(num_descs, getpagesize()) + getpagesize() - 1)
+   pages = (vring_size(num_descs, LGUEST_VRING_ALIGN) + getpagesize() - 1)
/ getpagesize();
p = get_pages(pages);
 
@@ -1228,7 +1228,7 @@
vq-config.pfn = to_guest_phys(p) / getpagesize();
 
/* Initialize the vring. */
-   vring_init(vq-vring, num_descs, p, getpagesize());
+   vring_init(vq-vring, num_descs, p, LGUEST_VRING_ALIGN);
 
/* Append virtqueue to this device's descriptor.  We use
 * device_config() to get the end of the device's current virtqueues;
diff -r c6744f753314 drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.cWed Nov 12 22:02:52 2008 +1030
+++ b/drivers/lguest/lguest_device.cWed Nov 12 22:11:03 2008 +1030
@@ -250,7 +250,7 @@
/* Figure out how many pages the ring will take, and map that memory */
lvq-pages = lguest_map((unsigned long)lvq-config.pfn  PAGE_SHIFT,
DIV_ROUND_UP(vring_size(lvq-config.num,
-   PAGE_SIZE),
+   LGUEST_VRING_ALIGN),
 PAGE_SIZE));
if (!lvq-pages) {
err = -ENOMEM;
diff -r c6744f753314 include/linux/lguest_launcher.h
--- a/include/linux/lguest_launcher.h   Wed Nov 12 22:02:52 2008 +1030
+++ b/include/linux/lguest_launcher.h   Wed Nov 12 22:11:03 2008 +1030
@@ -59,4 +59,6 @@
LHREQ_IRQ, /* + irq */
LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
 };
+
+#define LGUEST_VRING_ALIGN 4096
 #endif /* _LINUX_LGUEST_LAUNCHER */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/7] virtio: rename 'pagesize' arg to vring_init/vring_size

2008-11-12 Thread Rusty Russell

It's really the alignment desired for consumer/producer separation;
historically this x86 pagesize, but with PowerPC it'll still be x86
pagesize.  And in theory lguest could choose a different value.

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 include/linux/virtio_ring.h |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff -r c6761450d706 include/linux/virtio_ring.h
--- a/include/linux/virtio_ring.h   Wed Nov 12 21:47:57 2008 +1030
+++ b/include/linux/virtio_ring.h   Wed Nov 12 21:51:09 2008 +1030
@@ -83,7 +83,7 @@
  * __u16 avail_idx;
  * __u16 available[num];
  *
- * // Padding to the next page boundary.
+ * // Padding to the next align boundary.
  * char pad[];
  *
  * // A ring of used descriptor heads with free-running index.
@@ -93,19 +93,19 @@
  * };
  */
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
- unsigned long pagesize)
+ unsigned long align)
 {
vr-num = num;
vr-desc = p;
vr-avail = p + num*sizeof(struct vring_desc);
-   vr-used = (void *)(((unsigned long)vr-avail-ring[num] + pagesize-1)
-~(pagesize - 1));
+   vr-used = (void *)(((unsigned long)vr-avail-ring[num] + align-1)
+~(align - 1));
 }
 
-static inline unsigned vring_size(unsigned int num, unsigned long pagesize)
+static inline unsigned vring_size(unsigned int num, unsigned long align)
 {
return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
-+ pagesize - 1)  ~(pagesize - 1))
++ align - 1)  ~(align - 1))
+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
 }
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/7] virtio: Don't use PAGE_SIZE for vring alignment in virtio_pci.

2008-11-12 Thread Rusty Russell

That doesn't work for non-4k guests which are now appearing.

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 drivers/virtio/virtio_pci.c |5 +++--
 include/linux/virtio_pci.h  |4 
 2 files changed, 7 insertions(+), 2 deletions(-)

diff -r c6f6a3ab173b drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c   Wed Nov 12 21:51:09 2008 +1030
+++ b/drivers/virtio/virtio_pci.c   Wed Nov 12 21:57:25 2008 +1030
@@ -216,7 +216,7 @@
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
struct virtio_pci_vq_info *info;
struct virtqueue *vq;
-   unsigned long flags;
+   unsigned long flags, size;
u16 num;
int err;
 
@@ -237,7 +237,8 @@
info-queue_index = index;
info-num = num;
 
-   info-queue = kzalloc(PAGE_ALIGN(vring_size(num,PAGE_SIZE)), 
GFP_KERNEL);
+   size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
+   info-queue = kzalloc(size, GFP_KERNEL);
if (info-queue == NULL) {
err = -ENOMEM;
goto out_info;
diff -r c6f6a3ab173b include/linux/virtio_pci.h
--- a/include/linux/virtio_pci.hWed Nov 12 21:51:09 2008 +1030
+++ b/include/linux/virtio_pci.hWed Nov 12 21:57:25 2008 +1030
@@ -57,4 +57,8 @@
 /* How many bits to shift physical queue address written to QUEUE_PFN.
  * 12 is historical, and due to x86 page size. */
 #define VIRTIO_PCI_QUEUE_ADDR_SHIFT12
+
+/* The alignment to use between consumer and producer parts of vring.
+ * x86 pagesize again. */
+#define VIRTIO_PCI_VRING_ALIGN 4096
 #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/7] virtio: use KVM_S390_VIRTIO_RING_ALIGN instead of relying on pagesize

2008-11-12 Thread Rusty Russell

This doesn't really matter, since s390 pagesize is 4k anyway.

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 arch/s390/include/asm/kvm_virtio.h |3 +++
 drivers/s390/kvm/kvm_virtio.c  |9 ++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff -r 5c9c0cc58458 arch/s390/include/asm/kvm_virtio.h
--- a/arch/s390/include/asm/kvm_virtio.hWed Nov 12 22:05:59 2008 +1030
+++ b/arch/s390/include/asm/kvm_virtio.hWed Nov 12 22:10:29 2008 +1030
@@ -50,6 +50,9 @@
 #define KVM_S390_VIRTIO_RESET  1
 #define KVM_S390_VIRTIO_SET_STATUS 2
 
+/* Overkill: we have the consumer and producer writing to separate pages. */
+#define KVM_S390_VIRTIO_RING_ALIGN 4096
+
 #ifdef __KERNEL__
 /* early virtio console setup */
 #ifdef CONFIG_S390_GUEST
diff -r 5c9c0cc58458 drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 22:05:59 2008 +1030
+++ b/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 22:10:29 2008 +1030
@@ -188,7 +188,8 @@
config = kvm_vq_config(kdev-desc)+index;
 
err = vmem_add_mapping(config-address,
-  vring_size(config-num, PAGE_SIZE));
+  vring_size(config-num,
+ KVM_S390_VIRTIO_RING_ALIGN));
if (err)
goto out;
 
@@ -209,7 +210,8 @@
return vq;
 unmap:
vmem_remove_mapping(config-address,
-   vring_size(config-num, PAGE_SIZE));
+   vring_size(config-num,
+  KVM_S390_VIRTIO_RING_ALIGN));
 out:
return ERR_PTR(err);
 }
@@ -220,7 +222,8 @@
 
vring_del_virtqueue(vq);
vmem_remove_mapping(config-address,
-   vring_size(config-num, PAGE_SIZE));
+   vring_size(config-num,
+  KVM_S390_VIRTIO_RING_ALIGN));
 }
 
 /*


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/7] virtio: hand virtio ring alignment as argument to vring_new_virtqueue

2008-11-12 Thread Rusty Russell

This allows each virtio user to hand in the alignment appropriate to
their virtio_ring structures.

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 drivers/lguest/lguest_device.c |4 ++--
 drivers/s390/kvm/kvm_virtio.c  |3 ++-
 drivers/virtio/virtio_pci.c|4 ++--
 drivers/virtio/virtio_ring.c   |3 ++-
 include/linux/virtio_ring.h|1 +
 5 files changed, 9 insertions(+), 6 deletions(-)

diff -r 6ea441813d18 drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.cWed Nov 12 21:57:55 2008 +1030
+++ b/drivers/lguest/lguest_device.cWed Nov 12 22:01:45 2008 +1030
@@ -259,8 +259,8 @@
 
/* OK, tell virtio_ring.c to set up a virtqueue now we know its size
 * and we've got a pointer to its pages. */
-   vq = vring_new_virtqueue(lvq-config.num, vdev, lvq-pages,
-lg_notify, callback);
+   vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN,
+vdev, lvq-pages, lg_notify, callback);
if (!vq) {
err = -ENOMEM;
goto unmap;
diff -r 6ea441813d18 drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 21:57:55 2008 +1030
+++ b/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 22:01:45 2008 +1030
@@ -192,7 +192,8 @@
if (err)
goto out;
 
-   vq = vring_new_virtqueue(config-num, vdev, (void *) config-address,
+   vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN,
+vdev, (void *) config-address,
 kvm_notify, callback);
if (!vq) {
err = -ENOMEM;
diff -r 6ea441813d18 drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c   Wed Nov 12 21:57:55 2008 +1030
+++ b/drivers/virtio/virtio_pci.c   Wed Nov 12 22:01:45 2008 +1030
@@ -249,8 +249,8 @@
  vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
-   vq = vring_new_virtqueue(info-num, vdev, info-queue,
-vp_notify, callback);
+   vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN,
+vdev, info-queue, vp_notify, callback);
if (!vq) {
err = -ENOMEM;
goto out_activate_queue;
diff -r 6ea441813d18 drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c  Wed Nov 12 21:57:55 2008 +1030
+++ b/drivers/virtio/virtio_ring.c  Wed Nov 12 22:01:45 2008 +1030
@@ -274,6 +274,7 @@
 };
 
 struct virtqueue *vring_new_virtqueue(unsigned int num,
+ unsigned int vring_align,
  struct virtio_device *vdev,
  void *pages,
  void (*notify)(struct virtqueue *),
@@ -292,7 +293,7 @@
if (!vq)
return NULL;
 
-   vring_init(vq-vring, num, pages, PAGE_SIZE);
+   vring_init(vq-vring, num, pages, vring_align);
vq-vq.callback = callback;
vq-vq.vdev = vdev;
vq-vq.vq_ops = vring_vq_ops;
diff -r 6ea441813d18 include/linux/virtio_ring.h
--- a/include/linux/virtio_ring.h   Wed Nov 12 21:57:55 2008 +1030
+++ b/include/linux/virtio_ring.h   Wed Nov 12 22:01:45 2008 +1030
@@ -115,6 +115,7 @@
 struct virtqueue;
 
 struct virtqueue *vring_new_virtqueue(unsigned int num,
+ unsigned int vring_align,
  struct virtio_device *vdev,
  void *pages,
  void (*notify)(struct virtqueue *vq),


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7] lguest: change virtio ring alignment.

2008-11-12 Thread Rusty Russell

This is just to test that the constant is wired through correctly.

lguest has no ABI so I could apply this patch, but it breaks the
don't be an asshole rule.

diff -r 19e5f3f3a4be Documentation/lguest/Makefile
--- a/Documentation/lguest/Makefile Wed Nov 12 22:27:50 2008 +1030
+++ b/Documentation/lguest/Makefile Wed Nov 12 22:38:52 2008 +1030
@@ -4,5 +4,7 @@
 
 all: lguest
 
+lguest: ../../include/linux/lguest_launcher.h
+
 clean:
rm -f lguest
diff -r 19e5f3f3a4be include/linux/lguest_launcher.h
--- a/include/linux/lguest_launcher.h   Wed Nov 12 22:27:50 2008 +1030
+++ b/include/linux/lguest_launcher.h   Wed Nov 12 22:38:52 2008 +1030
@@ -60,5 +60,6 @@
LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
 };
 
-#define LGUEST_VRING_ALIGN 4096
+/* A cacheline is enough separation between producer and consumer. */
+#define LGUEST_VRING_ALIGN 128
 #endif /* _LINUX_LGUEST_LAUNCHER */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7] virtio: Don't use PAGE_SIZE in virtio_pci.c

2008-11-12 Thread Rusty Russell

The virtio PCI devices don't depend on the guest page size.  This matters
now PowerPC virtio is gaining ground (they like 64k pages).

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 drivers/virtio/virtio_pci.c |2 +-
 include/linux/virtio_pci.h  |4 
 2 files changed, 5 insertions(+), 1 deletion(-)

diff -r 8014f9c51f4e drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c   Wed Nov 12 21:12:39 2008 +1030
+++ b/drivers/virtio/virtio_pci.c   Wed Nov 12 21:47:21 2008 +1030
@@ -244,7 +244,7 @@
}
 
/* activate the queue */
-   iowrite32(virt_to_phys(info-queue)  PAGE_SHIFT,
+   iowrite32(virt_to_phys(info-queue)  VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
diff -r 8014f9c51f4e include/linux/virtio_pci.h
--- a/include/linux/virtio_pci.hWed Nov 12 21:12:39 2008 +1030
+++ b/include/linux/virtio_pci.hWed Nov 12 21:47:21 2008 +1030
@@ -53,4 +53,8 @@
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION 0
+
+/* How many bits to shift physical queue address written to QUEUE_PFN.
+ * 12 is historical, and due to x86 page size. */
+#define VIRTIO_PCI_QUEUE_ADDR_SHIFT12
 #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of page size

2008-11-12 Thread Rusty Russell

On Thursday 13 November 2008 02:46:31 Hollis Blanchard wrote:
 On Wed, 2008-11-12 at 22:51 +1030, Rusty Russell wrote:
  On Tuesday 11 November 2008 10:07:09 Hollis Blanchard wrote:
   Both sides of the virtio interface must agree about how big a pfn
   really is. This is particularly an issue on architectures where the
   page size is configurable (e.g. PowerPC, IA64) -- the interface must be
   independent of PAGE_SHIFT.
  
   Currently there are three distinct problems:
   * The shift count used when passing the physical address of the ring to
   a PCI-based back end.
   * The ring layout itself is padded to span at least two pages.
   * The balloon driver operates in units of pages.
 
  Hi Hollis,
 
 The more I thought about this, the more I think we're not solving this
  as neatly as we could.  The trigger was noting that we're breaking the
  userspace API (vring_size and vring_init are exposed to userspace): I
  know that qemu cut  pastes, but that's no excuse.
 
 So instead, I've introduced separate constants for each use.  Yes,
  all these constants are 12/4096.  But just to be contrary, at the end
  is a patch to change lguest to 128.  And there's no reason this
  couldn't change in future using some guest detection scheme.

 OK. I thought it was simpler to just say 4KB everywhere in all aspects
 of the virtio interface, but I'm happy as long as we solve the problem
 somehow. :)

It is simpler, yes, but we can take this opportunity to deconflate them and 
make things clearer and better than the current code, not just fix it.

Note that I still don't have a balloon patch: want to send me one?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of page size

2008-11-13 Thread Rusty Russell

On Friday 14 November 2008 08:18:33 Hollis Blanchard wrote:
 On Thu, 2008-11-13 at 08:44 +1030, Rusty Russell wrote:
  Note that I still don't have a balloon patch: want to send me one?

 linux: virtio-balloon: avoid implicit use of Linux page size in balloon
 interface

Thanks, applied with following diff:

Use tabs to indent, and put BUILD_BUG_ON pagesize assumption.

Signed-off-by: Rusty Russell [EMAIL PROTECTED]

diff -r 50e970613233 drivers/virtio/virtio_balloon.c
--- a/drivers/virtio/virtio_balloon.c   Fri Nov 14 11:40:38 2008 +1030
+++ b/drivers/virtio/virtio_balloon.c   Fri Nov 14 11:41:19 2008 +1030
@@ -58,10 +58,11 @@
 
 static u32 page_to_balloon_pfn(struct page *page)
 {
-unsigned long pfn = page_to_pfn(page);
+   unsigned long pfn = page_to_pfn(page);
 
-/* Convert pfn from Linux page size to balloon page size. */
-return pfn  (PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT);
+   BUILD_BUG_ON(PAGE_SHIFT  VIRTIO_BALLOON_PFN_SHIFT);
+   /* Convert pfn from Linux page size to balloon page size. */
+   return pfn  (PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT);
 }
 
 static void balloon_ack(struct virtqueue *vq)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm-s390: implement config_changed for virtio on s390

2008-11-18 Thread Rusty Russell

On Wednesday 19 November 2008 08:14:13 Christian Borntraeger wrote:
 Can you replace the kvm-s390 patch with this one?

Done.  I hadn't pushed out the other one to linux-next yet anyway.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] virtio-pci queue allocation not page-aligned

2008-12-02 Thread Rusty Russell

On Wednesday 03 December 2008 05:38:21 Hollis Blanchard wrote:
 I just spent a number of hours tracking this one down, and I'm not too
 thrilled about it. vp_find_vq() does the memory allocation for virtio
 PCI rings, and it uses kzalloc() to do it. This is bad because the ring
 memory *must* be page-aligned.

 According to Anthony, at the time this code was written, various slab
 allocators were checked and all happened to return page-aligned buffers.
 So how did I hit a problem? I had enabled CONFIG_SLUB_DEBUG_ON while
 investigating an unrelated problem, which offset the address by 64
 bytes.

 One option is to add a BUG_ON(addr  ~PAGE_MASK) to vp_find_vq(). That's
 better than nothing, but still stinks.

It's a bug, we fix it.  I've complained before, but since there was no 
evidence of it actually breaking, I didn't push.

Prepare a patch, I'll try to get it in this release.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio: make PCI devices take a virtio_pci module ref

2008-12-05 Thread Rusty Russell

On Thursday 04 December 2008 23:14:31 Mark McLoughlin wrote:
 Nothing takes a ref on virtio_pci, so even if you have
 devices in use, rmmod will attempt to unload the module.
 
 Fix by simply making each device take a ref on the module.

Hi Mark,

   Taking a reference to oneself is almost always wrong.  I'm a little
surprised that a successful call to pci_device-probe doesn't bump the
module count though.

Jesse?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio: make PCI devices take a virtio_pci module ref

2008-12-07 Thread Rusty Russell

On Saturday 06 December 2008 01:37:06 Mark McLoughlin wrote:
 Another example of a lack of an explicit dependency causing problems is
 Fedora's mkinitrd having this hack:
 
 if echo $PWD | grep -q /virtio-pci/ ; then
 findmodule virtio_pci
 fi

 which basically says if this is a virtio device, don't forget to
 include virtio_pci in the initrd too!. Now, mkinitrd is full of hacks,
 but this is a particularly unusual one.

Um, I don't know what this does, sorry.

I have no idea how Fedora chooses what to put in an initrd; I can't think
of a sensible way of deciding what goes in and what doesn't other than
lists and heuristics.

But there really is no explicit dependency between virtio modules and
virtio_pci.  There just is for kvm/x86 at the moment, since that is how
they use virtio.  Running over another bus is certainly possible,
though may never happen for x86 (happens today for s390).

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm: use cpumask_var_t for cpus_hardware_enabled

2008-12-07 Thread Rusty Russell

On Monday 08 December 2008 02:25:50 Avi Kivity wrote:
 Rusty Russell wrote:
  This changes cpus_hardware_enabled from a cpumask_t to a cpumask_var_t:
  equivalent for CONFIG_CPUMASKS_OFFSTACK=n, otherwise dynamically allocated.
 
   
  -static cpumask_t cpus_hardware_enabled;
  +static cpumask_var_t cpus_hardware_enabled
 
 This isn't on stack, so it isn't buying us anything.

It's the CONFIG_NR_CPUS=4096 but nr_cpu_ids=4 case which we win using
dynamic allocation.  Gotta love distribution kernels.

 Is the plan to drop cpumask_t?

Yes.  And undefine 'struct cpumask' if CONFIG_CPUMASK_OFFSTACK.  That
will stop assignment and on-stack declarations for all but the most
determined.

 If so, we're penalizing non-stack users 
 by forcing them to go through another pointer (and cacheline).

Not quite.  If !CONFIG_CPUMASK_OFFSTACK, cpumask_var_t == cpumask_t[1].
Blame Linus :)

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] kvm: use modern cpumask primitives, no cpumask_t on stack

2008-12-07 Thread Rusty Russell

On Monday 08 December 2008 02:44:00 Avi Kivity wrote:
 Avi Kivity wrote:
  Rusty Russell wrote:
  We're getting rid on on-stack cpumasks for large NR_CPUS.
 
  1) Use cpumask_var_t and alloc_cpumask_var (a noop normally).  Fallback
 code is inefficient but never happens in practice.
 
  Wow, code duplication from Rusty. Things must be bad.
 
  Since we're in a get_cpu() here, how about a per_cpu static cpumask 
  instead? I don't mind the inefficient fallback, just the duplication.
 
 
 Btw, for the general case, instead of forcing everyone to duplicate, how 
 about:
 
 cpumask_var_t cpus;
 
 with_cpumask(cpus) {
... code to populate cpus
smp_call_function_some(...);
 } end_with_cpumask(cpus);
 
 Where with_cpumask() allocates cpus, and uses a mutex + static fallback 
 on failure.

I'd prefer not to hide deadlocks that way :(

I'll re-battle with that code to neaten it.  There are only a few places
which have these kind of issues.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: Extract core of kvm_flush_remote_tlbs/kvm_reload_remote_mmus.

2008-12-08 Thread Rusty Russell

Avi said:
 Wow, code duplication from Rusty. Things must be bad.

Something about glass houses comes to mind.  But instead, a patch.

Signed-off-by: Rusty Russell [EMAIL PROTECTED]
---
 virt/kvm/kvm_main.c |   44 +++-
 1 file changed, 15 insertions(+), 29 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -355,10 +355,11 @@ static void ack_flush(void *_completed)
 {
 }
 
-void kvm_flush_remote_tlbs(struct kvm *kvm)
+static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
int i, cpu, me;
cpumask_t cpus;
+   bool called = false;
struct kvm_vcpu *vcpu;
 
me = get_cpu();
@@ -367,45 +368,30 @@ void kvm_flush_remote_tlbs(struct kvm *k
vcpu = kvm-vcpus[i];
if (!vcpu)
continue;
-   if (test_and_set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests))
+   if (test_and_set_bit(req, vcpu-requests))
continue;
cpu = vcpu-cpu;
if (cpu != -1  cpu != me)
cpu_set(cpu, cpus);
}
-   if (cpus_empty(cpus))
-   goto out;
-   ++kvm-stat.remote_tlb_flush;
-   smp_call_function_mask(cpus, ack_flush, NULL, 1);
-out:
+   if (!cpus_empty(cpus)) {
+   smp_call_function_mask(cpus, ack_flush, NULL, 1);
+   called = true;
+   }
put_cpu();
+   return called;
+}
+
+void kvm_flush_remote_tlbs(struct kvm *kvm)
+{
+   if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
+   ++kvm-stat.remote_tlb_flush;
 }
 
 void kvm_reload_remote_mmus(struct kvm *kvm)
 {
-   int i, cpu, me;
-   cpumask_t cpus;
-   struct kvm_vcpu *vcpu;
-
-   me = get_cpu();
-   cpus_clear(cpus);
-   for (i = 0; i  KVM_MAX_VCPUS; ++i) {
-   vcpu = kvm-vcpus[i];
-   if (!vcpu)
-   continue;
-   if (test_and_set_bit(KVM_REQ_MMU_RELOAD, vcpu-requests))
-   continue;
-   cpu = vcpu-cpu;
-   if (cpu != -1  cpu != me)
-   cpu_set(cpu, cpus);
-   }
-   if (cpus_empty(cpus))
-   goto out;
-   smp_call_function_mask(cpus, ack_flush, NULL, 1);
-out:
-   put_cpu();
+   make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
 }
-
 
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] kvm: use modern cpumask primitives, no cpumask_t on stack

2008-12-08 Thread Rusty Russell

On Monday 08 December 2008 20:19:57 Avi Kivity wrote:
 Rusty Russell wrote:
  Btw, for the general case, instead of forcing everyone to duplicate, how 
  about:
 
  cpumask_var_t cpus;
 
  with_cpumask(cpus) {
 ... code to populate cpus
 smp_call_function_some(...);
  } end_with_cpumask(cpus);
 
  Where with_cpumask() allocates cpus, and uses a mutex + static fallback 
  on failure.
  
 
  I'd prefer not to hide deadlocks that way :(
 
  I'll re-battle with that code to neaten it.  There are only a few places
  which have these kind of issues.
 

 
 cpuvar_get_maybe_mutex_lock(...);
 ...
 cpuvar_put_maybe_mutex_unlock(...);

My thought was something like:

/* This is an empty struct for !CONFIG_CPUMASK_OFFSTACK. */
static struct cpuvar_with_mutex_fallback fallback;

...
cpumask_var_t tmp;

cpuvar_alloc_fallback(tmp, fallback);
...
cpuvar_free_fallback(tmp, fallback);

We may get there eventually, but so far I've managed to produce
less horrendous code in every case.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_net: add link status handling

2008-12-09 Thread Rusty Russell

On Wednesday 10 December 2008 08:02:14 Anthony Liguori wrote:
 Mark McLoughlin wrote:
  Allow the host to inform us that the link is down by adding
  a VIRTIO_NET_F_STATUS which indicates that device status is
  available in virtio_net config.
  
  This is currently useful for simulating link down conditions
  (e.g. using proposed qemu 'set_link' monitor command) but
  would also be needed if we were to support device assignment
  via virtio.
 
 It would be nice if the virtio-net card wrote some acknowledgement that 
 it has received the link status down/up events.

How about of every status change event? ie. a generic virtio_pci solution?

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_net: add link status handling

2008-12-09 Thread Rusty Russell

On Tuesday 09 December 2008 20:49:33 Mark McLoughlin wrote:
 Allow the host to inform us that the link is down by adding
 a VIRTIO_NET_F_STATUS which indicates that device status is
 available in virtio_net config.

This has been a TODO for a while, thanks!

 + if (vi-status == v)
 + return;
 +
 + vi-status = v;
 +
 + if (vi-status  VIRTIO_NET_S_LINK_UP) {
 + netif_carrier_on(vi-dev);
 + netif_wake_queue(vi-dev);
 + } else {
 + netif_carrier_off(vi-dev);
 + netif_stop_queue(vi-dev);
 + }

New status bits will screw this logic unless we count
on the host not to set them.  I suggest:

/* Ignore unknown (future) status bits */
v = VIRTIO_NET_S_LINK_UP;

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_net: add link status handling

2008-12-12 Thread Rusty Russell

On Thursday 11 December 2008 05:04:44 Mark McLoughlin wrote:
 On Tue, 2008-12-09 at 21:11 -0600, Anthony Liguori wrote:
  Rusty Russell wrote:
   On Wednesday 10 December 2008 08:02:14 Anthony Liguori wrote:
   It would be nice if the virtio-net card wrote some acknowledgement that 
   it has received the link status down/up events.
  
   How about of every status change event? ie. a generic virtio_pci solution?
  
  A really simple way to do it would just be to have another status field 
  that was the guest's status (verses the host requested status which the 
  current field is).  All config reads/writes result in exits so it's easy 
  to track.
  
  Adding YA virtio event may be a little overkill.
 
 Sounds very reasonable; that and Rusty's mask out unknown bits
 suggestion in the version below.

Not quite what I was after.  I've taken the original patch, added the
masking change.  I'll test here and feed to DaveM.

As far as Anthony's live migrate race, I've decided to declare with
Canutian arrogance that it Will Never Happen.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: use modern cpumask primitives, no cpumask_t on stack

2008-12-15 Thread Rusty Russell

On Friday 12 December 2008 03:21:24 Hollis Blanchard wrote:
 This patch breaks uniprocessor builds, because smp_call_function_many()
 is only defined for CONFIG_SMP.

Good catch.  I missed it because it's fixed as a side-effect of a
later patch in my series (before I convert users).

Linus, can you please apply this fix?

Thanks,
Rusty.

Subject: Define smp_call_function_many for UP

Otherwise those using it in transition patches (eg. kvm) can't compile
with CONFIG_SMP=n:

arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'make_all_cpus_request':
arch/x86/kvm/../../../virt/kvm/kvm_main.c:380: error: implicit declaration of 
function 'smp_call_function_many'

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/include/linux/smp.h b/include/linux/smp.h
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -146,6 +147,8 @@ static inline void smp_send_reschedule(i
 })
 #define smp_call_function_mask(mask, func, info, wait) \
(up_smp_call_function(func, info))
+#define smp_call_function_many(mask, func, info, wait) \
+   (up_smp_call_function(func, info))
 static inline void init_call_single_data(void)
 {
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Define smp_call_function_many for UP

2008-12-18 Thread Rusty Russell

Linus, please apply.  Otherwise those using it in transition patches (eg. kvm)
can't compile with CONFIG_SMP=n:

arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'make_all_cpus_request':
arch/x86/kvm/../../../virt/kvm/kvm_main.c:380: error: implicit declaration of 
function 'smp_call_function_many'

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 include/linux/smp.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/smp.h b/include/linux/smp.h
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -146,6 +147,8 @@ static inline void smp_send_reschedule(i
 })
 #define smp_call_function_mask(mask, func, info, wait) \
(up_smp_call_function(func, info))
+#define smp_call_function_many(mask, func, info, wait) \
+   (up_smp_call_function(func, info))
 static inline void init_call_single_data(void)
 {
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Define smp_call_function_many for UP

2008-12-21 Thread Rusty Russell

On Saturday 20 December 2008 03:33:26 Linus Torvalds wrote:
 Christmas at Rusty's house:
 
 Dreary.

Far from!  Sure, it's Arabella's first Xmas, but a Very Kernel Christmas can
be lots of fun.  Why just this morning, we did this together!  See if you
can tell which bits are hers...

Subject: Decorate lib/rbtree.c
From: Rusty and Arabella Russell rustabe...@rustcorp.com.au

A slight enhancement to the comments in lib/rbtree.c.

Signed-off-by: Rusty and Arabella Russell rustabe...@rustcorp.com.au
---
 lib/rbtree.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/lib/rbtree.c b/lib/rbtree.c
index 48499c2..7848186 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -17,7 +17,8 @@
   along with this program; if not, write to the Free Software
   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 
-  linux/lib/rbtree.c
+  An excellent description of this algorithm can be found at:
+   http://en.wikipedia.org/wiki/Red-black_tree
 */
 
 #include linux/rbtree.h
@@ -291,6 +292,7 @@ EXPORT_SYMBOL(rb_erase);
 
 /*
  * This function returns the first node (in sort order) of the tree.
+ * ssdddsd
  */
 struct rb_node *rb_first(struct rb_root *root)
 {


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-08 Thread Rusty Russell

On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
 On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
   Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
  
   On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
 Confused. We compare capacity to skb frags, no?
 That's sg I think ...
   
Current guest kernel use indirect buffers, num_free returns how many
available descriptors not skb frags. So it's wrong here.
   
Shirley
  
   I see. Good point. In other words when we complete the buffer
   it was indirect, but when we add a new one we
   can not allocate indirect so we consume.
   And then we start the queue and add will fail.
   I guess we need some kind of API to figure out
   whether the buf we complete was indirect?

I've finally read this thread... I think we need to get more serious
with our stats gathering to diagnose these kind of performance issues.

This is a start; it should tell us what is actually happening to the
virtio ring(s) without significant performance impact...

Subject: virtio: CONFIG_VIRTIO_STATS

For performance problems we'd like to know exactly what the ring looks
like.  This patch adds stats indexed by how-full-ring-is; we could extend
it to also record them by how-used-ring-is if we need.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -7,6 +7,14 @@ config VIRTIO_RING
tristate
depends on VIRTIO
 
+config VIRTIO_STATS
+   bool Virtio debugging stats (EXPERIMENTAL)
+   depends on VIRTIO_RING
+   select DEBUG_FS
+   ---help---
+ Virtio stats collected by how full the ring is at any time,
+ presented under debugfs/virtio/name-vq/num-used/
+
 config VIRTIO_PCI
tristate PCI driver for virtio devices (EXPERIMENTAL)
depends on PCI  EXPERIMENTAL
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -21,6 +21,7 @@
 #include linux/virtio_config.h
 #include linux/device.h
 #include linux/slab.h
+#include linux/debugfs.h
 
 /* virtio guest is communicating with a virtual device that actually runs on
  * a host processor.  Memory barriers are used to control SMP effects. */
@@ -95,6 +96,11 @@ struct vring_virtqueue
/* How to notify other side. FIXME: commonalize hcalls! */
void (*notify)(struct virtqueue *vq);
 
+#ifdef CONFIG_VIRTIO_STATS
+   struct vring_stat *stats;
+   struct dentry *statdir;
+#endif
+
 #ifdef DEBUG
/* They're supposed to lock for us. */
unsigned int in_use;
@@ -106,6 +112,87 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+#ifdef CONFIG_VIRTIO_STATS
+/* We have an array of these, indexed by how full the ring is. */
+struct vring_stat {
+   /* How many interrupts? */
+   size_t interrupt_nowork, interrupt_work;
+   /* How many non-notify kicks, how many notify kicks, how many add 
notify? */
+   size_t kick_no_notify, kick_notify, add_notify;
+   /* How many adds? */
+   size_t add_direct, add_indirect, add_fail;
+   /* How many gets? */
+   size_t get;
+   /* How many disable callbacks? */
+   size_t disable_cb;
+   /* How many enables? */
+   size_t enable_cb_retry, enable_cb_success;
+};
+
+static struct dentry *virtio_stats;
+
+static void create_stat_files(struct vring_virtqueue *vq)
+{
+   char name[80];
+   unsigned int i;
+
+   /* Racy in theory, but we don't care. */
+   if (!virtio_stats)
+   virtio_stats = debugfs_create_dir(virtio-stats, NULL);
+
+   sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name);
+   vq-statdir = debugfs_create_dir(name, virtio_stats);
+
+   for (i = 0; i  vq-vring.num; i++) {
+   struct dentry *dir;
+
+   sprintf(name, %i, i);
+   dir = debugfs_create_dir(name, vq-statdir);
+   debugfs_create_size_t(interrupt_nowork, 0400, dir,
+ vq-stats[i].interrupt_nowork);
+   debugfs_create_size_t(interrupt_work, 0400, dir,
+ vq-stats[i].interrupt_work);
+   debugfs_create_size_t(kick_no_notify, 0400, dir,
+ vq-stats[i].kick_no_notify);
+   debugfs_create_size_t(kick_notify, 0400, dir,
+ vq-stats[i].kick_notify);
+   debugfs_create_size_t(add_notify, 0400, dir,
+ vq-stats[i].add_notify);
+   debugfs_create_size_t(add_direct, 0400, dir,
+ vq-stats[i].add_direct);
+   debugfs_create_size_t(add_indirect, 0400, dir

Re: Network performance with small packets

2011-02-08 Thread Rusty Russell

On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
 On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
  On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
   On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM

 On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
   Confused. We compare capacity to skb frags, no?
   That's sg I think ...
 
  Current guest kernel use indirect buffers, num_free returns how many
  available descriptors not skb frags. So it's wrong here.
 
  Shirley

 I see. Good point. In other words when we complete the buffer
 it was indirect, but when we add a new one we
 can not allocate indirect so we consume.
 And then we start the queue and add will fail.
 I guess we need some kind of API to figure out
 whether the buf we complete was indirect?
  
  I've finally read this thread... I think we need to get more serious
  with our stats gathering to diagnose these kind of performance issues.
  
  This is a start; it should tell us what is actually happening to the
  virtio ring(s) without significant performance impact...
  
  Subject: virtio: CONFIG_VIRTIO_STATS
  
  For performance problems we'd like to know exactly what the ring looks
  like.  This patch adds stats indexed by how-full-ring-is; we could extend
  it to also record them by how-used-ring-is if we need.
  
  Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
 Not sure whether the intent is to merge this. If yes -
 would it make sense to use tracing for this instead?
 That's what kvm does.

Intent wasn't; I've not used tracepoints before, but maybe we should
consider a longer-term monitoring solution?

Patch welcome!

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-03-10 Thread Rusty Russell

On Tue, 08 Mar 2011 20:21:18 -0600, Andrew Theurer 
haban...@linux.vnet.ibm.com wrote:
 On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote:
  On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
   I've finally read this thread... I think we need to get more serious
   with our stats gathering to diagnose these kind of performance issues.
   
   This is a start; it should tell us what is actually happening to the
   virtio ring(s) without significant performance impact... 
  
  Should we also add similar stat on vhost vq as well for monitoring
  vhost_signal  vhost_notify?
 
 Tom L has started using Rusty's patches and found some interesting
 results, sent yesterday:
 http://marc.info/?l=kvmm=129953710930124w=2

Hmm, I'm not subscribed to kvm@ any more, so I didn't get this, so
replying here:

 Also, it looks like vhost is sending a lot of notifications for
 packets it has received before the guest can get scheduled to disable
 notifications and begin processing the packets resulting in some lock
 contention in the guest (and high interrupt rates).

Yes, this is a virtio design flaw, but one that should be fixable.
We have room at the end of the ring, which we can put a last_used
count.  Then we can tell if wakeups are redundant, before the guest
updates the flag.

Here's an old patch where I played with implementing this:

virtio: put last_used and last_avail index into ring itself.

Generally, the other end of the virtio ring doesn't need to see where
you're up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, if you want to save and restore a virtio_ring, but you're
not the consumer because the kernel is using it directly.

Fortunately, we have room to expand: the ring is always a whole number
of pages and there's hundreds of bytes of padding after the avail ring
and the used ring, whatever the number of descriptors (which must be a
power of 2).

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/virtio/virtio_ring.c |   23 +++
 include/linux/virtio_ring.h  |   12 +++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -71,9 +71,6 @@ struct vring_virtqueue
/* Number we've added since last sync. */
unsigned int num_added;
 
-   /* Last used index we've seen. */
-   u16 last_used_idx;
-
/* How to notify other side. FIXME: commonalize hcalls! */
void (*notify)(struct virtqueue *vq);
 
@@ -278,12 +275,13 @@ static void detach_buf(struct vring_virt
 
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
-   return vq-last_used_idx != vq-vring.used-idx;
+   return vring_last_used(vq-vring) != vq-vring.used-idx;
 }
 
 static void *vring_get_buf(struct virtqueue *_vq, unsigned int *len)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_used_elem *u;
void *ret;
unsigned int i;
 
@@ -300,8 +298,11 @@ static void *vring_get_buf(struct virtqu
return NULL;
}
 
-   i = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].id;
-   *len = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].len;
+   u = vq-vring.used-ring[vring_last_used(vq-vring) % vq-vring.num];
+   i = u-id;
+   *len = u-len;
+   /* Make sure we don't reload i after doing checks. */
+   rmb();
 
if (unlikely(i = vq-vring.num)) {
BAD_RING(vq, id %u out of range\n, i);
@@ -315,7 +316,8 @@ static void *vring_get_buf(struct virtqu
/* detach_buf clears data, so grab it now. */
ret = vq-data[i];
detach_buf(vq, i);
-   vq-last_used_idx++;
+   vring_last_used(vq-vring)++;
+
END_USE(vq);
return ret;
 }
@@ -402,7 +404,6 @@ struct virtqueue *vring_new_virtqueue(un
vq-vq.name = name;
vq-notify = notify;
vq-broken = false;
-   vq-last_used_idx = 0;
vq-num_added = 0;
list_add_tail(vq-vq.list, vdev-vqs);
 #ifdef DEBUG
@@ -413,6 +414,10 @@ struct virtqueue *vring_new_virtqueue(un
 
vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 
+   /* We publish indices whether they offer it or not: if not, it's junk
+* space anyway.  But calling this acknowledges the feature. */
+   virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_INDICES);
+
/* No callback?  Tell other side not to bother us. */
if (!callback)
vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
@@ -443,6 +448,8 @@ void vring_transport_features(struct vir
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
break

Re: TX from KVM guest virtio_net to vhost issues

2011-03-10 Thread Rusty Russell

On Wed, 09 Mar 2011 13:46:36 -0800, Shirley Ma mashi...@us.ibm.com wrote:
 Since we have lots of performance discussions about virtio_net and vhost
 communication. I think it's better to have a common understandings of
 the code first, then we can seek the right directions to improve it. We
 also need to collect more statistics data on both virtio and vhost.
 
 Let's look at TX first: from virtio_net(guest) to vhost(host), send vq
 is shared between guest virtio_net and host vhost, it uses memory
 barriers to sync the changes.
 
 In the start:
 
 Guest virtio_net TX send completion interrupt (for freeing used skbs) is
 disable. Guest virtio_net TX send completion interrupt is enabled only
 when send vq is overrun, guest needs to wait vhost to consume more
 available skbs. 
 
 Host vhost notification is enabled in the beginning (for consuming
 available skbs); It is disable whenever the send vq is not empty. Once
 the send vq is empty, the notification is enabled by vhost.
 
 In guest start_xmit(), it first frees used skbs, then send available
 skbs to vhost, ideally guest never enables TX send completion interrupts
 to free used skbs if vhost keeps posting used skbs in send vq.
 
 In vhost handle_tx(), it wakes up by guest whenever the send vq has a
 skb to send, once the send vq is not empty, vhost exits handle_tx()
 without enabling notification. Ideally if guest keeps xmit skbs in send
 vq, the notification is never enabled.
 
 I don't see issues on this implementation.
 
 However, in our TCP_STREAM small message size test, we found that
 somehow guest couldn't see more used skbs to free, which caused
 frequently TX send queue overrun.

So it seems like the guest is outrunning the host?

 In our TCP_RR small message size multiple streams test, we found that
 vhost couldn't see more xmit skbs in send vq, thus it enabled
 notification too often.

And here the host is outrunning the guest?

 What's the possible cause here in xmit? How guest, vhost are being
 scheduled? Whether it's possible, guest virtio_net cooperates with vhost
 for ideal performance: both guest virtio_net and vhost can be in pace
 with send vq without many notifications and exits?

We tend to blame the scheduler for all kinds of things, until we find
the real cause :)  Nailing things to different CPUs might help us
determine if it really is scheduler...

But if one side outruns the other, it does a lot of unnecessary work
notifying/interrupting it over and over again before the host/guest gets
a chance to shut notifications/interrupts off.  Hence the last_used
publishing patch (Xen does this right, I screwed it up).

Long weekend here, and I'm otherwise committed.  But if noone has
cleaned up that patch by early next week, I'll try to do so and see if
we can make a real difference.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

2011-03-17 Thread Rusty Russell

On Wed, 16 Mar 2011 17:12:55 -0700, Shirley Ma mashi...@us.ibm.com wrote:
 Signed-off-by: Shirley Ma x...@us.ibm.com

This is fascinating... and deeply weird.

OK, what's the difference between calling xmit_skb and ignoring failure,
and this patch which figures out it's going to fail before calling
xmit_skb?

ie. what if you *just* delete this:

 @@ -605,20 +620,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
 struct net_device *dev)
   skb_orphan(skb);
   nf_reset(skb);
  
 - /* Apparently nice girls don't return TX_BUSY; stop the queue
 -  * before it gets out of hand.  Naturally, this wastes entries. */
 - if (capacity  2+MAX_SKB_FRAGS) {
 - netif_stop_queue(dev);
 - if (unlikely(!virtqueue_enable_cb(vi-svq))) {
 - /* More just got used, free them then recheck. */
 - capacity += free_old_xmit_skbs(vi);
 - if (capacity = 2+MAX_SKB_FRAGS) {
 - netif_start_queue(dev);
 - virtqueue_disable_cb(vi-svq);
 - }
 - }
 - }
 -
   return NETDEV_TX_OK;
  }

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

2011-03-23 Thread Rusty Russell

On Tue, 22 Mar 2011 13:36:50 +0200, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Mon, Mar 21, 2011 at 11:03:07AM -0700, Shirley Ma wrote:
  On Fri, 2011-03-18 at 18:41 -0700, Shirley Ma wrote:
 +   /* Drop packet instead of stop queue for better
   performance
*/

I would like to see some justification as to why this is the right
way to go and not just papering over the real problem. 
   
   Fair. KVM guest virtio_net TX queue stop/restart is pretty expensive,
   which involves:
   
   1. Guest enable callback: one memory barrier, interrupt flag set
  
  Missed this cost: for history reason, it also involves a guest exit from
  I/O write (PCI_QUEUE_NOTIFY).
 
 OK, after some research, it looks like the reason was the tx timer that
 qemu used to use. So the hack of avoiding the add_buf call will
 avoid this kick and so break these hosts.
 I guess we can add a feature bit to detect a new host
 and so avoid the kick. We are running low on feature bits
 unfortunately, but just fo testing, could you quantify the difference
 that this makes using the following patch:

Performance would suffer for those ancient qemus if we didn't do this,
but it wouldn't be fatal to them.

I think we should just remove it; the standard certainly doesn't mention
it.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

2011-03-23 Thread Rusty Russell

 With simply removing the notify here, it does help the case when TX
 overrun hits too often, for example for 1K message size, the single
 TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.

OK, we'll be getting rid of the kick on full, so please delete that on
all benchmarks.

Now, does the capacity check before add_buf() still win anything?  I
can't see how unless we have some weird bug.

Once we've sorted that out, we should look at the more radical change
of publishing last_used and using that to intuit whether interrupts
should be sent.  If we're not careful with ordering and barriers that
could introduce more bugs.

Anything else on the optimization agenda I've missed?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

2011-03-25 Thread Rusty Russell

On Thu, 24 Mar 2011 10:46:49 -0700, Shirley Ma mashi...@us.ibm.com wrote:
 On Thu, 2011-03-24 at 16:28 +0200, Michael S. Tsirkin wrote:
  Several other things I am looking at, wellcome cooperation:
  1. It's probably a good idea to update avail index
 immediately instead of upon kick: for RX
 this might help parallelism with the host.
 Is that possible to use the same idea for publishing last used idx to
 publish avail idx? Then we can save guest iowrite/exits.

Yes, it should be symmetrical.  Test independently of course, but the
same logic applies.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

2011-03-25 Thread Rusty Russell

On Thu, 24 Mar 2011 16:28:22 +0200, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Thu, Mar 24, 2011 at 11:00:53AM +1030, Rusty Russell wrote:
   With simply removing the notify here, it does help the case when TX
   overrun hits too often, for example for 1K message size, the single
   TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.
  
  OK, we'll be getting rid of the kick on full, so please delete that on
  all benchmarks.
  
  Now, does the capacity check before add_buf() still win anything?  I
  can't see how unless we have some weird bug.
  
  Once we've sorted that out, we should look at the more radical change
  of publishing last_used and using that to intuit whether interrupts
  should be sent.  If we're not careful with ordering and barriers that
  could introduce more bugs.
 
 Right. I am working on this, and trying to be careful.
 One thing I'm in doubt about: sometimes we just want to
 disable interrupts. Should still use flags in that case?
 I thought that if we make the published index 0 to vq-num - 1,
 then a special value in the index field could disable
 interrupts completely. We could even reuse the space
 for the flags field to stick the index in. Too complex?

Making the index free-running avoids the full or empty confusion, plus
offers and extra debugging insight.

I think that if they really want to disable interrrupts, the flag should
still work, and when the client accepts the publish last_idx feature
they are accepting that interrupts may be omitted if they haven't
updated last_idx yet.

  Anything else on the optimization agenda I've missed?
  
  Thanks,
  Rusty.
 
 Several other things I am looking at, wellcome cooperation:
 1. It's probably a good idea to update avail index
immediately instead of upon kick: for RX
this might help parallelism with the host.

Yes, once we've done everything else, we should measure this.  It makes
sense.

 2. Adding an API to add a single buffer instead of s/g,
seems to help a bit.

This goes last, since it's kind of an ugly hack, but all internal to
Linux if we decide it's a win.

 3. For TX sometimes we free a single buffer, sometimes
a ton of them, which might make the transmit latency
vary. It's probably a good idea to limit this,
maybe free the minimal number possible to keep the device
going without stops, maybe free up to MAX_SKB_FRAGS.

This kind of heuristic is going to be quite variable depending on
circumstance, I think, so it's a lot of work to make sure we get it
right.

 4. If the ring is full, we now notify right after
the first entry is consumed. For TX this is suboptimal,
we should try delaying the interrupt on host.

Lguest already does that: only sends an interrupt when it's run out of
things to do.  It does update the used ring, however, as it processes
them.

This seems sensible to me, but needs to be measured separately as well.

 More ideas, would be nice if someone can try them out:
 1. We are allocating/freeing buffers for indirect descriptors.
Use some kind of pool instead?
And we could preformat part of the descriptor.

We need some poolish mechanism for virtio_blk too; perhaps an allocation
callback which both can use (virtio_blk to alloc from a pool, virtio_net
to recycle?).

Along similar lines to preformatting, we could actually try to prepend
the skb_vnet_hdr to the vnet data, and use a single descriptor for the
hdr and the first part of the packet.

Though IIRC, qemu's virtio barfs if the first descriptor isn't just the
hdr (barf...).

 2. I didn't have time to work on virtio2 ideas presented
at the kvm forum yet, any takers?

I didn't even attend.  But I think that virtio is moribund for the
moment; there wasn't enough demand and it's clear that there are
optimization unexplored in virtio1.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

2011-04-04 Thread Rusty Russell

On Sun, 27 Mar 2011 09:52:54 +0200, Michael S. Tsirkin m...@redhat.com 
wrote:
  Though IIRC, qemu's virtio barfs if the first descriptor isn't just the
  hdr (barf...).
 
 Maybe we can try fixing this before adding more flags,
 then e.g. publish used flag can be resued to also
 tell us layout is flexible. Or just add a feature flag for that.

We should probably do this at some stage, yes.

   2. I didn't have time to work on virtio2 ideas presented
  at the kvm forum yet, any takers?
  
  I didn't even attend.
 
 Hmm, right. But what was presented there was discussed on list as well:
 a single R/W descriptor ring with valid bit instead of 2 rings
 + a descriptor array.

I'll be happy when we reach the point that the extra cacheline is
hurting us :)

Then we should do direct descriptors w/ a cookie as the value to hand
back when finished.  That seems to be close to optimal.

 I agree absolutely that not all lessons has been learned,
 playing with different ring layouts would make at least
 an interesting paper IMO.

Yes, I'd like to see the results...

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] [RFC rev2] virtio-net changes

2011-04-12 Thread Rusty Russell

On Tue, 05 Apr 2011 20:38:52 +0530, Krishna Kumar krkum...@in.ibm.com wrote:
 Implement mq virtio-net driver. 
 
 Though struct virtio_net_config changes, it works with the old
 qemu since the last element is not accessed unless qemu sets
 VIRTIO_NET_F_MULTIQUEUE.
 
 Signed-off-by: Krishna Kumar krkum...@in.ibm.com

Hi Krishna!

This change looks fairly solid, but I'd prefer it split into a few
stages for clarity.

The first patch should extract out the struct send_queue and struct
receive_queue, even though there's still only one.  The second patch
can then introduce VIRTIO_NET_F_MULTIQUEUE.

You could split into more parts if that makes sense, but I'd prefer to
see the mechanical changes separate from the feature addition.

 -struct virtnet_info {
 - struct virtio_device *vdev;
 - struct virtqueue *rvq, *svq, *cvq;
 - struct net_device *dev;
 +/* Internal representation of a send virtqueue */
 +struct send_queue {
 + /* Virtqueue associated with this send _queue */
 + struct virtqueue *svq;

You can simply call this vq now it's inside 'send_queue'.

 +
 + /* TX: fragments + linear part + virtio header */
 + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];

Similarly, this can just be sg.

 +static void free_receive_bufs(struct virtnet_info *vi)
 +{
 + int i;
 +
 + for (i = 0; i  vi-numtxqs; i++) {
 + BUG_ON(vi-rq[i] == NULL);
 + while (vi-rq[i]-pages)
 + __free_pages(get_a_page(vi-rq[i], GFP_KERNEL), 0);
 + }
 +}

You can skip the BUG_ON(), since the next line will have the same effect.

 +/* Free memory allocated for send and receive queues */
 +static void free_rq_sq(struct virtnet_info *vi)
 +{
 + int i;
 +
 + if (vi-rq) {
 + for (i = 0; i  vi-numtxqs; i++)
 + kfree(vi-rq[i]);
 + kfree(vi-rq);
 + }
 +
 + if (vi-sq) {
 + for (i = 0; i  vi-numtxqs; i++)
 + kfree(vi-sq[i]);
 + kfree(vi-sq);
 + }

This looks weird, even though it's correct.

I think we need a better name than numtxqs and shorter than
num_queue_pairs.  Let's just use num_queues; sure, there are both tx and
rq queues, but I still think it's pretty clear.

 + for (i = 0; i  vi-numtxqs; i++) {
 + struct virtqueue *svq = vi-sq[i]-svq;
 +
 + while (1) {
 + buf = virtqueue_detach_unused_buf(svq);
 + if (!buf)
 + break;
 + dev_kfree_skb(buf);
 + }
 + }

I know this isn't your code, but it's ugly :)

while ((buf = virtqueue_detach_unused_buf(svq)) != NULL)
dev_kfree_skb(buf);

 + for (i = 0; i  vi-numtxqs; i++) {
 + struct virtqueue *rvq = vi-rq[i]-rvq;
 +
 + while (1) {
 + buf = virtqueue_detach_unused_buf(rvq);
 + if (!buf)
 + break;

Here too...

 +#define MAX_DEVICE_NAME  16

This isn't a good idea, see below.

 +static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
 +{
 + vq_callback_t **callbacks;
 + struct virtqueue **vqs;
 + int i, err = -ENOMEM;
 + int totalvqs;
 + char **names;

This whole routine is really messy.  How about doing find_vqs first,
then have routines like setup_rxq(), setup_txq() and setup_controlq()
would make this neater:

static int setup_rxq(struct send_queue *sq, char *name);

Also, use kasprintf() instead of kmalloc  sprintf.

 +#if 1
 + /* Allocate/initialize parameters for recv/send virtqueues */

Why is this #if 1'd?

I do prefer the #else method of doing two loops, myself (but use
kasprintf).

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-04-14 Thread Rusty Russell

On Tue, 12 Apr 2011 23:01:12 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
  Here's an old patch where I played with implementing this:
 
 ...
 
  
  virtio: put last_used and last_avail index into ring itself.
  
  Generally, the other end of the virtio ring doesn't need to see where
  you're up to in consuming the ring.  However, to completely understand
  what's going on from the outside, this information must be exposed.
  For example, if you want to save and restore a virtio_ring, but you're
  not the consumer because the kernel is using it directly.
  
  Fortunately, we have room to expand:
 
 This seems to be true for x86 kvm and lguest but is it true
 for s390?

Yes, as the ring is page aligned so there's always room.

 Will this last bit work on s390?
 If I understand correctly the memory is allocated by host there?

They have to offer the feature, so if the have some way of allocating
non-page-aligned amounts of memory, they'll have to add those extra 2
bytes.

So I think it's OK...
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-04-18 Thread Rusty Russell

On Thu, 14 Apr 2011 19:03:59 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
  They have to offer the feature, so if the have some way of allocating
  non-page-aligned amounts of memory, they'll have to add those extra 2
  bytes.
  
  So I think it's OK...
  Rusty.
 
 To clarify, my concern is that we always seem to try to map
 these extra 2 bytes, which thinkably might fail?

No, if you look at the layout it's clear that there's always most of a
page left for this extra room, both in the middle and at the end.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_blk: support barriers without FLUSH feature

2010-06-15 Thread Rusty Russell

On Tue, 15 Jun 2010 10:13:48 pm Christoph Hellwig wrote:
 If we want to support barriers with the cache=writethrough mode in qemu
 we need to tell the block layer that we only need queue drains to
 implement a barrier.  Follow the model set by SCSI and IDE and assume
 that there is no volatile write cache if the host doesn't advertize it.
 While this might imply working barriers on old qemu versions or other
 hypervisors that actually have a volatile write cache this is only a
 cosmetic issue - these hypervisors don't guarantee any data integrity
 with or without this patch, but with the patch we at least provide
 data ordering.
 
 Signed-off-by: Christoph Hellwig h...@lst.de

Thanks, applied.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Remove virtio_blk VBID ioctl

2010-06-20 Thread Rusty Russell

On Sat, 19 Jun 2010 04:08:03 am Ryan Harper wrote:
 With the availablility of a sysfs device attribute for examining disk serial
 numbers the ioctl is no longer needed.  The user-space changes for this aren't
 upstream yet so we don't have any users to worry about.

If John Cooper acks this, I'll push it to Linus immediately.

Unfortunately we offered this interface in 2.6.34, and we're now removing it.
That's unpleasant.

Thanks,
Rusty.
PS.  John should have been cc'd on these patches!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-20 Thread Rusty Russell

On Sat, 19 Jun 2010 04:08:02 am Ryan Harper wrote:
 Create a new attribute for virtio-blk devices that will fetch the serial 
 number
 of the block device.  This attribute can be used by udev to create disk/by-id
 symlinks for devices that don't have a UUID (filesystem) associated with them.
 
 ATA_IDENTIFY strings are special in that they can be up to 20 chars long
 and aren't required to be NULL-terminated.  The buffer is also zero-padded
 meaning that if the serial is 19 chars or less that we get a NULL terminated
 string.  When copying this value into a string buffer, we must be careful to
 copy up to the NULL (if it present) and only 20 if it is longer and not to
 attempt to NULL terminate; this isn't needed.
 
 Signed-off-by: Ryan Harper ry...@us.ibm.com
 Signed-off-by: john cooper john.coo...@redhat.com
 ---
  drivers/block/virtio_blk.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 258bc2a..f1ef26f 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -281,6 +281,31 @@ static int index_to_minor(int index)
   return index  PART_BITS;
  }
  
 +/* Copy serial number from *s to *d.  Copy operation terminates on either
 + * encountering a nul in *s or after n bytes have been copied, whichever
 + * occurs first.  *d is not forcibly nul terminated.  Return # of bytes 
 copied.
 + */
 +static inline int serial_sysfs(char *d, char *s, int n)
 +{
 + char *di = d;
 +
 + while (*s  n--)
 + *d++ = *s++;
 + return d - di;
 +}
 +
 +static ssize_t virtblk_serial_show(struct device *dev,
 + struct device_attribute *attr, char *buf)
 +{
 + struct gendisk *disk = dev_to_disk(dev);
 + char id_str[VIRTIO_BLK_ID_BYTES];
 +
 + if (IS_ERR(virtblk_get_id(disk, id_str)))
 + return 0;

0?  Really?  That doesn't seem very informative.

 + return serial_sysfs(buf, id_str, min(VIRTIO_BLK_ID_BYTES, PAGE_SIZE));

How about something like this:

BUILD_BUG_ON(PAGE_SIZE  VIRTIO_BLK_ID_BYTES + 1);

/* id_str is not necessarily nul-terminated! */
buf[VIRTIO_BLK_ID_BYTES] = '\0';
return virtblk_get_id(disk, buf);

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 >

1 - 100 of 640 matches

Mail list logo