[PATCH] virtio: Fix GFP flags passed from the virtio balloon driver
From: Balbir Singh bal...@linux.vnet.ibm.com The virtio balloon driver can dig into the reservation pools of the OS to satisfy a balloon request. This is not advisable and other balloon drivers (drivers/xen/balloon.c) avoid this as well. The patch also adds changes to avoid printing a warning if allocation fails, since we retry after sometime anyway. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Signed-off-by: Rusty Russell ru...@rustcorp.com.au Cc: kvm kvm@vger.kernel.org Cc: sta...@kernel.org --- drivers/virtio/virtio_balloon.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 369f2ee..f8ffe8c 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -102,7 +102,8 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) num = min(num, ARRAY_SIZE(vb-pfns)); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns++) { - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY); + struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | + __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { if (printk_ratelimit()) dev_printk(KERN_INFO, vb-vdev-dev, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. ISTR Christoph had withdrawn some patches in this area, and was waiting for him to resubmit? I've given up on figuring out the block device. What seem to me to be sane semantics along the lines of memory barriers are foreign to disk people: they want (and depend on) flushing everywhere. For example, tdb transactions do not require a flush, they only require what I would call a barrier: that prior data be written out before any future data. Surely that would be more efficient in general than a flush! In fact, TDB wants only writes to *that file* (and metadata) written out first; it has no ordering issues with other I/O on the same device. A generic I/O interface would allow you to specify this request depends on these outstanding requests and leave it at that. It might have some sync flush command for dumb applications and OSes. The userspace API might be not be as precise and only allow such a barrier against all prior writes on this fd. ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want these writes must occur before this write. That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a all prior writes to this struct file before all future writes. Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Yet we really only need ordering, not guarantees about it actually hitting disk before returning. In other words, FLUSH can be more relaxed than BARRIER inside the kernel. It's ironic that we think of fsync as stronger than fbarrier outside the kernel :-) It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Wed, 5 May 2010 04:24:59 am Christoph Hellwig wrote: On Fri, Feb 19, 2010 at 12:22:20AM +0200, Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. The whole virtio-blk interface is full of warts. It has been extended rather ad-hoc, so that is rather expected. One issue I struggled with especially is how type field mixes bits and non-bit values. I ended up simply defining all legal values, so that we have CMD = 2, CMD_OUT = 3 and so on. It's basically a complete mess without much logic behind it. +\change_unchanged +the high bit +\change_inserted 0 1266497301 + (VIRTIO_BLK_T_BARRIER) +\change_unchanged + indicates that this request acts as a barrier and that all preceeding requests + must be complete before this one, and all following requests must not be + started until this is complete. + +\change_inserted 0 1266504385 + Note that a barrier does not flush caches in the underlying backend device + in host, and thus does not serve as data consistency guarantee. + Driver must use FLUSH request to flush the host cache. +\change_unchanged I'm not sure it's even worth documenting it. I can't see any way to actually implement safe behaviour with the VIRTIO_BLK_T_BARRIER-style barriers. Btw, did I mention that .lyx is a a really horrible format to review diffs for? Plain latex would be a lot better.. Yeah, or export as text post that diff for content review. I do like the patches to the lyx source though (I check all versions into revision control, before and after merging changes, which makes it easy to produce annotated versions). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio: put last_used and last_avail index into ring itself.
On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote: virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: the ring is always a whole number of pages and there's hundreds of bytes of padding after the avail ring and the used ring, whatever the number of descriptors (which must be a power of 2). We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. Signed-off-by: Rusty Russell ru...@rustcorp.com.au I've been looking at this patch some more (more on why later), and I wonder: would it be better to add some alignment to the last used index address, so that if we later add more stuff at the tail, it all fits in a single cache line? In theory, but not in practice. We don't have many rings, so the difference between 1 and 2 cache lines is not very much. We use a new feature bit anyway, so layout change should not be a problem. Since I raised the question of caches: for used ring, the ring is not aligned to 64 bit, so on CPUs with 64 bit or larger cache lines, used entries will often cross cache line boundaries. Am I right and might it have been better to align ring entries to cache line boundaries? What do you think? I think everyone is settled on 128 byte cache lines for the forseeable future, so it's not really an issue. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question on virtio
On Wed, 5 May 2010 08:39:47 pm Michael S. Tsirkin wrote: Hi! I see this in virtio_ring.c: /* Put entry in available array (but don't update avail-idx * until they do sync). */ Why is it done this way? It seems that updating the index straight away would be simpler, while this might allow the host to specilatively look up the buffer and handle it, without waiting for the kick. I agree. From my TODO: what if we actually expose in -add_buf? I don't *think* anyone adds buffers without being ready for them to be used, so changing this should be safe. Want to give it a try and report back? Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 6 May 2010 06:28:14 am Michael S. Tsirkin wrote: Rusty, this is a simplified form of a patch you posted in the past. I have a vhost patch that, using this feature, shows external to host bandwidth grow from 5 to 7 GB/s, by avoiding an interrupt in the window after previous interrupt was sent and before interrupts were disabled for the vq. With vhost under some external to host loads I see this window being hit about 30% sometimes. Fascinating. So you use this to guess if the guest is still processing? I haven't thought about it hard, but is that racy? Obviously happy to apply this when you finalize it. Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote: On Wed, 5 May 2010 14:28:41 +0930 Rusty Russell ru...@rustcorp.com.au wrote: On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want these writes must occur before this write. That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a all prior writes to this struct file before all future writes. Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Seems over-zealous. If the recovery_header held a strong checksum of the recovery_data you would not need the first fsync, and as long as you have two places to write recovery data, you don't need the 3rd and 4th syncs. Just: write_internally_checksummed_recovery_data_and_header_to_unused_log_space() fsync / msync overwrite_with_new_data() To recovery you choose the most recent log_space and replay the content. That may be a redundant operation, but that is no loss. I think you missed a checksum for the new data? Otherwise we can't tell if the new data is completely written. But yes, I will steal this scheme for TDB2, thanks! In practice, it's the first sync which is glacial, the rest are pretty cheap. Also cannot see the point of msync if you have already performed an fsync, and if there is a point, I would expect you to call msync before fsync... Maybe there is some subtlety there that I am not aware of. I assume it's this from the msync man page: msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to disk. Without use of this call there is no guarantee that changes are written back before mun‐ map(2) is called. It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. Only we know that approach doesn't work. People will learn that they don't need to give the extra information to still achieve the same result - just like they did with ext3 and fsync. Then when we improve the implementation to only provide the guarantees that you asked for, people will complain that they are getting empty files that they didn't expect. I think that's an oversimplification: IIUC that occurred to people *not* using fsync(). They weren't using it because it was too slow. Providing a primitive which is as fast or faster and more specific doesn't have the same magnitude of social issues. And we can't write userspace interfaces for idiots only. The abstraction I would like to see is a simple 'barrier' that contains no data and has a filesystem-wide effect. I think you lack ambition ;) Thinking about the single-file use case (eg. kvm guest or tdb), isn't that suboptimal for md? Since you have to hand your barrier to every device whereas a file-wide primitive may theoretically only go to some. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio: put last_used and last_avail index into ring itself.
On Thu, 6 May 2010 03:57:55 pm Michael S. Tsirkin wrote: On Thu, May 06, 2010 at 10:22:12AM +0930, Rusty Russell wrote: On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote: What do you think? I think everyone is settled on 128 byte cache lines for the forseeable future, so it's not really an issue. You mean with 64 bit descriptors we will be bouncing a cache line between host and guest, anyway? I'm confused by this entire thread. Descriptors are 16 bytes. They are at the start, so presumably aligned to cache boundaries. Available ring follows that at 2 bytes per entry, so it's also packed nicely into cachelines. Then there's padding to page boundary. That puts us on a cacheline again for the used ring; also 2 bytes per entry. I don't see how any change in layout could be more cache friendly? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote: On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: + /* We publish the last-seen used index at the end of the available ring. +* It is at the end for backwards compatibility. */ + vr-last_used_idx =(vr)-avail-ring[num]; + /* Verify that last used index does not spill over the used ring. */ + BUG_ON((void *)vr-last_used_idx + + sizeof *vr-last_used_idx (void *)vr-used); } Shouldn't this be on its own cache line? It's next to the available ring; because that's where the guest publishes its data. That whole page is guest-write, host-read. Putting it on a cacheline by itself would be a slight pessimization; the host cpu would have to get the last_used_idx cacheline and the avail descriptor cacheline every time. This way, they are sometimes the same cacheline. Hope that clarifies, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 6 May 2010 03:49:46 pm Michael S. Tsirkin wrote: Now, I also added an mb() in guest between read and write so that last used index write can not get ahead of used index read. It does feel good to have it there, but I can not say why it's helpful. Works fine without it, but then these subtle races might be hard to trigger. What do you think? I couldn't see that in the patch? I don't think it's necessary though, since the write of depends last_used depends on the read of used (and no platform we care about would reorder such a thing). I'm reasonably happy, but we should write some convenient test for missing interrupts. I'm thinking of a sender which does a loop: blasts 1MB of UDP packets, then prints the time and sleep(1). The receiver would print the time every 1MB of received data. The two times should almost exactly correspond. Assuming that the network doesn't overflow and lose stuff, this should identify any missing wakeup/interrupts (depending on direction used). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio: put last_used and last_avail index into ring itself.
On Sun, 9 May 2010 06:27:33 pm Michael S. Tsirkin wrote: On Fri, May 07, 2010 at 12:35:39PM +0930, Rusty Russell wrote: Then there's padding to page boundary. That puts us on a cacheline again for the used ring; also 2 bytes per entry. Hmm, is used ring really 2 bytes per entry? Err, no, I am an idiot. /* u32 is used here for ids for padding reasons. */ struct vring_used_elem { /* Index of start of used descriptor chain. */ __u32 id; /* Total length of the descriptor chain which was used (written to) */ __u32 len; }; struct vring_used { __u16 flags; __u16 idx; struct vring_used_elem ring[]; }; OK, now I get it. Sorry, I was focussed on the avail ring. I thought that used ring has 8 bytes per entry, and that struct vring_used is aligned at page boundary, this would mean that ring element is at offset 4 bytes from page boundary. Thus with cacheline size 128 bytes, each 4th element crosses a cacheline boundary. If we had a 4 byte padding after idx, each used element would always be completely within a single cacheline. I think the numbers are: every 16th entry hits two cachelines. So currently the first 15 entries are free (assuming we hit the idx cacheline anyway), then 1 in 16 cost 2 cachelines. That makes the aligned version win when N 240. But, we access the array linearly. So the extra cacheline cost is in fact amortized. I doubt it could be measured, but maybe vring_get_buf() should prefetch? While you're there, we could use an rather than a mod on the calculation, which may actually be measurable :) Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio: imply disable_cb on callbacks
On Tue, 18 May 2010 09:03:17 pm Michael S. Tsirkin wrote: Rusty, the patch virtio: imply disable_cb on callbacks is on your tree. I'd like to figure out how it works: for example: diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -69,6 +69,8 @@ static void blk_done(struct virtqueue *v /* In case queue is stopped waiting for more buffers. */ blk_start_queue(vblk-disk-queue); spin_unlock_irqrestore(vblk-lock, flags); + + vq-vq_ops-enable_cb(vq); } static bool do_req(struct request_queue *q, struct virtio_blk *vblk, Since this does not check the return status from enable_cb, it seems we could loose an interrupt if it arrives between poll and callback enable? It's been quite a while since I wrote this patch, and never really liked it enough to polish it. We would need to enable this *before* reading the queue, AFAICT. I'll remove it from my series; it's in the wilderness area already :) Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Wed, 12 May 2010 04:57:22 am Avi Kivity wrote: On 05/07/2010 06:23 AM, Rusty Russell wrote: On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote: On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: + /* We publish the last-seen used index at the end of the available ring. + * It is at the end for backwards compatibility. */ + vr-last_used_idx =(vr)-avail-ring[num]; + /* Verify that last used index does not spill over the used ring. */ + BUG_ON((void *)vr-last_used_idx + +sizeof *vr-last_used_idx (void *)vr-used); } Shouldn't this be on its own cache line? It's next to the available ring; because that's where the guest publishes its data. That whole page is guest-write, host-read. Putting it on a cacheline by itself would be a slight pessimization; the host cpu would have to get the last_used_idx cacheline and the avail descriptor cacheline every time. This way, they are sometimes the same cacheline. If one peer writes the tail of the available ring, while the other reads last_used_idx, it's a false bounce, no? I think we're talking about the last 2 entries of the avail ring. That means the worst case is 1 false bounce every time around the ring. I think that's why we're debating it instead of measuring it :) Note that this is a exclusive-shared-exclusive bounce only, too. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: utilize PUBLISH_USED_IDX feature
On Thu, 20 May 2010 07:57:18 am Michael S. Tsirkin wrote: On Wed, May 19, 2010 at 08:04:51PM +0300, Avi Kivity wrote: On 05/18/2010 04:19 AM, Michael S. Tsirkin wrote: With PUBLISH_USED_IDX, guest tells us which used entries it has consumed. This can be used to reduce the number of interrupts: after we write a used entry, if the guest has not yet consumed the previous entry, or if the guest has already consumed the new entry, we do not need to interrupt. This imporves bandwidth by 30% under some workflows. Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Rusty, Dave, this patch depends on the patch virtio: put last seen used index into ring itself which is currently destined at Rusty's tree. Rusty, if you are taking that one for 2.6.35, please take this one as well. Dave, any objections? I object: I think the index should have its own cacheline, The issue here is that host/guest do not know each other's cache line size. I guess we could just put it at offset 128 or something like that ... Rusty? I was assuming you'd put it at the end of the padding. I think it's a silly optimization, but Avi obviously feels strongly about it and I respect his opinion. Please resubmit... Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: Note that this is a exclusive-shared-exclusive bounce only, too. A bounce is a bounce. I tried to measure this to show that you were wrong, but I was only able to show that you're right. How annoying. Test code below. Virtio is already way too bouncy due to the indirection between the avail/used rings and the descriptor pool. I tried to do a more careful analysis below, and I think this is an overstatement. A device with out of order completion (like virtio-blk) will quickly randomize the unused descriptor indexes, so every descriptor fetch will require a bounce. In contrast, if the rings hold the descriptors themselves instead of pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for every descriptor, amortized. We already have indirect, this would be a logical next step. So let's think about it. The avail ring would contain 64 bit values, the used ring would contain indexes into the avail ring. So client writes descriptor page and adds to avail ring, then writes to index. Server reads index, avail ring, descriptor page (3). Writes used entry (1). Updates last_used (1). Client reads used (1), derefs avail (1), updates last_used (1), cleans descriptor page (1). That's 9 cacheline transfers, worst case. Best case of a half-full ring in steady state, assuming 128-byte cache lines, the avail ring costs are 1/16, the used entry is 1/64. This drops it to 6 and 9/64 transfers. (Note, the current scheme adds 2 more cacheline transfers, for the descriptor table, worst case. Assuming indirect, we get 2/8 xfer best case. Either way, it's not the main source of cacheline xfers). Can we do better? The obvious idea is to try to get rid of last_used and used, and use the ring itself. We would use an invalid entry to mark the head of the ring. Any other thoughts? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote: On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: Note that this is a exclusive-shared-exclusive bounce only, too. A bounce is a bounce. I tried to measure this to show that you were wrong, but I was only able to show that you're right. How annoying. Test code below. This time for sure! #define _GNU_SOURCE #include unistd.h #include sched.h #include err.h #include stdbool.h #include stdint.h #include string.h #include stdlib.h #include stdio.h #include sys/time.h #include sys/mman.h /* We share memory via an mmap. */ struct counter { unsigned int cacheline1; char pad[256]; unsigned int cacheline2; }; #define MAX_BOUNCES 1 enum mode { SHARE, UNSHARE, LOCKSHARE, LOCKUNSHARE, }; int main(int argc, char *argv[]) { cpu_set_t cpuset; volatile struct counter *counter; struct timeval start, stop; bool child; unsigned int count; uint64_t usec; enum mode mode; if (argc != 4) errx(1, Usage: cachebounce share|unshare|lockshare|lockunshare cpu0 cpu1); if (strcmp(argv[1], share) == 0) mode = SHARE; else if (strcmp(argv[1], unshare) == 0) mode = UNSHARE; else if (strcmp(argv[1], lockshare) == 0) mode = LOCKSHARE; else if (strcmp(argv[1], lockunshare) == 0) mode = LOCKSHARE; else errx(1, Usage: cachebounce share|unshare|lockshare|lockunshare cpu0 cpu1); CPU_ZERO(cpuset); counter = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0); if (counter == MAP_FAILED) err(1, Mapping page); /* Fault it in. */ counter-cacheline1 = counter-cacheline2 = 0; child = (fork() == 0); CPU_SET(atoi(argv[2 + child]), cpuset); if (sched_setaffinity(getpid(), sizeof(cpu_set_t), cpuset) != 0) err(1, Calling sched_setaffinity()); gettimeofday(start, NULL); if (child) { count = 1; switch (mode) { case SHARE: while (count MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter-cacheline1 != count); count++; counter-cacheline1 = count; count++; } break; case UNSHARE: while (count MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter-cacheline1 != count); count++; counter-cacheline2 = count; count++; } break; case LOCKSHARE: while (count MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (__sync_val_compare_and_swap(counter-cacheline1, count, count+1) != count); count += 2; } break; case LOCKUNSHARE: while (count MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter-cacheline1 != count); __sync_val_compare_and_swap(counter-cacheline2, count, count+1); count += 2; } break; } } else { count = 0; switch (mode) { case SHARE: while (count MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter-cacheline1 != count); count++; counter-cacheline1 = count; count++; } break; case UNSHARE: while (count MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter-cacheline2 != count); count++; counter-cacheline1 = count; count++; } break; case LOCKSHARE: while (count MAX_BOUNCES
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 20 May 2010 04:30:56 pm Avi Kivity wrote: On 05/20/2010 08:01 AM, Rusty Russell wrote: A device with out of order completion (like virtio-blk) will quickly randomize the unused descriptor indexes, so every descriptor fetch will require a bounce. In contrast, if the rings hold the descriptors themselves instead of pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for every descriptor, amortized. We already have indirect, this would be a logical next step. So let's think about it. The avail ring would contain 64 bit values, the used ring would contain indexes into the avail ring. Have just one ring, no indexes. The producer places descriptors into the ring and updates the head, The consumer copies out descriptors to be processed and copies back in completed descriptors. Chaining is always linear. The descriptors contain a tag that allow the producer to identify the completion. This could definitely work. The original reason for the page boundaries was for untrusted inter-guest communication: with appropriate page protections they could see each other's rings and a simply inter-guest copy hypercall could verify that the other guest really exposed that data via virtio ring. But, cute as that is, we never did that. And it's not clear that it wins much over simply having the hypervisor read both rings directly. Can we do better? The obvious idea is to try to get rid of last_used and used, and use the ring itself. We would use an invalid entry to mark the head of the ring. Interesting! So a peer will read until it hits a wall. But how to update the wall atomically? Maybe we can have a flag in the descriptor indicate headness or tailness. Update looks ugly though: write descriptor with head flag, write next descriptor with head flag, remove flag from previous descriptor. I was thinking a separate magic invalid entry. To publish an 3 descriptor chain, you would write descriptors 2 and 3, write an invalid entry at 4, barrier, write entry 1. It is a bit ugly, yes, but not terrible. I think that a simple simulator for this is worth writing, which tracks cacheline moves under various fullness scenarios... Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2-RFC 0/2] virtio: put last seen used index into ring itself
On Thu, 27 May 2010 05:20:35 am Michael S. Tsirkin wrote: Here's a rewrite of the original patch with a new layout. I haven't tested it yet so no idea how this performs, but I think this addresses the cache bounce issue raised by Avi. Posting for early flames/comments. Sorry, not without some evidence that it'll actually reduce cacheline bouncing. I *think* it will, but it's not obvious: the host may keep looking at avail_idx as we're updating last_seen. Or does qemu always look at both together anyway? Can someone convince me this is a win? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx
On Wed, 2 Jun 2010 12:17:12 am Michael S. Tsirkin wrote: This adds an (unused) option to put available ring before control (avail index, flags), and adds padding between index and flags. This avoids cache line sharing between control and ring, and also makes it possible to extend avail control without incurring extra cache misses. Signed-off-by: Michael S. Tsirkin m...@redhat.com No no no no. 254? You're trying to Morton me![1] How's this (untested): diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -74,8 +74,8 @@ struct vring { /* The standard layout for the ring is a continuous chunk of memory which looks * like this. We assume num is a power of 2. * - * struct vring - * { + * struct vring { + * *** The driver writes to this part. * // The actual descriptors (16 bytes each) * struct vring_desc desc[num]; * @@ -84,9 +84,11 @@ struct vring { * __u16 avail_idx; * __u16 available[num]; * - * // Padding to the next align boundary. + * // Padding so used_flags is on the next align boundary. * char pad[]; + * __u16 last_used; // On a cacheline of its own. * + * *** The device writes to this part. * // A ring of used descriptor heads with free-running index. * __u16 used_flags; * __u16 used_idx; @@ -110,6 +112,12 @@ static inline unsigned vring_size(unsign + sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num; } +/* Last used index sits at the very end of the driver part of the struct */ +static inline __u16 *vring_last_used_idx(const struct vring *vr) +{ + return (__u16 *)vr-used - 1; +} + #ifdef __KERNEL__ #include linux/irqreturn.h struct virtio_device; Cheers, Rusty. [1] Andrew Morton has this technique where he posts a solution so ugly it forces others to fix it properly. Ego-roping, basically. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx
On Fri, 4 Jun 2010 08:05:43 pm Michael S. Tsirkin wrote: On Fri, Jun 04, 2010 at 12:04:57PM +0930, Rusty Russell wrote: On Wed, 2 Jun 2010 12:17:12 am Michael S. Tsirkin wrote: This adds an (unused) option to put available ring before control (avail index, flags), and adds padding between index and flags. This avoids cache line sharing between control and ring, and also makes it possible to extend avail control without incurring extra cache misses. Signed-off-by: Michael S. Tsirkin m...@redhat.com No no no no. 254? You're trying to Morton me![1] Hmm, I wonder what will we do if we want a 3rd field on a separate chacheline. But ok. How's this (untested): I think we also want to put flags there as well, they are used on interrupt path, together with last used index. I'm uncomfortable with moving a field. We haven't done that before and I wonder what will break with old code. Should we instead just abandon the flags field and use last_used only? Or, more radically, put flags == last_used when the feature is on? Thoughts? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extending virtio_console to support multiple ports
On Tue, 1 Sep 2009 02:07:05 am Anthony Liguori wrote: Amit Shah wrote: Functionally speaking, both virtio-console and virtio-serial do the same thing. In fact, virtio-console is just a subset of virtio-serial. If there are problems converging the two drivers in Linux, then I suggest you have two separate driver modules in Linux. That would obviously be rejected for Linux though because you cannot have two drivers for the same device. Why should qemu have a different policy? I've been on leave, cheerfully not thinking about this. AFAICT, a console and serial are very similar, so it makes sense to merge them. But a serial port doesn't have framing either; if it does, it's something else. And generally using virtio framing in upper layers is a mistake (one we've made in the qemu implementations, but that's unfortunate). Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Wed, 2 Sep 2009 01:35:18 am Stephen Hemminger wrote: On Tue, 1 Sep 2009 14:58:19 +0800 Xin, Xiaohui xiaohui@intel.com wrote: [RFC] Virtual Machine Device Queues (VMDq) support on KVM Network adapter with VMDq technology presents multiple pairs of tx/rx queues, and renders network L2 sorting mechanism based on MAC addresses and VLAN tags for each tx/rx queue pair. Here we present a generic framework, in which network traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without any software copy. Actually this framework can apply to traditional network adapters which have just one tx/rx queue pair. And applications using the same user/kernel interface can utilize this framework to send/receive network traffic directly thru a tx/rx queue pair in a network adapter. We use virtio-net architecture to illustrate the framework. || pop add_buf|| |Qemu process| -TX -- | Guest Kernel | || - -- || |Virtio-net | push get_buf|| | (Backend service) | -RX -- | Virtio-net| || - -- |driver | || push get_buf|| || || | | | AIO (read write) combined with Direct I/O | (which substitute synced file operations) |---| | Host kernel | read: copy-less with directly mapped user | | | space to kernel, payload directly DMAed | | | into user space | | | write: copy-less with directly mapped user | | | space to kernel, payload directly hooked | | | to a skb | | || | (a likely || | queue pair || | instance) || | | || | NIC driver -- TUN/TAP driver | |---| | | traditional adapter or a tx/rx queue pair The basic idea is to utilize the kernel Asynchronous I/O combined with Direct I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to kernel, we still can see it in SCSI tape driver. With traditional file operations, a copying of payload contents from/to the kernel DMA address to/from a user buffer is needed. That's what the copying we want to save. The proposed framework is like this: A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in host side. KVM virto-net Backend service, the user space program submits asynchronous read/write I/O requests to the host kernel through TUN/TAP device. The requests are corresponding to the vqueue elements include both transmission receive. They can be queued in one AIO request and later, the completion will be notified through the underlying packets tx/rx processing of the rx/tx queue pair. Detailed path: To guest Virtio-net driver, packets receive corresponding to asynchronous read I/O requests of Backend service. 1) Guest Virtio-net driver provides header and payload address through the receive vqueue to Virtio-net backend service. 2) Virtio-net backend service encapsulates multiple vqueue elements into multiple AIO control blocks and composes them into one AIO read request. 3) Virtio-net backend service uses io_submit() syscall to pass the request to the TUN/TAP device. 4) Virtio-net backend service uses io_getevents() syscall to check the completion of the request. 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares for Direct I/O. A modified NIC driver may render a skb which header is allocated in host kernel, but the payload buffer is directly mapped from user space buffer which are rendered through the AIO request by the Backend service. get_user_pages() may do this. For one AIO read request, the TUN/TAP driver maintains a list for the directly mapped buffers, and a NIC driver tries to get the buffers as payload buffer to compose the new skbs. Of course, if
Re: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default
On Fri, 18 Sep 2009 03:01:42 am Christoph Hellwig wrote: Err, I'll take this one back for now pending some more discussion. What we need more urgently is the writeback cache flag, which is now implemented in qemu, patch following ASAP. OK, still catching up on mail. I'll push them out of the queue for now. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] virtio: adding __devexit to virtballoon_remove
Thanks, I already have this from Uwe. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] virtio: order used ring after used index read
On Sun, 25 Oct 2009 11:58:53 pm Michael S. Tsirkin wrote: On SMP guests, reads from the ring might bypass used index reads. This causes guest crashes because host writes to used index to signal ring data readiness. Fix this by inserting rmb before used ring reads. Thanks, good spotting. Applied. Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-net: fix data corruption with OOM
On Mon, 26 Oct 2009 03:33:40 am Michael S. Tsirkin wrote: virtio net used to unlink skbs from send queues on error, but ever since 48925e372f04f5e35fec6269127c62b2c71ab794 we do not do this. This causes guest data corruption and crashes with vhost since net core can requeue the skb or free it without it being taken off the list. This patch fixes this by queueing the skb after successfull transmit. I originally thought that this was racy: as soon as we do add_buf, we need to make sure we're ready for the callback (for virtio_pci, it's -kick, but we shouldn't rely on that). So a comment would be nice. How's this? Subject: virtio-net: fix data corruption with OOM Date: Sun, 25 Oct 2009 19:03:40 +0200 From: Michael S. Tsirkin m...@redhat.com virtio net used to unlink skbs from send queues on error, but ever since 48925e372f04f5e35fec6269127c62b2c71ab794 we do not do this. This causes guest data corruption and crashes with vhost since net core can requeue the skb or free it without it being taken off the list. This patch fixes this by queueing the skb after successful transmit. Signed-off-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Rusty Russell ru...@rustcorp.com.au (+ comment) --- Rusty, here's a fix for another data corrupter I saw. This fixes a regression from 2.6.31, so definitely 2.6.32 I think. Comments? drivers/net/virtio_net.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -516,8 +516,7 @@ again: /* Free up any pending old buffers before queueing new ones. */ free_old_xmit_skbs(vi); - /* Put new one in send queue and do transmit */ - __skb_queue_head(vi-send, skb); + /* Try to transmit */ capacity = xmit_skb(vi, skb); /* This can happen with OOM and indirect buffers. */ @@ -531,8 +530,17 @@ again: } return NETDEV_TX_BUSY; } + vi-svq-vq_ops-kick(vi-svq); - vi-svq-vq_ops-kick(vi-svq); + /* +* Put new one in send queue. You'd expect we'd need this before +* xmit_skb calls add_buf(), since the callback can be triggered +* immediately after that. But since the callback just triggers +* another call back here, normal network xmit locking prevents the +* race. +*/ + __skb_queue_head(vi-send, skb); + /* Don't wait up for transmitted skbs to be freed. */ skb_orphan(skb); nf_reset(skb); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: performance regression in virtio-net in 2.6.32-rc4
On Tue, 27 Oct 2009 05:18:35 am Michael S. Tsirkin wrote: Hi! I noticed a performance regression in virtio net: going from 2.6.31 to 2.6.32-rc4 I see this, for guest to host communication: ... Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 16384 1638410.207806.48 ... 87380 16384 1638410.006814.60 Hmm, needs a bisect I'd say. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Hi Michael, Now everyone else has finally kicked all the tires and it seems to pass, I've done a fairly complete review. Generally, it's really nice; just one bug and a few minor suggestions for polishing. +/* Caller must have TX VQ lock */ +static void tx_poll_stop(struct vhost_net *net) +{ + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED)) + return; likely? Really? + for (;;) { + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in, + NULL, NULL); Danger! You need an arg to vhost_get_vq_desc to tell it the max desc size you can handle. Otherwise, it's only limited by ring size, and a malicious guest can overflow you here, and below: + /* Skip header. TODO: support TSO. */ + s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); ... + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + vhost_no_notify(vq); I prefer a name like vhost_disable_notify(). + /* OK, now we need to know about added descriptors. */ + if (head == vq-num vhost_notify(vq)) + /* They could have slipped one in as we were doing that: + * check again. */ + continue; + /* Nothing new? Wait for eventfd to tell us they refilled. */ + if (head == vq-num) + break; + /* We don't need to be notified again. */ + vhost_no_notify(vq); Similarly, vhost_enable_notify. This one is particularly misleading since it doesn't actually notify anything! In particular, this code would be neater as: if (head == vq-num) { if (vhost_enable_notify(vq)) { /* Try again, they could have slipped one in. */ continue; } /* Nothing more to do. */ break; } vhost_disable_notify(vq); Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to return true only when there's more pending. Saves a loop around here most of the time. Also, the vhost_no_notify/vhost_disable_notify() can be moved out of the loop entirely. (It could be under an if (unlikely(enabled)), not sure if it's worth it). + len = err; + err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size); That unsigned char * arg to memcpy_toiovec is annoying. A patch might be nice, separate from this effort. +static int vhost_net_open(struct inode *inode, struct file *f) +{ + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL); + int r; + if (!n) + return -ENOMEM; + f-private_data = n; + n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick; + n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick; I have a personal dislike of calloc for structures. In userspace, it's because valgrind can't spot uninitialized fields. These days a similar argument applies in the kernel, because we have KMEMCHECK now. If someone adds a field to the struct and forgets to initialize it, we can spot it. +static void vhost_net_enable_vq(struct vhost_net *n, int index) +{ + struct socket *sock = n-vqs[index].private_data; OK, I can't help but this that presenting the vqs as an array doesn't buy us very much. Esp. if you change vhost_dev_init to take a NULL-terminated varargs. I think readability would improve. It means passing a vq around rather than an index. Not completely sure it'll be a win tho. +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) +{ + struct socket *sock, *oldsock = NULL; ... + sock = get_socket(fd); + if (IS_ERR(sock)) { + r = PTR_ERR(sock); + goto done; + } + + /* start polling new socket */ + oldsock = vq-private_data; ... +done: + mutex_unlock(n-dev.mutex); + if (oldsock) { + vhost_net_flush_vq(n, index); + fput(oldsock-file); I dislike this style; I prefer multiple different goto points, one for when oldsock is set, and one for when it's not. That way, gcc warns us about uninitialized variables if we get it wrong. +static long vhost_net_reset_owner(struct vhost_net *n) +{ + struct socket *tx_sock = NULL; + struct socket *rx_sock = NULL; + long r; This should be called err, since that's what it is. +static void vhost_net_set_features(struct vhost_net *n, u64 features) +{ + size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? + sizeof(struct virtio_net_hdr) : 0; + int i; + mutex_lock(n-dev.mutex); + n-dev.acked_features = features; Why is this called acked_features? Not just features? I expected to see
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
On Thu, 5 Nov 2009 03:55:42 am Paul E. McKenney wrote: On Wed, Nov 04, 2009 at 01:57:29PM +0200, Michael S. Tsirkin wrote: Can you ack this usage please? I thought I had done so in my paragraph above, but if you would like something a bit more formal... snip verbose super-ack with qualifications That's great guys. And yes, this is a kind of read-copy-update. And no, there's nothing wrong with it. But it's still nasty to use half an API. If it were a few places I would have open-coded it with a comment, or wrapped it. As it is, I don't think that would be a win. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
On Sat, 7 Nov 2009 03:00:07 am Paul E. McKenney wrote: On Fri, Nov 06, 2009 at 03:31:20PM +1030, Rusty Russell wrote: But it's still nasty to use half an API. If it were a few places I would have open-coded it with a comment, or wrapped it. As it is, I don't think that would be a win. So would it help to have a rcu_read_lock_workqueue() and rcu_read_unlock_workqueue() that checked nesting and whether they were actually running in the context of a workqueue item? Or did you have something else in mind? Or am I misjudging the level of sarcasm in your reply? ;-) You read correctly. If we get a second user, creating an API makes sense. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote: On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote: +/* Caller must have TX VQ lock */ +static void tx_poll_stop(struct vhost_net *net) +{ + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED)) + return; likely? Really? Hmm ... yes. tx poll stop is called on each packet (as long as we do not fill up 1/2 backend queue), the first call will stop polling the rest checks state and does nothing. This is because we normally do not care when the message has left the queue in backend device: we tell backend to send it and forget. We only start polling when backend tx queue fills up. OK, good. +static void vhost_net_set_features(struct vhost_net *n, u64 features) +{ + size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? + sizeof(struct virtio_net_hdr) : 0; + int i; + mutex_lock(n-dev.mutex); + n-dev.acked_features = features; Why is this called acked_features? Not just features? I expected to see code which exposed these back to userspace, and didn't. Not sure how do you mean. Userspace sets them, why does it want to get them exposed back? There's something about the 'acked' which rubs me the wrong way. enabled_features is perhaps a better term than acked_features; acked seems more a user point-of-view, enabled seems more driver POV? set_features matches your ioctl names, but it sounds like a fn name :( It's marginal. And 'features' is shorter than both. + switch (ioctl) { + case VHOST_SET_VRING_NUM: I haven't looked at your userspace implementation, but does a generic VHOST_SET_VRING_STATE VHOST_GET_VRING_STATE with a struct make more sense? It'd be simpler here, Not by much though, right? but not sure if it'd be simpler to use? The problem is with VHOST_SET_VRING_BASE as well. I want it to be separate because I want to make it possible to relocate e.g. used ring to another address while ring is running. This would be a good debugging tool (you look at kernel's used ring, check descriptor, then update guest's used ring) and also possibly an extra way to do migration. And it's nicer to have vring size separate as well, because it is initialized by host and never changed, right? Actually, this looks wrong to me: + case VHOST_SET_VRING_BASE: ... + vq-avail_idx = vq-last_avail_idx = s.num; The last_avail_idx is part of the state of the driver. It needs to be saved and restored over susp/resume. The only reason it's not in the ring itself is because I figured the other side doesn't need to see it (which is true, but missed debugging opportunities as well as man-in-the-middle issues like this one). I had a patch which put this field at the end of the ring, I might resurrect it to avoid this problem. This is backwards compatible with all implementations. See patch at end. I would drop avail_idx altogether: get_user is basically free, and simplifies a lot. As most state is in the ring, all you need is an ioctl to save/restore the last_avail_idx. We could merge DESC, AVAIL, USED, and it will reduce the amount of code in userspace. With both base, size and fds separate, it seemed a bit more symmetrical to have desc/avail/used separate as well. What's your opinion? Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT. For future reference, this is *exactly* the kind of thing which would have been nice as a followup patch. Easy to separate, easy to review, not critical to the core. Yes. It's not too late to split it out though: should I do it yet? Only if you're feeling enthused. It's lightly reviewed now. Cheers, Rusty. virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: the ring is always a whole number of pages and there's hundreds of bytes of padding after the avail ring and the used ring, whatever the number of descriptors (which must be a power of 2). We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/virtio/virtio_ring.c | 23 +++ include/linux/virtio_ring.h | 12 +++- 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -71,9 +71,6 @@ struct vring_virtqueue /* Number we've added since last sync. */ unsigned int num_added; - /* Last
Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Mon, 9 Nov 2009 05:40:32 pm Michael S. Tsirkin wrote: On Mon, Nov 9, 2009 at 8:17 AM, Rusty Russell ru...@rustcorp.com.au wrote: There's something about the 'acked' which rubs me the wrong way. enabled_features is perhaps a better term than acked_features; acked seems more a user point-of-view, enabled seems more driver POV? Hmm. Are you happy with the ioctl name? If yes I think being consistent with that is important. I think in my original comments I noted that I preferred GET / SET, rather than GET/ACK. Actually, this looks wrong to me: + case VHOST_SET_VRING_BASE: ... + vq-avail_idx = vq-last_avail_idx = s.num; The last_avail_idx is part of the state of the driver. It needs to be saved and restored over susp/resume. Exactly. That's what VHOST_GET/SET_VRING_BASE does. avail_idx is just a cached value for notify on empty, so what this does is clear the value. Ah, you actually refresh it every time anyway. Hmm, could you do my poor brain a favor and either just get_user it in vhost_trigger_irq(), or call it 'cached_avail_idx' or something? The only reason it's not in the ring itself is because I figured the other side doesn't need to see it (which is true, but missed debugging opportunities as well as man-in-the-middle issues like this one). I had a patch which put this field at the end of the ring, I might resurrect it to avoid this problem. This is backwards compatible with all implementations. See patch at end. Yes, I remember that patch. There seems to be little point though, at this stage. Well, it avoids this ioctl, by exposing all the state. We may well need it later, to expand the ring in other ways. I would drop avail_idx altogether: get_user is basically free, and simplifies a lot. As most state is in the ring, all you need is an ioctl to save/restore the last_avail_idx. avail_idx is there for notify on empty: I had this thought that it's better to leave the avail cache line alone when we are triggering interrupt to avoid bouncing it around if guest is updating it meanwhile on another CPU, and I think my testing showed that it helped performance, but could be a mistake. You don't believe this can help? I believe it could help, but this is YA case where it would have been nice to have a dumb basic patch and this as a patch on top. But I am going to ask you to re-run that measurement, see if it stacks up (because it's an interesting lesson if it does..) Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv9 3/3] vhost_net: a kernel-level virtio server
On Tue, 10 Nov 2009 03:52:30 am Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. Thanks, applied. Will be in tomorrow's linux-next. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv9 3/3] vhost_net: a kernel-level virtio server
One fix: vhost: fix TUN=m VHOST_NET=y drivers/built-in.o: In function `get_tun_socket': net.c:(.text+0x15436e): undefined reference to `tun_get_socket' Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/vhost/Kconfig |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -1,6 +1,6 @@ config VHOST_NET tristate Host kernel accelerator for virtio net (EXPERIMENTAL) - depends on NET EVENTFD EXPERIMENTAL + depends on NET EVENTFD TUN EXPERIMENTAL ---help--- This kernel module can be loaded in host kernel to accelerate guest networking with virtio_net. Not to be confused with virtio_net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv9 3/3] vhost_net: a kernel-level virtio server
On Tue, 10 Nov 2009 10:06:37 pm Michael S. Tsirkin wrote: If tun is a module, vhost must be a module, too. If tun is built-in or disabled, vhost can be built-in. I really like the brainbending :) Keeps readers on their toes... Applied, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Split the KVM pv-ops support by feature
On Wed, 18 Nov 2009 10:43:12 am Alexander Graf wrote: Currently selecting KVM guest support enabled multiple features at once that not everyone necessarily wants to have, namely: These patches make perfect sense, but please make sure Jeremy Fitzhardinge (CC'd) is in the loop, as he split the structs in the first place. Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Fri, 20 Nov 2009 04:39:19 pm Shirley Ma wrote: Guest virtio_net receives packets from its pre-allocated vring buffers, then it delivers these packets to upper layer protocols as skb buffs. So it's not necessary to pre-allocate skb for each mergable buffer, then frees it when it's useless. This patch has deferred skb allocation when receiving packets for both big packets and mergeable buffers. It reduces skb pre-allocations and skb_frees. Based on Mickael Avi's suggestion. A destroy function has been created to push virtio free buffs to vring for unused pages, and used page private to maintain page list. I didn't touch small packet skb allocation to avoid extra copies for small packets. This patch has tested and measured against 2.6.32-rc5 git. It is built again 2.6.32-rc7 kernel. Tests have been done for small packets, big packets and mergeable buffers. The single netperf TCP_STREAM performance improved for host to guest. It also reduces UDP packets drop rate. The netperf laptop results were: mtu=1500 netperf -H xxx -l 120 w/o patch w/i patch (two runs) guest to host: 3336.84Mb/s 3730.14Mb/s ~ 3582.88Mb/s host to guest: 3165.10Mb/s 3370.39Mb/s ~ 3407.96Mb/s Nice! Is this using mergeable_rx_bufs? Or just big_packets? I'd like to drop big packet support from our driver, but I don't know how many kvm hosts will not offer mergable rx bufs yet. Anthony? Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Sat, 21 Nov 2009 02:51:41 am Shirley Ma wrote: Signed-off-by: Shirley Ma (x...@us.ibm.com) Hi Shirley, This patch seems like a good idea, but it needs some work. As this is the first time I've received a patch from you, I will go through it in extra detail. (As virtio maintainer, it's my responsibility to include this patch, or not). -static void give_a_page(struct virtnet_info *vi, struct page *page) +static void give_pages(struct virtnet_info *vi, struct page *page) { - page-private = (unsigned long)vi-pages; + struct page *npage = (struct page *)page-private; + + if (!npage) + page-private = (unsigned long)vi-pages; + else { + /* give a page list */ + while (npage) { + if (npage-private == (unsigned long)0) { + npage-private = (unsigned long)vi-pages; + break; + } + npage = (struct page *)npage-private; + } + } vi-pages = page; The loops should cover both cases of the if branch. Either way, we want to find the last page in the list so you can set that -private pointer to vi-pages. Also, the == (unsigned long)0 is a little verbose. How about: struct page *end; /* Find end of list, sew whole thing into vi-pages. */ for (end = page; end-private; end = (struct page *)end-private); end-private = (unsigned long)vi-pages; vi-pages = page; +void virtio_free_pages(void *buf) This is a terrible name; it's specific to virtio_net, plus it should be static. Just free_pages should be sufficient here I think. +{ + struct page *page = (struct page *)buf; + struct page *npage; + + while (page) { + npage = (struct page *)page-private; + __free_pages(page, 0); + page = npage; + } This smells a lot like a for loop to me? struct page *i, *next; for (i = buf; next; i = next) { next = (struct page *)i-private; __free_pages(i, 0); } +static int set_skb_frags(struct sk_buff *skb, struct page *page, + int offset, int len) A better name might be skb_add_frag()? +static void receive_skb(struct net_device *dev, void *buf, unsigned len) { struct virtnet_info *vi = netdev_priv(dev); - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); - int err; + struct skb_vnet_hdr *hdr; + struct sk_buff *skb; int i; if (unlikely(len sizeof(struct virtio_net_hdr) + ETH_HLEN)) { @@ -132,39 +173,71 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb, goto drop; } - if (vi-mergeable_rx_bufs) { - unsigned int copy; - char *p = page_address(skb_shinfo(skb)-frags[0].page); + if (!vi-mergeable_rx_bufs !vi-big_packets) { + skb = (struct sk_buff *)buf; This cast is unnecessary, but a comment would be nice: /* Simple case: the pointer is the 1514-byte skb */ + } else { And a matching comment here: /* The pointer is a chain of pages. */ + struct page *page = (struct page *)buf; + int copy, hdr_len, num_buf, offset; + char *p; + + p = page_address(page); - if (len PAGE_SIZE) - len = PAGE_SIZE; - len -= sizeof(struct virtio_net_hdr_mrg_rxbuf); + skb = netdev_alloc_skb(vi-dev, GOOD_COPY_LEN + NET_IP_ALIGN); + if (unlikely(!skb)) { + dev-stats.rx_dropped++; Does this mean we leak all those pages? Shouldn't we give_pages() here? + return; + } + skb_reserve(skb, NET_IP_ALIGN); + hdr = skb_vnet_hdr(skb); - memcpy(hdr-mhdr, p, sizeof(hdr-mhdr)); - p += sizeof(hdr-mhdr); + if (vi-mergeable_rx_bufs) { + hdr_len = sizeof(hdr-mhdr); + memcpy(hdr-mhdr, p, hdr_len); + num_buf = hdr-mhdr.num_buffers; + offset = hdr_len; + if (len PAGE_SIZE) + len = PAGE_SIZE; + } else { + /* big packtes 6 bytes alignment between virtio_net + * header and data */ + hdr_len = sizeof(hdr-hdr); + memcpy(hdr-hdr, p, hdr_len); + offset = hdr_len + 6; Really? I can't see where the current driver does this 6 byte offset. There should be a header, then the packet... Ah, I see below that you set things up this way. The better way to do this is to create a new structure to use in various places. struct padded_vnet_hdr { struct virtio_net_hdr hdr; /*
Re: [PATCH] vhost: Fix warnings and bad type handling
On Sun, 22 Nov 2009 08:58:01 pm Michael S. Tsirkin wrote: From: Alan Cox a...@linux.intel.com Subject: [PATCH] vhost: fix warnings on 32 bit systems Fix compiler warning about discarding top 32 bit of data on 32 bit systems, and document that dicarded bits must be 0. Signed-off-by: Alan Cox a...@linux.intel.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- So I think the below slightly tweaked version of Alan's patch is a bit better. OK? Thanks, applied. Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: + skb = (struct sk_buff *)buf; This cast is unnecessary, but a comment would be nice: Without this cast there is a compile warning. Hi Shirley, Looks like buf is a void *, so no cast should be necessary. But I could be reading the patch wrong. However, I question whether making it 16 byte is the right thing: the ethernet header is 14 bytes long, so don't we want 8 bytes of padding? Because in QEMU it requires 10 bytes header in a separately, so one page is used to share between virtio_net_hdr header which is 10 bytes head and rest of data. So I put 6 bytes offset here between two buffers. I didn't look at the reason why a seperate buf is used for virtio_net_hdr in QEMU. It's a qemu bug. It insists the header be an element in the scatterlist by itself. Unfortunately we have to accommodate it. However, there's no reason for the merged rxbuf and big packet layout to be the same: for the big packet case you should layout as follows: #define BIG_PACKET_PAD (NET_SKB_PAD - sizeof(struct virtio_net_hdr) + NET_IP_ALIGN) struct big_packet_page { struct virtio_net_hdr hdr; char pad[BIG_PACKET_PAD]; /* Actual packet data starts here */ unsigned char data[PAGE_SIZE - BIG_PACKET_PAD - sizeof(struct virtio_net_hdr)]; }; Then use this type as the template for registering the sg list for the big packet case. I think you can move the memcpy outside the if, like so: memcpy(hdr, p, hdr_len); I didn't move it out, because num_buf = hdr-mhdr.num_buffers; Yes, that has to stay inside, but the memcpy can move out. It's just a bit neater to have more common code. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Wed, 25 Nov 2009 01:06:32 am Anthony Liguori wrote: So does lguest. It's been that way since the beginning. Fixing this would result in breaking older guests. Agreed, we can't fix it in the guests, but it's a wart. That's why I haven't bothered fixing it, but if someone else wants to I'll cheer all the way. lguest did it because I knew I could fix lguest any time; it was a bad mistake and I will now fix lguest :) We really need to introduce a feature bit if we want to change this. I don't think it's worth it. But the spec does say that the implementation should not rely on the framing (I think there's a note that current implementations are buggy tho, so you should frame it separately anyway). That way future devices can get it right, at least. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Wed, 25 Nov 2009 07:45:30 pm Michael S. Tsirkin wrote: Hmm, is it really worth it to save a header copy if it's linear? We are going to access it anyway, and it fits into one cacheline nicely. On the other hand we have more code making life harder for compiler and processor. Not sure: I think there would be many places where it would be useful. We do a similar thing in the kernel to inspect non-linear packets, and it's served us well. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: set pci bus master enable bit
On Mon, 30 Nov 2009 02:22:01 am Michael S. Tsirkin wrote: As all virtio devices perform DMA, we must enable bus mastering for them to be spec compliant. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_pci.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 28d9cf7..717bae1 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -648,6 +648,7 @@ static int __devinit virtio_pci_probe(struct pci_dev *pci_dev, goto out_req_regions; pci_set_drvdata(pci_dev, vp_dev); + pci_set_master(pci_dev); I can believe this, but I have no idea if it's right. I've applied it, and hope Jesse will comment if there's something wrong with it. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio spec: add virtio-blk max sectors feature
On Thu, 3 Dec 2009 08:28:38 pm Avi Kivity wrote: On 12/03/2009 10:42 AM, Avishay Traeger1 wrote: I previously submitted a patch to have the guest virtio-blk driver get the value for the maximum I/O size from the host bdrv, rather than assume that there is no limit. Avi requested that I first patch the virtio spec (http://ozlabs.org/~rusty/virtio-spec/). Below is that patch. Please CC me on replies, as I am not subscribed. Copying Rusty and virtualizat...@. Thanks Avi... Avishay; this would be the total sectors in an I/O, as separate from SIZE_MAX (maximum size of any single scatterlist entry) and SEG_MAX (maximum number of scatterlist entries)? Seems like a reasonable idea; esp if you need it. Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio
On Fri, 11 Dec 2009 11:03:25 pm Shirley Ma wrote: Signed-off-by: x...@us.ibm.com Hi Shirley, These patches look quite close. More review to follow :) This title needs revision. It should start with virtio: (all the virtio patches do, for easy identification after merge), eg: Subject: virtio: Add destroy callback for unused buffers It also needs a commit message which explains the problem and the solution. Something like this: There's currently no way for a virtio driver to ask for unused buffers, so it has to keep a list itself to reclaim them at shutdown. This is redundant, since virtio_ring stores that information. So add a new hook to do this: virtio_net will be the first user. - diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c708ecc..bb5eb7b 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -107,6 +107,16 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) return p; } +static void virtio_net_free_pages(void *buf) +{ + struct page *page, *next; + + for (page = buf; page; page = next) { + next = (struct page *)page-private; + __free_pages(page, 0); + } +} + static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; This belongs in one of the future patches: it will cause an unused warning and is logically not part of this patch anyway. +static int vring_destroy_bufs(struct virtqueue *_vq, void (*destroy)(void *)) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + void *buf; + unsigned int i; + int freed = 0; unsigned int for return and freed counter? Means changing prototype, but negative return isn't useful here. + + START_USE(vq); + + for (i = 0; i vq-vring.num; i++) { + if (vq-data[i]) { + /* detach_buf clears data, so grab it now. */ + buf = vq-data[i]; + detach_buf(vq, i); + destroy(buf); + freed++; You could simplify this a bit, since the queue must be inactive: destroy(vq-data[i]); detach_buf(vq, i); freed++; + } + } + + END_USE(vq); + return freed; Perhaps add a: /* That should have freed everything. */ BUG_ON(vq-num_free != vq-vring.num) void (*disable_cb)(struct virtqueue *vq); bool (*enable_cb)(struct virtqueue *vq); + int (*destroy_bufs)(struct virtqueue *vq, void (*destory)(void *)); destory typo :) Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls chain pages in virtio_net
On Fri, 11 Dec 2009 11:13:02 pm Shirley Ma wrote: Signed-off-by: Shirley Ma x...@us.ibm.com I don't think there's a good way of splitting this change across multiple patches. And I don't think this patch will compile; I don't think we can get rid of trim_pages yet. We *could* first split the receive paths into multiple parts as you have (eg. add_recvbuf_big etc), then actually rewrite them, but that's too much work to refactor the code twice. So just roll all the driver changes into one patch; so you will have two patches: one which creates the destroy_bufs API for virtio, and one which uses it in the virtio_net driver. +static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct page **page, + unsigned int *len) This actually allocates an skb, and transfers the first page to it (via copy and possibly the first fragment). skb_from_page() perhaps? + hdr_len = sizeof(hdr-hdr); + offset = sizeof(struct padded_vnet_hdr); + } + + memcpy(hdr, p, hdr_len); + + *len -= hdr_len; Perhaps you should return NULL here as an error if *len was hdr_len? Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio
On Wed, 16 Dec 2009 09:10:02 am Michael S. Tsirkin wrote: On Wed, Dec 16, 2009 at 09:06:12AM +1030, Rusty Russell wrote: On Tue, 15 Dec 2009 06:52:53 am Michael S. Tsirkin wrote: On Mon, Dec 14, 2009 at 12:08:05PM -0800, Shirley Ma wrote: Hello Michael, I agree with the comments (will have two patches instead of 4 based on Rusty's comments) except below one. On Sun, 2009-12-13 at 12:26 +0200, Michael S. Tsirkin wrote: That said - do we have to use a callback? I think destroy_buf which returns data pointer, and which we call repeatedly until we get NULL or error, would be an a better, more flexible API. This is not critical though. The reason to use this is because in virtio_net remove, it has BUG_ON(vi-num != 0), which will be consistent with small skb packet. If we use NULL, error then we lose the track for vi-num, since we don't know how many buffers have been passed to ULPs or still unused. Thanks Shirley I dont insist, but my idea was for (;;) { b = vq-destroy(vq); if (!b) break; --vi-num; put_page(b); } In this case it should be called get_unused_buf or something. But I like Shirley's approach here; destroy (with callback) accurately reflects the only time this can be validly used. Cheers, Rusty. I guess the actual requirement is that device must be inactive. Technically, the vq has to be inactive. (This distinction may matter for the multiport virtio_console work). As I said this is fine with me as well. But I think the callback should get vq pointer besides the data pointer, so that it can e.g. find the device if it needs to. In case of virtio net this makes it possible to decrement the outstanding skb counter in the callback. Makes sense? Sure. I don't really mind either way, and I'm warming to the name detach_buf :) Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: fix high 32 bit in FEATURES ioctls
On Tue, 22 Dec 2009 10:09:33 pm Michael S. Tsirkin wrote: From: David Stevens dlstev...@us.ibm.com Subject: vhost: fix high 32 bit in FEATURES ioctls Thanks, applied. Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] virtio_blk: check for hardsector size from host
On Tuesday 27 May 2008 19:04:59 Christian Borntraeger wrote: Rusty, Jens, I need your opinion on the following patch. It seems to work, but I would like to get some feedback if this patch is the right approach: Looks like the right approach to me. Don't know about the block side of it... Just that u64 seems like overkill: u32? Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio_config: fix len calculation of config elements
On Thursday 29 May 2008 19:08:01 Christian Borntraeger wrote: v is a pointer, to we have to use sizeof(*v) instead of sizeof(v). How embarassing. Applied, thanks. Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
On Tuesday 17 June 2008 00:02:55 Anthony Liguori wrote: There's nothing that prevents zero-copy to be implemented for tun without vringfd. In fact, I seem to recall that your earlier patches implemented zero-copy :-) They didn't actually work. You need to block until the data isn't being used any more (thread pool anyone?), or implement an aio interface. I like the vringfd model and I think it's a good way to move forward. My concern is that it introduces an extra syscall in the TX path. Right now, we do a single write call whereas with vringfd we need to insert the TX packet into the queue, do a notify, and then wait for indication that the TX has succeeded. If the guest wants notification of xmit, yes you need another syscall for that. But it often doesn't (note: current vring tun ignored the NO_NOTIFY flag, but one thing at a time). I know we'll win with TSO but we don't need vringfd for TSO. The jury's still out IMHO as to whether we should do vringfd or just try to merge TSO tun patches. Note that we can do TSO in userspace, too. No syscall reduction, but an VM exit reduction. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
On Friday 25 July 2008 09:22:53 Dor Laor wrote: Mark McLoughlin wrote: vq-vring.used-flags = ~VRING_USED_F_NO_NOTIFY; qemu_del_timer(n-tx_timer); n-tx_timer_active = 0; As stated by newer messages, we should handle the first tx notification if the timer wasn't active to shorten latency. Cheers, Dor Here's what lguest does at the moment. Basically, we cut the timeout a tiny bit each time, until we get *fewer* packets than last time. Then we bump it up again. Rough, but seems to work (it should be a per-device var of course, not a static). @@ -921,6 +922,7 @@ static void handle_net_output(int fd, st unsigned int head, out, in, num = 0; int len; struct iovec iov[vq-vring.num]; + static int last_timeout_num; if (!timeout) net_xmit_notify++; @@ -941,6 +943,14 @@ static void handle_net_output(int fd, st /* Block further kicks and set up a timer if we saw anything. */ if (!timeout num) block_vq(vq); + + if (timeout) { + if (num last_timeout_num) + timeout_usec += 10; + else if (timeout_usec 1) + timeout_usec--; + last_timeout_num = num; + } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9][RFC] KVM virtio_net performance
On Saturday 26 July 2008 19:45:36 Avi Kivity wrote: Mark McLoughlin wrote: Hey, Here's a bunch of patches attempting to improve the performance of virtio_net. This is more an RFC rather than a patch submission since, as can be seen below, not all patches actually improve the perfomance measurably. I've tried hard to test each of these patches with as stable and informative a benchmark as I could find. The first benchmark is a netperf[1] based throughput benchmark and the second uses a flood ping[2] to measure latency differences. Each set of figures is min/average/max/standard deviation. The first set is Gb/s and the second is milliseconds. The network configuration used was very simple - the guest with a virtio_net interface and the host with a tap interface and static IP addresses assigned to both - e.g. there was no bridge in the host involved and iptables was disable in both the host and guest. I used: 1) kvm-71-26-g6152996 with the patches that follow 2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from 219:bbd2611289c5 applied; these are the patches have just been submitted to Linus The conclusions I draw are: 1) The length of the tx mitigation timer makes quite a difference to throughput achieved; we probably need a good heuristic for adjusting this on the fly. The tx mitigation timer is just one part of the equation; the other is the virtio ring window size, which is now fixed. Using a maximum sized window is good when the guest and host are running flat out, doing nothing but networking. When throughput drops (because the guest is spending cpu on processing, or simply because the other side is not keeping up), we need to drop the windows size so as to retain acceptable latencies. The tx timer can then be set to a bit after the end of the window, acting as a safety belt in case the throughput changes. Interestingly, I played a little with a threshold patch. Unfortunately it seemed to just add YA variable to the mix; it'd take real research to figure out how to adjust the threshold and timeout values appropriately for different situations. This isn't too good. Low latency is important for nfs clients (or other request/response workloads). I think we can keep these low by adjusting the virtio window (for example, on an idle system it should be 1), so that the tx mitigation timer only fires when the workload transitions from throughput to request/response. From reading this thread, this seems like an implementation bug. Don't suppress notifications on an empty ring (ie. threshold will be 1 when nothing is happening). Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9][RFC] KVM virtio_net performance
On Saturday 26 July 2008 19:45:36 Avi Kivity wrote: 5) Eliminating an extra copy on the host-guest path only makes a barely measurable difference. That's expected on a host-guest test. Zero copy is mostly important for guest-external, and with zerocopy already enabled in the guest (sendfile or nfs server workloads). Note that this will also need the copyless tun patches. If we're doing one copy from userspace-kernel anyway, this extra copy is probably lost in the noise. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Monday 18 August 2008 21:44:25 Herbert Xu wrote: On Mon, Aug 18, 2008 at 02:40:55PM +0300, Avi Kivity wrote: Isn't that turned on automatically for real hardware? And what's to prevent a broken dhclient together with the (presumably) hacked up initscripts that call ethtool? Well the idea is that only a fixed guest would even know about enabling this. For those not following closely: We already have a method for the guest to accept or reject features. Our problem is that the guest is already accepting the CSUM feature: but one critical userspace app (dhcp-client) can't actually handle it due to a bug. The proposal is to add another mechanism, whereby the host doesn't advertise CSUM, but advertises a new CSUM2 feature. The driver doesn't accept this by default: then guest userspace says hey, I *really can* handle CSUM. This would have to be done dby resetting the device in the ethtool callback (that's how we renegotiate features). And guests need a special virtio hack in their init scripts. This leaves the small number of current users without CSUM (and hence GSO etc). Yet they might not use dhcp with bridging anyway. Worst of all, we have to document this embarrassing workaround. Neither solution is good. But I don't think Anthony's hack looks so bad after this. Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Tuesday 19 August 2008 13:17:57 Chris Wedgwood wrote: On Tue, Aug 19, 2008 at 10:45:20AM +1000, Rusty Russell wrote: For those not following closely: We already have a method for the guest to accept or reject features. Our problem is that the guest is already accepting the CSUM feature: but one critical userspace app (dhcp-client) can't actually handle it due to a bug. Can't we just get dhcp-client client fixed upstream and let the distro's update in a couple of months? Herbert has already fixed the client, but it seems upstream hasn't released yet. The proposal is to add another mechanism, whereby the host doesn't advertise CSUM, but advertises a new CSUM2 feature. The driver doesn't accept this by default: then guest userspace says hey, I *really can* handle CSUM. This would have to be done dby resetting the device in the ethtool callback (that's how we renegotiate features). And guests need a special virtio hack in their init scripts. If guest can have modified scripts for virtio and what not, they can have a fixed dhcp-client. They need to do both. This way if they don't, it still works, but networking is at a penalty (no CSUM offload). Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Tuesday 19 August 2008 15:17:08 Herbert Xu wrote: On Mon, Aug 18, 2008 at 10:13:55PM -0700, Chris Wedgwood wrote: CSUM2 sounds so ugly though. Features seem to get added and never removed how about if this had a documented short lifetime (if it really must go in)? All we need is a simple toggle to disable checksum offload. Every NIC that offers receive checksum offload allows it to be disabled. virtio shouldn't be any different. Resetting the NIC seems a bit over the top. Not really. We could extend the protocol, but that's currently how feature negotiation works: you can't do it while the device is live. That seemed simplest. I learnt from Xen :) (Of course, we don't need to *disable* it, we need to *enable* it). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Tuesday 19 August 2008 15:28:40 Chris Wedgwood wrote: On Tue, Aug 19, 2008 at 03:17:08PM +1000, Herbert Xu wrote: All we need is a simple toggle to disable checksum offload. Every NIC that offers receive checksum offload allows it to be disabled. virtio shouldn't be any different. So why CSUM2 and not an ethtool interface then? We need both. CSUM2 is the new virtio-level feature. The ethtool interface allows you to toggle it (we don't support that currently, but that's simply slackness). Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Wednesday 20 August 2008 01:05:55 Chris Wedgwood wrote: On Tue, Aug 19, 2008 at 07:10:44PM +1000, Rusty Russell wrote: We need both. CSUM2 is the new virtio-level feature. Perhaps that's what I'm misisng. How is this different to CSUM? Because current guests which say they can handle CSUM can't *really* handle it. That's the entire point. We need a new one, so that existing guests don't ack it (because they don't understand it), and new guests can ack it later. Hope that clarifies, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of page size
On Tuesday 11 November 2008 10:07:09 Hollis Blanchard wrote: Both sides of the virtio interface must agree about how big a pfn really is. This is particularly an issue on architectures where the page size is configurable (e.g. PowerPC, IA64) -- the interface must be independent of PAGE_SHIFT. Currently there are three distinct problems: * The shift count used when passing the physical address of the ring to a PCI-based back end. * The ring layout itself is padded to span at least two pages. * The balloon driver operates in units of pages. Hi Hollis, The more I thought about this, the more I think we're not solving this as neatly as we could. The trigger was noting that we're breaking the userspace API (vring_size and vring_init are exposed to userspace): I know that qemu cut pastes, but that's no excuse. So instead, I've introduced separate constants for each use. Yes, all these constants are 12/4096. But just to be contrary, at the end is a patch to change lguest to 128. And there's no reason this couldn't change in future using some guest detection scheme. Patch stream to follow... Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] virtio: use LGUEST_VRING_ALIGN instead of relying on pagesize
This doesn't really matter, since lguest is i386 only at the moment, but we could actually choose a different value. (lguest doesn't have a guarenteed ABI). Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- Documentation/lguest/lguest.c |6 +++--- drivers/lguest/lguest_device.c |2 +- include/linux/lguest_launcher.h |2 ++ 3 files changed, 6 insertions(+), 4 deletions(-) diff -r c6744f753314 Documentation/lguest/lguest.c --- a/Documentation/lguest/lguest.c Wed Nov 12 22:02:52 2008 +1030 +++ b/Documentation/lguest/lguest.c Wed Nov 12 22:11:03 2008 +1030 @@ -1030,7 +1030,7 @@ /* Zero out the virtqueues. */ for (vq = dev-vq; vq; vq = vq-next) { memset(vq-vring.desc, 0, - vring_size(vq-config.num, getpagesize())); + vring_size(vq-config.num, LGUEST_VRING_ALIGN)); lg_last_avail(vq) = 0; } } else if (dev-desc-status VIRTIO_CONFIG_S_FAILED) { @@ -1211,7 +1211,7 @@ void *p; /* First we need some memory for this virtqueue. */ - pages = (vring_size(num_descs, getpagesize()) + getpagesize() - 1) + pages = (vring_size(num_descs, LGUEST_VRING_ALIGN) + getpagesize() - 1) / getpagesize(); p = get_pages(pages); @@ -1228,7 +1228,7 @@ vq-config.pfn = to_guest_phys(p) / getpagesize(); /* Initialize the vring. */ - vring_init(vq-vring, num_descs, p, getpagesize()); + vring_init(vq-vring, num_descs, p, LGUEST_VRING_ALIGN); /* Append virtqueue to this device's descriptor. We use * device_config() to get the end of the device's current virtqueues; diff -r c6744f753314 drivers/lguest/lguest_device.c --- a/drivers/lguest/lguest_device.cWed Nov 12 22:02:52 2008 +1030 +++ b/drivers/lguest/lguest_device.cWed Nov 12 22:11:03 2008 +1030 @@ -250,7 +250,7 @@ /* Figure out how many pages the ring will take, and map that memory */ lvq-pages = lguest_map((unsigned long)lvq-config.pfn PAGE_SHIFT, DIV_ROUND_UP(vring_size(lvq-config.num, - PAGE_SIZE), + LGUEST_VRING_ALIGN), PAGE_SIZE)); if (!lvq-pages) { err = -ENOMEM; diff -r c6744f753314 include/linux/lguest_launcher.h --- a/include/linux/lguest_launcher.h Wed Nov 12 22:02:52 2008 +1030 +++ b/include/linux/lguest_launcher.h Wed Nov 12 22:11:03 2008 +1030 @@ -59,4 +59,6 @@ LHREQ_IRQ, /* + irq */ LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */ }; + +#define LGUEST_VRING_ALIGN 4096 #endif /* _LINUX_LGUEST_LAUNCHER */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] virtio: rename 'pagesize' arg to vring_init/vring_size
It's really the alignment desired for consumer/producer separation; historically this x86 pagesize, but with PowerPC it'll still be x86 pagesize. And in theory lguest could choose a different value. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- include/linux/virtio_ring.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff -r c6761450d706 include/linux/virtio_ring.h --- a/include/linux/virtio_ring.h Wed Nov 12 21:47:57 2008 +1030 +++ b/include/linux/virtio_ring.h Wed Nov 12 21:51:09 2008 +1030 @@ -83,7 +83,7 @@ * __u16 avail_idx; * __u16 available[num]; * - * // Padding to the next page boundary. + * // Padding to the next align boundary. * char pad[]; * * // A ring of used descriptor heads with free-running index. @@ -93,19 +93,19 @@ * }; */ static inline void vring_init(struct vring *vr, unsigned int num, void *p, - unsigned long pagesize) + unsigned long align) { vr-num = num; vr-desc = p; vr-avail = p + num*sizeof(struct vring_desc); - vr-used = (void *)(((unsigned long)vr-avail-ring[num] + pagesize-1) -~(pagesize - 1)); + vr-used = (void *)(((unsigned long)vr-avail-ring[num] + align-1) +~(align - 1)); } -static inline unsigned vring_size(unsigned int num, unsigned long pagesize) +static inline unsigned vring_size(unsigned int num, unsigned long align) { return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num) -+ pagesize - 1) ~(pagesize - 1)) ++ align - 1) ~(align - 1)) + sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] virtio: Don't use PAGE_SIZE for vring alignment in virtio_pci.
That doesn't work for non-4k guests which are now appearing. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- drivers/virtio/virtio_pci.c |5 +++-- include/linux/virtio_pci.h |4 2 files changed, 7 insertions(+), 2 deletions(-) diff -r c6f6a3ab173b drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c Wed Nov 12 21:51:09 2008 +1030 +++ b/drivers/virtio/virtio_pci.c Wed Nov 12 21:57:25 2008 +1030 @@ -216,7 +216,7 @@ struct virtio_pci_device *vp_dev = to_vp_device(vdev); struct virtio_pci_vq_info *info; struct virtqueue *vq; - unsigned long flags; + unsigned long flags, size; u16 num; int err; @@ -237,7 +237,8 @@ info-queue_index = index; info-num = num; - info-queue = kzalloc(PAGE_ALIGN(vring_size(num,PAGE_SIZE)), GFP_KERNEL); + size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); + info-queue = kzalloc(size, GFP_KERNEL); if (info-queue == NULL) { err = -ENOMEM; goto out_info; diff -r c6f6a3ab173b include/linux/virtio_pci.h --- a/include/linux/virtio_pci.hWed Nov 12 21:51:09 2008 +1030 +++ b/include/linux/virtio_pci.hWed Nov 12 21:57:25 2008 +1030 @@ -57,4 +57,8 @@ /* How many bits to shift physical queue address written to QUEUE_PFN. * 12 is historical, and due to x86 page size. */ #define VIRTIO_PCI_QUEUE_ADDR_SHIFT12 + +/* The alignment to use between consumer and producer parts of vring. + * x86 pagesize again. */ +#define VIRTIO_PCI_VRING_ALIGN 4096 #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] virtio: use KVM_S390_VIRTIO_RING_ALIGN instead of relying on pagesize
This doesn't really matter, since s390 pagesize is 4k anyway. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- arch/s390/include/asm/kvm_virtio.h |3 +++ drivers/s390/kvm/kvm_virtio.c |9 ++--- 2 files changed, 9 insertions(+), 3 deletions(-) diff -r 5c9c0cc58458 arch/s390/include/asm/kvm_virtio.h --- a/arch/s390/include/asm/kvm_virtio.hWed Nov 12 22:05:59 2008 +1030 +++ b/arch/s390/include/asm/kvm_virtio.hWed Nov 12 22:10:29 2008 +1030 @@ -50,6 +50,9 @@ #define KVM_S390_VIRTIO_RESET 1 #define KVM_S390_VIRTIO_SET_STATUS 2 +/* Overkill: we have the consumer and producer writing to separate pages. */ +#define KVM_S390_VIRTIO_RING_ALIGN 4096 + #ifdef __KERNEL__ /* early virtio console setup */ #ifdef CONFIG_S390_GUEST diff -r 5c9c0cc58458 drivers/s390/kvm/kvm_virtio.c --- a/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 22:05:59 2008 +1030 +++ b/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 22:10:29 2008 +1030 @@ -188,7 +188,8 @@ config = kvm_vq_config(kdev-desc)+index; err = vmem_add_mapping(config-address, - vring_size(config-num, PAGE_SIZE)); + vring_size(config-num, + KVM_S390_VIRTIO_RING_ALIGN)); if (err) goto out; @@ -209,7 +210,8 @@ return vq; unmap: vmem_remove_mapping(config-address, - vring_size(config-num, PAGE_SIZE)); + vring_size(config-num, + KVM_S390_VIRTIO_RING_ALIGN)); out: return ERR_PTR(err); } @@ -220,7 +222,8 @@ vring_del_virtqueue(vq); vmem_remove_mapping(config-address, - vring_size(config-num, PAGE_SIZE)); + vring_size(config-num, + KVM_S390_VIRTIO_RING_ALIGN)); } /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] virtio: hand virtio ring alignment as argument to vring_new_virtqueue
This allows each virtio user to hand in the alignment appropriate to their virtio_ring structures. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- drivers/lguest/lguest_device.c |4 ++-- drivers/s390/kvm/kvm_virtio.c |3 ++- drivers/virtio/virtio_pci.c|4 ++-- drivers/virtio/virtio_ring.c |3 ++- include/linux/virtio_ring.h|1 + 5 files changed, 9 insertions(+), 6 deletions(-) diff -r 6ea441813d18 drivers/lguest/lguest_device.c --- a/drivers/lguest/lguest_device.cWed Nov 12 21:57:55 2008 +1030 +++ b/drivers/lguest/lguest_device.cWed Nov 12 22:01:45 2008 +1030 @@ -259,8 +259,8 @@ /* OK, tell virtio_ring.c to set up a virtqueue now we know its size * and we've got a pointer to its pages. */ - vq = vring_new_virtqueue(lvq-config.num, vdev, lvq-pages, -lg_notify, callback); + vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, +vdev, lvq-pages, lg_notify, callback); if (!vq) { err = -ENOMEM; goto unmap; diff -r 6ea441813d18 drivers/s390/kvm/kvm_virtio.c --- a/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 21:57:55 2008 +1030 +++ b/drivers/s390/kvm/kvm_virtio.c Wed Nov 12 22:01:45 2008 +1030 @@ -192,7 +192,8 @@ if (err) goto out; - vq = vring_new_virtqueue(config-num, vdev, (void *) config-address, + vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN, +vdev, (void *) config-address, kvm_notify, callback); if (!vq) { err = -ENOMEM; diff -r 6ea441813d18 drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c Wed Nov 12 21:57:55 2008 +1030 +++ b/drivers/virtio/virtio_pci.c Wed Nov 12 22:01:45 2008 +1030 @@ -249,8 +249,8 @@ vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ - vq = vring_new_virtqueue(info-num, vdev, info-queue, -vp_notify, callback); + vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, +vdev, info-queue, vp_notify, callback); if (!vq) { err = -ENOMEM; goto out_activate_queue; diff -r 6ea441813d18 drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c Wed Nov 12 21:57:55 2008 +1030 +++ b/drivers/virtio/virtio_ring.c Wed Nov 12 22:01:45 2008 +1030 @@ -274,6 +274,7 @@ }; struct virtqueue *vring_new_virtqueue(unsigned int num, + unsigned int vring_align, struct virtio_device *vdev, void *pages, void (*notify)(struct virtqueue *), @@ -292,7 +293,7 @@ if (!vq) return NULL; - vring_init(vq-vring, num, pages, PAGE_SIZE); + vring_init(vq-vring, num, pages, vring_align); vq-vq.callback = callback; vq-vq.vdev = vdev; vq-vq.vq_ops = vring_vq_ops; diff -r 6ea441813d18 include/linux/virtio_ring.h --- a/include/linux/virtio_ring.h Wed Nov 12 21:57:55 2008 +1030 +++ b/include/linux/virtio_ring.h Wed Nov 12 22:01:45 2008 +1030 @@ -115,6 +115,7 @@ struct virtqueue; struct virtqueue *vring_new_virtqueue(unsigned int num, + unsigned int vring_align, struct virtio_device *vdev, void *pages, void (*notify)(struct virtqueue *vq), -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] lguest: change virtio ring alignment.
This is just to test that the constant is wired through correctly. lguest has no ABI so I could apply this patch, but it breaks the don't be an asshole rule. diff -r 19e5f3f3a4be Documentation/lguest/Makefile --- a/Documentation/lguest/Makefile Wed Nov 12 22:27:50 2008 +1030 +++ b/Documentation/lguest/Makefile Wed Nov 12 22:38:52 2008 +1030 @@ -4,5 +4,7 @@ all: lguest +lguest: ../../include/linux/lguest_launcher.h + clean: rm -f lguest diff -r 19e5f3f3a4be include/linux/lguest_launcher.h --- a/include/linux/lguest_launcher.h Wed Nov 12 22:27:50 2008 +1030 +++ b/include/linux/lguest_launcher.h Wed Nov 12 22:38:52 2008 +1030 @@ -60,5 +60,6 @@ LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */ }; -#define LGUEST_VRING_ALIGN 4096 +/* A cacheline is enough separation between producer and consumer. */ +#define LGUEST_VRING_ALIGN 128 #endif /* _LINUX_LGUEST_LAUNCHER */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] virtio: Don't use PAGE_SIZE in virtio_pci.c
The virtio PCI devices don't depend on the guest page size. This matters now PowerPC virtio is gaining ground (they like 64k pages). Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- drivers/virtio/virtio_pci.c |2 +- include/linux/virtio_pci.h |4 2 files changed, 5 insertions(+), 1 deletion(-) diff -r 8014f9c51f4e drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c Wed Nov 12 21:12:39 2008 +1030 +++ b/drivers/virtio/virtio_pci.c Wed Nov 12 21:47:21 2008 +1030 @@ -244,7 +244,7 @@ } /* activate the queue */ - iowrite32(virt_to_phys(info-queue) PAGE_SHIFT, + iowrite32(virt_to_phys(info-queue) VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ diff -r 8014f9c51f4e include/linux/virtio_pci.h --- a/include/linux/virtio_pci.hWed Nov 12 21:12:39 2008 +1030 +++ b/include/linux/virtio_pci.hWed Nov 12 21:47:21 2008 +1030 @@ -53,4 +53,8 @@ /* Virtio ABI version, this must match exactly */ #define VIRTIO_PCI_ABI_VERSION 0 + +/* How many bits to shift physical queue address written to QUEUE_PFN. + * 12 is historical, and due to x86 page size. */ +#define VIRTIO_PCI_QUEUE_ADDR_SHIFT12 #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of page size
On Thursday 13 November 2008 02:46:31 Hollis Blanchard wrote: On Wed, 2008-11-12 at 22:51 +1030, Rusty Russell wrote: On Tuesday 11 November 2008 10:07:09 Hollis Blanchard wrote: Both sides of the virtio interface must agree about how big a pfn really is. This is particularly an issue on architectures where the page size is configurable (e.g. PowerPC, IA64) -- the interface must be independent of PAGE_SHIFT. Currently there are three distinct problems: * The shift count used when passing the physical address of the ring to a PCI-based back end. * The ring layout itself is padded to span at least two pages. * The balloon driver operates in units of pages. Hi Hollis, The more I thought about this, the more I think we're not solving this as neatly as we could. The trigger was noting that we're breaking the userspace API (vring_size and vring_init are exposed to userspace): I know that qemu cut pastes, but that's no excuse. So instead, I've introduced separate constants for each use. Yes, all these constants are 12/4096. But just to be contrary, at the end is a patch to change lguest to 128. And there's no reason this couldn't change in future using some guest detection scheme. OK. I thought it was simpler to just say 4KB everywhere in all aspects of the virtio interface, but I'm happy as long as we solve the problem somehow. :) It is simpler, yes, but we can take this opportunity to deconflate them and make things clearer and better than the current code, not just fix it. Note that I still don't have a balloon patch: want to send me one? Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of page size
On Friday 14 November 2008 08:18:33 Hollis Blanchard wrote: On Thu, 2008-11-13 at 08:44 +1030, Rusty Russell wrote: Note that I still don't have a balloon patch: want to send me one? linux: virtio-balloon: avoid implicit use of Linux page size in balloon interface Thanks, applied with following diff: Use tabs to indent, and put BUILD_BUG_ON pagesize assumption. Signed-off-by: Rusty Russell [EMAIL PROTECTED] diff -r 50e970613233 drivers/virtio/virtio_balloon.c --- a/drivers/virtio/virtio_balloon.c Fri Nov 14 11:40:38 2008 +1030 +++ b/drivers/virtio/virtio_balloon.c Fri Nov 14 11:41:19 2008 +1030 @@ -58,10 +58,11 @@ static u32 page_to_balloon_pfn(struct page *page) { -unsigned long pfn = page_to_pfn(page); + unsigned long pfn = page_to_pfn(page); -/* Convert pfn from Linux page size to balloon page size. */ -return pfn (PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT); + BUILD_BUG_ON(PAGE_SHIFT VIRTIO_BALLOON_PFN_SHIFT); + /* Convert pfn from Linux page size to balloon page size. */ + return pfn (PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT); } static void balloon_ack(struct virtqueue *vq) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm-s390: implement config_changed for virtio on s390
On Wednesday 19 November 2008 08:14:13 Christian Borntraeger wrote: Can you replace the kvm-s390 patch with this one? Done. I hadn't pushed out the other one to linux-next yet anyway. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] virtio-pci queue allocation not page-aligned
On Wednesday 03 December 2008 05:38:21 Hollis Blanchard wrote: I just spent a number of hours tracking this one down, and I'm not too thrilled about it. vp_find_vq() does the memory allocation for virtio PCI rings, and it uses kzalloc() to do it. This is bad because the ring memory *must* be page-aligned. According to Anthony, at the time this code was written, various slab allocators were checked and all happened to return page-aligned buffers. So how did I hit a problem? I had enabled CONFIG_SLUB_DEBUG_ON while investigating an unrelated problem, which offset the address by 64 bytes. One option is to add a BUG_ON(addr ~PAGE_MASK) to vp_find_vq(). That's better than nothing, but still stinks. It's a bug, we fix it. I've complained before, but since there was no evidence of it actually breaking, I didn't push. Prepare a patch, I'll try to get it in this release. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: make PCI devices take a virtio_pci module ref
On Thursday 04 December 2008 23:14:31 Mark McLoughlin wrote: Nothing takes a ref on virtio_pci, so even if you have devices in use, rmmod will attempt to unload the module. Fix by simply making each device take a ref on the module. Hi Mark, Taking a reference to oneself is almost always wrong. I'm a little surprised that a successful call to pci_device-probe doesn't bump the module count though. Jesse? Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio: make PCI devices take a virtio_pci module ref
On Saturday 06 December 2008 01:37:06 Mark McLoughlin wrote: Another example of a lack of an explicit dependency causing problems is Fedora's mkinitrd having this hack: if echo $PWD | grep -q /virtio-pci/ ; then findmodule virtio_pci fi which basically says if this is a virtio device, don't forget to include virtio_pci in the initrd too!. Now, mkinitrd is full of hacks, but this is a particularly unusual one. Um, I don't know what this does, sorry. I have no idea how Fedora chooses what to put in an initrd; I can't think of a sensible way of deciding what goes in and what doesn't other than lists and heuristics. But there really is no explicit dependency between virtio modules and virtio_pci. There just is for kvm/x86 at the moment, since that is how they use virtio. Running over another bus is certainly possible, though may never happen for x86 (happens today for s390). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm: use cpumask_var_t for cpus_hardware_enabled
On Monday 08 December 2008 02:25:50 Avi Kivity wrote: Rusty Russell wrote: This changes cpus_hardware_enabled from a cpumask_t to a cpumask_var_t: equivalent for CONFIG_CPUMASKS_OFFSTACK=n, otherwise dynamically allocated. -static cpumask_t cpus_hardware_enabled; +static cpumask_var_t cpus_hardware_enabled This isn't on stack, so it isn't buying us anything. It's the CONFIG_NR_CPUS=4096 but nr_cpu_ids=4 case which we win using dynamic allocation. Gotta love distribution kernels. Is the plan to drop cpumask_t? Yes. And undefine 'struct cpumask' if CONFIG_CPUMASK_OFFSTACK. That will stop assignment and on-stack declarations for all but the most determined. If so, we're penalizing non-stack users by forcing them to go through another pointer (and cacheline). Not quite. If !CONFIG_CPUMASK_OFFSTACK, cpumask_var_t == cpumask_t[1]. Blame Linus :) Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm: use modern cpumask primitives, no cpumask_t on stack
On Monday 08 December 2008 02:44:00 Avi Kivity wrote: Avi Kivity wrote: Rusty Russell wrote: We're getting rid on on-stack cpumasks for large NR_CPUS. 1) Use cpumask_var_t and alloc_cpumask_var (a noop normally). Fallback code is inefficient but never happens in practice. Wow, code duplication from Rusty. Things must be bad. Since we're in a get_cpu() here, how about a per_cpu static cpumask instead? I don't mind the inefficient fallback, just the duplication. Btw, for the general case, instead of forcing everyone to duplicate, how about: cpumask_var_t cpus; with_cpumask(cpus) { ... code to populate cpus smp_call_function_some(...); } end_with_cpumask(cpus); Where with_cpumask() allocates cpus, and uses a mutex + static fallback on failure. I'd prefer not to hide deadlocks that way :( I'll re-battle with that code to neaten it. There are only a few places which have these kind of issues. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Extract core of kvm_flush_remote_tlbs/kvm_reload_remote_mmus.
Avi said: Wow, code duplication from Rusty. Things must be bad. Something about glass houses comes to mind. But instead, a patch. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- virt/kvm/kvm_main.c | 44 +++- 1 file changed, 15 insertions(+), 29 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -355,10 +355,11 @@ static void ack_flush(void *_completed) { } -void kvm_flush_remote_tlbs(struct kvm *kvm) +static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) { int i, cpu, me; cpumask_t cpus; + bool called = false; struct kvm_vcpu *vcpu; me = get_cpu(); @@ -367,45 +368,30 @@ void kvm_flush_remote_tlbs(struct kvm *k vcpu = kvm-vcpus[i]; if (!vcpu) continue; - if (test_and_set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests)) + if (test_and_set_bit(req, vcpu-requests)) continue; cpu = vcpu-cpu; if (cpu != -1 cpu != me) cpu_set(cpu, cpus); } - if (cpus_empty(cpus)) - goto out; - ++kvm-stat.remote_tlb_flush; - smp_call_function_mask(cpus, ack_flush, NULL, 1); -out: + if (!cpus_empty(cpus)) { + smp_call_function_mask(cpus, ack_flush, NULL, 1); + called = true; + } put_cpu(); + return called; +} + +void kvm_flush_remote_tlbs(struct kvm *kvm) +{ + if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) + ++kvm-stat.remote_tlb_flush; } void kvm_reload_remote_mmus(struct kvm *kvm) { - int i, cpu, me; - cpumask_t cpus; - struct kvm_vcpu *vcpu; - - me = get_cpu(); - cpus_clear(cpus); - for (i = 0; i KVM_MAX_VCPUS; ++i) { - vcpu = kvm-vcpus[i]; - if (!vcpu) - continue; - if (test_and_set_bit(KVM_REQ_MMU_RELOAD, vcpu-requests)) - continue; - cpu = vcpu-cpu; - if (cpu != -1 cpu != me) - cpu_set(cpu, cpus); - } - if (cpus_empty(cpus)) - goto out; - smp_call_function_mask(cpus, ack_flush, NULL, 1); -out: - put_cpu(); + make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); } - int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm: use modern cpumask primitives, no cpumask_t on stack
On Monday 08 December 2008 20:19:57 Avi Kivity wrote: Rusty Russell wrote: Btw, for the general case, instead of forcing everyone to duplicate, how about: cpumask_var_t cpus; with_cpumask(cpus) { ... code to populate cpus smp_call_function_some(...); } end_with_cpumask(cpus); Where with_cpumask() allocates cpus, and uses a mutex + static fallback on failure. I'd prefer not to hide deadlocks that way :( I'll re-battle with that code to neaten it. There are only a few places which have these kind of issues. cpuvar_get_maybe_mutex_lock(...); ... cpuvar_put_maybe_mutex_unlock(...); My thought was something like: /* This is an empty struct for !CONFIG_CPUMASK_OFFSTACK. */ static struct cpuvar_with_mutex_fallback fallback; ... cpumask_var_t tmp; cpuvar_alloc_fallback(tmp, fallback); ... cpuvar_free_fallback(tmp, fallback); We may get there eventually, but so far I've managed to produce less horrendous code in every case. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio_net: add link status handling
On Wednesday 10 December 2008 08:02:14 Anthony Liguori wrote: Mark McLoughlin wrote: Allow the host to inform us that the link is down by adding a VIRTIO_NET_F_STATUS which indicates that device status is available in virtio_net config. This is currently useful for simulating link down conditions (e.g. using proposed qemu 'set_link' monitor command) but would also be needed if we were to support device assignment via virtio. It would be nice if the virtio-net card wrote some acknowledgement that it has received the link status down/up events. How about of every status change event? ie. a generic virtio_pci solution? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio_net: add link status handling
On Tuesday 09 December 2008 20:49:33 Mark McLoughlin wrote: Allow the host to inform us that the link is down by adding a VIRTIO_NET_F_STATUS which indicates that device status is available in virtio_net config. This has been a TODO for a while, thanks! + if (vi-status == v) + return; + + vi-status = v; + + if (vi-status VIRTIO_NET_S_LINK_UP) { + netif_carrier_on(vi-dev); + netif_wake_queue(vi-dev); + } else { + netif_carrier_off(vi-dev); + netif_stop_queue(vi-dev); + } New status bits will screw this logic unless we count on the host not to set them. I suggest: /* Ignore unknown (future) status bits */ v = VIRTIO_NET_S_LINK_UP; Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio_net: add link status handling
On Thursday 11 December 2008 05:04:44 Mark McLoughlin wrote: On Tue, 2008-12-09 at 21:11 -0600, Anthony Liguori wrote: Rusty Russell wrote: On Wednesday 10 December 2008 08:02:14 Anthony Liguori wrote: It would be nice if the virtio-net card wrote some acknowledgement that it has received the link status down/up events. How about of every status change event? ie. a generic virtio_pci solution? A really simple way to do it would just be to have another status field that was the guest's status (verses the host requested status which the current field is). All config reads/writes result in exits so it's easy to track. Adding YA virtio event may be a little overkill. Sounds very reasonable; that and Rusty's mask out unknown bits suggestion in the version below. Not quite what I was after. I've taken the original patch, added the masking change. I'll test here and feed to DaveM. As far as Anthony's live migrate race, I've decided to declare with Canutian arrogance that it Will Never Happen. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: use modern cpumask primitives, no cpumask_t on stack
On Friday 12 December 2008 03:21:24 Hollis Blanchard wrote: This patch breaks uniprocessor builds, because smp_call_function_many() is only defined for CONFIG_SMP. Good catch. I missed it because it's fixed as a side-effect of a later patch in my series (before I convert users). Linus, can you please apply this fix? Thanks, Rusty. Subject: Define smp_call_function_many for UP Otherwise those using it in transition patches (eg. kvm) can't compile with CONFIG_SMP=n: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'make_all_cpus_request': arch/x86/kvm/../../../virt/kvm/kvm_main.c:380: error: implicit declaration of function 'smp_call_function_many' Signed-off-by: Rusty Russell ru...@rustcorp.com.au diff --git a/include/linux/smp.h b/include/linux/smp.h --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -146,6 +147,8 @@ static inline void smp_send_reschedule(i }) #define smp_call_function_mask(mask, func, info, wait) \ (up_smp_call_function(func, info)) +#define smp_call_function_many(mask, func, info, wait) \ + (up_smp_call_function(func, info)) static inline void init_call_single_data(void) { } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Define smp_call_function_many for UP
Linus, please apply. Otherwise those using it in transition patches (eg. kvm) can't compile with CONFIG_SMP=n: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'make_all_cpus_request': arch/x86/kvm/../../../virt/kvm/kvm_main.c:380: error: implicit declaration of function 'smp_call_function_many' Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- include/linux/smp.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/smp.h b/include/linux/smp.h --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -146,6 +147,8 @@ static inline void smp_send_reschedule(i }) #define smp_call_function_mask(mask, func, info, wait) \ (up_smp_call_function(func, info)) +#define smp_call_function_many(mask, func, info, wait) \ + (up_smp_call_function(func, info)) static inline void init_call_single_data(void) { } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Define smp_call_function_many for UP
On Saturday 20 December 2008 03:33:26 Linus Torvalds wrote: Christmas at Rusty's house: Dreary. Far from! Sure, it's Arabella's first Xmas, but a Very Kernel Christmas can be lots of fun. Why just this morning, we did this together! See if you can tell which bits are hers... Subject: Decorate lib/rbtree.c From: Rusty and Arabella Russell rustabe...@rustcorp.com.au A slight enhancement to the comments in lib/rbtree.c. Signed-off-by: Rusty and Arabella Russell rustabe...@rustcorp.com.au --- lib/rbtree.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/lib/rbtree.c b/lib/rbtree.c index 48499c2..7848186 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -17,7 +17,8 @@ along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - linux/lib/rbtree.c + An excellent description of this algorithm can be found at: + http://en.wikipedia.org/wiki/Red-black_tree */ #include linux/rbtree.h @@ -291,6 +292,7 @@ EXPORT_SYMBOL(rb_erase); /* * This function returns the first node (in sort order) of the tree. + * ssdddsd */ struct rb_node *rb_first(struct rb_root *root) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -7,6 +7,14 @@ config VIRTIO_RING tristate depends on VIRTIO +config VIRTIO_STATS + bool Virtio debugging stats (EXPERIMENTAL) + depends on VIRTIO_RING + select DEBUG_FS + ---help--- + Virtio stats collected by how full the ring is at any time, + presented under debugfs/virtio/name-vq/num-used/ + config VIRTIO_PCI tristate PCI driver for virtio devices (EXPERIMENTAL) depends on PCI EXPERIMENTAL diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -21,6 +21,7 @@ #include linux/virtio_config.h #include linux/device.h #include linux/slab.h +#include linux/debugfs.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -95,6 +96,11 @@ struct vring_virtqueue /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); +#ifdef CONFIG_VIRTIO_STATS + struct vring_stat *stats; + struct dentry *statdir; +#endif + #ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; @@ -106,6 +112,87 @@ struct vring_virtqueue #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +#ifdef CONFIG_VIRTIO_STATS +/* We have an array of these, indexed by how full the ring is. */ +struct vring_stat { + /* How many interrupts? */ + size_t interrupt_nowork, interrupt_work; + /* How many non-notify kicks, how many notify kicks, how many add notify? */ + size_t kick_no_notify, kick_notify, add_notify; + /* How many adds? */ + size_t add_direct, add_indirect, add_fail; + /* How many gets? */ + size_t get; + /* How many disable callbacks? */ + size_t disable_cb; + /* How many enables? */ + size_t enable_cb_retry, enable_cb_success; +}; + +static struct dentry *virtio_stats; + +static void create_stat_files(struct vring_virtqueue *vq) +{ + char name[80]; + unsigned int i; + + /* Racy in theory, but we don't care. */ + if (!virtio_stats) + virtio_stats = debugfs_create_dir(virtio-stats, NULL); + + sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name); + vq-statdir = debugfs_create_dir(name, virtio_stats); + + for (i = 0; i vq-vring.num; i++) { + struct dentry *dir; + + sprintf(name, %i, i); + dir = debugfs_create_dir(name, vq-statdir); + debugfs_create_size_t(interrupt_nowork, 0400, dir, + vq-stats[i].interrupt_nowork); + debugfs_create_size_t(interrupt_work, 0400, dir, + vq-stats[i].interrupt_work); + debugfs_create_size_t(kick_no_notify, 0400, dir, + vq-stats[i].kick_no_notify); + debugfs_create_size_t(kick_notify, 0400, dir, + vq-stats[i].kick_notify); + debugfs_create_size_t(add_notify, 0400, dir, + vq-stats[i].add_notify); + debugfs_create_size_t(add_direct, 0400, dir, + vq-stats[i].add_direct); + debugfs_create_size_t(add_indirect, 0400, dir
Re: Network performance with small packets
On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote: On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. Intent wasn't; I've not used tracepoints before, but maybe we should consider a longer-term monitoring solution? Patch welcome! Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 08 Mar 2011 20:21:18 -0600, Andrew Theurer haban...@linux.vnet.ibm.com wrote: On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote: On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote: I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Should we also add similar stat on vhost vq as well for monitoring vhost_signal vhost_notify? Tom L has started using Rusty's patches and found some interesting results, sent yesterday: http://marc.info/?l=kvmm=129953710930124w=2 Hmm, I'm not subscribed to kvm@ any more, so I didn't get this, so replying here: Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin processing the packets resulting in some lock contention in the guest (and high interrupt rates). Yes, this is a virtio design flaw, but one that should be fixable. We have room at the end of the ring, which we can put a last_used count. Then we can tell if wakeups are redundant, before the guest updates the flag. Here's an old patch where I played with implementing this: virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: the ring is always a whole number of pages and there's hundreds of bytes of padding after the avail ring and the used ring, whatever the number of descriptors (which must be a power of 2). We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/virtio/virtio_ring.c | 23 +++ include/linux/virtio_ring.h | 12 +++- 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -71,9 +71,6 @@ struct vring_virtqueue /* Number we've added since last sync. */ unsigned int num_added; - /* Last used index we've seen. */ - u16 last_used_idx; - /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); @@ -278,12 +275,13 @@ static void detach_buf(struct vring_virt static inline bool more_used(const struct vring_virtqueue *vq) { - return vq-last_used_idx != vq-vring.used-idx; + return vring_last_used(vq-vring) != vq-vring.used-idx; } static void *vring_get_buf(struct virtqueue *_vq, unsigned int *len) { struct vring_virtqueue *vq = to_vvq(_vq); + struct vring_used_elem *u; void *ret; unsigned int i; @@ -300,8 +298,11 @@ static void *vring_get_buf(struct virtqu return NULL; } - i = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].id; - *len = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].len; + u = vq-vring.used-ring[vring_last_used(vq-vring) % vq-vring.num]; + i = u-id; + *len = u-len; + /* Make sure we don't reload i after doing checks. */ + rmb(); if (unlikely(i = vq-vring.num)) { BAD_RING(vq, id %u out of range\n, i); @@ -315,7 +316,8 @@ static void *vring_get_buf(struct virtqu /* detach_buf clears data, so grab it now. */ ret = vq-data[i]; detach_buf(vq, i); - vq-last_used_idx++; + vring_last_used(vq-vring)++; + END_USE(vq); return ret; } @@ -402,7 +404,6 @@ struct virtqueue *vring_new_virtqueue(un vq-vq.name = name; vq-notify = notify; vq-broken = false; - vq-last_used_idx = 0; vq-num_added = 0; list_add_tail(vq-vq.list, vdev-vqs); #ifdef DEBUG @@ -413,6 +414,10 @@ struct virtqueue *vring_new_virtqueue(un vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC); + /* We publish indices whether they offer it or not: if not, it's junk +* space anyway. But calling this acknowledges the feature. */ + virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_INDICES); + /* No callback? Tell other side not to bother us. */ if (!callback) vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; @@ -443,6 +448,8 @@ void vring_transport_features(struct vir switch (i) { case VIRTIO_RING_F_INDIRECT_DESC: break
Re: TX from KVM guest virtio_net to vhost issues
On Wed, 09 Mar 2011 13:46:36 -0800, Shirley Ma mashi...@us.ibm.com wrote: Since we have lots of performance discussions about virtio_net and vhost communication. I think it's better to have a common understandings of the code first, then we can seek the right directions to improve it. We also need to collect more statistics data on both virtio and vhost. Let's look at TX first: from virtio_net(guest) to vhost(host), send vq is shared between guest virtio_net and host vhost, it uses memory barriers to sync the changes. In the start: Guest virtio_net TX send completion interrupt (for freeing used skbs) is disable. Guest virtio_net TX send completion interrupt is enabled only when send vq is overrun, guest needs to wait vhost to consume more available skbs. Host vhost notification is enabled in the beginning (for consuming available skbs); It is disable whenever the send vq is not empty. Once the send vq is empty, the notification is enabled by vhost. In guest start_xmit(), it first frees used skbs, then send available skbs to vhost, ideally guest never enables TX send completion interrupts to free used skbs if vhost keeps posting used skbs in send vq. In vhost handle_tx(), it wakes up by guest whenever the send vq has a skb to send, once the send vq is not empty, vhost exits handle_tx() without enabling notification. Ideally if guest keeps xmit skbs in send vq, the notification is never enabled. I don't see issues on this implementation. However, in our TCP_STREAM small message size test, we found that somehow guest couldn't see more used skbs to free, which caused frequently TX send queue overrun. So it seems like the guest is outrunning the host? In our TCP_RR small message size multiple streams test, we found that vhost couldn't see more xmit skbs in send vq, thus it enabled notification too often. And here the host is outrunning the guest? What's the possible cause here in xmit? How guest, vhost are being scheduled? Whether it's possible, guest virtio_net cooperates with vhost for ideal performance: both guest virtio_net and vhost can be in pace with send vq without many notifications and exits? We tend to blame the scheduler for all kinds of things, until we find the real cause :) Nailing things to different CPUs might help us determine if it really is scheduler... But if one side outruns the other, it does a lot of unnecessary work notifying/interrupting it over and over again before the host/guest gets a chance to shut notifications/interrupts off. Hence the last_used publishing patch (Xen does this right, I screwed it up). Long weekend here, and I'm otherwise committed. But if noone has cleaned up that patch by early next week, I'll try to do so and see if we can make a real difference. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop
On Wed, 16 Mar 2011 17:12:55 -0700, Shirley Ma mashi...@us.ibm.com wrote: Signed-off-by: Shirley Ma x...@us.ibm.com This is fascinating... and deeply weird. OK, what's the difference between calling xmit_skb and ignoring failure, and this patch which figures out it's going to fail before calling xmit_skb? ie. what if you *just* delete this: @@ -605,20 +620,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); nf_reset(skb); - /* Apparently nice girls don't return TX_BUSY; stop the queue - * before it gets out of hand. Naturally, this wastes entries. */ - if (capacity 2+MAX_SKB_FRAGS) { - netif_stop_queue(dev); - if (unlikely(!virtqueue_enable_cb(vi-svq))) { - /* More just got used, free them then recheck. */ - capacity += free_old_xmit_skbs(vi); - if (capacity = 2+MAX_SKB_FRAGS) { - netif_start_queue(dev); - virtqueue_disable_cb(vi-svq); - } - } - } - return NETDEV_TX_OK; } Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop
On Tue, 22 Mar 2011 13:36:50 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Mon, Mar 21, 2011 at 11:03:07AM -0700, Shirley Ma wrote: On Fri, 2011-03-18 at 18:41 -0700, Shirley Ma wrote: + /* Drop packet instead of stop queue for better performance */ I would like to see some justification as to why this is the right way to go and not just papering over the real problem. Fair. KVM guest virtio_net TX queue stop/restart is pretty expensive, which involves: 1. Guest enable callback: one memory barrier, interrupt flag set Missed this cost: for history reason, it also involves a guest exit from I/O write (PCI_QUEUE_NOTIFY). OK, after some research, it looks like the reason was the tx timer that qemu used to use. So the hack of avoiding the add_buf call will avoid this kick and so break these hosts. I guess we can add a feature bit to detect a new host and so avoid the kick. We are running low on feature bits unfortunately, but just fo testing, could you quantify the difference that this makes using the following patch: Performance would suffer for those ancient qemus if we didn't do this, but it wouldn't be fatal to them. I think we should just remove it; the standard certainly doesn't mention it. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop
With simply removing the notify here, it does help the case when TX overrun hits too often, for example for 1K message size, the single TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s. OK, we'll be getting rid of the kick on full, so please delete that on all benchmarks. Now, does the capacity check before add_buf() still win anything? I can't see how unless we have some weird bug. Once we've sorted that out, we should look at the more radical change of publishing last_used and using that to intuit whether interrupts should be sent. If we're not careful with ordering and barriers that could introduce more bugs. Anything else on the optimization agenda I've missed? Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop
On Thu, 24 Mar 2011 10:46:49 -0700, Shirley Ma mashi...@us.ibm.com wrote: On Thu, 2011-03-24 at 16:28 +0200, Michael S. Tsirkin wrote: Several other things I am looking at, wellcome cooperation: 1. It's probably a good idea to update avail index immediately instead of upon kick: for RX this might help parallelism with the host. Is that possible to use the same idea for publishing last used idx to publish avail idx? Then we can save guest iowrite/exits. Yes, it should be symmetrical. Test independently of course, but the same logic applies. Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop
On Thu, 24 Mar 2011 16:28:22 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Mar 24, 2011 at 11:00:53AM +1030, Rusty Russell wrote: With simply removing the notify here, it does help the case when TX overrun hits too often, for example for 1K message size, the single TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s. OK, we'll be getting rid of the kick on full, so please delete that on all benchmarks. Now, does the capacity check before add_buf() still win anything? I can't see how unless we have some weird bug. Once we've sorted that out, we should look at the more radical change of publishing last_used and using that to intuit whether interrupts should be sent. If we're not careful with ordering and barriers that could introduce more bugs. Right. I am working on this, and trying to be careful. One thing I'm in doubt about: sometimes we just want to disable interrupts. Should still use flags in that case? I thought that if we make the published index 0 to vq-num - 1, then a special value in the index field could disable interrupts completely. We could even reuse the space for the flags field to stick the index in. Too complex? Making the index free-running avoids the full or empty confusion, plus offers and extra debugging insight. I think that if they really want to disable interrrupts, the flag should still work, and when the client accepts the publish last_idx feature they are accepting that interrupts may be omitted if they haven't updated last_idx yet. Anything else on the optimization agenda I've missed? Thanks, Rusty. Several other things I am looking at, wellcome cooperation: 1. It's probably a good idea to update avail index immediately instead of upon kick: for RX this might help parallelism with the host. Yes, once we've done everything else, we should measure this. It makes sense. 2. Adding an API to add a single buffer instead of s/g, seems to help a bit. This goes last, since it's kind of an ugly hack, but all internal to Linux if we decide it's a win. 3. For TX sometimes we free a single buffer, sometimes a ton of them, which might make the transmit latency vary. It's probably a good idea to limit this, maybe free the minimal number possible to keep the device going without stops, maybe free up to MAX_SKB_FRAGS. This kind of heuristic is going to be quite variable depending on circumstance, I think, so it's a lot of work to make sure we get it right. 4. If the ring is full, we now notify right after the first entry is consumed. For TX this is suboptimal, we should try delaying the interrupt on host. Lguest already does that: only sends an interrupt when it's run out of things to do. It does update the used ring, however, as it processes them. This seems sensible to me, but needs to be measured separately as well. More ideas, would be nice if someone can try them out: 1. We are allocating/freeing buffers for indirect descriptors. Use some kind of pool instead? And we could preformat part of the descriptor. We need some poolish mechanism for virtio_blk too; perhaps an allocation callback which both can use (virtio_blk to alloc from a pool, virtio_net to recycle?). Along similar lines to preformatting, we could actually try to prepend the skb_vnet_hdr to the vnet data, and use a single descriptor for the hdr and the first part of the packet. Though IIRC, qemu's virtio barfs if the first descriptor isn't just the hdr (barf...). 2. I didn't have time to work on virtio2 ideas presented at the kvm forum yet, any takers? I didn't even attend. But I think that virtio is moribund for the moment; there wasn't enough demand and it's clear that there are optimization unexplored in virtio1. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop
On Sun, 27 Mar 2011 09:52:54 +0200, Michael S. Tsirkin m...@redhat.com wrote: Though IIRC, qemu's virtio barfs if the first descriptor isn't just the hdr (barf...). Maybe we can try fixing this before adding more flags, then e.g. publish used flag can be resued to also tell us layout is flexible. Or just add a feature flag for that. We should probably do this at some stage, yes. 2. I didn't have time to work on virtio2 ideas presented at the kvm forum yet, any takers? I didn't even attend. Hmm, right. But what was presented there was discussed on list as well: a single R/W descriptor ring with valid bit instead of 2 rings + a descriptor array. I'll be happy when we reach the point that the extra cacheline is hurting us :) Then we should do direct descriptors w/ a cookie as the value to hand back when finished. That seems to be close to optimal. I agree absolutely that not all lessons has been learned, playing with different ring layouts would make at least an interesting paper IMO. Yes, I'd like to see the results... Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] [RFC rev2] virtio-net changes
On Tue, 05 Apr 2011 20:38:52 +0530, Krishna Kumar krkum...@in.ibm.com wrote: Implement mq virtio-net driver. Though struct virtio_net_config changes, it works with the old qemu since the last element is not accessed unless qemu sets VIRTIO_NET_F_MULTIQUEUE. Signed-off-by: Krishna Kumar krkum...@in.ibm.com Hi Krishna! This change looks fairly solid, but I'd prefer it split into a few stages for clarity. The first patch should extract out the struct send_queue and struct receive_queue, even though there's still only one. The second patch can then introduce VIRTIO_NET_F_MULTIQUEUE. You could split into more parts if that makes sense, but I'd prefer to see the mechanical changes separate from the feature addition. -struct virtnet_info { - struct virtio_device *vdev; - struct virtqueue *rvq, *svq, *cvq; - struct net_device *dev; +/* Internal representation of a send virtqueue */ +struct send_queue { + /* Virtqueue associated with this send _queue */ + struct virtqueue *svq; You can simply call this vq now it's inside 'send_queue'. + + /* TX: fragments + linear part + virtio header */ + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; Similarly, this can just be sg. +static void free_receive_bufs(struct virtnet_info *vi) +{ + int i; + + for (i = 0; i vi-numtxqs; i++) { + BUG_ON(vi-rq[i] == NULL); + while (vi-rq[i]-pages) + __free_pages(get_a_page(vi-rq[i], GFP_KERNEL), 0); + } +} You can skip the BUG_ON(), since the next line will have the same effect. +/* Free memory allocated for send and receive queues */ +static void free_rq_sq(struct virtnet_info *vi) +{ + int i; + + if (vi-rq) { + for (i = 0; i vi-numtxqs; i++) + kfree(vi-rq[i]); + kfree(vi-rq); + } + + if (vi-sq) { + for (i = 0; i vi-numtxqs; i++) + kfree(vi-sq[i]); + kfree(vi-sq); + } This looks weird, even though it's correct. I think we need a better name than numtxqs and shorter than num_queue_pairs. Let's just use num_queues; sure, there are both tx and rq queues, but I still think it's pretty clear. + for (i = 0; i vi-numtxqs; i++) { + struct virtqueue *svq = vi-sq[i]-svq; + + while (1) { + buf = virtqueue_detach_unused_buf(svq); + if (!buf) + break; + dev_kfree_skb(buf); + } + } I know this isn't your code, but it's ugly :) while ((buf = virtqueue_detach_unused_buf(svq)) != NULL) dev_kfree_skb(buf); + for (i = 0; i vi-numtxqs; i++) { + struct virtqueue *rvq = vi-rq[i]-rvq; + + while (1) { + buf = virtqueue_detach_unused_buf(rvq); + if (!buf) + break; Here too... +#define MAX_DEVICE_NAME 16 This isn't a good idea, see below. +static int initialize_vqs(struct virtnet_info *vi, int numtxqs) +{ + vq_callback_t **callbacks; + struct virtqueue **vqs; + int i, err = -ENOMEM; + int totalvqs; + char **names; This whole routine is really messy. How about doing find_vqs first, then have routines like setup_rxq(), setup_txq() and setup_controlq() would make this neater: static int setup_rxq(struct send_queue *sq, char *name); Also, use kasprintf() instead of kmalloc sprintf. +#if 1 + /* Allocate/initialize parameters for recv/send virtqueues */ Why is this #if 1'd? I do prefer the #else method of doing two loops, myself (but use kasprintf). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 12 Apr 2011 23:01:12 +0300, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote: Here's an old patch where I played with implementing this: ... virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: This seems to be true for x86 kvm and lguest but is it true for s390? Yes, as the ring is page aligned so there's always room. Will this last bit work on s390? If I understand correctly the memory is allocated by host there? They have to offer the feature, so if the have some way of allocating non-page-aligned amounts of memory, they'll have to add those extra 2 bytes. So I think it's OK... Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, 14 Apr 2011 19:03:59 +0300, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote: They have to offer the feature, so if the have some way of allocating non-page-aligned amounts of memory, they'll have to add those extra 2 bytes. So I think it's OK... Rusty. To clarify, my concern is that we always seem to try to map these extra 2 bytes, which thinkably might fail? No, if you look at the layout it's clear that there's always most of a page left for this extra room, both in the middle and at the end. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio_blk: support barriers without FLUSH feature
On Tue, 15 Jun 2010 10:13:48 pm Christoph Hellwig wrote: If we want to support barriers with the cache=writethrough mode in qemu we need to tell the block layer that we only need queue drains to implement a barrier. Follow the model set by SCSI and IDE and assume that there is no volatile write cache if the host doesn't advertize it. While this might imply working barriers on old qemu versions or other hypervisors that actually have a volatile write cache this is only a cosmetic issue - these hypervisors don't guarantee any data integrity with or without this patch, but with the patch we at least provide data ordering. Signed-off-by: Christoph Hellwig h...@lst.de Thanks, applied. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Remove virtio_blk VBID ioctl
On Sat, 19 Jun 2010 04:08:03 am Ryan Harper wrote: With the availablility of a sysfs device attribute for examining disk serial numbers the ioctl is no longer needed. The user-space changes for this aren't upstream yet so we don't have any users to worry about. If John Cooper acks this, I'll push it to Linus immediately. Unfortunately we offered this interface in 2.6.34, and we're now removing it. That's unpleasant. Thanks, Rusty. PS. John should have been cc'd on these patches! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices
On Sat, 19 Jun 2010 04:08:02 am Ryan Harper wrote: Create a new attribute for virtio-blk devices that will fetch the serial number of the block device. This attribute can be used by udev to create disk/by-id symlinks for devices that don't have a UUID (filesystem) associated with them. ATA_IDENTIFY strings are special in that they can be up to 20 chars long and aren't required to be NULL-terminated. The buffer is also zero-padded meaning that if the serial is 19 chars or less that we get a NULL terminated string. When copying this value into a string buffer, we must be careful to copy up to the NULL (if it present) and only 20 if it is longer and not to attempt to NULL terminate; this isn't needed. Signed-off-by: Ryan Harper ry...@us.ibm.com Signed-off-by: john cooper john.coo...@redhat.com --- drivers/block/virtio_blk.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 258bc2a..f1ef26f 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -281,6 +281,31 @@ static int index_to_minor(int index) return index PART_BITS; } +/* Copy serial number from *s to *d. Copy operation terminates on either + * encountering a nul in *s or after n bytes have been copied, whichever + * occurs first. *d is not forcibly nul terminated. Return # of bytes copied. + */ +static inline int serial_sysfs(char *d, char *s, int n) +{ + char *di = d; + + while (*s n--) + *d++ = *s++; + return d - di; +} + +static ssize_t virtblk_serial_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + char id_str[VIRTIO_BLK_ID_BYTES]; + + if (IS_ERR(virtblk_get_id(disk, id_str))) + return 0; 0? Really? That doesn't seem very informative. + return serial_sysfs(buf, id_str, min(VIRTIO_BLK_ID_BYTES, PAGE_SIZE)); How about something like this: BUILD_BUG_ON(PAGE_SIZE VIRTIO_BLK_ID_BYTES + 1); /* id_str is not necessarily nul-terminated! */ buf[VIRTIO_BLK_ID_BYTES] = '\0'; return virtblk_get_id(disk, buf); Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html