[Bug 47451] need to re-load driver in guest to make a hot-plug VF work
https://bugzilla.kernel.org/show_bug.cgi?id=47451 --- Comment #4 from Jay Ren yongjie@intel.com 2012-09-28 06:07:50 --- (In reply to comment #3) (In reply to comment #2) (In reply to comment #1) Can we narrow down the kvm.git commit range at all? The one provided is over 12k commits covering v3.4-rc3 to v3.5-rc6. Thanks I did more testing. Do you remember the bug #43328 ( VT-d/SR-IOV totally doesn't work in guest)? Just use your fix commit for that bug, I'll meet this hot-plug issue. Is there a chance your patch fixed one bug but introduced another one? :) commit a76beb14123a69ca080f5a5425e28b786d62318d Author: Alex Williamson alex.william...@redhat.com Date: Mon Jul 9 10:53:22 2012 -0600 KVM: Fix device assignment threaded irq handler Thanks for the narrowing it down. It looks like perhaps that patch was ineffective at trying to keep us out of using IRQF_ONESHOT due to irq_setup_forced_threading() re-enabling it. Does the problem go away if you change the two calls to request_threaded_irq() in that commit to use IRQF_NO_THREAD for the flag value in place of 0? No, replacing flag value with 'IRQF_NO_THREAD' can't make PCIe NIC hot-plug work. Can you try with your commit a76beb14123a6 ? BTW, sometimes, this bug is not so stable. Using '-m 512 -smp 2' option for qemu-kvm commandline to start a RHEL6.x guest will make it very easy to reproduce. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/28/2012 11:15 AM, H. Peter Anvin wrote: On 09/27/2012 10:38 PM, Raghavendra K T wrote: + +bool kvm_overcommitted() +{ This better not be C... I think you meant I should have had like kvm_overcommitted(void) and (different function name perhaps) or is it the body of function? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 09/28/2012 02:37 AM, Jiannan Ouyang wrote: On Thu, Sep 27, 2012 at 4:50 AM, Avi Kivity a...@redhat.com mailto:a...@redhat.com wrote: On 09/25/2012 04:43 PM, Jiannan Ouyang wrote: I've actually implemented this preempted_bitmap idea. Interesting, please share the code if you can. However, I'm doing this to expose this information to the guest, so the guest is able to know if the lock holder is preempted or not before spining. Right now, I'm doing experiment to show that this idea works. I'm wondering what do you guys think of the relationship between the pv_ticketlock approach and PLE handler approach. Are we going to adopt PLE instead of the pv ticketlock, and why? Right now we're searching for the best solution. The tradeoffs are more or less: PLE: - works for unmodified / non-Linux guests - works for all types of spins (e.g. smp_call_function*()) - utilizes an existing hardware interface (PAUSE instruction) so likely more robust compared to a software interface PV: - has more information, so it can perform better Given these tradeoffs, if we can get PLE to work for moderate amounts of overcommit then I'll prefer it (even if it slightly underperforms PV). If we are unable to make it work well, then we'll have to add PV. -- error compiling committee.c: too many arguments to function FYI. The preempted_bitmap patch. I delete some unrelated code in the generated patch file and seems broken the patch file format... I hope anyone could teach me some solutions. However, it's pretty straight forward, four things: declaration, initialization, set and clear. I think you guys can figure it out easily! As Avi sugguested, you could check task state TASK_RUNNING in sched_out. Signed-off-by: Jiannan Ouyang ouy...@cs.pitt.edu mailto:ouy...@cs.pitt.edu diff --git a/arch/x86/include/asm/ paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 8613cbb..4fcb648 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -73,6 +73,16 @@ struct pv_info { const char *name; }; I suppose we need this in common place since s390 also should have this, if we are using this information in vcpu_on_spin().. +struct pv_sched_info { + unsigned long sched_bitmap; Thinking, whether we need something similar to cpumask here? Only thing is we are representing guest (v)cpumask. +} __attribute__((__packed__)); + struct pv_init_ops { /* * Patch may replace one of the defined code sequences with diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 676b8c7..2242d22 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c +struct pv_sched_info pv_sched_info = { +.sched_bitmap = (unsigned long)-1, +}; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 44ee712..3eb277e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -494,6 +494,11 @@ static struct kvm *kvm_create_vm(unsigned long type) mutex_init(kvm-slots_lock); atomic_set(kvm-users_count, 1); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +kvm-pv_sched_info.sched_bitmap = (unsigned long)-1; +#endif + r = kvm_init_mmu_notifier(kvm); if (r) goto out_err; @@ -2697,7 +2702,13 @@ struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn) static void kvm_sched_in(struct preempt_notifier *pn, int cpu) { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + set_bit(vcpu-vcpu_id, vcpu-kvm-pv_sched_info.sched_bitmap); kvm_arch_vcpu_load(vcpu, cpu); } @@ -2705,7 +2716,13 @@ static void kvm_sched_out(struct preempt_notifier *pn, struct task_struct *next) { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + clear_bit(vcpu-vcpu_id, vcpu-kvm-pv_sched_info.sched_bitmap); kvm_arch_vcpu_put(vcpu); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-blk: Disable callback in virtblk_done()
Asias He as...@redhat.com writes: I forgot about the cool hack which MST put in to defer event updates using disable_cb/enable_cb. Hmm, are you talking about virtqueue_enable_cb_delayed()? Just the fact that virtqueue_disable_cb() prevents updates of used_index, and then we do the update in virtqueue_enable_cb(). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-blk: Disable callback in virtblk_done()
On 09/28/2012 02:08 PM, Rusty Russell wrote: Asias He as...@redhat.com writes: I forgot about the cool hack which MST put in to defer event updates using disable_cb/enable_cb. Hmm, are you talking about virtqueue_enable_cb_delayed()? Just the fact that virtqueue_disable_cb() prevents updates of used_index, and then we do the update in virtqueue_enable_cb(). Okay. -- Asias -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vga passthrough // questions about pci passthrough
On 2012-09-27 21:18, Alex Williamson wrote: On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote: thank you for the information. i will try what you mentioned... do you have some additional information about rebooting a VM with a passed through videocard? (amd / ati 7870) I don't. Is the bsod on reboot only or does it also happen on shutdown? There's a slim chance it could be traced by enabling debug in the pci-assign driver and analyzing what the guest driver is trying to do. I'm hoping that q35 chipset support might resolve some issues with vga assignment as it exposes a topology that looks a bit more like one that a driver would expect on physical hardware. Thanks, From our attempts to get more working than what NVIDIA Quadro cards support officially, my own experiments with q35 in this context and our discussions with NVIDIA, I'm pretty skeptical that this chipset will make a difference here. Most problems are due to those non-standard side channels to configure the hardware, memory mappings etc. And getting this working requires either cooperation of the vendor or *a lot* of reverse engineering. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: [RFC v2 PATCH 04/21] x86: Avoid RCU warnings on slave CPUs
Hi Paul, Thank you for your comments, and sorry for my late reply. On 2012/09/21 2:34, Paul E. McKenney wrote: On Thu, Sep 06, 2012 at 08:27:40PM +0900, Tomoki Sekiyama wrote: Initialize rcu related variables to avoid warnings about RCU usage while slave CPUs is running specified functions. Also notify RCU subsystem before the slave CPU is entered into idle state. Hello, Tomoki, A few questions and comments interspersed below. snip diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index e8cfe377..45dfc1d 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -382,6 +382,8 @@ notrace static void __cpuinit start_slave_cpu(void *unused) f = per_cpu(slave_cpu_func, cpu); per_cpu(slave_cpu_func, cpu).func = NULL; +rcu_note_context_switch(cpu); + Why not use rcu_idle_enter() and rcu_idle_exit()? These would tell RCU to ignore the slave CPU for the duration of its idle period. The way you have it, if a slave CPU stayed idle for too long, you would get RCU CPU stall warnings, and possibly system hangs as well. That's true, rcu_idle_enter() and rcu_idle_exit() should be used when the slave cpu is idle. Thanks. Or is this being called from some task that is not the idle task? If so, you instead want the new rcu_user_enter() and rcu_user_exit() that are hopefully on their way into 3.7. Or maybe better, use a real idle task, so that idle_task(smp_processor_id()) returns true and RCU stops complaining. ;-) Note that CPUs that RCU believes to be idle are not permitted to contain RCU read-side critical sections, which in turn means no entering the scheduler, no sleeping, and so on. There is an RCU_NONIDLE() macro to tell RCU to pay attention to the CPU only for the duration of the statement passed to RCU_NONIDLE, and there are also an _rcuidle variant of the tracing statement to allow tracing from idle. This was for KVM is called as `func', which contains RCU read-side critical sections, and rcu_virt_note_context_switch() (that is rcu_note_context_switch(cpu)) before entering guest. Maybe it should be replaced by rcu_user_enter() and rcu_user_exit() in the future. --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -2589,6 +2589,9 @@ static int __cpuinit rcu_cpu_notify(struc tnotifier_block *self, switch (action) { case CPU_UP_PREPARE: case CPU_UP_PREPARE_FROZEN: +#ifdef CONFIG_SLAVE_CPU +case CPU_SLAVE_UP_PREPARE: +#endif Why do you need #ifdef here? Why not define CPU_SLAVE_UP_PREPARE unconditionally? Then if CONFIG_SLAVE_CPU=n, rcu_cpu_notify() would never be invoked with CPU_SLAVE_UP_PREPARE, so no problems. Agreed. That will make the code simpler. Thank you again, -- Tomoki Sekiyama tomoki.sekiyama...@hitachi.com Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-blk: Disable callback in virtblk_done()
On Thu, Sep 27, 2012 at 09:40:03AM +0930, Rusty Russell wrote: I forgot about the cool hack which MST put in to defer event updates using disable_cb/enable_cb. I considered sticking some invalid value in event index on disable but in my testing it did not seem to give any gain, and knowing actual index of the other side is better for debugging. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote: Peter, Can I post your patch with your from/sob.. in V2? Please let me know.. Yeah I guess ;-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU
-Original Message- From: Avi Kivity [mailto:a...@redhat.com] Sent: Thursday, September 27, 2012 6:12 PM To: Hao, Xudong Cc: kvm@vger.kernel.org; Zhang, Xiantao Subject: Re: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU On 09/26/2012 07:54 AM, Hao, Xudong wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Avi Kivity Sent: Tuesday, September 25, 2012 4:16 PM To: Hao, Xudong Cc: kvm@vger.kernel.org; Zhang, Xiantao Subject: Re: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU On 09/25/2012 04:32 AM, Hao, Xudong wrote: btw, it is clear that long term the fpu will always be eagerly loaded, as hosts and guests (and hardware) are updated. At that time it will make sense to remove the lazy fpu code entirely. But maybe that time is here already, since exits are rare and so the guest has a lot of chance to use the fpu, so eager fpu saves the #NM vmexit. Can you check a kernel compile on a westmere system? If eager fpu is faster there than lazy fpu, we can just make the fpu always eager and remove quite a bit of code. I remember westmere does not support Xsave, do you want performance of fxsave/fresotr ? Yes. If a westmere is fast enough then we can probably justify it. If you can run tests on Sandy/Ivy Bridge, even better. Run kernel compile on westmere, eager fpu is about 0.4% faster, seems eager does not benefit it too much, so remain lazy fpu for lazy_allowed fpu state? Why not make it eager all the time then? It will simplify the code quite a bit, no? The code will simple if make it eager, I'll remove the lazy logic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] virtio: add API to query ring capacity
It's sometimes necessary to query ring capacity after dequeueing a buffer. Add an API for this. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_ring.c | 19 +++ include/linux/virtio.h | 2 ++ 2 files changed, 21 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5aa43c3..ee3d80b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -715,4 +715,23 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq) } EXPORT_SYMBOL_GPL(virtqueue_get_vring_size); +/** + * virtqueue_get_capacity - query available ring capacity + * @vq: the struct virtqueue we're talking about. + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted), otherwise result is unreliable. + * + * Returns remaining capacity of queue. + * Note that it only really makes sense to treat all + * return values as available: indirect buffers mean that + * we can put an entire sg[] array inside a single queue entry. + */ +unsigned int virtqueue_get_capacity(struct virtqueue *_vq) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + return vq-num_free; +} +EXPORT_SYMBOL_GPL(virtqueue_get_capacity); + MODULE_LICENSE(GPL); diff --git a/include/linux/virtio.h b/include/linux/virtio.h index a1ba8bb..fab61e8 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -50,6 +50,8 @@ void *virtqueue_detach_unused_buf(struct virtqueue *vq); unsigned int virtqueue_get_vring_size(struct virtqueue *vq); +unsigned int virtqueue_get_capacity(struct virtqueue *vq); + /** * virtio_device - representation of a device using virtio * @index: unique position on the virtio bus -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] virtio-net: correct capacity math on ring full
Capacity math on ring full is wrong: we are looking at num_sg but that might be optimistic because of indirect buffer use. The implementation also penalizes fast path with extra memory accesses for the benefit of ring full condition handling which is slow path. It's easy to query ring capacity so let's do just that. This change also makes it easier to move vnet header for tx around as follow-up patch does. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/net/virtio_net.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 83d2b0c..316f1be 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -95,7 +95,6 @@ struct skb_vnet_hdr { struct virtio_net_hdr hdr; struct virtio_net_hdr_mrg_rxbuf mhdr; }; - unsigned int num_sg; }; struct padded_vnet_hdr { @@ -557,10 +556,10 @@ again: return received; } -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) +static void free_old_xmit_skbs(struct virtnet_info *vi) { struct sk_buff *skb; - unsigned int len, tot_sgs = 0; + unsigned int len; struct virtnet_stats *stats = this_cpu_ptr(vi-stats); while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { @@ -571,16 +570,15 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) stats-tx_packets++; u64_stats_update_end(stats-tx_syncp); - tot_sgs += skb_vnet_hdr(skb)-num_sg; dev_kfree_skb_any(skb); } - return tot_sgs; } static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) { struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); const unsigned char *dest = ((struct ethhdr *)skb-data)-h_dest; + unsigned num_sg; pr_debug(%s: xmit %p %pM\n, vi-dev-name, skb, dest); @@ -619,8 +617,8 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) else sg_set_buf(vi-tx_sg, hdr-hdr, sizeof hdr-hdr); - hdr-num_sg = skb_to_sgvec(skb, vi-tx_sg + 1, 0, skb-len) + 1; - return virtqueue_add_buf(vi-svq, vi-tx_sg, hdr-num_sg, + num_sg = skb_to_sgvec(skb, vi-tx_sg + 1, 0, skb-len) + 1; + return virtqueue_add_buf(vi-svq, vi-tx_sg, num_sg, 0, skb, GFP_ATOMIC); } @@ -664,7 +662,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) netif_stop_queue(dev); if (unlikely(!virtqueue_enable_cb_delayed(vi-svq))) { /* More just got used, free them then recheck. */ - capacity += free_old_xmit_skbs(vi); + free_old_xmit_skbs(vi); + capacity = virtqueue_get_capacity(vi-svq); if (capacity = 2+MAX_SKB_FRAGS) { netif_start_queue(dev); virtqueue_disable_cb(vi-svq); -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] virtio-net: put virtio net header inline with data
For small packets we can simplify xmit processing by linearizing buffers with the header: most packets seem to have enough head room we can use for this purpose. Since existing hypervisors require that header is the first s/g element, we need a feature bit for this. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/net/virtio_net.c | 44 +++- include/linux/virtio_net.h | 5 - 2 files changed, 39 insertions(+), 10 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 316f1be..6e6e53e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -67,6 +67,9 @@ struct virtnet_info { /* Host will merge rx buffers for big packets (shake it! shake it!) */ bool mergeable_rx_bufs; + /* Host can handle any s/g split between our header and packet data */ + bool any_header_sg; + /* enable config space updates */ bool config_enable; @@ -576,11 +579,28 @@ static void free_old_xmit_skbs(struct virtnet_info *vi) static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) { - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); + struct skb_vnet_hdr *hdr; const unsigned char *dest = ((struct ethhdr *)skb-data)-h_dest; unsigned num_sg; + unsigned hdr_len; + bool can_push; + pr_debug(%s: xmit %p %pM\n, vi-dev-name, skb, dest); + if (vi-mergeable_rx_bufs) + hdr_len = sizeof hdr-mhdr; + else + hdr_len = sizeof hdr-hdr; + + can_push = vi-any_header_sg + !((unsigned long)skb-data (__alignof__(*hdr) - 1)) + !skb_header_cloned(skb) skb_headroom(skb) = hdr_len; + /* Even if we can, don't push here yet as this would skew +* csum_start offset below. */ + if (can_push) + hdr = (struct skb_vnet_hdr *)(skb-data - hdr_len); + else + hdr = skb_vnet_hdr(skb); if (skb-ip_summed == CHECKSUM_PARTIAL) { hdr-hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; @@ -609,15 +629,18 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) hdr-hdr.gso_size = hdr-hdr.hdr_len = 0; } - hdr-mhdr.num_buffers = 0; - - /* Encode metadata header at front. */ if (vi-mergeable_rx_bufs) - sg_set_buf(vi-tx_sg, hdr-mhdr, sizeof hdr-mhdr); - else - sg_set_buf(vi-tx_sg, hdr-hdr, sizeof hdr-hdr); + hdr-mhdr.num_buffers = 0; - num_sg = skb_to_sgvec(skb, vi-tx_sg + 1, 0, skb-len) + 1; + if (can_push) { + __skb_push(skb, hdr_len); + num_sg = skb_to_sgvec(skb, vi-tx_sg, 0, skb-len); + /* Pull header back to avoid skew in tx bytes calculations. */ + __skb_pull(skb, hdr_len); + } else { + sg_set_buf(vi-tx_sg, hdr, hdr_len); + num_sg = skb_to_sgvec(skb, vi-tx_sg + 1, 0, skb-len) + 1; + } return virtqueue_add_buf(vi-svq, vi-tx_sg, num_sg, 0, skb, GFP_ATOMIC); } @@ -1128,6 +1151,9 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) vi-mergeable_rx_bufs = true; + if (virtio_has_feature(vdev, VIRTIO_NET_F_ANY_HEADER_SG)) + vi-any_header_sg = true; + err = init_vqs(vi); if (err) goto free_stats; @@ -1286,7 +1312,7 @@ static unsigned int features[] = { VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO, VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ, VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, - VIRTIO_NET_F_GUEST_ANNOUNCE, + VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_ANY_HEADER_SG }; static struct virtio_driver virtio_net_driver = { diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 2470f54..16a577b 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -51,6 +51,7 @@ #define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */ #define VIRTIO_NET_F_GUEST_ANNOUNCE 21 /* Guest can announce device on the * network */ +#define VIRTIO_NET_F_ANY_HEADER_SG 22 /* Host can handle any header s/g */ #define VIRTIO_NET_S_LINK_UP 1 /* Link is up */ #define VIRTIO_NET_S_ANNOUNCE 2 /* Announcement is needed */ @@ -62,7 +63,9 @@ struct virtio_net_config { __u16 status; } __attribute__((packed)); -/* This is the first element of the scatter-gather list. If you don't +/* This header comes first in the scatter-gather list. + * If VIRTIO_NET_F_ANY_HEADER_SG is not negotiated, it must + * be the first element of the scatter-gather list. If you don't * specify GSO or CSUM features, you can simply ignore the header. */ struct virtio_net_hdr { #define
[PATCH 0/3] virtio-net: inline header support
Thinking about Sasha's patches, we can reduce ring usage for virtio net small packets dramatically if we put virtio net header inline with the data. This can be done for free in case guest net stack allocated extra head room for the packet, and I don't see why would this have any downsides. Even though with my recent patches qemu no longer requires header to be the first s/g element, we need a new feature bit to detect this. A trivial qemu patch will be sent separately. We could get rid of an extra s/g for big packets too, but since in practice everyone enables mergeable buffers, I don't see much of a point. Rusty, if you decide to pick this up I'll send a (rather trivial) spec patch shortly afterwards, but holidays are beginning here. Considering how simple the guest patch is, I hope it can make it in 3.7? Also note that patch 1 and 2 are IMO a good idea without patch 3. If you decide to defer patch 3 pls consider 1/2 separately. Before: [root@virtlab203 qemu]# ssh robin ./netperf/bin/netperf -t TCP_RR -H 11.0.0.4 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.0.0.4 (11.0.0.4) port 0 AF_INET : demo Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 16384 87380 11 10.002992.88 16384 87380 After: [root@virtlab203 qemu]# ssh robin ./netperf/bin/netperf -t TCP_RR -H 11.0.0.4 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.0.0.4 (11.0.0.4) port 0 AF_INET : demo Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 16384 87380 11 10.003195.57 16384 87380 Michael S. Tsirkin (3): virtio: add API to query ring capacity virtio-net: correct capacity math on ring full virtio-net: put virtio net header inline with data drivers/net/virtio_net.c | 57 +++- drivers/virtio/virtio_ring.c | 19 +++ include/linux/virtio.h | 2 ++ include/linux/virtio_net.h | 5 +++- 4 files changed, 66 insertions(+), 17 deletions(-) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH qemu] virtio-net: add feature bit for any header s/g
Old qemu versions required that 1st s/g entry is the header. My recent patchset titled virtio-net: iovec handling cleanup removed this limitation but a feature bit is needed so guests know it's safe to lay out header differently. This patch applies on top and adds such a feature bit. virtio net header inline with the data is beneficial for latency and small packet bandwidth. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- hw/virtio-net.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/virtio-net.h b/hw/virtio-net.h index 36aa463..e7187e4 100644 --- a/hw/virtio-net.h +++ b/hw/virtio-net.h @@ -44,6 +44,7 @@ #define VIRTIO_NET_F_CTRL_RX18 /* Control channel RX mode support */ #define VIRTIO_NET_F_CTRL_VLAN 19 /* Control channel VLAN filtering */ #define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */ +#define VIRTIO_NET_F_ANY_HEADER_SG 22 /* Host can handle any header s/g */ #define VIRTIO_NET_S_LINK_UP1 /* Link is up */ @@ -186,5 +187,6 @@ struct virtio_net_ctrl_mac { DEFINE_PROP_BIT(ctrl_vq, _state, _field, VIRTIO_NET_F_CTRL_VQ, true), \ DEFINE_PROP_BIT(ctrl_rx, _state, _field, VIRTIO_NET_F_CTRL_RX, true), \ DEFINE_PROP_BIT(ctrl_vlan, _state, _field, VIRTIO_NET_F_CTRL_VLAN, true), \ -DEFINE_PROP_BIT(ctrl_rx_extra, _state, _field, VIRTIO_NET_F_CTRL_RX_EXTRA, true) +DEFINE_PROP_BIT(ctrl_rx_extra, _state, _field, VIRTIO_NET_F_CTRL_RX_EXTRA, true), \ +DEFINE_PROP_BIT(any_header_sg, _state, _field, VIRTIO_NET_F_ANY_HEADER_SG, true) #endif -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote: On 09/27/2012 05:33 PM, Avi Kivity wrote: On 09/27/2012 01:23 PM, Raghavendra K T wrote: This gives us a good case for tracking preemption on a per-vm basis. As long as we aren't preempted, we can keep the PLE window high, and also return immediately from the handler without looking for candidates. 1) So do you think, deferring preemption patch ( Vatsa was mentioning long back) is also another thing worth trying, so we reduce the chance of LHP. Yes, we have to keep it in mind. It will be useful for fine grained locks, not so much so coarse locks or IPIs. Agree. I would still of course prefer a PLE solution, but if we can't get it to work we can consider preemption deferral. Okay. IIRC, with defer preemption : we will have hook in spinlock/unlock path to measure depth of lock held, and shared with host scheduler (may be via MSRs now). Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather give say one chance. A downside is that we have to do that even when undercommitted. Hopefully vcpu preemption is very rare when undercommitted, so it should not happen much at all. Also there may be a lot of false positives (deferred preemptions even when there is no contention). It will be interesting to see how this behaves with a very high lock activity in a guest. Once the scheduler defers preemption, is it for a fixed amount of time, or does it know to cut the deferral short as soon as the lock depth is reduced [by x]? Yes. That is a worry. 2) looking at the result (comparing A C) , I do feel we have significant in iterating over vcpus (when compared to even vmexit) so We still would need undercommit fix sugested by PeterZ (improving by 140%). ? Looking only at the current runqueue? My worry is that it misses a lot of cases. Maybe try the current runqueue first and then others. Or were you referring to something else? No. I was referring to the same thing. However. I had tried following also (which works well to check undercommited scenario). But thinking to use only for yielding in case of overcommit (yield in overcommit suggested by Rik) and keep undercommit patch as suggested by PeterZ [ patch is not in proper diff I suppose ]. Will test them. Peter, Can I post your patch with your from/sob.. in V2? Please let me know.. --- diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 28f00bc..9ed3759 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) return eligible; } #endif + +bool kvm_overcommitted() +{ + unsigned long load; + + load = avenrun[0] + FIXED_1/200; + load = load FSHIFT; + load = (load 7) / num_online_cpus(); + + if (load 128) + return true; + + return false; +} + void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me-kvm; @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) int pass; int i; + if (!kvm_overcommitted()) + return; + kvm_vcpu_set_in_spin_loop(me, true); /* * We boost the priority of a VCPU that is runnable but not -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
On Fri, Sep 28, 2012 at 02:07:26AM +, Auld, Will wrote: Marcelo, I tagged my comments below with [auld] to make it easier to read. Thanks, Will -Original Message- From: Marcelo Tosatti [mailto:mtosa...@redhat.com] Sent: Thursday, September 27, 2012 4:49 AM To: Auld, Will Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao; Liu, Jinsong Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM On Thu, Sep 27, 2012 at 08:31:22AM -0300, Marcelo Tosatti wrote: On Thu, Sep 27, 2012 at 12:50:16AM +, Auld, Will wrote: Marcelo, I think I am missing something. There should be no needed changes to current algorithms that exist today. Does it seem that I have broken Zachary's implementation somehow? Yes. compute_guest_tsc() function must take ia32_tsc_adjust into account. guest_read_tsc (and the SVM equivalent) also. [auld] I don't see how that function is broken. compute_guest_tsc() should return the TSC value accordingly to what is emulated via vcpu-arch.virtual_tsc_mult, but this can be fixed later. Also, must take into account VMX-SVM migration. In that case, you should export IA32_TSC_ADJUST along with IA32_TSC MSR. [auld] I'll give this more thought. Two different ways to go, allow this to only work on host processors with this feature or enable this for all VM independent of the underlying host processor capability. In the former case migrating cross architecture might be disallowed. In the later case sending only IA32_TSC on migration should be enough as the delta would be accounted for in tsc_offset of the control structure. That is fine, yes, if you want to migrate across, don't expose the feature. Which brings us back to the initial question, if there are other means to provide stable TSC, why use this MSR? For example, VMWare guests have no need to use this MSR (because the hypervisor provides TSC guarantees). [auld] Using this MSR simplifies the process of synchronizing the tsc for each logical processor because its value does not change with the clock. How do you write the same value to all the IA32_TIME_STAMP_COUNTER MSR? Well, figure out what you want to write there, get all the processors to rendezvous at the same time, have all logical processors complete their writes in a very small amount of time. This is in contrast to deciding the offset to write and then having all the logical processors write the offset. No worries about rendezvous, synchronization of the writes in time and such. Then we come back to the two questions: - Is there anyone from Intel working on the Linux host side, where it makes sense to use this? [auld] I am not aware of anyone working on this for Linux. - Are you sure its worthwhile to expose this to KVM guests? [auld] At least one OS is moving to implement this that is commonly used as a guest. OK thanks. Thanks, Will -Original Message- From: Marcelo Tosatti [mailto:mtosa...@redhat.com] Sent: Wednesday, September 26, 2012 5:29 PM To: Auld, Will Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao; Liu, Jinsong Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM On Wed, Sep 26, 2012 at 10:58:46PM +, Auld, Will wrote: Avi, Still working on your suggestions. Marcelo, The purpose is to be able to run guests that implement this change and not require they revert to the older method of adjusting the TSC. I am making no assumption about whether the guest checks to see if the times are good enough or just runs an algorithm every time but in any case this would allow the simpler, cleaner and less expensive algorithm to run if it exists. Will, you can choose to not expose the feature. Correct? Because this conflicts with the model that has been envisioned and developed by Zachary... for that model to continue to be functional you'll have to make sure the TSC emulation is adjusted accordingly to consider IA32_TSC_ADJUST (for example, when trapping TSC). From that point of view, the patch below is incomplete. ... or KVM can choose to never expose the feature via CPUID and handle TSC consistency itself (i understand your perspective of getting a task complete, but unfortunately from my POV its not so simple). Thanks, Will The purpose of the IA32_TSC_ADJUST control is to make it easier for the operating system (host) to decrease the delta between cores to an acceptable value, so that applications can make use of direct RDTSC, correct? Why is it necessary for the guests to make use of such interface, if the hypervisor could provide proper TSC? (not against exposing it to the guests, just thinking out loud). That is, if the purpose of the IA32_TSC_ADJUST is to provide proper synchronized TSC across cores, and newer
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-28 at 06:40 -0500, Andrew Theurer wrote: It will be interesting to see how this behaves with a very high lock activity in a guest. Once the scheduler defers preemption, is it for a fixed amount of time, or does it know to cut the deferral short as soon as the lock depth is reduced [by x]? Since the locks live in a guest/userspace, we don't even know they're held at all, let alone when state changes. Also, afaik PLE simply exits the guest whenever you do a busy-wait, there's no guarantee its due to a lock at all, we could be waiting for a 'virtual' hardware resource or whatnot. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/28/2012 05:10 PM, Andrew Theurer wrote: On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote: On 09/27/2012 05:33 PM, Avi Kivity wrote: On 09/27/2012 01:23 PM, Raghavendra K T wrote: [...] Also there may be a lot of false positives (deferred preemptions even when there is no contention). It will be interesting to see how this behaves with a very high lock activity in a guest. Once the scheduler defers preemption, is it for a fixed amount of time, or does it know to cut the deferral short as soon as the lock depth is reduced [by x]? Design/protocol that Vatsa, had in mind was something like this: - scheduler does not give a vcpu holding lock forever, it may give one chance that would give only few ticks. In addition to giving chance, scheduler also sets some indication that he has been given chance. - vcpu once he release (all) the lock(s), if it had given chance, it should clear that (ACK), and relinquish the cpu. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vga passthrough // questions about pci passthrough
well my first tests with the vga rom were useless because of apparmor rules i guess now i placed the vga.rom in /usr/share/qemu ... well the error is gone now but no changes ;) so i added the bar parameter but it also made no difference :( are you interested in the windows memory dump from the bsod? another thing, after i ran some benchmarks after a fresh reboot on win7 i wanted to measure some values of the 7870 so i started gpu-z ( http://www.techpowerup.com/gpuz/ ) then almost immediately the vm froze i found one log entry in one of the libvirt log files: kvm: /build/buildd/qemu-kvm-1.2.0+noroms/exec.c:2255: register_subpage: Assertion `existing-mr-subpage || existing-mr == io_mem_unassigned' failed. maybe you know what this is about. thanks again for your patience and help ;) Am 28.09.2012 10:12, schrieb Jan Kiszka: On 2012-09-27 21:18, Alex Williamson wrote: On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote: thank you for the information. i will try what you mentioned... do you have some additional information about rebooting a VM with a passed through videocard? (amd / ati 7870) I don't. Is the bsod on reboot only or does it also happen on shutdown? There's a slim chance it could be traced by enabling debug in the pci-assign driver and analyzing what the guest driver is trying to do. I'm hoping that q35 chipset support might resolve some issues with vga assignment as it exposes a topology that looks a bit more like one that a driver would expect on physical hardware. Thanks, From our attempts to get more working than what NVIDIA Quadro cards support officially, my own experiments with q35 in this context and our discussions with NVIDIA, I'm pretty skeptical that this chipset will make a difference here. Most problems are due to those non-standard side channels to configure the hardware, memory mappings etc. And getting this working requires either cooperation of the vendor or *a lot* of reverse engineering. Jan -- Adiumentum GmbH Gf. Martin Wolf Banderbacherstraße 76 90513 Zirndorf 0911 / 9601470 mw...@adiumentum.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vga passthrough // questions about pci passthrough
On Fri, 2012-09-28 at 10:12 +0200, Jan Kiszka wrote: On 2012-09-27 21:18, Alex Williamson wrote: On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote: thank you for the information. i will try what you mentioned... do you have some additional information about rebooting a VM with a passed through videocard? (amd / ati 7870) I don't. Is the bsod on reboot only or does it also happen on shutdown? There's a slim chance it could be traced by enabling debug in the pci-assign driver and analyzing what the guest driver is trying to do. I'm hoping that q35 chipset support might resolve some issues with vga assignment as it exposes a topology that looks a bit more like one that a driver would expect on physical hardware. Thanks, From our attempts to get more working than what NVIDIA Quadro cards support officially, my own experiments with q35 in this context and our discussions with NVIDIA, I'm pretty skeptical that this chipset will make a difference here. Most problems are due to those non-standard side channels to configure the hardware, memory mappings etc. And getting this working requires either cooperation of the vendor or *a lot* of reverse engineering. I heard from an nvidia guy that the driver behaves differently depending on whether it finds an upstream express port, so we're probably causing ourselves more problems if it's trying to run in AGP mode. There was also a lot of FUD in Xen (maybe justified) around how the BIOS determines the memory ranges and whether it bypasses the PCI BARs and gets them directly. That means some cards may require identity mapping to work. It seems like the very high-end cards are possibly fixing this, but they're far more expensive than I can justify. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
resize raw images
Hi, i'm not very experienced in KVM. I installed two VM's in a raw image. I'm impressed of the speed of the vm's, that's nice :-). I have a lot of vm's running on VMWare Server 1.09, which is very old. I'd like to migrate them to KVM. I'd like to migrate them to raw images, because i'm able to mount a raw image from the host like a partition if the VM is having problems. I also have to create some new vm's. What is when disk space is running out ? My idea is to create the new vm's in raw images. Inside the vm, filesystems will reside in logical volumes. When disk space is running out, i resize the raw image using: - qemu-img create -f raw additional.raw size - cat additional.raw vm.raw - inside the vm, resize the filesystems easily with lvm tools und resize2fs. What do you think about this idea ? Are there easier solutions ? Thanks in advance. Bernd -- Bernd Lentes Systemadministration Institut für Entwicklungsgenetik Gebäude 35.34 - Raum 208 HelmholtzZentrum münchen bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/idg Wir sollten nicht den Tod fürchten, sondern das schlechte Leben Helmholtz Zentrum München Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) Ingolstädter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum Registergericht: Amtsgericht München HRB 6466 USt-IdNr: DE 129521671 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vga passthrough // questions about pci passthrough
On 2012-09-28 17:50, Alex Williamson wrote: On Fri, 2012-09-28 at 10:12 +0200, Jan Kiszka wrote: On 2012-09-27 21:18, Alex Williamson wrote: On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote: thank you for the information. i will try what you mentioned... do you have some additional information about rebooting a VM with a passed through videocard? (amd / ati 7870) I don't. Is the bsod on reboot only or does it also happen on shutdown? There's a slim chance it could be traced by enabling debug in the pci-assign driver and analyzing what the guest driver is trying to do. I'm hoping that q35 chipset support might resolve some issues with vga assignment as it exposes a topology that looks a bit more like one that a driver would expect on physical hardware. Thanks, From our attempts to get more working than what NVIDIA Quadro cards support officially, my own experiments with q35 in this context and our discussions with NVIDIA, I'm pretty skeptical that this chipset will make a difference here. Most problems are due to those non-standard side channels to configure the hardware, memory mappings etc. And getting this working requires either cooperation of the vendor or *a lot* of reverse engineering. I heard from an nvidia guy that the driver behaves differently depending on whether it finds an upstream express port, so we're probably causing ourselves more problems if it's trying to run in AGP mode. May be a point for the low- to mid-range cards. It does not apply to the virtualization-ready Quadro series according to our information back then. There was also a lot of FUD in Xen (maybe justified) around how the BIOS determines the memory ranges and whether it bypasses the PCI BARs and gets them directly. That means some cards may require identity mapping to work. It seems like the very high-end cards are possibly fixing this, but they're far more expensive than I can justify. Thanks, Yes, that is what makes them virtualization ready. But they also come with limitations. So far, you can't pass-through a primary card or use it for early boot messages of the guest as the BIOS is not ready for that - without identity mapping or even more. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
PLE: - works for unmodified / non-Linux guests - works for all types of spins (e.g. smp_call_function*()) - utilizes an existing hardware interface (PAUSE instruction) so likely more robust compared to a software interface PV: - has more information, so it can perform better Should we also consider that we always have an edge here for non-PLE machine? True. The deployment share for these is decreasing rapidly though. I hate optimizing for obsolete hardware. Keep in mind that the patchset that Jeremy provided also cleans (remove) parts of the pv spinlock code. It removes the various spin_lock, spin_unlock, etc that touch paravirt code. Instead the pv code is only in the slowpath. And if you don't compile with CONFIG_PARAVIRT_SPINLOCK the end code is the same as it is now. On a different subject- I am curious whether the Haswell new locking instructions (the transactional ones?) can be put in usage for the slow case? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [libvirt] TSC scaling interface to management
On Tue, Sep 25, 2012 at 11:08:58AM +0100, Daniel P. Berrange wrote: On Wed, Sep 12, 2012 at 12:39:39PM -0300, Marcelo Tosatti wrote: HW TSC scaling is a feature of AMD processors that allows a multiplier to be specified to the TSC frequency exposed to the guest. KVM also contains provision to trap TSC (KVM: Infrastructure for software and hardware based TSC rate scaling cc578287e3224d0da) or advance TSC frequency. This is useful when migrating to a host with different frequency and the guest is possibly using direct RDTSC instructions for purposes other than measuring cycles (that is, it previously calculated cycles-per-second, and uses that information which is stale after migration). qemu-x86: Set tsc_khz in kvm when supported (e7429073ed1a76518) added support for tsc_khz= option in QEMU. I am proposing the following changes so that management applications can work with this: 1) New option for tsc_khz, which is tsc_khz=host (QEMU command line option). Host means that QEMU is responsible for retrieving the TSC frequency of the host processor and use that. Management application does not have to deal with the burden. FYI, libvirt already has support for expressing a number of different TSC related config options, for support of Xen and VMWare's capabilities in this area. What we currently allow for is timer name='tsc' frequency='NNN' mode='auto|native|emulate|smpsafe'/ In this context the frequency attribute provides the HZ value to provide to the guest. - auto == Emulate if TSC is unstable, else allow native TSC access - native == Always allow native TSC access - emulate = Always emulate TSC - smpsafe == Always emulate TSC, and interlock SMP These options can be mapped into KVM if necessary (they can map to tsc_khz=XXX or to the module options (unfortunately not per-guest ATM)). Therefore it appears that this tsc_khz=auto option can be specified only if the user specifies so (it can be a per-guest flag hidden in the management configuration/manual). Sending this email to gather suggestions (or objections) to this interface. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| Karen had the suggestion to remove the burden of choice from the user, which we can achieve by knowing whether or not the guest is using a paravirtual clock. The problem is that opens a can of races: Did migration happen before or after guest boot process enabled the paravirtual clock etc. I suppose leaving the option to the user is fine: if you run an obscure operating system, which does not support paravirtual clock, then it must be dealt with specialy (its in the manual, no big deal). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html