Re: [PATCH 0/3] Unmapped page cache control (v5)
On Fri, Apr 01, 2011 at 08:38:11AM +0530, Balbir Singh wrote: > * Dave Chinner [2011-04-01 08:40:33]: > > > On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote: > > > > > > The following series implements page cache control, > > > this is a split out version of patch 1 of version 3 of the > > > page cache optimization patches posted earlier at > > > Previous posting http://lwn.net/Articles/425851/ and analysis > > > at http://lwn.net/Articles/419713/ > > > > > > Detailed Description > > > > > > This patch implements unmapped page cache control via preferred > > > page cache reclaim. The current patch hooks into kswapd and reclaims > > > page cache if the user has requested for unmapped page control. > > > This is useful in the following scenario > > > - In a virtualized environment with cache=writethrough, we see > > > double caching - (one in the host and one in the guest). As > > > we try to scale guests, cache usage across the system grows. > > > The goal of this patch is to reclaim page cache when Linux is running > > > as a guest and get the host to hold the page cache and manage it. > > > There might be temporary duplication, but in the long run, memory > > > in the guests would be used for mapped pages. > > > > What does this do that "cache=none" for the VMs and using the page > > cache inside the guest doesn't acheive? That avoids double caching > > and doesn't require any new complexity inside the host OS to > > acheive... > > > > There was a long discussion on cache=none in the first posting and the > downsides/impact on throughput. Please see > http://www.mail-archive.com/kvm@vger.kernel.org/msg30655.html All there is in that thread is handwaving about the differences between cache=none vs cache=writeback behaviour and about the amount of data loss/corruption when failures occur. There is only one real example provided about real world performance in the entire thread, but the root cause of the performance difference is not analysed, determined and understood. Hence I'm not convinced from this thread that using cache=write* and using this functionality is anything other than papering over some still unknown problem Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
* Dave Chinner [2011-04-01 08:40:33]: > On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote: > > > > The following series implements page cache control, > > this is a split out version of patch 1 of version 3 of the > > page cache optimization patches posted earlier at > > Previous posting http://lwn.net/Articles/425851/ and analysis > > at http://lwn.net/Articles/419713/ > > > > Detailed Description > > > > This patch implements unmapped page cache control via preferred > > page cache reclaim. The current patch hooks into kswapd and reclaims > > page cache if the user has requested for unmapped page control. > > This is useful in the following scenario > > - In a virtualized environment with cache=writethrough, we see > > double caching - (one in the host and one in the guest). As > > we try to scale guests, cache usage across the system grows. > > The goal of this patch is to reclaim page cache when Linux is running > > as a guest and get the host to hold the page cache and manage it. > > There might be temporary duplication, but in the long run, memory > > in the guests would be used for mapped pages. > > What does this do that "cache=none" for the VMs and using the page > cache inside the guest doesn't acheive? That avoids double caching > and doesn't require any new complexity inside the host OS to > acheive... > There was a long discussion on cache=none in the first posting and the downsides/impact on throughput. Please see http://www.mail-archive.com/kvm@vger.kernel.org/msg30655.html -- Three Cheers, Balbir -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
> On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote: > > > > The following series implements page cache control, > > this is a split out version of patch 1 of version 3 of the > > page cache optimization patches posted earlier at > > Previous posting http://lwn.net/Articles/425851/ and analysis > > at http://lwn.net/Articles/419713/ > > > > Detailed Description > > > > This patch implements unmapped page cache control via preferred > > page cache reclaim. The current patch hooks into kswapd and reclaims > > page cache if the user has requested for unmapped page control. > > This is useful in the following scenario > > - In a virtualized environment with cache=writethrough, we see > > double caching - (one in the host and one in the guest). As > > we try to scale guests, cache usage across the system grows. > > The goal of this patch is to reclaim page cache when Linux is running > > as a guest and get the host to hold the page cache and manage it. > > There might be temporary duplication, but in the long run, memory > > in the guests would be used for mapped pages. > > What does this do that "cache=none" for the VMs and using the page > cache inside the guest doesn't acheive? That avoids double caching > and doesn't require any new complexity inside the host OS to > acheive... Right. "cache=none" has no double caching issue and KSM already solved cross gues cache sharing. So, I _guess_ this is not a core motivation of his patch. But I'm not him. I'm not sure. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote: > > The following series implements page cache control, > this is a split out version of patch 1 of version 3 of the > page cache optimization patches posted earlier at > Previous posting http://lwn.net/Articles/425851/ and analysis > at http://lwn.net/Articles/419713/ > > Detailed Description > > This patch implements unmapped page cache control via preferred > page cache reclaim. The current patch hooks into kswapd and reclaims > page cache if the user has requested for unmapped page control. > This is useful in the following scenario > - In a virtualized environment with cache=writethrough, we see > double caching - (one in the host and one in the guest). As > we try to scale guests, cache usage across the system grows. > The goal of this patch is to reclaim page cache when Linux is running > as a guest and get the host to hold the page cache and manage it. > There might be temporary duplication, but in the long run, memory > in the guests would be used for mapped pages. What does this do that "cache=none" for the VMs and using the page cache inside the guest doesn't acheive? That avoids double caching and doesn't require any new complexity inside the host OS to acheive... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Guest virtual CPUs limited to only 50% of host CPU
Hi, I am using Qemu-KVM-0.12.5 on Intel Xeon (Vt-x enabled) processors and monitoring the system using htop on the host. On the processors that are running Qemu-KVM I am seeing a 50/50 split between userspace and guest ("gu:" in htop). I have pinned the vCPU qemu-kvm threads to specific host CPUs using taskset. In the guest the CPUs are nearly 100% userspace in htop. Does anyone have ideas on why this is? Is there a way I can get much higher utilization for the guest virtual CPUs wrt the host? Thanks in advance! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Thu, 31 Mar 2011, KOSAKI Motohiro wrote: > 1) zone reclaim doesn't work if the system has multiple node and the >workload is file cache oriented (eg file server, web server, mail server, > et al). >because zone recliam make some much free pages than zone->pages_min and >then new page cache request consume nearest node memory and then it >bring next zone reclaim. Then, memory utilization is reduced and >unnecessary LRU discard is increased dramatically. That is only true if the webserver only allocates from a single node. If the allocation load is balanced then it will be fine. It is useful to reclaim pages from the node where we allocate memory since that keeps the dataset node local. >SGI folks added CPUSET specific solution in past. > (cpuset.memory_spread_page) >But global recliam still have its issue. zone recliam is HPC workload > specific >feature and HPC folks has no motivation to don't use CPUSET. The spreading can also be done via memory policies. But that is only required if the application has an unbalanced allocation behavior. > 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to >decide to filter out mapped pages. It made a lot of problems for DB servers >and large application servers. Because, if the system has a lot of mapped >pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2) >reclaim latency become terribly slow and hangup detectors misdetect its >state and start to force reboot. That was big problem of RHEL5 based > banking >system. >So, sc->may_unmap should be killed in future. Don't increase uses. Because a bank could not configure its system properly we need to get rid of may_unmap? Maybe raise min_unmapped_ratio instead and take care that either the allocation load is balanced or a round robin scheme is used by the app? > And, this patch introduce new allocator fast path overhead. I haven't seen > any justification for it. We could do the triggering differently. > In other words, you have to kill following three for getting ack 1) zone > reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap) > 3) fastpath overhead. In other words, If you want a feature for vm guest, > Any hardcoded machine configration assumption and/or workload assumption > are wrong. It would be good if you could come up with a new reclaim scheme that avoids the need for zone reclaim and still allows one to take advantage of memory distances. I agree that the current scheme sometimes requires tuning too many esoteric knobs to get useful behavior. > But, I agree that now we have to concern slightly large VM change parhaps > (or parhaps not). Ok, it's good opportunity to fill out some thing. > Historically, Linux MM has "free memory are waste memory" policy, and It > worked completely fine. But now we have a few exceptions. > > 1) RT, embedded and finance systems. They really hope to avoid reclaim >latency (ie avoid foreground reclaim completely) and they can accept >to make slightly much free pages before memory shortage. In general we need a mechanism to ensure we can avoid reclaim during critical sections of application. So some way to give some hints to the machine to free up lots of memory (/proc/sys/vm/dropcaches is far too drastic) may be useful. > And, now we have four proposal of utilization related issues. > > 1) cleancache (from Oracle) > 2) VirtFS (from IBM) > 3) kstaled (from Google) > 4) unmapped page reclaim (from you) > > Probably, we can't merge all of them and we need to consolidate some > requirement and implementations. Well all these approaches show that we have major issues with reclaim and large memory. Things get overly complicated. Time for a new approach that integrates all the goals that these try to accomplish? > Personally I think cleancache or other multi level page cache framework > looks promising. but another solution is also acceptable. Anyway, I hope > to everyone back 1000feet bird eye at once and sorting out all requiremnt > with all related person. Would be good if you could takle that problem. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] Native Linux KVM tool
Hi all, We’re proud to announce the native Linux KVM tool! The goal of this tool is to provide a clean, from-scratch, lightweight KVM host tool implementation that can boot Linux guest images (just a hobby, won't be big and professional like QEMU) with no BIOS dependencies and with only the minimal amount of legacy device emulation. Note that this is a development prototype for the time being: there's no networking support and no graphics support, amongst other missing essentials. It's great as a learning tool if you want to get your feet wet in virtualization land: it's only 5 KLOC of clean C code that can already boot a guest Linux image. Right now it can boot a Linux image and provide you output via a serial console, over the host terminal, i.e. you can use it to boot a guest Linux image in a terminal or over ssh and log into the guest without much guest or host side setup work needed. 1. To try out the tool, clone the git repository: git clone git://github.com/penberg/linux-kvm.git or alternatively, if you already have a kernel source tree: git checkout -b kvm/tool git pull git://github.com/penberg/linux-kvm.git 2. Compile the tool: cd tools/kvm && make 3. Download a raw userspace image: wget http://wiki.qemu.org/download/linux-0.2.img.bz2 && bunzip2 linux-0.2.img.bz2 4. Build a kernel with CONFIG_VIRTIO_BLK=y and CONFIG_SERIAL_8250_CONSOLE=y configuration options. Note: also make sure you have CONFIG_EXT2_FS or CONFIG_EXT4_FS if you use the above image. 5. And finally, launch the hypervisor: ./kvm --image=linux-0.2.img --kernel=../../arch/x86/boot/bzImage The tool has been written by Pekka Enberg, Cyrill Gorcunov, and Asias He. Special thanks to Avi Kivity for his help on KVM internals and Ingo Molnar for all-around support and encouragement! See the following thread for original discussion for motivation of this project: http://thread.gmane.org/gmane.linux.kernel/962051/focus=962620 Pekka -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM, iSCSI and High Availability
Am Thursday 31 March 2011 schrieben Sie: > That's what CLVM is for, it propagates the volume changes to every member > of the 'cluster'. Oh, right. I didn't know about clvm until now. It sounds very promising though, certainly better than working with the proprietary API of whoever your SAN-vendor is to create a new LUN for every VM. Also, the machine we have got here, a Dell PowerVault, appears to be limited to at most 255 LUNs. I don't if that's a limitation of iSCSI or just a problem of this particular array. Guido -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]arch:x86:kvm:i8254.h Fix typo in kvm_pit
On 03/31/2011 02:20 AM, Avi Kivity wrote: On 03/30/2011 07:42 PM, Justin P. Mattock wrote: On 03/30/2011 10:17 AM, Avi Kivity wrote: On 03/30/2011 06:30 PM, Justin P. Mattock wrote: On 03/30/2011 09:26 AM, Avi Kivity wrote: On 03/30/2011 06:19 PM, Justin P. Mattock wrote: The below patch changes base_addresss to base_address. Note: I have grepped for base_addresss and nothing shows up, grepping for base_address gets me lots of output, telling me that this is a typo, but could be wrong. Signed-off-by: Justin P. Mattock --- arch/x86/kvm/i8254.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h index 46d08ca..c2fa48b 100644 --- a/arch/x86/kvm/i8254.h +++ b/arch/x86/kvm/i8254.h @@ -33,7 +33,7 @@ struct kvm_kpit_state { }; struct kvm_pit { - unsigned long base_addresss; + unsigned long base_address; struct kvm_io_device dev; struct kvm_io_device speaker_dev; struct kvm *kvm; Why not remove the variable completely? didnt even think to completely remove the variable(figured it was used somewhere).I will look at that and resend with removal of the variable for you.. Well if it was used, you ought to have changed all of the users, no? at the moment I see: (keep in mind my reading skills only go so far!) grep -Re base_address kvm/* -n kvm/ioapic.c:276: return ((addr >= ioapic->base_address && kvm/ioapic.c:277: (addr < ioapic->base_address + IOAPIC_MEM_LENGTH))); kvm/ioapic.c:371: ioapic->base_address = IOAPIC_DEFAULT_BASE_ADDRESS; kvm/ioapic.h:38: u64 base_address; so changing base_addresss; to base_address; gets kvm_ioapic_reset to function correctly as well as ioapic_in_range? (but could be wrong) Can you explain how kvm_ioapic_reset() would be affected by the change? Really, you need to understand what you're doing before sending patches. well looking at the code: virt/kvm/ioapic.c @@276 static inline int ioapic_in_range(struct kvm_ioapic *ioapic, gpa_t addr) { return ((addr >= ioapic->base_address && (addr < ioapic->base_address + IOAPIC_MEM_LENGTH))); } I see: base_address in there but looking more at the code its for something completely different.. Justin P. Mattock -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM, iSCSI and High Availability
That's what CLVM is for, it propagates the volume changes to every member of the 'cluster'. David Martin - Original Message - > Am Monday 28 March 2011 schrieb David Martin: > > - Original Message - > > > > > On 3/28/11 2:46 PM, Avi Kivity wrote: > > > > On 03/25/2011 10:26 PM, Marcin M. Jessa wrote: > > > [...] > > > > > > > One LUN per image allows you to implement failover, LVM doesn't > > > > (but > > > > cluster-LVM does). I recommend using one LUN per image; it's > > > > much > > > > simpler. > > > > > > Some people say "Use one LUN, it's easier and use CLVM". Why is it > > > easier to use CLVM and one LUN per virtual guest? > > > > I find it easier because i can do: > > lvcreate -n vm1 --size 40G iscsi_vg > > then virt-install or whatever > > If I were using 1 lun per vm then I would have to provision the lun, > > make > > ALL hosts aware of the lun, and finally screw with the multipath > > configs > > etc. > > Don't you have basically the same problem when using LVM in one LUN? > You still > have to make all the hosts aware of the new LV manually. I don't even > know LVM > even supports this, it wasn't exactly designed for a situation where > multiple > hosts might simultaneously read and write to a volume group, let alone > create > and destroy logical volumes while the VG is in use by any number of > other > hosts... > > Guido -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2 v3] kvm/x86: move and fix substitue search for missing CPUID entries
If KVM cannot find an exact match for a requested CPUID leaf, the code will try to find the closest match instead of simply confessing it's failure. The implementation was meant to satisfy the CPUID specification, but did not properly check for extended and standard leaves and also didn't account for the index subleaf. Beside that this rule only applies to CPUID intercepts, which is not the only user of the kvm_find_cpuid_entry() function. So fix this algorithm and call it from kvm_emulate_cpuid(). This fixes a crash of newer Linux kernels as KVM guests on AMD Bulldozer CPUs, where bogus values were returned in response to a CPUID intercept. CC: [2.6.38] Signed-off-by: Andre Przywara --- arch/x86/kvm/x86.c | 31 +-- 1 files changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6e86cec..a38fb9b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, best = e; break; } - /* -* Both basic or both extended? -*/ - if (((e->function ^ function) & 0x8000) == 0) - if (!best || e->function > best->function) - best = e; } return best; } @@ -4984,6 +4978,27 @@ not_found: return 36; } +/* + * If no match is found, check whether we exceed the vCPU's limit + * and return the content of the highest valid _standard_ leaf instead. + * This is to satisfy the CPUID specification. + */ +static struct kvm_cpuid_entry2* check_cpuid_limit(struct kvm_vcpu *vcpu, + u32 function, u32 index) +{ + struct kvm_cpuid_entry2 *maxlevel; + + maxlevel = kvm_find_cpuid_entry(vcpu, function & 0x8000, 0); + if (!maxlevel || maxlevel->eax >= function) + return NULL; + if (function & 0x8000) { + maxlevel = kvm_find_cpuid_entry(vcpu, 0, 0); + if (!maxlevel) + return NULL; + } + return kvm_find_cpuid_entry(vcpu, maxlevel->eax, index); +} + void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { u32 function, index; @@ -4996,6 +5011,10 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) kvm_register_write(vcpu, VCPU_REGS_RCX, 0); kvm_register_write(vcpu, VCPU_REGS_RDX, 0); best = kvm_find_cpuid_entry(vcpu, function, index); + + if (!best) + best = check_cpuid_limit(vcpu, function, index); + if (best) { kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax); kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx); -- 1.6.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm/x86: move and fix substitue search for missing CPUID entries
Avi Kivity wrote: On 03/31/2011 03:13 PM, Andre Przywara wrote: If KVM cannot find an exact match for a requested CPUID leaf, the code will try to find the closest match instead of simply confessing it's failure. The implementation was meant to satisfy the CPUID specification, but did not properly check for extended and standard leaves and also didn't account for the index subleaf. Beside that this rule only applies to CPUID intercepts, which is not the only user of the kvm_find_cpuid_entry() function. So fix this algorithm and move it into kvm_emulate_cpuid(). This fixes a crash of newer Linux kernels as KVM guests on AMD Bulldozer CPUs, where bogus values were returned in response to a CPUID intercept. @@ -4996,6 +4990,19 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) kvm_register_write(vcpu, VCPU_REGS_RCX, 0); kvm_register_write(vcpu, VCPU_REGS_RDX, 0); best = kvm_find_cpuid_entry(vcpu, function, index); + + /* if no match is found, check whether we exceed the vCPU's limit +* and return the content of the highest valid standard leaf instead. +* This is to satisfy the CPUID specification. +*/ + if (!best) { + best = kvm_find_cpuid_entry(vcpu, function& 0x8000, 0); "highest valid standard leaf" means the second argument should be zero, no? Weird, but somehow true. I fixed this is in a another version (following). Thanks for spotting this. Andre. + if (best&& best->eax< function) + best = kvm_find_cpuid_entry(vcpu, best->eax, index); + else + best = NULL; + } + if (best) { kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax); kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx); -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM, iSCSI and High Availability
Am Monday 28 March 2011 schrieb David Martin: > - Original Message - > > > On 3/28/11 2:46 PM, Avi Kivity wrote: > > > On 03/25/2011 10:26 PM, Marcin M. Jessa wrote: > > [...] > > > > > One LUN per image allows you to implement failover, LVM doesn't (but > > > cluster-LVM does). I recommend using one LUN per image; it's much > > > simpler. > > > > Some people say "Use one LUN, it's easier and use CLVM". Why is it > > easier to use CLVM and one LUN per virtual guest? > > I find it easier because i can do: > lvcreate -n vm1 --size 40G iscsi_vg > then virt-install or whatever > If I were using 1 lun per vm then I would have to provision the lun, make > ALL hosts aware of the lun, and finally screw with the multipath configs > etc. Don't you have basically the same problem when using LVM in one LUN? You still have to make all the hosts aware of the new LV manually. I don't even know LVM even supports this, it wasn't exactly designed for a situation where multiple hosts might simultaneously read and write to a volume group, let alone create and destroy logical volumes while the VG is in use by any number of other hosts... Guido -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm/x86: move and fix substitue search for missing CPUID entries
On 03/31/2011 03:13 PM, Andre Przywara wrote: If KVM cannot find an exact match for a requested CPUID leaf, the code will try to find the closest match instead of simply confessing it's failure. The implementation was meant to satisfy the CPUID specification, but did not properly check for extended and standard leaves and also didn't account for the index subleaf. Beside that this rule only applies to CPUID intercepts, which is not the only user of the kvm_find_cpuid_entry() function. So fix this algorithm and move it into kvm_emulate_cpuid(). This fixes a crash of newer Linux kernels as KVM guests on AMD Bulldozer CPUs, where bogus values were returned in response to a CPUID intercept. @@ -4996,6 +4990,19 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) kvm_register_write(vcpu, VCPU_REGS_RCX, 0); kvm_register_write(vcpu, VCPU_REGS_RDX, 0); best = kvm_find_cpuid_entry(vcpu, function, index); + + /* if no match is found, check whether we exceed the vCPU's limit +* and return the content of the highest valid standard leaf instead. +* This is to satisfy the CPUID specification. +*/ + if (!best) { + best = kvm_find_cpuid_entry(vcpu, function& 0x8000, 0); "highest valid standard leaf" means the second argument should be zero, no? + if (best&& best->eax< function) + best = kvm_find_cpuid_entry(vcpu, best->eax, index); + else + best = NULL; + } + if (best) { kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax); kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx); -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm/x86: move and fix substitue search for missing CPUID entries
If KVM cannot find an exact match for a requested CPUID leaf, the code will try to find the closest match instead of simply confessing it's failure. The implementation was meant to satisfy the CPUID specification, but did not properly check for extended and standard leaves and also didn't account for the index subleaf. Beside that this rule only applies to CPUID intercepts, which is not the only user of the kvm_find_cpuid_entry() function. So fix this algorithm and move it into kvm_emulate_cpuid(). This fixes a crash of newer Linux kernels as KVM guests on AMD Bulldozer CPUs, where bogus values were returned in response to a CPUID intercept. CC: [2.6.38] Signed-off-by: Andre Przywara --- arch/x86/kvm/x86.c | 19 +-- 1 files changed, 13 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6e86cec..552b8f8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, best = e; break; } - /* -* Both basic or both extended? -*/ - if (((e->function ^ function) & 0x8000) == 0) - if (!best || e->function > best->function) - best = e; } return best; } @@ -4996,6 +4990,19 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) kvm_register_write(vcpu, VCPU_REGS_RCX, 0); kvm_register_write(vcpu, VCPU_REGS_RDX, 0); best = kvm_find_cpuid_entry(vcpu, function, index); + + /* if no match is found, check whether we exceed the vCPU's limit +* and return the content of the highest valid standard leaf instead. +* This is to satisfy the CPUID specification. +*/ + if (!best) { + best = kvm_find_cpuid_entry(vcpu, function & 0x8000, 0); + if (best && best->eax < function) + best = kvm_find_cpuid_entry(vcpu, best->eax, index); + else + best = NULL; + } + if (best) { kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax); kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx); -- 1.6.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm/x86: remove unneeded substitute search for missing CPUID entries
On 03/31/2011 12:12 PM, Andre Przywara wrote: Avi Kivity wrote: On 03/30/2011 03:01 PM, Andre Przywara wrote: If KVM cannot find an exact match for a requested CPUID leaf, the code will try to find the closest match instead of simply confessing it's failure. The heuristic is on one hand wrong nowadays, since it does not take the KVM CPUID leaves (0x40xx) into account. On the other hand the callers of this function can all deal with the no-match situation. So lets remove this code, as it serves no purpose. This fixes a crash of newer Linux kernels as KVM guests on AMD Bulldozer CPUs, where bogus values were returned in response to a CPUID intercept. @@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, best = e; break; } -/* - * Both basic or both extended? - */ -if (((e->function ^ function)& 0x8000) == 0) -if (!best || e->function> best->function) -best = e; } return best; } This behaviour is mandated by the spec (looking at the Intel one), though it is implemented incorrectly - should always return largest basic leaf, and ignore the kvm leaves. But the spec says that this applies only if EAX is higher than the largest supported leaf. The code as is checks whether KVM has an entry in the cpuid "cache" for it, which is not the same. Especially this case that hit me was a missing index entry, which should return 0. Ah, I see. The check for too large leaf numbers should be moved into kvm_emulate_cpuid(). There is already some code in QEMU (cpu_x86_cpuid) to handle this, but that path does not apply to KVM. I will make a new version of this patch which replaces the old check with a sane version in kvm_emulate_cpuid(). Thanks for pointing this out. I think the correct behaviour is: if (e->function < 1 && (!best || e->function > best->function)) best = e; We probably need a find_exact_cpuid_entry() that returns NULL if it doesn't find a match, for internal use. As mentioned, this behavior only applies to the actual intercept case, not to all users of kvm_find_cpuid_entry(). So I'd like to make this check in the intercept code path and not in this function. Right. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions
On Thu, Mar 31, 2011 at 06:03:37AM -0400, Avi Kivity wrote: > On 03/31/2011 11:42 AM, Roedel, Joerg wrote: > > On Thu, Mar 31, 2011 at 05:18:28AM -0400, Avi Kivity wrote: > > > On 03/31/2011 09:14 AM, Roedel, Joerg wrote: > > > > On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote: > > > > > The spec indicates we need to check the TSS and IOPL based > > > permissions > > > > > before the intercept (vmx agrees). With the code as is, it > > > happens > > > > > afterwards. > > > > > > > > > > One way to do this is to have an ExtraChecks bit in the > > > opcode::flags. > > > > > Then opcode::u.xcheck->perms() is the pre-intercept check and > > > > > opcode::u.xcheck->execute() is the post-intercept execution. > > > Should > > > > > work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us. > > > > > > > > Okay, as you suggested, I put these checks into the instruction > > > emulator > > > > and let the hard work of implementing per-arch checks to the > > > nested-vmx > > > > people ;) > > > > I doubt that this makes the opcode-tables more readable, but lets see > > > :) > > > > > > I think we're miscommunicating. I'm talking about x86 checks, not virt > > > vendor specific checks. > > > > The place of the intercept check may be vendor specific. I havn't looked > > at the Intel spec, though. But there are probably differences. > > That's why there are three hooks: pre-ex, post-ex, post-mem. If an > intercept fits in between, use the pre-ex hook and duplicate the checks > in the intercept. > > As far as I recall, everything should fit into those three, though. Okay, thats the way to go then, thanks. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm/x86: remove unneeded substitute search for missing CPUID entries
Avi Kivity wrote: On 03/30/2011 03:01 PM, Andre Przywara wrote: If KVM cannot find an exact match for a requested CPUID leaf, the code will try to find the closest match instead of simply confessing it's failure. The heuristic is on one hand wrong nowadays, since it does not take the KVM CPUID leaves (0x40xx) into account. On the other hand the callers of this function can all deal with the no-match situation. So lets remove this code, as it serves no purpose. This fixes a crash of newer Linux kernels as KVM guests on AMD Bulldozer CPUs, where bogus values were returned in response to a CPUID intercept. @@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu, best = e; break; } - /* -* Both basic or both extended? -*/ - if (((e->function ^ function)& 0x8000) == 0) - if (!best || e->function> best->function) - best = e; } return best; } This behaviour is mandated by the spec (looking at the Intel one), though it is implemented incorrectly - should always return largest basic leaf, and ignore the kvm leaves. But the spec says that this applies only if EAX is higher than the largest supported leaf. The code as is checks whether KVM has an entry in the cpuid "cache" for it, which is not the same. Especially this case that hit me was a missing index entry, which should return 0. The check for too large leaf numbers should be moved into kvm_emulate_cpuid(). There is already some code in QEMU (cpu_x86_cpuid) to handle this, but that path does not apply to KVM. I will make a new version of this patch which replaces the old check with a sane version in kvm_emulate_cpuid(). Thanks for pointing this out. I think the correct behaviour is: if (e->function < 1 && (!best || e->function > best->function)) best = e; We probably need a find_exact_cpuid_entry() that returns NULL if it doesn't find a match, for internal use. As mentioned, this behavior only applies to the actual intercept case, not to all users of kvm_find_cpuid_entry(). So I'd like to make this check in the intercept code path and not in this function. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] KVM: emulator: do not needlesly sync registers from emulator ctxt to vcpu
Currently we sync registers back and forth before/after exiting to userspace for IO, but during IO device model shouldn't need to read/write the registers, so we can as well skip those sync points. The only exaception is broken vmware backdor interface. The new code sync registers content during IO only if registers are read from/written to by userspace in the middle of the IO operation and this almost never happens in practise. Signed-off-by: Gleb Natapov --- v1->v2 - rebased onto see emulation patchset. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index e820c63..f68bacc 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -370,6 +370,8 @@ struct kvm_vcpu_arch { /* emulate context */ struct x86_emulate_ctxt emulate_ctxt; + bool emulate_regs_need_sync_to_vcpu; + bool emulate_regs_need_sync_from_vcpu; gpa_t time; struct pvclock_vcpu_time_info hv_clock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9686547..eecad1b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4378,6 +4378,7 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu) ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16; memset(c, 0, sizeof(struct decode_cache)); memcpy(c->regs, vcpu->arch.regs, sizeof c->regs); + vcpu->arch.emulate_regs_need_sync_from_vcpu = false; } int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq) @@ -4460,6 +4461,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, { int r; struct decode_cache *c = &vcpu->arch.emulate_ctxt.decode; + bool writeback = true; kvm_clear_exception_queue(vcpu); vcpu->arch.mmio_fault_cr2 = cr2; @@ -4500,9 +4502,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, return EMULATE_DONE; } - /* this is needed for vmware backdor interface to work since it + /* this is needed for vmware backdoor interface to work since it changes registers values during IO operation */ - memcpy(c->regs, vcpu->arch.regs, sizeof c->regs); + if (vcpu->arch.emulate_regs_need_sync_from_vcpu) { + vcpu->arch.emulate_regs_need_sync_from_vcpu = false; + memcpy(c->regs, vcpu->arch.regs, sizeof c->regs); + } restart: r = x86_emulate_insn(&vcpu->arch.emulate_ctxt); @@ -4520,19 +4525,28 @@ restart: } else if (vcpu->arch.pio.count) { if (!vcpu->arch.pio.in) vcpu->arch.pio.count = 0; + else + writeback = false; r = EMULATE_DO_MMIO; - } else if (vcpu->mmio_needed) + } else if (vcpu->mmio_needed) { + if (!vcpu->mmio_is_write) + writeback = false; r = EMULATE_DO_MMIO; - else if (r == EMULATION_RESTART) + } else if (r == EMULATION_RESTART) goto restart; else r = EMULATE_DONE; - toggle_interruptibility(vcpu, vcpu->arch.emulate_ctxt.interruptibility); - kvm_set_rflags(vcpu, vcpu->arch.emulate_ctxt.eflags); - kvm_make_request(KVM_REQ_EVENT, vcpu); - memcpy(vcpu->arch.regs, c->regs, sizeof c->regs); - kvm_rip_write(vcpu, vcpu->arch.emulate_ctxt.eip); + if (writeback) { + toggle_interruptibility(vcpu, + vcpu->arch.emulate_ctxt.interruptibility); + kvm_set_rflags(vcpu, vcpu->arch.emulate_ctxt.eflags); + kvm_make_request(KVM_REQ_EVENT, vcpu); + memcpy(vcpu->arch.regs, c->regs, sizeof c->regs); + vcpu->arch.emulate_regs_need_sync_to_vcpu = false; + kvm_rip_write(vcpu, vcpu->arch.emulate_ctxt.eip); + } else + vcpu->arch.emulate_regs_need_sync_to_vcpu = true; return r; } @@ -5516,6 +5530,18 @@ out: int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) { + if (vcpu->arch.emulate_regs_need_sync_to_vcpu) { + /* +* We are here if userspace calls get_regs() in the middle of +* instruction emulation. Registers state needs to be copied +* back from emulation context to vcpu. Usrapace shouldn't do +* that usually, but some bad designed PV devices (vmware +* backdoor interface) need this to work +*/ + struct decode_cache *c = &vcpu->arch.emulate_ctxt.decode; + memcpy(vcpu->arch.regs, c->regs, sizeof c->regs); + vcpu->arch.emulate_regs_need_sync_to_vcpu = false; + } regs->rax = kvm_register_read(vcpu, VCPU_REGS_RAX); regs->rbx = kvm_register_read(vcpu, VCPU_REGS_RBX); regs->rcx = kvm_register_read(vcpu, VCPU_REGS_RCX); @@ -5543,6 +5569,9 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct
Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions
On 03/31/2011 11:42 AM, Roedel, Joerg wrote: On Thu, Mar 31, 2011 at 05:18:28AM -0400, Avi Kivity wrote: > On 03/31/2011 09:14 AM, Roedel, Joerg wrote: > > On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote: > > > The spec indicates we need to check the TSS and IOPL based permissions > > > before the intercept (vmx agrees). With the code as is, it happens > > > afterwards. > > > > > > One way to do this is to have an ExtraChecks bit in the opcode::flags. > > > Then opcode::u.xcheck->perms() is the pre-intercept check and > > > opcode::u.xcheck->execute() is the post-intercept execution. Should > > > work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us. > > > > Okay, as you suggested, I put these checks into the instruction emulator > > and let the hard work of implementing per-arch checks to the nested-vmx > > people ;) > > I doubt that this makes the opcode-tables more readable, but lets see :) > > I think we're miscommunicating. I'm talking about x86 checks, not virt > vendor specific checks. The place of the intercept check may be vendor specific. I havn't looked at the Intel spec, though. But there are probably differences. That's why there are three hooks: pre-ex, post-ex, post-mem. If an intercept fits in between, use the pre-ex hook and duplicate the checks in the intercept. As far as I recall, everything should fit into those three, though. > For example, the flow for IOIO would be: > > #UD check (lock prefix) > PE/IOPL/CPL/VM check > TSS bitmap check (can cause #PF) > Intercept check > Operand segment check > Possible #PF > Execution > > We need to make sure the TSS bitmap check happens before the intercept, > so we need to split ->execute() into two. Right. For the generic case, how about factor out the checks (for the POST_EX intercept case) into a seperate excp_check-callback (similar to the execute-callback) and execute it before the post-exception-intercept check? That is exactly my suggestion. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: better fix for race between nmi injection and enabling nmi window
On Thu, Mar 31, 2011 at 11:25:46AM +0200, Avi Kivity wrote: > On 03/31/2011 11:24 AM, Gleb Natapov wrote: > >On Thu, Mar 31, 2011 at 11:23:28AM +0200, Avi Kivity wrote: > >> On 03/30/2011 08:47 PM, Gleb Natapov wrote: > >> >On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote: > >> >> On 03/30/2011 06:30 PM, Marcelo Tosatti wrote: > >> >> >Based on Gleb's idea, fix race between nmi injection and enabling > >> >> >nmi window in a simpler way. > >> >> > > >> >> >Signed-off-by: Marcelo Tosatti > >> >> > > >> >> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >> >> >index a6a129f..9a7cc1be 100644 > >> >> >--- a/arch/x86/kvm/x86.c > >> >> >+++ b/arch/x86/kvm/x86.c > >> >> >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct > >> kvm_vcpu *vcpu) > >> >> >static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > >> >> >{ > >> >> >int r; > >> >> >+ int nmi_pending; > >> >> >bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&& > >> >> >vcpu->run->request_interrupt_window; > >> >> > > >> >> >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu > >> *vcpu) > >> >> >if (unlikely(r)) > >> >> >goto out; > >> >> > > >> >> >+ nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending); > >> >> >+ > >> >> >if (kvm_check_request(KVM_REQ_EVENT, vcpu) || > >> req_int_win) { > >> >> >inject_pending_event(vcpu); > >> >> > > >> >> >/* enable NMI/IRQ window open exits if needed */ > >> >> >- if (vcpu->arch.nmi_pending) > >> >> >+ if (nmi_pending) > >> >> >kvm_x86_ops->enable_nmi_window(vcpu); > >> >> >else if (kvm_cpu_has_interrupt(vcpu) || > >> req_int_win) > >> >> >kvm_x86_ops->enable_irq_window(vcpu); > >> >> > > >> >> > >> >> What about the check in inject_pending_events()? > >> >> > >> >Didn't we decide that this check is not a problem? Worst that can happen > >> >is NMI injection will be delayed till next exit. > >> > >> Could be very far in the future. > >> > >Next host interrupt. But with tickles host and guest yeah. > > > > esp. important with NMI, which may be used in a situation where your > tick (and everything else) are dead. > If host is alive eventually cpu should receive reschedule IPI. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions
On Thu, Mar 31, 2011 at 05:18:28AM -0400, Avi Kivity wrote: > On 03/31/2011 09:14 AM, Roedel, Joerg wrote: > > On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote: > > > The spec indicates we need to check the TSS and IOPL based permissions > > > before the intercept (vmx agrees). With the code as is, it happens > > > afterwards. > > > > > > One way to do this is to have an ExtraChecks bit in the opcode::flags. > > > Then opcode::u.xcheck->perms() is the pre-intercept check and > > > opcode::u.xcheck->execute() is the post-intercept execution. Should > > > work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us. > > > > Okay, as you suggested, I put these checks into the instruction emulator > > and let the hard work of implementing per-arch checks to the nested-vmx > > people ;) > > I doubt that this makes the opcode-tables more readable, but lets see :) > > I think we're miscommunicating. I'm talking about x86 checks, not virt > vendor specific checks. The place of the intercept check may be vendor specific. I havn't looked at the Intel spec, though. But there are probably differences. > For example, the flow for IOIO would be: > >#UD check (lock prefix) >PE/IOPL/CPL/VM check >TSS bitmap check (can cause #PF) >Intercept check >Operand segment check >Possible #PF >Execution > > We need to make sure the TSS bitmap check happens before the intercept, > so we need to split ->execute() into two. Right. For the generic case, how about factor out the checks (for the POST_EX intercept case) into a seperate excp_check-callback (similar to the execute-callback) and execute it before the post-exception-intercept check? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: better fix for race between nmi injection and enabling nmi window
On Wed, Mar 30, 2011 at 08:47:03PM +0200, Gleb Natapov wrote: > On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote: > > On 03/30/2011 06:30 PM, Marcelo Tosatti wrote: > > >Based on Gleb's idea, fix race between nmi injection and enabling > > >nmi window in a simpler way. > > > > > >Signed-off-by: Marcelo Tosatti > > > > > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > >index a6a129f..9a7cc1be 100644 > > >--- a/arch/x86/kvm/x86.c > > >+++ b/arch/x86/kvm/x86.c > > >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu) > > > static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > > { > > > int r; > > >+ int nmi_pending; > > > bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&& > > > vcpu->run->request_interrupt_window; > > > > > >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > > if (unlikely(r)) > > > goto out; > > > > > >+ nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending); > > >+ > > > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { > > > inject_pending_event(vcpu); > > > > > > /* enable NMI/IRQ window open exits if needed */ > > >- if (vcpu->arch.nmi_pending) > > >+ if (nmi_pending) > > > kvm_x86_ops->enable_nmi_window(vcpu); > > > else if (kvm_cpu_has_interrupt(vcpu) || req_int_win) > > > kvm_x86_ops->enable_irq_window(vcpu); > > > > > > > What about the check in inject_pending_events()? > > > Didn't we decide that this check is not a problem? Worst that can happen > is NMI injection will be delayed till next exit. Yes. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: better fix for race between nmi injection and enabling nmi window
On Thu, Mar 31, 2011 at 11:25:46AM +0200, Avi Kivity wrote: > >> >> >else if (kvm_cpu_has_interrupt(vcpu) || > >> req_int_win) > >> >> >kvm_x86_ops->enable_irq_window(vcpu); > >> >> > > >> >> > >> >> What about the check in inject_pending_events()? > >> >> > >> >Didn't we decide that this check is not a problem? Worst that can happen > >> >is NMI injection will be delayed till next exit. > >> > >> Could be very far in the future. > >> > >Next host interrupt. But with tickles host and guest yeah. > > > > esp. important with NMI, which may be used in a situation where your > tick (and everything else) are dead. Well the host must be alive. So eventually NMI will be delivered to the guest, but not closely matching guest visible clocks. What is the problem with that? The race should be rare enough to not interfere with NMI watchdog-like things? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: better fix for race between nmi injection and enabling nmi window
On 03/31/2011 11:24 AM, Gleb Natapov wrote: On Thu, Mar 31, 2011 at 11:23:28AM +0200, Avi Kivity wrote: > On 03/30/2011 08:47 PM, Gleb Natapov wrote: > >On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote: > >> On 03/30/2011 06:30 PM, Marcelo Tosatti wrote: > >> >Based on Gleb's idea, fix race between nmi injection and enabling > >> >nmi window in a simpler way. > >> > > >> >Signed-off-by: Marcelo Tosatti > >> > > >> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >> >index a6a129f..9a7cc1be 100644 > >> >--- a/arch/x86/kvm/x86.c > >> >+++ b/arch/x86/kvm/x86.c > >> >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu) > >> >static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > >> >{ > >> > int r; > >> >+ int nmi_pending; > >> > bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&& > >> > vcpu->run->request_interrupt_window; > >> > > >> >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > >> > if (unlikely(r)) > >> > goto out; > >> > > >> >+ nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending); > >> >+ > >> > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { > >> > inject_pending_event(vcpu); > >> > > >> > /* enable NMI/IRQ window open exits if needed */ > >> >- if (vcpu->arch.nmi_pending) > >> >+ if (nmi_pending) > >> > kvm_x86_ops->enable_nmi_window(vcpu); > >> > else if (kvm_cpu_has_interrupt(vcpu) || req_int_win) > >> > kvm_x86_ops->enable_irq_window(vcpu); > >> > > >> > >> What about the check in inject_pending_events()? > >> > >Didn't we decide that this check is not a problem? Worst that can happen > >is NMI injection will be delayed till next exit. > > Could be very far in the future. > Next host interrupt. But with tickles host and guest yeah. esp. important with NMI, which may be used in a situation where your tick (and everything else) are dead. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: better fix for race between nmi injection and enabling nmi window
On Thu, Mar 31, 2011 at 11:23:28AM +0200, Avi Kivity wrote: > On 03/30/2011 08:47 PM, Gleb Natapov wrote: > >On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote: > >> On 03/30/2011 06:30 PM, Marcelo Tosatti wrote: > >> >Based on Gleb's idea, fix race between nmi injection and enabling > >> >nmi window in a simpler way. > >> > > >> >Signed-off-by: Marcelo Tosatti > >> > > >> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >> >index a6a129f..9a7cc1be 100644 > >> >--- a/arch/x86/kvm/x86.c > >> >+++ b/arch/x86/kvm/x86.c > >> >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu > >> *vcpu) > >> > static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > >> > { > >> > int r; > >> >+int nmi_pending; > >> > bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&& > >> > vcpu->run->request_interrupt_window; > >> > > >> >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu > >> *vcpu) > >> > if (unlikely(r)) > >> > goto out; > >> > > >> >+nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending); > >> >+ > >> > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { > >> > inject_pending_event(vcpu); > >> > > >> > /* enable NMI/IRQ window open exits if needed */ > >> >-if (vcpu->arch.nmi_pending) > >> >+if (nmi_pending) > >> > kvm_x86_ops->enable_nmi_window(vcpu); > >> > else if (kvm_cpu_has_interrupt(vcpu) || req_int_win) > >> > kvm_x86_ops->enable_irq_window(vcpu); > >> > > >> > >> What about the check in inject_pending_events()? > >> > >Didn't we decide that this check is not a problem? Worst that can happen > >is NMI injection will be delayed till next exit. > > Could be very far in the future. > Next host interrupt. But with tickles host and guest yeah. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: better fix for race between nmi injection and enabling nmi window
On 03/30/2011 08:47 PM, Gleb Natapov wrote: On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote: > On 03/30/2011 06:30 PM, Marcelo Tosatti wrote: > >Based on Gleb's idea, fix race between nmi injection and enabling > >nmi window in a simpler way. > > > >Signed-off-by: Marcelo Tosatti > > > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >index a6a129f..9a7cc1be 100644 > >--- a/arch/x86/kvm/x86.c > >+++ b/arch/x86/kvm/x86.c > >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu) > > static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > { > > int r; > >+ int nmi_pending; > > bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&& > > vcpu->run->request_interrupt_window; > > > >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > if (unlikely(r)) > > goto out; > > > >+ nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending); > >+ > > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { > > inject_pending_event(vcpu); > > > > /* enable NMI/IRQ window open exits if needed */ > >- if (vcpu->arch.nmi_pending) > >+ if (nmi_pending) > > kvm_x86_ops->enable_nmi_window(vcpu); > > else if (kvm_cpu_has_interrupt(vcpu) || req_int_win) > > kvm_x86_ops->enable_irq_window(vcpu); > > > > What about the check in inject_pending_events()? > Didn't we decide that this check is not a problem? Worst that can happen is NMI injection will be delayed till next exit. Could be very far in the future. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]arch:x86:kvm:i8254.h Fix typo in kvm_pit
On 03/30/2011 07:42 PM, Justin P. Mattock wrote: On 03/30/2011 10:17 AM, Avi Kivity wrote: On 03/30/2011 06:30 PM, Justin P. Mattock wrote: On 03/30/2011 09:26 AM, Avi Kivity wrote: On 03/30/2011 06:19 PM, Justin P. Mattock wrote: The below patch changes base_addresss to base_address. Note: I have grepped for base_addresss and nothing shows up, grepping for base_address gets me lots of output, telling me that this is a typo, but could be wrong. Signed-off-by: Justin P. Mattock --- arch/x86/kvm/i8254.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h index 46d08ca..c2fa48b 100644 --- a/arch/x86/kvm/i8254.h +++ b/arch/x86/kvm/i8254.h @@ -33,7 +33,7 @@ struct kvm_kpit_state { }; struct kvm_pit { - unsigned long base_addresss; + unsigned long base_address; struct kvm_io_device dev; struct kvm_io_device speaker_dev; struct kvm *kvm; Why not remove the variable completely? didnt even think to completely remove the variable(figured it was used somewhere).I will look at that and resend with removal of the variable for you.. Well if it was used, you ought to have changed all of the users, no? at the moment I see: (keep in mind my reading skills only go so far!) grep -Re base_address kvm/* -n kvm/ioapic.c:276: return ((addr >= ioapic->base_address && kvm/ioapic.c:277:(addr < ioapic->base_address + IOAPIC_MEM_LENGTH))); kvm/ioapic.c:371: ioapic->base_address = IOAPIC_DEFAULT_BASE_ADDRESS; kvm/ioapic.h:38:u64 base_address; so changing base_addresss; to base_address; gets kvm_ioapic_reset to function correctly as well as ioapic_in_range? (but could be wrong) Can you explain how kvm_ioapic_reset() would be affected by the change? Really, you need to understand what you're doing before sending patches. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions
On 03/31/2011 09:14 AM, Roedel, Joerg wrote: On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote: > The spec indicates we need to check the TSS and IOPL based permissions > before the intercept (vmx agrees). With the code as is, it happens > afterwards. > > One way to do this is to have an ExtraChecks bit in the opcode::flags. > Then opcode::u.xcheck->perms() is the pre-intercept check and > opcode::u.xcheck->execute() is the post-intercept execution. Should > work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us. Okay, as you suggested, I put these checks into the instruction emulator and let the hard work of implementing per-arch checks to the nested-vmx people ;) I doubt that this makes the opcode-tables more readable, but lets see :) I think we're miscommunicating. I'm talking about x86 checks, not virt vendor specific checks. For example, the flow for IOIO would be: #UD check (lock prefix) PE/IOPL/CPL/VM check TSS bitmap check (can cause #PF) Intercept check Operand segment check Possible #PF Execution We need to make sure the TSS bitmap check happens before the intercept, so we need to split ->execute() into two. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
* KOSAKI Motohiro [2011-03-31 14:40:33]: > > > > The following series implements page cache control, > > this is a split out version of patch 1 of version 3 of the > > page cache optimization patches posted earlier at > > Previous posting http://lwn.net/Articles/425851/ and analysis > > at http://lwn.net/Articles/419713/ > > > > Detailed Description > > > > This patch implements unmapped page cache control via preferred > > page cache reclaim. The current patch hooks into kswapd and reclaims > > page cache if the user has requested for unmapped page control. > > This is useful in the following scenario > > - In a virtualized environment with cache=writethrough, we see > > double caching - (one in the host and one in the guest). As > > we try to scale guests, cache usage across the system grows. > > The goal of this patch is to reclaim page cache when Linux is running > > as a guest and get the host to hold the page cache and manage it. > > There might be temporary duplication, but in the long run, memory > > in the guests would be used for mapped pages. > > > > - The option is controlled via a boot option and the administrator > > can selectively turn it on, on a need to use basis. > > > > A lot of the code is borrowed from zone_reclaim_mode logic for > > __zone_reclaim(). One might argue that the with ballooning and > > KSM this feature is not very useful, but even with ballooning, > > we need extra logic to balloon multiple VM machines and it is hard > > to figure out the correct amount of memory to balloon. With these > > patches applied, each guest has a sufficient amount of free memory > > available, that can be easily seen and reclaimed by the balloon driver. > > The additional memory in the guest can be reused for additional > > applications or used to start additional guests/balance memory in > > the host. > > If anyone think this series works, They are just crazy. This patch reintroduce > two old issues. > > 1) zone reclaim doesn't work if the system has multiple node and the >workload is file cache oriented (eg file server, web server, mail server, > et al). >because zone recliam make some much free pages than zone->pages_min and >then new page cache request consume nearest node memory and then it >bring next zone reclaim. Then, memory utilization is reduced and >unnecessary LRU discard is increased dramatically. > >SGI folks added CPUSET specific solution in past. > (cpuset.memory_spread_page) >But global recliam still have its issue. zone recliam is HPC workload > specific >feature and HPC folks has no motivation to don't use CPUSET. > I am afraid you misread the patches and the intent. The intent to explictly enable control of unmapped pages and has nothing specifically to do with multiple nodes at this point. The control is system wide and carefully enabled by the administrator. > 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to >decide to filter out mapped pages. It made a lot of problems for DB servers >and large application servers. Because, if the system has a lot of mapped >pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2) >reclaim latency become terribly slow and hangup detectors misdetect its >state and start to force reboot. That was big problem of RHEL5 based > banking >system. >So, sc->may_unmap should be killed in future. Don't increase uses. > Can you remove sc->may_unmap without removing zone_reclaim()? The LRU churn can be addressed at the time of isolation, I'll send out an incremental patch for that. > > But, I agree that now we have to concern slightly large VM change parhaps > (or parhaps not). Ok, it's good opportunity to fill out some thing. > Historically, Linux MM has "free memory are waste memory" policy, and It > worked completely fine. But now we have a few exceptions. > > 1) RT, embedded and finance systems. They really hope to avoid reclaim >latency (ie avoid foreground reclaim completely) and they can accept >to make slightly much free pages before memory shortage. > > 2) VM guest >VM host and VM guest naturally makes two level page cache model. and >Linux page cache + two level don't work fine. It has two issues >1) hard to visualize real memory consumption. That makes harder to > works baloon fine. And google want to visualize memory utilization > to pack in more jobs. >2) hard to make in kernel memory utilization improvement mechanism. > > > And, now we have four proposal of utilization related issues. > > 1) cleancache (from Oracle) Cleancache requires both hypervisor and guest support. With these patches, Linux can run well under hypverisor if we know the hypversior does a lot of the IO and maintains the cache. > 2) VirtFS (from IBM) > 3) kstaled (from Google) > 4) unmapped page reclaim (from you) > > Probably, we can't merge all of them and we need t
Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions
On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote: > The spec indicates we need to check the TSS and IOPL based permissions > before the intercept (vmx agrees). With the code as is, it happens > afterwards. > > One way to do this is to have an ExtraChecks bit in the opcode::flags. > Then opcode::u.xcheck->perms() is the pre-intercept check and > opcode::u.xcheck->execute() is the post-intercept execution. Should > work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us. Okay, as you suggested, I put these checks into the instruction emulator and let the hard work of implementing per-arch checks to the nested-vmx people ;) I doubt that this makes the opcode-tables more readable, but lets see :) Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html