[ kvm-Bugs-1884444 ] Virtual network (set up via TCP socket) dog slow, but vde ok
Bugs item #188, was opened at 2008-02-01 15:30 Message generated for change (Comment added) made by jessorensen You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=188group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: qemu Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Alain Knaff (alain_knaff) Assigned to: Anthony Liguori (aliguori) Summary: Virtual network (set up via TCP socket) dog slow, but vde ok Initial Comment: I've got two kvm virtual machines that I connect to gether via a virtual network: 1. A samba server (Kubuntu 7.10), whose network I set up with -net nic,vlan=1,macaddr=00:1B:FC:1C:00:01 -net socket,vlan=1,listen=127.0.0.1:1234 2. A Windows XP client, whose network I set up with -net nic,vlan=1,macaddr=00:1B:FC:1C:01:11 -net socket,vlan=1,connect=127.0.0.1:1234 The Windows XP VM has joined the Samba VM's domain. When a domain user logs in, it takes litterally _minutes_ until its profile data is copied to the client, and same thing to copy it back to the server on logout. Setting up both with VDE, it's *much* quicker (only a couple of seconds, rather than minutes for login/logout). I also tried udp mcast networking, but couldn't get both VMs to communicate with each other. A tcpdump on the host shows traffic on port 1234. However, a tcpdump on the Kubuntu guest shows no traffic (not even arp) when I try to ping it from the windows VM -- Comment By: Jes Sorensen (jessorensen) Date: 2010-06-14 08:21 Message: Hi, Per Anthony's add, any chance you could let us know if this problem was resolved? I am trying to go through the old bugs and if this one is fixed, I'd like to close it. Thanks, Jes -- Comment By: Anthony Liguori (aliguori) Date: 2008-05-26 19:16 Message: Logged In: YES user_id=120449 Originator: NO I suspect this is related to whether SIGIO was enabled on the file descriptors. With the recent IO thread refactorings, this should be fixed. Please test if you can or I will queue it up to test myself. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=188group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH 13/14] KVM test: kvm_monitor.py: add QMP interface
On 06/14/2010 05:39 AM, Amos Kong wrote: On Sun, Jun 13, 2010 at 05:33:44PM +0300, Michael Goldish wrote: An initial QMP client implementation. Should be fully functional and supports asynchronous events. However, most tests must be modified to support it, because it returns output in a different format from the human monitor (the human monitor returns strings and the QMP one returns dicts or lists). To enable QMP, set main_monitor to a monitor whose monitor_type is qmp. For example (a single QMP monitor): monitors = monitor1 monitor_type_monitor1 = qmp main_monitor = monitor1 Another example (multiple monitors, both human and QMP): monitors = MyMonitor SomeOtherMonitor YetAnotherMonitor # defines 3 monitors monitor_type = human# default for all monitors monitor_type_SomeOtherMonitor = qmp # applies only to SomeOtherMonitor monitor_type_YetAnotherMonitor = qmp# applies only to YetAnotherMonitor main_monitor = SomeOtherMonitor # the main monitor is a QMP one, so # the test will use QMP Note: Monitor methods now raise exceptions such as MonitorLockError and QMPCmdError. If this turns out to be a bad idea, it shouldn't be hard to revert to the old convention of returning a (status, output) tuple. Signed-off-by: Michael Goldish mgold...@redhat.com --- client/tests/kvm/kvm_monitor.py | 275 +++ client/tests/kvm/kvm_vm.py |6 +- 2 files changed, 279 insertions(+), 2 deletions(-) diff --git a/client/tests/kvm/kvm_monitor.py b/client/tests/kvm/kvm_monitor.py index c5cf9c3..76a1a83 100644 --- a/client/tests/kvm/kvm_monitor.py +++ b/client/tests/kvm/kvm_monitor.py @@ -6,6 +6,11 @@ Interfaces to the QEMU monitor. import socket, time, threading, logging import kvm_utils +try: +import json +except ImportError: +logging.warning(Could not import json module. +QMP monitor functionality disabled.) class MonitorError(Exception): @@ -28,6 +33,10 @@ class MonitorProtocolError(MonitorError): pass +class QMPCmdError(MonitorError): +pass + + class Monitor: Common code for monitor classes. @@ -114,6 +123,8 @@ class HumanMonitor(Monitor): suppress_exceptions is False @raise MonitorProtocolError: Raised if the initial (qemu) prompt isn't found and suppress_exceptions is False +@note: Other exceptions may be raised. See _get_command_output's +docstring. try: Monitor.__init__(self, filename) @@ -354,3 +365,267 @@ class HumanMonitor(Monitor): @return: The command's output return self._get_command_output(mouse_button %d % state) + + +class QMPMonitor(Monitor): + +Wraps QMP monitor commands. + + +def __init__(self, filename, suppress_exceptions=False): + +Connect to the monitor socket and issue the qmp_capabilities command + +@param filename: Monitor socket filename +@raise MonitorConnectError: Raised if the connection fails and +suppress_exceptions is False +@note: Other exceptions may be raised if the qmp_capabilities command +fails. See _get_command_output's docstring. + +try: +Monitor.__init__(self, filename) + +self.protocol = qmp +self.events = [] + +# Issue qmp_capabilities +self._get_command_output(qmp_capabilities) + +except MonitorError, e: +if suppress_exceptions: +logging.warn(e) +else: +raise + + +# Private methods + +def _build_cmd(self, cmd, args=None): +obj = {execute: cmd} +if args: +obj[arguments] = args +return obj + + +def _read_objects(self, timeout=5): + +Read lines from monitor and try to decode them. +Stop when all available lines have been successfully decoded, or when +timeout expires. If any decoded objects are asynchronous events, store +them in self.events. Return all decoded objects. + +@param timeout: Time to wait for all lines to decode successfully +@return: A list of objects + +s = +objs = [] +end_time = time.time() + timeout +while time.time() end_time: +s += self._recvall() +for line in s.splitlines(): +if not line: +continue +try: +obj = json.loads(line) +except: +# Found an incomplete or broken line -- keep reading +break +objs += [obj] +else:
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
Alex Williamson alex.william...@redhat.com writes: qdev_get_dev_path() is intended to be the canonical utility for creating a string representing the qdev hierarchy of a device. The path consists of bus and device names as well as identified properties of the immediate parent bus and device. This results in strings like: /main-system-bus/pci.0,addr=00.0/i440FX /main-system-bus/pci.0,addr=01.0/PIIX3 /main-system-bus/pci.0,addr=02.0/cirrus-vga /main-system-bus/pci.0/isa.0/mc146818rtc /main-system-bus/pci.0/isa.0/isa-serial /main-system-bus/pci.0/isa.0/i8042 /main-system-bus/pci.0/isa.0/isa-fdc /main-system-bus/pci.0,addr=03.0/i82551,mac=52:54:00:12:34:56 /main-system-bus/pci.0,addr=04.0/virtio-net-pci,mac=52:54:00:12:34:57 /main-system-bus/pci.0,addr=05.0/e1000,mac=52:54:00:12:34:58 /main-system-bus/pci.0,addr=06.0/rtl8139,mac=52:54:00:12:34:59 /main-system-bus/pci.0,addr=07.0/pcnet,mac=52:54:00:12:34:5a /main-system-bus/pci.0,addr=01.1/piix3-ide /main-system-bus/pci.0,addr=01.3/PIIX4_PM /main-system-bus/pci.0,addr=08.0/lsi53c895a /main-system-bus/pci.0,addr=09.0/virtio-blk-pci There are two primary targets for these strings. The first is vmsave, where we currently use a device provided string plus instance number to track SaveStateEntries. This fails when we introduce device hotplug, particularly in a case were we have gaps in the instance numbers that cannot easily be reproduced on a migration target. The second is for naming RAMBlocks. For these, we would like a string that matches the vmstate string. Could you explain why you add identified properties of the immediate parent bus and device? They make the result ver much *not* a dev path in the qdev sense... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [BSG]: Add support for struct sg_io_v4-d[out,in]_iovec_count
On Mon, 2010-06-14 at 15:22 +0900, FUJITA Tomonori wrote: On Sun, 13 Jun 2010 23:11:29 -0700 Nicholas A. Bellinger n...@linux-iscsi.org wrote: From: Nicholas Bellinger n...@linux-iscsi.org Greetings Jens and co, This patch adds the missing support to block/bsg.c:bsg_map_hdr() to accept struct sg_io_v4-d[out,in]_iovec_count and process struct sg_io_v4-d[out,in]_xferp memory containing userspace iovecs for kernel level BSG. It adds a new wrapper bsg_rq_map_user_iov() that will call copy_from_user() and blk_rq_map_user_iov() following the original SG_IO logic in drivers/scsi/sg.c:sg_start_req(). So far this has been tested on a x86_64 v2.6.34 KVM Host with TCM_Loop Virtual SAS Port/LUNs into a x86_64 v2.6.26 KVM Guest with Megasas 8707EM2 HBA Emulation + my new scsi-bsg backstore code. Please consider this for v2.6.36 as it will be required in order for QEMU-KVM MegaSAS and VirtIO HBA emulation using QEMU scatterlist memory and BSG backstores. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- block/bsg.c | 55 +++ 1 files changed, 47 insertions(+), 8 deletions(-) This have been rejected several times. I guess that would explain why the iovec_count structure members are part of struct sg_io_v4 and block/bsg.c:bsg_map_hdr() silently accepts and overwrites them.. :( This doesn't work on 32bit user/64bit kernel. Hmmm, being able to use QEMU-KVM HBA emulation with userspace scatterlist memory for the various HBA emulation with a matched user/kernel is still useful though, yes..? Being able to handle TMR emulation from QEMU-KVM HBA emulation would also be useful I think. Having QEMU-KVM fall back to SG_IO for the 32bit/64bit kernel would be accpetable for me if we can still make BSG + iovecs work for the typical case. The compat code doesn't work for the read/write interface (does for the ioctl interface though). So we can't support this feature cleanly. So I am curious to see if there will be a performance improvement between QEMU-KVM + scsi-bsg w/ AIO IOVECs compared to the legacy SG_IO ioctl(). Anyways, I will post my QEMU-KVM scsi-bsg patches and run some benchmarks this week with this patch on 5500 series Nehalem so see aside from the userspace TMR case how useful having a BSG backstore for QEMU-KVM may be. Best, --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 09:28:19]: On Mon, 14 Jun 2010 00:01:45 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * Balbir Singh bal...@linux.vnet.ibm.com [2010-06-08 21:21:46]: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Are there any major objections to this patch? This kind of patch needs how it works well measurement. - How did you measure the effect of the patch ? kernbench is not enough, of course. I can run other benchmarks as well, I will do so - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? I believe in LRU, just that the problem I am trying to solve is of using double the memory for caching the same data (consider kvm running in cache=writethrough or writeback mode, both the hypervisor and the guest OS maintain a page cache of the same data). As the VM's grow the overhead is substantial. In my runs I found upto 60% duplication in some cases. - Linux vm guys tend to say, free memory is bad memory. ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. Free memory is good for cases when you want to do more in the same system. I agree that in a bare metail environment that might be partially true. I don't have a problem with frequently used data being cached, but I am targetting a consolidated environment at the moment. Moreover, the administrator has control via a boot option, so it is non-instrusive in many ways. If your patch just drops duplicated, but no more necessary for other kvm, I agree your patch may increase available size of page-caches. But you just drops unmapped pages. unmapped and unused are the best targets, I plan to add slab cache control later. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string
Hi, My premise with this attempt is that we walk the hierarchy and use the names to create the base of the path. As we get to the device, particularly to the parent bus of the device, we need to start looking at properties to ensure uniqueness. You'll need that for every bus along the way down to the device. Create a virtual machine with two lsi scsi host adapters, then attach a disk with scsi id 0 to each. Just the scsi id isn't good enougth to identify the device. You'll need the lsi pci address too. For now, the only properties I've tagged as path properties are PCI bus addresses and MAC addresses. mac address isn't needed here. You need the property which specifies the bus address. For PCI this obviously is the PCI address. For scsi the scsi id. For ISA you can use the I/O port base. virtio-serial the port number, ... cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
On Mon, 14 Jun 2010 12:19:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? I believe in LRU, just that the problem I am trying to solve is of using double the memory for caching the same data (consider kvm running in cache=writethrough or writeback mode, both the hypervisor and the guest OS maintain a page cache of the same data). As the VM's grow the overhead is substantial. In my runs I found upto 60% duplication in some cases. - Linux vm guys tend to say, free memory is bad memory. ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. Free memory is good for cases when you want to do more in the same system. I agree that in a bare metail environment that might be partially true. I don't have a problem with frequently used data being cached, but I am targetting a consolidated environment at the moment. Moreover, the administrator has control via a boot option, so it is non-instrusive in many ways. It sounds that what you want is to improve performance etc. but to make it easy sizing the system and to help admins. Right ? From performance perspective, I don't see any advantage to drop caches which can be dropped easily. I just use cpus for the purpose it may no be necessary. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 16:00:21]: On Mon, 14 Jun 2010 12:19:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? I believe in LRU, just that the problem I am trying to solve is of using double the memory for caching the same data (consider kvm running in cache=writethrough or writeback mode, both the hypervisor and the guest OS maintain a page cache of the same data). As the VM's grow the overhead is substantial. In my runs I found upto 60% duplication in some cases. - Linux vm guys tend to say, free memory is bad memory. ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. Free memory is good for cases when you want to do more in the same system. I agree that in a bare metail environment that might be partially true. I don't have a problem with frequently used data being cached, but I am targetting a consolidated environment at the moment. Moreover, the administrator has control via a boot option, so it is non-instrusive in many ways. It sounds that what you want is to improve performance etc. but to make it easy sizing the system and to help admins. Right ? Right, to allow freeing up of using double the memory to cache data. From performance perspective, I don't see any advantage to drop caches which can be dropped easily. I just use cpus for the purpose it may no be necessary. It is not that easy, in a virtualized environment, you do directly reclaim, but use a mechanism like ballooning and that too requires a smart software to decide where to balloon from. This patch (optionally if enabled) optimizes that by 1. Reducing double caching 2. Not requiring newer smarts or a management software to monitor and balloon 3. Allows better estimation of free memory by avoiding double caching 4. Allows immediate use of free memory for other applications or startup of newer guest instances. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
On Mon, 14 Jun 2010 13:06:46 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: It sounds that what you want is to improve performance etc. but to make it easy sizing the system and to help admins. Right ? Right, to allow freeing up of using double the memory to cache data. Oh, sorry. ask again.. It sounds that what you want is _not_ to improve performance etc. but to make it ... ? -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/11/2010 07:56 AM, Balbir Singh wrote: Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/24] Move nested option from svm.c to x86.c
On 06/13/2010 03:23 PM, Nadav Har'El wrote: The SVM module had a nested option, on by default, which controls whether to allow nested virtualization. Now that VMX also supports nested virtualization, we can move this option to x86.c, for both SVM and VMX. The nested option takes three possible values. 0 disables nested virtualization on both SVM and VMX, and 1 enables it on both. The value 2, which is the default when this module option is not explicitly set, asks each of SVM or VMX to choose its own default; Currently, VMX disables nested virtualization in this case, while SVM leaves it enabled. When nested VMX becomes more mature, this default should probably be changed to enable nested virtualization on both architectures. --- .before/arch/x86/kvm/x86.c 2010-06-13 15:01:28.0 +0300 +++ .after/arch/x86/kvm/x86.c 2010-06-13 15:01:28.0 +0300 @@ -95,6 +95,17 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops); int ignore_msrs = 0; module_param_named(ignore_msrs, ignore_msrs, bool, S_IRUGO | S_IWUSR); +/* If nested=1, nested virtualization is supported. I.e., the guest may use + * VMX or SVM (as appropriate) and be a hypervisor for its own guests. + * If nested=0, nested virtualization is not supported. + * When nested starts as 2 (which is the default), it is later modified by the + * specific module used (VMX or SVM). Currently, nested will be left enabled + * on SVM, but reset to 0 on VMX. + */ +int nested = 2; +EXPORT_SYMBOL_GPL(nested); +module_param(nested, int, S_IRUGO); + A global variable names 'nested' is not a good idea. I recommend having a kvm-intel scope module parameter instead, that also avoids the 0/1/2 values. After the patches are merged we can try to consolidate here. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/24] Add VMX and SVM to list of supported cpuid features
On 06/13/2010 03:23 PM, Nadav Har'El wrote: Add the VMX CPU feature to the list of CPU featuress KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl (unless the nested module option is off). Qemu uses this ioctl, and intersects KVM's list with its own list of desired cpu features (depending on the -cpu option given to qemu) to determine the final list of features presented to the guest. This patch also does the same for SVM: KVM now advertises it supports SVM, unless the nested module option is off. Signed-off-by: Nadav Har'Eln...@il.ibm.com --- --- .before/arch/x86/kvm/x86.c 2010-06-13 15:01:28.0 +0300 +++ .after/arch/x86/kvm/x86.c 2010-06-13 15:01:28.0 +0300 @@ -1923,7 +1923,7 @@ static void do_cpuid_ent(struct kvm_cpui /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = F(XMM3) | 0 /* Reserved, DTES64, MONITOR */ | - 0 /* DS-CPL, VMX, SMX, EST */ | + 0 /* DS-CPL */ | (nested ? F(VMX) : 0) | 0 /* SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | 0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ | 0 /* Reserved, DCA */ | F(XMM4_1) | You can use kvm_x86_ops-set_supported_cpuid() to alter features. @@ -1931,7 +1931,8 @@ static void do_cpuid_ent(struct kvm_cpui 0 /* Reserved, XSAVE, OSXSAVE */; /* cpuid 0x8001.ecx */ const u32 kvm_supported_word6_x86_features = - F(LAHF_LM) | F(CMP_LEGACY) | F(SVM) | 0 /* ExtApicSpace */ | + F(LAHF_LM) | F(CMP_LEGACY) | (nested ? F(SVM) : 0) | + 0 /* ExtApicSpace */ | F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) | F(3DNOWPREFETCH) | 0 /* OSVW */ | 0 /* IBS */ | F(SSE5) | 0 /* SKINIT */ | 0 /* WDT */; Good idea, but let's leave it out of the nvmx patches. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/24] Implement VMXON and VMXOFF
On 06/13/2010 03:24 PM, Nadav Har'El wrote: This patch allows a guest to use the VMXON and VMXOFF instructions, and emulates them accordingly. Basically this amounts to checking some prerequisites, and then remembering whether the guest has enabled or disabled VMX operation. Should probably reorder with next patch. +/* The nested_vmx structure is part of vcpu_vmx, and holds information we need + * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example, + * the current VMCS set by L1, a list of the VMCSs used to run the active + * L2 guests on the hardware, and more. + */ Please (here and elsewhere) use the standard kernel style for multiline comments - start with /* on a line by itself. +/* Emulate the VMXON instruction. + * Currently, we just remember that VMX is active, and do not save or even + * inspect the argument to VMXON (the so-called VMXON pointer) because we + * do not currently need to store anything in that guest-allocated memory + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their + * argument is different from the VMXON pointer (which the spec says they do). + */ +static int handle_vmon(struct kvm_vcpu *vcpu) +{ + struct kvm_segment cs; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + /* The Intel VMX Instruction Reference lists a bunch of bits that +* are prerequisite to running VMXON, most notably CR4.VMXE must be +* set to 1. Otherwise, we should fail with #UD. We test these now: +*/ + if (!nested) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (!(vcpu-arch.cr4 X86_CR4_VMXE) || + !(vcpu-arch.cr0 X86_CR0_PE) || + (vmx_get_rflags(vcpu) X86_EFLAGS_VM)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + vmx_get_segment(vcpu,cs, VCPU_SREG_CS); + if (is_long_mode(vcpu) !cs.l) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (vmx_get_cpl(vcpu)) { + kvm_inject_gp(vcpu, 0); + return 1; + } + + vmx-nested.vmxon = 1; = true + + skip_emulated_instruction(vcpu); + return 1; +} Need to block INIT signals in the local apic as well (fine for a separate patch). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
On 06/13/2010 03:25 PM, Nadav Har'El wrote: An implementation of VMX needs to define a VMCS structure. This structure is kept in guest memory, but is opaque to the guest (who can only read or write it with VMX instructions). This patch starts to define the VMCS structure which our nested VMX implementation will present to L1. We call it vmcs12, as it is the VMCS that L1 keeps for its L2 guests. This patch also adds the notion (as required by the VMX spec) of the current VMCS, and finally includes utility functions for mapping the guest-allocated VMCSs in host memory. +#define VMCS12_REVISION 0x11e57ed0 Where did this number come from? It's not from real hardware, yes? + +/* + * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a + * single nested guest (L2), hence the name vmcs12. Any VMX implementation has + * a VMCS structure (which is opaque to the guest), and vmcs12 is our emulated + * VMX's VMCS. This structure is stored in guest memory specified by VMPTRLD, + * and accessed by the guest using VMREAD/VMWRITE/VMCLEAR instructions. More + * than one of these structures may exist, if L1 runs multiple L2 guests. + * nested_vmx_run() will use the data here to build a VMCS for the underlying + * hardware which will be used to run L2. + * This structure is packed in order to preseve the binary content after live + * migration. If there are changes in the content or layout, VMCS12_REVISION + * must be changed. + */ +struct __attribute__ ((__packed__)) vmcs12 { __packed is a convenient define for this. + /* According to the Intel spec, a VMCS region must start with the +* following two fields. Then follow implementation-specific data. +*/ + u32 revision_id; + u32 abort; +}; Note that this structure becomes an ABI, it cannot change except in a backward compatible way due to the need for live migration. So I'd like a documentation patch that adds a description of the content to Documentation/kvm/. It can be as simple as listing the structure definition. +static struct page *nested_get_page(struct kvm_vcpu *vcpu, u64 vmcs_addr) +{ + struct page *vmcs_page = + gfn_to_page(vcpu-kvm, vmcs_addr PAGE_SHIFT); + + if (is_error_page(vmcs_page)) { + printk(KERN_ERR %s error allocating page 0x%llx\n, + __func__, vmcs_addr); Those printks can be used by a malicious guest to span the host logs. Please wrap them with something that is conditional on a debug flag. I'm not sure what we need to do with vmcs that is not in RAM. It may simplify things to return the error_page to the caller and set KVM_REQ_TRIPLE_FAULT, so we don't have to deal with error handling later on. + kvm_release_page_clean(vmcs_page); + return NULL; + } + return vmcs_page; +} + +static void nested_unmap_current(struct kvm_vcpu *vcpu) +{ + struct page *page; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (!vmx-nested.current_l2_page) { + printk(KERN_INFO Shadow vmcs already unmapped\n); + BUG_ON(1); + return; + } + + page = kmap_atomic_to_page(vmx-nested.current_l2_page); + + kunmap_atomic(vmx-nested.current_l2_page, KM_USER0); + + kvm_release_page_dirty(page); Do we always dirty the page? I guess it is no big deal even if we don't. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 11:09:44]: On 06/11/2010 07:56 AM, Balbir Singh wrote: Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
On 06/13/2010 03:26 PM, Nadav Har'El wrote: This patch includes a couple of utility functions for extracting pointer operands of VMX instructions issued by L1 (a guest hypervisor), and translating guest-given vmcs12 virtual addresses to guest-physical addresses. +/* + * Decode the memory-address operand of a vmx instruction, according to the + * Intel spec. + */ +#define VMX_OPERAND_SCALING(vii) ((vii) 3) +#define VMX_OPERAND_ADDR_SIZE(vii) (((vii) 7) 7) +#define VMX_OPERAND_IS_REG(vii)((vii) (1u 10)) +#define VMX_OPERAND_SEG_REG(vii) (((vii) 15) 7) +#define VMX_OPERAND_INDEX_REG(vii) (((vii) 18) 0xf) +#define VMX_OPERAND_INDEX_INVALID(vii) ((vii) (1u 22)) +#define VMX_OPERAND_BASE_REG(vii) (((vii) 23) 0xf) +#define VMX_OPERAND_BASE_INVALID(vii) ((vii) (1u 27)) +#define VMX_OPERAND_REG(vii) (((vii) 3) 0xf) +#define VMX_OPERAND_REG2(vii) (((vii) 28) 0xf) +static gva_t get_vmx_mem_address(struct kvm_vcpu *vcpu, +unsigned long exit_qualification, +u32 vmx_instruction_info) +{ + int scaling = VMX_OPERAND_SCALING(vmx_instruction_info); + int addr_size = VMX_OPERAND_ADDR_SIZE(vmx_instruction_info); + bool is_reg = VMX_OPERAND_IS_REG(vmx_instruction_info); + int seg_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info); + int index_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info); + bool index_is_valid = !VMX_OPERAND_INDEX_INVALID(vmx_instruction_info); + int base_reg = VMX_OPERAND_BASE_REG(vmx_instruction_info); + bool base_is_valid = !VMX_OPERAND_BASE_INVALID(vmx_instruction_info); Since those defines are used just ones, you can fold them into their uses. It doesn't add much to repeat the variable name. + gva_t addr; + + if (is_reg) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 0; + } + + switch (addr_size) { + case 1: /* 32 bit. high bits are undefined according to the spec: */ + exit_qualification= 0x; + break; + case 2: /* 64 bit */ + break; + default: /* addr_size=0 means 16 bit */ + return 0; + } + + /* Addr = segment_base + offset */ + /* offfset = Base + [Index * Scale] + Displacement */ + addr = vmx_get_segment_base(vcpu, seg_reg); + if (base_is_valid) + addr += kvm_register_read(vcpu, base_reg); + if (index_is_valid) + addr += kvm_register_read(vcpu, index_reg)scaling; + addr += exit_qualification; /* holds the displacement */ Do we need a segment limit and access rights check? + + return addr; +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1: On 06/13/2010 03:25 PM, Nadav Har'El wrote: +#define VMCS12_REVISION 0x11e57ed0 Where did this number come from? It's not from real hardware, yes? Basically, we are presenting emulated VMX for the L1 guest, complete with its own VMCS structure. This structure needs to have some VMCS revision id, which should be an arbitrary number that we invent - it is not related to any revision id that any real hardware uses. If you look closely, you can see that the number I used is leetspeak for Nested0 ;-) As you can see in the following patches, MSR_IA32_VMX_BASIC will return this arbitrary VMCS revision id, and and VMPTRLD will verify that the VMCS region that L1 is trying to load contains this revision id. -- Nadav Har'El| Monday, Jun 14 2010, 2 Tammuz 5770 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If you're looking for a helping hand, http://nadav.harel.org.il |look first at the end of your arm. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/24] Implement VMCLEAR
On 06/13/2010 03:27 PM, Nadav Har'El wrote: This patch implements the VMCLEAR instruction. + +/* Emulate the VMCLEAR instruction */ +static int handle_vmclear(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + gpa_t guest_vmcs_addr, save_current_vmptr; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr)) + return 1; + + save_current_vmptr = vmx-nested.current_vmptr; + + vmx-nested.current_vmptr = guest_vmcs_addr; + if (!nested_map_current(vcpu)) + return 1; Haven't you leaked current_vmptr here? If I read the code correctly, you are implementing a sort of stack here and pushing the current vmptr into save_current_vmptr. Perhaps it's simper to have an nvmxptr structure which holds a vmptr and a kmap'ed pointer to it, and pass that around to functions. + vmx-nested.current_l2_page-launch_state = 0; + nested_unmap_current(vcpu); + + nested_free_current_vmcs(vcpu); + + if (save_current_vmptr == guest_vmcs_addr) + vmx-nested.current_vmptr = -1ull; + else + vmx-nested.current_vmptr = save_current_vmptr; + + skip_emulated_instruction(vcpu); + clear_rflags_cf_zf(vcpu); + return 1; +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/24] Implement VMPTRLD
On 06/13/2010 03:27 PM, Nadav Har'El wrote: This patch implements the VMPTRLD instruction. static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu) { unsigned long rflags; @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp return 1; } +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t guest_vmcs_addr) +{ + bool ret; + struct vmcs12 *vmcs12; + struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr); Blank line so I can catch my breath. + if (vmcs_page == NULL) + return 0; Doesn't seem right. + vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0); + if (vmcs12-revision_id == VMCS12_REVISION) + ret = 1; + else { + set_rflags_to_vmx_fail_valid(vcpu); + ret = 0; + } + kunmap_atomic(vmcs12, KM_USER0); + kvm_release_page_dirty(vmcs_page); Can release a clean page here. But what happened to those mapping helpers? + return ret; +} + +/* Emulate the VMPTRLD instruction */ +static int handle_vmptrld(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + gpa_t guest_vmcs_addr; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr)) { + set_rflags_to_vmx_fail_invalid(vcpu); Need to skip_emulated_instruction() in this case. + return 1; + } + + if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr)) + return 1; Here too. + + if (vmx-nested.current_vmptr != guest_vmcs_addr) { + vmx-nested.current_vmptr = guest_vmcs_addr; + + if (nested_create_current_vmcs(vcpu)) { + printk(KERN_ERR %s error could not allocate memory, + __func__); In general ftrace and the ENOMEM itself are sufficient documentation that something went wrong. + return -ENOMEM; + } + } + + clear_rflags_cf_zf(vcpu); + skip_emulated_instruction(vcpu); + return 1; +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/24] Implement VMPTRST
On 06/13/2010 03:28 PM, Nadav Har'El wrote: This patch implements the VMPTRST instruction. Signed-off-by: Nadav Har'Eln...@il.ibm.com --- --- .before/arch/x86/kvm/x86.c 2010-06-13 15:01:29.0 +0300 +++ .after/arch/x86/kvm/x86.c 2010-06-13 15:01:29.0 +0300 @@ -3301,7 +3301,7 @@ static int kvm_read_guest_virt_system(gv return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error); } -static int kvm_write_guest_virt_system(gva_t addr, void *val, +int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes, struct kvm_vcpu *vcpu, u32 *error) write_guest_virt_system() is used by writes which need to ignore the cpl, for example when a cpl 3 instruction loads a segment, the processor needs to update the accessed flag even though it is only accessible to cpl 0. That's not your case, you need the ordinary write_guest_virt(). Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it. +/* Emulate the VMPTRST instruction */ +static int handle_vmptrst(struct kvm_vcpu *vcpu) +{ + int r = 0; + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + gva_t vmcs_gva; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + vmcs_gva = get_vmx_mem_address(vcpu, exit_qualification, + vmx_instruction_info); + if (vmcs_gva == 0) + return 1; What's wrong with gva 0? It's favoured by exploiters everywhere. + r = kvm_write_guest_virt_system(vmcs_gva, +(void *)to_vmx(vcpu)-nested.current_vmptr, +sizeof(u64), vcpu, NULL); + if (r) { Check against the X86EMUL return codes. You'll need to inject a page fault on failure. + printk(KERN_INFO %s failed to write vmptr\n, __func__); + return 1; + } + clear_rflags_cf_zf(vcpu); + skip_emulated_instruction(vcpu); + return 1; +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/24] Add VMCS fields to the vmcs12
On 06/13/2010 03:28 PM, Nadav Har'El wrote: In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the standard VMCS fields. These fields are encapsulated in a struct shadow_vmcs. Later patches will enable L1 to read and write these fields using VMREAD/ VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing a real VMCS for running L2. +/* shadow_vmcs is a structure used in nested VMX for holding a copy of all + * standard VMCS fields. It is used for emulating a VMCS for L1 (see vmcs12), + * and also for easier access to VMCS data (see l1_shadow_vmcs). + */ +struct __attribute__ ((__packed__)) shadow_vmcs { + u32 host_ia32_sysenter_cs; + unsigned long cr0_guest_host_mask; I think I counted an odd number of u32 fields, which mean the ulong fields will be unaligned. Please add padding to preserve natural alignment. Have you considered placing often used fields together to reduce cache misses? I'm not sure whether it's worth the effort. /* @@ -139,6 +269,8 @@ struct __attribute__ ((__packed__)) vmcs u32 revision_id; u32 abort; + struct shadow_vmcs shadow_vmcs; + bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ That will make it difficult to expand shadow_vmcs in the future. I suggest putting it at the end, and reserving some space in the middle. +#define OFFSET(x) offsetof(struct shadow_vmcs, x) + +static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = { Can leave the size unspecified (and use ARRAY_SIZE() later on). The encoding of the indexes is a very sparse, so the table will be very large. No need to deal with that now though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/24] Implement reading and writing of VMX MSRs
On 06/13/2010 03:25 PM, Nadav Har'El wrote: When the guest can use VMX instructions (when the nested module option is on), it should also be able to read and write VMX MSRs, e.g., to query about VMX capabilities. This patch adds this support. Signed-off-by: Nadav Har'Eln...@il.ibm.com --- --- .before/arch/x86/kvm/x86.c 2010-06-13 15:01:28.0 +0300 +++ .after/arch/x86/kvm/x86.c 2010-06-13 15:01:28.0 +0300 @@ -702,7 +702,11 @@ static u32 msrs_to_save[] = { #ifdef CONFIG_X86_64 MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR, #endif - MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA + MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA, + MSR_IA32_FEATURE_CONTROL, MSR_IA32_VMX_BASIC, + MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS, + MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS, + MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP, }; These are read only from the guest point of view, but we need write support from the host to allow for tuning the features exposed to the guest. /* + * If we allow our guest to use VMX instructions, we should also let it use + * VMX-specific MSRs. + */ +static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) +{ + u64 vmx_msr = 0; + u32 vmx_msr_high, vmx_msr_low; + + switch (msr_index) { + case MSR_IA32_FEATURE_CONTROL: + *pdata = 0; + break; + case MSR_IA32_VMX_BASIC: + /* +* This MSR reports some information about VMX support of the +* processor. We should return information about the VMX we +* emulate for the guest, and the VMCS structure we give it - +* not about the VMX support of the underlying hardware. Some +* However, some capabilities of the underlying hardware are +* used directly by our emulation (e.g., the physical address +* width), so these are copied from what the hardware reports. +*/ + *pdata = VMCS12_REVISION | + (((u64)sizeof(struct vmcs12)) 32); + rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr); +#define VMX_BASIC_64 0x0001LLU +#define VMX_BASIC_MEM_TYPE 0x003cLLU +#define VMX_BASIC_INOUT0x0040LLU Please move those defines (with longer names) to msr-index.h. + *pdata |= vmx_msr + (VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT); + break; +#define CORE2_PINBASED_CTLS_MUST_BE_ONE 0x0016 +#define MSR_IA32_VMX_TRUE_PINBASED_CTLS 0x48d + case MSR_IA32_VMX_TRUE_PINBASED_CTLS: + case MSR_IA32_VMX_PINBASED_CTLS: + vmx_msr_low = CORE2_PINBASED_CTLS_MUST_BE_ONE; + vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE | + PIN_BASED_EXT_INTR_MASK | + PIN_BASED_NMI_EXITING | + PIN_BASED_VIRTUAL_NMIS; IIRC not all processors support PIN_BASED_VIRTUAL_NMIs. Can we support this feature on hosts that don't have it? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] *** SUBJECT HERE ***
From: Nicholas Bellinger n...@linux-iscsi.org Greetings Gerd, Hannes and co, This series adds initial support for a hw/scsi-bsg.c backstore for scsi-bus compatible HBA emulation in QEMU-KVM on Linux hosts supporting the BSG driver. This code is available from the scsi-bsg branch in the megasas/scsi friendly QEMU-KVM tree at: http://git.kernel.org/?p=virt/kvm/nab/qemu-kvm.git;a=shortlog;h=refs/heads/scsi-bsg Note that this initial code is being posted for review and to see how useful a BSG backstore would be for QEMU-KVM and Linux hosts. Note that in order for BSG I/O to function using vectored AIO a kernel patch to linux/block/bsg.c:bsg_map_hdr() is currently required running on a bit paired user/kernel enviroment. The kernel patch in question is here: http://marc.info/?l=linux-scsim=127649585524598w=2 The first three patches involve updating block code to support the BSG backstore for scsi-bsg. The forth patch adds the minor changes to hw/scsi-bus.c and hw/scsi-disk.c in order to function with scsi-bsg. And the fifth patch adds the main hw/scsi-bsg.c logic necessary to run the new struct SCSIDeviceInfo and for BSG AIO using struct iovec and paio_submit_len() to function. So far this patch series has been tested with a Linux based x86_64 KVM host and guest using the hw/megasas.c 8708EM2 HBA Emulation with TCM_Loop virtual SAS Port LUNs. Comments are welcome, Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org Nicholas Bellinger (5): [block]: Add top level BSG support [block]: Add BSG qemu_open() in block/raw.c:raw_open() [block]: Add paio_submit_len() non sector sized AIO [scsi]: Add BSG support for scsi-bus and scsi-disk [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo Makefile.objs |2 +- block.c | 23 ++- block.h |1 + block/raw-posix-aio.h |3 + block/raw-posix.c | 17 ++- block/raw.c | 20 ++ block_int.h |5 + hw/scsi-bsg.c | 588 + hw/scsi-bus.c |3 +- hw/scsi-disk.c|4 + posix-aio-compat.c| 28 +++ 11 files changed, 687 insertions(+), 7 deletions(-) create mode 100644 hw/scsi-bsg.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] [block]: Add top level BSG support
From: Nicholas Bellinger n...@linux-iscsi.org This patch adds top level BSG support to QEMU-KVM block and adds the BDS_* prefixed defines for SG_IO and BSG. It adds the BDS_SCSI_GENERIC and BDS_BSG assignments in block/raw-posix.c:hdev_open() using S_ISCHR() and major(st.st_rdev) in order to determine when we are dealing with scsi-generic or scsi-bsg backstores. It also adds a special case BSG check in find_protocol() using the BSG major in order to avoid the strchr() for ':' as we expect filenames to contain /dev/bsg/H:C:T:L. This path also adds a struct BlockDriverState-fd to save a opened file descriptor with format_name = 'raw' for use by scsi-bsg. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- block.c | 23 --- block.h |1 + block/raw-posix.c | 17 +++-- block_int.h |5 + 4 files changed, 41 insertions(+), 5 deletions(-) diff --git a/block.c b/block.c index 88dbc00..3cd18ec 100644 --- a/block.c +++ b/block.c @@ -285,8 +285,12 @@ static BlockDriver *find_protocol(const char *filename) { BlockDriver *drv1; char protocol[128]; -int len; +int len, bsg = 0; const char *p; +#if defined(__linux__) +struct stat st; +#endif + /* TODO Drivers without bdrv_file_open must be specified explicitly */ @@ -296,7 +300,15 @@ static BlockDriver *find_protocol(const char *filename) return bdrv_find_format(file); #endif p = strchr(filename, ':'); -if (!p) { +#if defined(__linux__) +if (stat(filename, st) 0) +return NULL; +/* This is not yet defined in include/linux/major.h.. */ +if (S_ISCHR(st.st_mode) major(st.st_rdev) == 254) +bsg = 1; +#endif + +if (!p || bsg) { drv1 = find_hdev_driver(filename); if (!drv1) { drv1 = bdrv_find_format(file); @@ -1209,7 +1221,12 @@ int bdrv_is_read_only(BlockDriverState *bs) int bdrv_is_sg(BlockDriverState *bs) { -return bs-sg; +return bs-sg == BDS_SCSI_GENERIC; +} + +int bdrv_is_bsg(BlockDriverState *bs) +{ +return bs-sg == BDS_BSG; } int bdrv_enable_write_cache(BlockDriverState *bs) diff --git a/block.h b/block.h index f87d24e..4faeedc 100644 --- a/block.h +++ b/block.h @@ -146,6 +146,7 @@ int bdrv_get_translation_hint(BlockDriverState *bs); int bdrv_is_removable(BlockDriverState *bs); int bdrv_is_read_only(BlockDriverState *bs); int bdrv_is_sg(BlockDriverState *bs); +int bdrv_is_bsg(BlockDriverState *bs); int bdrv_enable_write_cache(BlockDriverState *bs); int bdrv_is_inserted(BlockDriverState *bs); int bdrv_media_changed(BlockDriverState *bs); diff --git a/block/raw-posix.c b/block/raw-posix.c index 1515ca9..d349109 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -52,6 +52,7 @@ #include sys/ioctl.h #include linux/cdrom.h #include linux/fd.h +#include linux/major.h #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) #include signal.h @@ -843,6 +844,9 @@ static int hdev_probe_device(const char *filename) static int hdev_open(BlockDriverState *bs, const char *filename, int flags) { BDRVRawState *s = bs-opaque; +#if defined(__linux__) +struct stat st; +#endif #ifdef CONFIG_COCOA if (strstart(filename, /dev/cdrom, NULL)) { @@ -873,8 +877,17 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags) s-type = FTYPE_FILE; #if defined(__linux__) -if (strstart(filename, /dev/sg, NULL)) { -bs-sg = 1; +if (stat(filename, st) 0) { +printf(stat() failed errno: %d\n, errno); +return -1; +} +if (S_ISCHR(st.st_mode)) { +if (major(st.st_rdev) == SCSI_GENERIC_MAJOR) { +bs-sg = BDS_SCSI_GENERIC; +} else if (major(st.st_rdev) == 254) { +/* This is not yet defined in include/linux/major.h.. */ +bs-sg = BDS_BSG; +} } #endif diff --git a/block_int.h b/block_int.h index 1a7240c..74bcb1a 100644 --- a/block_int.h +++ b/block_int.h @@ -40,6 +40,10 @@ #define BLOCK_OPT_CLUSTER_SIZE cluster_size #define BLOCK_OPT_PREALLOC preallocation +#define BDS_NONE0 +#define BDS_SCSI_GENERIC1 +#define BDS_BSG 2 + typedef struct AIOPool { void (*cancel)(BlockDriverAIOCB *acb); int aiocb_size; @@ -141,6 +145,7 @@ struct BlockDriverState { int encrypted; /* if true, the media is encrypted */ int valid_key; /* if true, a valid encryption key has been set */ int sg;/* if true, the device is a /dev/sg* */ +int fd;/* Used for BSG file descriptor */ /* event callback when inserting/removing */ void (*change_cb)(void *opaque); void *change_opaque; -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] [block]: Add BSG qemu_open() in block/raw.c:raw_open()
From: Nicholas Bellinger n...@linux-iscsi.org This patch adds a BSG specific qemu_open() call in block/raw.c:raw_open() that saves the opened file descriptor for BSG AIO into BlockDriverState-fd. It also adds the reverse close() call to block/raw.c:raw_close() Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- block/raw.c | 20 1 files changed, 20 insertions(+), 0 deletions(-) diff --git a/block/raw.c b/block/raw.c index 4406b8c..7820c78 100644 --- a/block/raw.c +++ b/block/raw.c @@ -5,7 +5,25 @@ static int raw_open(BlockDriverState *bs, int flags) { +int fd, ret; + bs-sg = bs-file-sg; +/* + * scsi-generic and other raw types do not call qemu_open() + */ +if (bs-sg != BDS_BSG) +return 0; +/* + * Obtain a file descriptor for the underlying BSG device for AIO w/ iovecs + */ +fd = qemu_open(bs-filename, flags, 0644); +if (fd 0) { +ret = -errno; +if (ret == -EROFS) +ret = -EACCES; +return ret; +} +bs-fd = fd; return 0; } @@ -37,6 +55,8 @@ static BlockDriverAIOCB *raw_aio_writev(BlockDriverState *bs, static void raw_close(BlockDriverState *bs) { +if (bs-fd 0) +close(bs-fd); } static void raw_flush(BlockDriverState *bs) -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO
From: Nicholas Bellinger n...@linux-iscsi.org This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512) so that it can be used by BSG AIO for write()/read() of struct sg_io_v4. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- block/raw-posix-aio.h |3 +++ posix-aio-compat.c| 28 2 files changed, 31 insertions(+), 0 deletions(-) diff --git a/block/raw-posix-aio.h b/block/raw-posix-aio.h index dfc63b8..29df842 100644 --- a/block/raw-posix-aio.h +++ b/block/raw-posix-aio.h @@ -30,6 +30,9 @@ int paio_init(void); BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, BlockDriverCompletionFunc *cb, void *opaque, int type); +BlockDriverAIOCB *paio_submit_len(BlockDriverState *bs, int fd, +int64_t sector_num, QEMUIOVector *qiov, int nb_len, +BlockDriverCompletionFunc *cb, void *opaque, int type); BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd, unsigned long int req, void *buf, BlockDriverCompletionFunc *cb, void *opaque); diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 272e998..ac9276c 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -585,6 +585,34 @@ BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, return acb-common; } +BlockDriverAIOCB *paio_submit_len(BlockDriverState *bs, int fd, +int64_t sector_num, QEMUIOVector *qiov, int nb_len, +BlockDriverCompletionFunc *cb, void *opaque, int type) +{ +struct qemu_paiocb *acb; + +acb = qemu_aio_get(raw_aio_pool, bs, cb, opaque); +if (!acb) +return NULL; +acb-aio_type = type; +acb-aio_fildes = fd; +acb-ev_signo = SIGUSR2; +acb-async_context_id = get_async_context_id(); + +if (qiov) { +acb-aio_iov = qiov-iov; +acb-aio_niov = qiov-niov; +} +acb-aio_nbytes = nb_len; +acb-aio_offset = 0; + +acb-next = posix_aio_state-first_aio; +posix_aio_state-first_aio = acb; + +qemu_paio_submit(acb); +return acb-common; +} + BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd, unsigned long int req, void *buf, BlockDriverCompletionFunc *cb, void *opaque) -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] [scsi]: Add BSG support for scsi-bus and scsi-disk
From: Nicholas Bellinger n...@linux-iscsi.org This patch updates hw/scsi-bus.c:scsi_bus_legacy_add_drive() to check for the scsi-bsg backstore. It also updates hw/scsi-disk.c:scsi_disk_initfn() to check for when bdrv_is_bsg() is present and we need to fail for the fileio backed scsi-disk code. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- hw/scsi-bus.c |3 ++- hw/scsi-disk.c |4 2 files changed, 6 insertions(+), 1 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index 7d80405..b5f5fbb 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -90,7 +90,8 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, DriveInfo *dinfo, int unit) const char *driver; DeviceState *dev; -driver = bdrv_is_sg(dinfo-bdrv) ? scsi-generic : scsi-disk; +driver = bdrv_is_bsg(dinfo-bdrv) ? scsi-bsg : + bdrv_is_sg(dinfo-bdrv) ? scsi-generic : scsi-disk; dev = qdev_create(bus-qbus, driver); qdev_prop_set_uint32(dev, scsi-id, unit); qdev_prop_set_drive(dev, drive, dinfo); diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index fb3c189..290a2e7 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -704,6 +704,10 @@ static int scsi_disk_initfn(SCSIDevice *dev) error_report(scsi-disk: unwanted /dev/sg*); return -1; } +if (bdrv_is_bsg(s-bs)) { +error_report(scsi-disk: unwanted BSG); +return -1; +} if (bdrv_get_type_hint(s-bs) == BDRV_TYPE_CDROM) { s-qdev.blocksize = 2048; -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo
From: Nicholas Bellinger n...@linux-iscsi.org This patch adds initial support for using the Linux BSG interface with write/read vectored AIO as a QEMU backstore (SCSIDeviceInfo) with hw/scsi-bus.c compatible HBA emulation. So far it has been tested with x86_64 host and guest using hw/megasas.c and TCM_Loop LLD Port LUNs. Because this path uses struct iovec for struct sg_io_v4-d[out,in]_xferp payloads, which currently requires a patch to linux/block/bsg.c:bsg_map_hdr() in order to setup the user - kernel iovecs. This also will only currently work with paired user/kernel (eg: 64bit user / 64bit kernel) because of different pointer sizes in struct iovec-iov_base. There are also two FIXMEs in hw/scsi-bsg.c:bsg_generic_initfn() related to extraction of SCSI LUN and device type values using BSG and required by QEMU-KVM. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- Makefile.objs |2 +- hw/scsi-bsg.c | 588 + 2 files changed, 589 insertions(+), 1 deletions(-) create mode 100644 hw/scsi-bsg.c diff --git a/Makefile.objs b/Makefile.objs index 188d617..c4fcb72 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -197,7 +197,7 @@ hw-obj-$(CONFIG_IDE_CMD646) += ide/cmd646.o hw-obj-$(CONFIG_IDE_MACIO) += ide/macio.o # SCSI layer -hw-obj-y += scsi-disk.o scsi-generic.o +hw-obj-y += scsi-disk.o scsi-generic.o scsi-bsg.o hw-obj-y += lsi53c895a.o megasas.o hw-obj-$(CONFIG_ESP) += esp.o diff --git a/hw/scsi-bsg.c b/hw/scsi-bsg.c new file mode 100644 index 000..fc76b76 --- /dev/null +++ b/hw/scsi-bsg.c @@ -0,0 +1,588 @@ +/* + * block layer implementation of the sg v4 interface for Linux hosts + * + * Copyright (c) 2010 Rising Tide Systems + * Written by Nicholas A. Bellinger n...@linux-iscsi.org + * + * Based on hw/scsi-generic code by Laurent Vivier, Paul Brook, and Fabrice Bellard + * + * This code is licenced under the LGPL. + */ + +#include qemu-common.h +#include qemu-error.h +#include block.h +#include scsi.h +#include dma.h +#include block/raw-posix-aio.h + +#ifdef __linux__ + +#define DEBUG_BSG +#undef DEBUG_BSG_IO +#undef DEBUG_BSG_MAP + +#ifdef DEBUG_BSG +#define DPRINTF(fmt, ...) \ +do { printf(scsi-bsg: fmt , ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) do {} while(0) +#endif + +#define BADF(fmt, ...) \ +do { fprintf(stderr, scsi-bsg: fmt , ## __VA_ARGS__); } while (0) + +#include stdio.h +#include sys/types.h +#include sys/stat.h +#include sys/epoll.h +#include unistd.h +#include scsi/sg.h +#include linux/bsg.h +#include scsi-defs.h + +#define SCSI_SENSE_BUF_SIZE 96 + +#define SG_ERR_DRIVER_TIMEOUT 0x06 +#define SG_ERR_DRIVER_SENSE 0x08 + +#ifndef MAX_UINT +#define MAX_UINT ((unsigned int)-1) +#endif + +typedef struct SCSIBSGState SCSIBSGState; + +typedef struct SCSIBSGReq { +SCSIRequest req; +uint8_t *buf; +int buflen; +QEMUIOVector iov; +QEMUIOVector aio_iov; +struct sg_io_v4 bsg_hdr; +} SCSIBSGReq; + +struct SCSIBSGState { +SCSIDevice qdev; +BlockDriverState *bs; +int lun; +int driver_status; +uint8_t sensebuf[SCSI_SENSE_BUF_SIZE]; +uint8_t senselen; +}; + +static int bsg_read(int fd, void *p_read, int to_read) +{ +int err; + +while (to_read 0) { +err = read(fd, p_read, to_read); +if (err = 0) { +to_read -= err; +p_read += err; +} else if (errno == EINTR) +continue; +else { +printf(bsg device %d read failed, errno: %d\n, +fd, errno); +return errno; +} +} +return 0; +} + +static SCSIBSGReq *bsg_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun) +{ +SCSIRequest *req; +SCSIBSGReq *r; + +req = scsi_req_alloc(sizeof(SCSIBSGReq), d, tag, lun); +r = DO_UPCAST(SCSIBSGReq, req, req); +qemu_iovec_init(r-iov, 1); +qemu_iovec_init(r-aio_iov, 1); +return r; +} + +static void bsg_remove_request(SCSIBSGReq *r) +{ +qemu_free(r-buf); +qemu_iovec_destroy(r-iov); +qemu_iovec_destroy(r-aio_iov); +scsi_req_free(r-req); +} + +static void bsg_command_complete(void *opaque, int ret) +{ +SCSIBSGReq *r = (SCSIBSGReq *)opaque; +SCSIBSGState *s = DO_UPCAST(SCSIBSGState, qdev, r-req.dev); + +s-driver_status = r-bsg_hdr.driver_status; +if (s-driver_status) +s-senselen = SCSI_SENSE_BUF_SIZE; + +if (ret != 0) { +scsi_req_print(r-req); +fprintf(stderr, %s: ret %d (%s)\n, __FUNCTION__, +ret, strerror(-ret)); +s-senselen = scsi_build_sense(SENSE_CODE(INVALID_FIELD), +s-sensebuf, SCSI_SENSE_BUF_SIZE, 0); +s-driver_status = SG_ERR_DRIVER_SENSE; +r-req.status = CHECK_CONDITION; +} else { +if (s-driver_status SG_ERR_DRIVER_TIMEOUT) { +scsi_req_print(r-req); +fprintf(stderr, %s: timeout\n, __FUNCTION__); +r-req.status = BUSY 1; +} else if
[PATCH 0/5] [QEMU-KVM]: Add BSG backstore using struct sg_io_v4
From: Nicholas Bellinger n...@linux-iscsi.org Quick resend with project subject for cover letter.. Greetings Gerd, Hannes and co, This series adds initial support for a hw/scsi-bsg.c backstore for scsi-bus compatible HBA emulation in QEMU-KVM on Linux hosts supporting the BSG driver. This code is available from the scsi-bsg branch in the megasas/scsi friendly QEMU-KVM tree at: http://git.kernel.org/?p=virt/kvm/nab/qemu-kvm.git;a=shortlog;h=refs/heads/scsi-bsg Note that this initial code is being posted for review and to see how useful a BSG backstore would be for QEMU-KVM and Linux hosts. Note that in order for BSG I/O to function using vectored AIO a kernel patch to linux/block/bsg.c:bsg_map_hdr() is currently required running on a bit paired user/kernel enviroment. The kernel patch in question is here: http://marc.info/?l=linux-scsim=127649585524598w=2 The first three patches involve updating block code to support the BSG backstore for scsi-bsg. The forth patch adds the minor changes to hw/scsi-bus.c and hw/scsi-disk.c in order to function with scsi-bsg. And the fifth patch adds the main hw/scsi-bsg.c logic necessary to run the new struct SCSIDeviceInfo and for BSG AIO using struct iovec and paio_submit_len() to function. So far this patch series has been tested with a Linux based x86_64 KVM host and guest using the hw/megasas.c 8708EM2 HBA Emulation with TCM_Loop virtual SAS Port LUNs. Comments are welcome, Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org Nicholas Bellinger (5): [block]: Add top level BSG support [block]: Add BSG qemu_open() in block/raw.c:raw_open() [block]: Add paio_submit_len() non sector sized AIO [scsi]: Add BSG support for scsi-bus and scsi-disk [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo Makefile.objs |2 +- block.c | 23 ++- block.h |1 + block/raw-posix-aio.h |3 + block/raw-posix.c | 17 ++- block/raw.c | 20 ++ block_int.h |5 + hw/scsi-bsg.c | 588 + hw/scsi-bus.c |3 +- hw/scsi-disk.c|4 + posix-aio-compat.c| 28 +++ 11 files changed, 687 insertions(+), 7 deletions(-) create mode 100644 hw/scsi-bsg.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12
On 06/13/2010 03:29 PM, Nadav Har'El wrote: This patch contains code to prepare the VMCS which can be used to actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information in shadow_vmcs that L1 built for L2 (vmcs12), and that in the VMCS that we built for L1 (vmcs01). VMREAD/WRITE can only access one VMCS at a time (the current VMCS), which makes it difficult for us to read from vmcs01 while writing to vmcs12. This is why we first make a copy of vmcs01 in memory (l1_shadow_vmcs) and then read that memory copy while writing to vmcs12. Signed-off-by: Nadav Har'Eln...@il.ibm.com --- --- .before/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 @@ -849,6 +849,36 @@ static inline bool report_flexpriority(v return flexpriority_enabled; } +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu) +{ + return cpu_has_vmx_tpr_shadow() + get_shadow_vmcs(vcpu)-cpu_based_vm_exec_control + CPU_BASED_TPR_SHADOW; Operator precedence is with you, but the line width limit is not. Use parentheses to improve readability. @@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v preempt_enable(); } +int load_vmcs_host_state(struct shadow_vmcs *src) +{ + vmcs_write16(HOST_ES_SELECTOR, src-host_es_selector); + vmcs_write16(HOST_CS_SELECTOR, src-host_cs_selector); + vmcs_write16(HOST_SS_SELECTOR, src-host_ss_selector); + vmcs_write16(HOST_DS_SELECTOR, src-host_ds_selector); + vmcs_write16(HOST_FS_SELECTOR, src-host_fs_selector); + vmcs_write16(HOST_GS_SELECTOR, src-host_gs_selector); + vmcs_write16(HOST_TR_SELECTOR, src-host_tr_selector); Why do we need to go through a shadow_vmcs for host fields? Instead of cloning a vmcs, you can call a common init routing to initialize the host fields. + + vmcs_write64(TSC_OFFSET, src-tsc_offset); Don't you need to adjust for our TSC_OFFSET? +/* prepare_vmcs_02 is called in when the L1 guest hypervisor runs its nested + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it + * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a + * way that will both be appropriate to L1's requests, and our needs. + */ +int prepare_vmcs_02(struct kvm_vcpu *vcpu, + struct shadow_vmcs *vmcs12, struct shadow_vmcs *vmcs01) +{ + u32 exec_control; + + load_vmcs_common(vmcs12); + + vmcs_write64(VMCS_LINK_POINTER, vmcs12-vmcs_link_pointer); Not sure about this. We don't use the vmcs link pointer, it's better to keep it at its default of -1ull. + vmcs_write64(IO_BITMAP_A, vmcs01-io_bitmap_a); + vmcs_write64(IO_BITMAP_B, vmcs01-io_bitmap_b); I guess merging the io bitmaps doesn't make much sense, at least at this stage. + if (cpu_has_vmx_msr_bitmap()) + vmcs_write64(MSR_BITMAP, vmcs01-msr_bitmap); However, merging the msr bitmaps is critical. Perhaps you do it in a later patch. + + if (vmcs12-vm_entry_msr_load_count 0 || + vmcs12-vm_exit_msr_load_count 0 || + vmcs12-vm_exit_msr_store_count 0) { + printk(KERN_WARNING + %s: VMCS MSR_{LOAD,STORE} unsupported\n, __func__); Unfortunate, since kvm has started to use this feature. For all unsupported mandatory features, we need reporting that is always enabled (i.e. no dprintk()), but to avoid flooding dmesg, use printk_ratelimit() or report just once. Also, it's better to kill the guest (KVM_REQ_TRIPLE_FAULT, or a VM instruction error) than to let it continue incorrectly. + } + + if (nested_cpu_has_vmx_tpr_shadow(vcpu)) { + struct page *page = + nested_get_page(vcpu, vmcs12-virtual_apic_page_addr); + if (!page) + return 1; ? + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page)); + kvm_release_page_clean(page); Hm. The host can move this page around. If it happens not to be mapped into the guest, we won't get any notification. So we need to keep a reference to this page, or else force an exit from the mmu notifier callbacks if it is removed by the host. + } + + if (nested_vm_need_virtualize_apic_accesses(vcpu)) { + struct page *page = + nested_get_page(vcpu, vmcs12-apic_access_addr); + if (!page) + return 1; + vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page)); + kvm_release_page_clean(page); + } Ditto. + + vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, +(vmcs01-pin_based_vm_exec_control | + vmcs12-pin_based_vm_exec_control)); We don't really want the guest's pin-based controls to influence
Re: [PATCH 0/3]: Fixes to IRQ routing
On Thu, Jun 10, 2010 at 04:44:04PM -0400, Chris Lalancette wrote: As we've discussed previously, here is a series of patches to fix some of the IRQ routing issues we have in KVM. With this series in place I was able to successfully kdump a RHEL-5 64-bit, and RHEL-6 32- and 64-bit guest on CPU's other than the BSP. RHEL-5 32-bit kdump still does not work; it gets stuck on Checking 'hlt' instruction. However, it does that both before and after this series, so there is something else going on there that I still have to debug. I also need to change the kvm_migrate_pit_timer function to migrate the timer over to the last CPU that handled the timer interrupt, on the theory that that particlar CPU is likely to handle the timer interrupt again in the near future. Changes since RFC: - Changed ps-inject_lock from raw_spinlock_t to spinlock_t - Fixed up some formatting issues - Changed to have one PIT workqueue per-guest - Remember to cancel_work_sync when destroying the PIT Chris Lalancette Looks good to me. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
On 06/13/2010 03:30 PM, Nadav Har'El wrote: Implement the VMLAUNCH and VMRESUME instructions, allowing a guest hypervisor to run its own guests. Signed-off-by: Nadav Har'Eln...@il.ibm.com --- --- .before/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300 @@ -272,6 +272,9 @@ struct __attribute__ ((__packed__)) vmcs struct shadow_vmcs shadow_vmcs; bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ + + int cpu; Not sure cpu should be here. It's certainly won't survive live migration. Perhaps in struct vmcs_list (which should be renamed, perhaps struct cached_vmcs). + int launched; }; What's the difference between this and launch_state? struct vmcs_list { @@ -297,6 +300,24 @@ struct nested_vmx { /* list of real (hardware) VMCS, one for each L2 guest of L1 */ struct list_head l2_vmcs_list; /* a vmcs_list */ int l2_vmcs_num; + + /* Are we running a nested guest now */ + bool nested_mode; + /* Level 1 state for switching to level 2 and back */ + struct { + u64 efer; + unsigned long cr3; + unsigned long cr4; + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + int cpu; + int launched; + } l1_state; This state needs save/restore support (as well as the current vmptr and vmxon state). + /* Level 1 shadow vmcs for switching to level 2 and back */ + struct shadow_vmcs *l1_shadow_vmcs; Again, not really happy about shadowing the non-nested vmcs. + /* Level 1 vmcs loaded into the processor */ + struct vmcs *l1_vmcs; }; enum vmcs_field_type { @@ -1407,6 +1428,19 @@ static void vmx_vcpu_load(struct kvm_vcp new_offset = vmcs_read64(TSC_OFFSET) + delta; vmcs_write64(TSC_OFFSET, new_offset); } + + if (vmx-nested.l1_shadow_vmcs != NULL) { + struct shadow_vmcs *l1svmcs = + vmx-nested.l1_shadow_vmcs; + l1svmcs-host_tr_base = vmcs_readl(HOST_TR_BASE); + l1svmcs-host_gdtr_base = vmcs_readl(HOST_GDTR_BASE); + l1svmcs-host_ia32_sysenter_esp = + vmcs_readl(HOST_IA32_SYSENTER_ESP); These are all static (at least on a single cpu. No need to read them from a vmcs. + if (tsc_this vcpu-arch.host_tsc) + l1svmcs-tsc_offset = vmcs_read64(TSC_OFFSET); + if (vmx-nested.nested_mode) + load_vmcs_host_state(l1svmcs); + } } } @@ -4348,6 +4392,42 @@ static int handle_vmclear(struct kvm_vcp return 1; } +static int nested_vmx_run(struct kvm_vcpu *vcpu); + +static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch) +{ + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (!nested_map_current(vcpu)) + return 1; Better error handling needed, perhaps triple fault. + if (to_vmx(vcpu)-nested.current_l2_page-launch_state == launch) { + /* Must use VMLAUNCH for the first time, VMRESUME later */ + set_rflags_to_vmx_fail_valid(vcpu); + nested_unmap_current(vcpu); skip_emulted_instruction(); + return 1; + } + nested_unmap_current(vcpu); + + skip_emulated_instruction(vcpu); + + nested_vmx_run(vcpu); + return 1; +} @@ -4958,7 +5038,8 @@ static int vmx_handle_exit(struct kvm_vc (0x%x) and exit reason is 0x%x\n, __func__, vectoring_info, exit_reason); - if (unlikely(!cpu_has_virtual_nmis() vmx-soft_vnmi_blocked)) { + if (!vmx-nested.nested_mode + unlikely(!cpu_has_virtual_nmis() vmx-soft_vnmi_blocked)) { Too much indent. the unlikely() looks like the first statement of the block. I think it isn't enough to check for nested mode. If the guest hasn't enabled virtual NMIs, then the nested guest should behave exactly like the guest. +static int nested_vmx_run(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + vmx-nested.nested_mode = 1; true + sync_cached_regs_to_vmcs(vcpu); + save_vmcs(vmx-nested.l1_shadow_vmcs); + + vmx-nested.l1_state.efer = vcpu-arch.efer; Not sure why you need to save efer. Ordinarily, vmx reconstructs it from the guest efer and the host size exit control, you can do the same. + if (!enable_ept) + vmx-nested.l1_state.cr3 = vcpu-arch.cr3; Ditto, isn't that HOST_CR3? + vmx-nested.l1_state.cr4 = vcpu-arch.cr4; Ditto. + + if (!nested_map_current(vcpu)) {
Re: [PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO
On Mon, Jun 14, 2010 at 02:44:31AM -0700, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512) so that it can be used by BSG AIO for write()/read() of struct sg_io_v4. Jusre remove the nb_sectors argument, we already get the length passed in the iovec. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO
On Mon, 2010-06-14 at 07:42 -0400, Christoph Hellwig wrote: On Mon, Jun 14, 2010 at 02:44:31AM -0700, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512) so that it can be used by BSG AIO for write()/read() of struct sg_io_v4. Jusre remove the nb_sectors argument, we already get the length passed in the iovec. Mmmm, should this stay in a seperate paio_submit_len() function..? Or would it be preferred to calculate struct qemu_paiocb-aio_nbytes from the passed iov[i].iov_len for the BlockDriverState-sg == BDS_BSG case..? Best, --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit
On 06/13/2010 03:32 PM, Nadav Har'El wrote: This patch contains the logic of whether an L2 exit should be handled by L0 and then L2 should be resumed, or whether L1 should be run to handle this exit (using the nested_vmx_vmexit() function of the previous patch). The basic idea is to let L1 handle the exit only if it actually asked to trap this sort of event. For example, when L2 exits on a change to CR0, we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any bit which changed; If it did, we exit to L1. But if it didn't it means that it is we (L0) that wished to trap this event, so we handle it ourselves. The next two patches add additional logic of what to do when an interrupt or exception is injected: Does L0 need to do it, should we exit to L1 to do it, or should we resume L2 and keep the exception to be injected later. We keep a new flag, nested_run_pending, which can override the decision of which should run next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This is necessary in several situations where had L1 run on bare metal it would not have expected to be resumed at this stage. One example is when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run. Another examples is when L2 exits on an #NM exception that L0 asked for (because of lazy FPU loading), and L0 must deal with the exception and resume L2 which was in a middle of an instruction, and not resume L1 which does not expect to see an exit from L2 at this point. nested_run_pending is especially intended to avoid switching to L1 in the injection decision-point described above. @@ -3819,6 +3841,8 @@ static int handle_exception(struct kvm_v if (is_no_device(intr_info)) { vmx_fpu_activate(vcpu); + if (vmx-nested.nested_mode) + vmx-nested.nested_run_pending = 1; return 1; } Isn't this true for many other exceptions? #UD which we emulate (but the guest doesn't trap), page faults which we handle completely... + +/* Return 1 if we should exit from L2 to L1 to handle a CR access exit, + * rather than handle it ourselves in L0. I.e., check if L1 wanted to + * intercept (via guest_host_mask etc.) the current event. + */ +static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu, + struct shadow_vmcs *l2svmcs) +{ + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int cr = exit_qualification 15; + int reg = (exit_qualification 8) 15; + unsigned long val = kvm_register_read(vcpu, reg); + + switch ((exit_qualification 4) 3) { + case 0: /* mov to cr */ + switch (cr) { + case 0: + if (l2svmcs-cr0_guest_host_mask + (val ^ l2svmcs-cr0_read_shadow)) + return 1; + break; + case 3: + if (l2svmcs-cpu_based_vm_exec_control + CPU_BASED_CR3_LOAD_EXITING) + return 1; + break; + case 4: + if (l2svmcs-cr4_guest_host_mask + (l2svmcs-cr4_read_shadow ^ val)) + return 1; + break; + case 8: + if (l2svmcs-cpu_based_vm_exec_control + CPU_BASED_CR8_LOAD_EXITING) + return 1; Should check TPR threshold here too if enabled. + case 3: /* lmsw */ + if (l2svmcs-cr0_guest_host_mask + (val ^ l2svmcs-cr0_read_shadow)) + return 1; Need to mask off bit 0 (cr0.pe) of val, since lmsw can't clear it. + break; + } + return 0; +} + +/* Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we + * should handle it ourselves in L0. Only call this when in nested_mode (L2). + */ +static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu, bool afterexit) +{ + u32 exit_code = vmcs_read32(VM_EXIT_REASON); + struct vcpu_vmx *vmx = to_vmx(vcpu); + u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + struct shadow_vmcs *l2svmcs; + int r = 0; + + if (vmx-nested.nested_run_pending) + return 0; + + if (unlikely(vmx-fail)) { + printk(KERN_INFO %s failed vm entry %x\n, + __func__, vmcs_read32(VM_INSTRUCTION_ERROR)); + return 1; + } + + if (afterexit) { + /* There are some cases where we should let L1 handle certain +* events when these are injected (afterexit==0) but we should +* handle them in L0 on an exit (afterexit==1). +*/ + switch (exit_code) { + case EXIT_REASON_EXTERNAL_INTERRUPT: + return 0; +
Re: [PATCH 20/24] Correct handling of interrupt injection
On 06/13/2010 03:32 PM, Nadav Har'El wrote: When KVM wants to inject an interrupt, the guest should think a real interrupt has happened. Normally (in the non-nested case) this means checking that the guest doesn't block interrupts (and if it does, inject when it doesn't - using the interrupt window VMX mechanism), and setting up the appropriate VMCS fields for the guest to receive the interrupt. However, when we are running a nested guest (L2) and its hypervisor (L1) requested exits on interrupts (as most hypervisors do), the most efficient thing to do is to exit L2, telling L1 that the exit was caused by an interrupt, the one we were injecting; Only when L1 asked not to be notified of interrupts, we should to inject it directly to the running guest L2 (i.e., the normal code path). However, properly doing what is described above requires invasive changes to the flow of the existing code, which we elected not to do in this stage. Instead we do something more simplistic and less efficient: we modify vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt now, to exit from L2 to L1 before continuing the normal code. The normal kvm code then notices that L1 is blocking interrupts, and sets the interrupt window to inject the interrupt later to L1. Shortly after, L1 gets the interrupt while it is itself running, not as an exit from L2. The cost is an extra L1 exit (the interrupt window). That's a little sad. cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; @@ -3718,6 +3738,13 @@ static int nested_vmx_vmexit(struct kvm_ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { + if (to_vmx(vcpu)-nested.nested_mode nested_exit_on_intr(vcpu)) { + if (to_vmx(vcpu)-nested.nested_run_pending) + return 0; + nested_vmx_vmexit(vcpu, true); + /* fall through to normal code, but now in L1, not L2 */ + } + What exit is reported here? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] Nested VMX, v5
On 06/13/2010 03:22 PM, Nadav Har'El wrote: Hi Avi, This is a followup of our nested VMX patches that Orit Wasserman posted in December. We've addressed most of the comments and concerns that you and others on the mailing list had with the previous patch set. We hope you'll find these patches easier to understand, and suitable for applying to KVM. The following 24 patches implement nested VMX support. The patches enable a guest to use the VMX APIs in order to run its own nested guests. I.e., it allows running hypervisors (that use VMX) under KVM. We describe the theory behind this work, our implementation, and its performance characteristics, in IBM Research report H-0282, The Turtles Project: Design and Implementation of Nested Virtualization, available at: http://bit.ly/a0o9te The current patches support running Linux under a nested KVM using shadow page table (with bypass_guest_pf disabled). They support multiple nested hypervisors, which can run multiple guests. Only 64-bit nested hypervisors are supported. SMP is supported. Additional patches for running Windows under nested KVM, and Linux under nested VMware server, and support for nested EPT, are currently running in the lab, and will be sent as follow-on patchsets. These patches were written by: Abel Gordon, abelgat il.ibm.com Nadav Har'El, nyhat il.ibm.com Orit Wasserman, oritwat il.ibm.com Ben-Ami Yassor, benamiat il.ibm.com Muli Ben-Yehuda, muliat il.ibm.com With contributions by: Anthony Liguori, aliguoriat us.ibm.com Mike Day, mddayat us.ibm.com This work was inspired by the nested SVM support by Alexander Graf and Joerg Roedel. Changes since v4: * Rebased to the current KVM tree. * Support for lazy FPU loading. * Implemented about 90 requests and suggestions made on the mailing list regarding the previous version of this patch set. * Split the changes into many more, and better documented, patches. Overall, very nice. The finer split and better documentation really help reviewing, thanks. Let's try to get this merged quickly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
On 06/14/2010 11:49 AM, Nadav Har'El wrote: On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1: On 06/13/2010 03:25 PM, Nadav Har'El wrote: +#define VMCS12_REVISION 0x11e57ed0 Where did this number come from? It's not from real hardware, yes? Basically, we are presenting emulated VMX for the L1 guest, complete with its own VMCS structure. This structure needs to have some VMCS revision id, which should be an arbitrary number that we invent - it is not related to any revision id that any real hardware uses. If you look closely, you can see that the number I used is leetspeak for Nested0 ;-) Will have to brush up on my leetspeak, I see. As you can see in the following patches, MSR_IA32_VMX_BASIC will return this arbitrary VMCS revision id, and and VMPTRLD will verify that the VMCS region that L1 is trying to load contains this revision id. Ok good, I was worried this was a real hardware ID. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 11:48 AM, Balbir Singh wrote: In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, Right, should have used a different one. let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/24] Correct handling of interrupt injection
On 06/14/2010 03:29 PM, Avi Kivity wrote: On 06/13/2010 03:32 PM, Nadav Har'El wrote: When KVM wants to inject an interrupt, the guest should think a real interrupt has happened. Normally (in the non-nested case) this means checking that the guest doesn't block interrupts (and if it does, inject when it doesn't - using the interrupt window VMX mechanism), and setting up the appropriate VMCS fields for the guest to receive the interrupt. However, when we are running a nested guest (L2) and its hypervisor (L1) requested exits on interrupts (as most hypervisors do), the most efficient thing to do is to exit L2, telling L1 that the exit was caused by an interrupt, the one we were injecting; Only when L1 asked not to be notified of interrupts, we should to inject it directly to the running guest L2 (i.e., the normal code path). However, properly doing what is described above requires invasive changes to the flow of the existing code, which we elected not to do in this stage. Instead we do something more simplistic and less efficient: we modify vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt now, to exit from L2 to L1 before continuing the normal code. The normal kvm code then notices that L1 is blocking interrupts, and sets the interrupt window to inject the interrupt later to L1. Shortly after, L1 gets the interrupt while it is itself running, not as an exit from L2. The cost is an extra L1 exit (the interrupt window). That's a little sad. It can also be broken if the guest chooses to keep interrupts disabled during exits, and instead ask vmx to ack interrupts. The guest can then vmread the vector number and dispatch the interrupt itself. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 15:40:28]: On 06/14/2010 11:48 AM, Balbir Singh wrote: In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, Right, should have used a different one. let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? In a consolidated environment, even at the cost of some CPU to run shrinkers, I think potentially yes. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. True, that is why the first patch is controlled via a boot parameter that the host can pass. For the second patch, I think we'll need something like a balloon size cache? with the cache argument being optional. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. Yes, agreed. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
On Mon, 2010-06-14 at 08:39 +0200, Markus Armbruster wrote: Alex Williamson alex.william...@redhat.com writes: qdev_get_dev_path() is intended to be the canonical utility for creating a string representing the qdev hierarchy of a device. The path consists of bus and device names as well as identified properties of the immediate parent bus and device. This results in strings like: /main-system-bus/pci.0,addr=00.0/i440FX /main-system-bus/pci.0,addr=01.0/PIIX3 /main-system-bus/pci.0,addr=02.0/cirrus-vga /main-system-bus/pci.0/isa.0/mc146818rtc /main-system-bus/pci.0/isa.0/isa-serial /main-system-bus/pci.0/isa.0/i8042 /main-system-bus/pci.0/isa.0/isa-fdc /main-system-bus/pci.0,addr=03.0/i82551,mac=52:54:00:12:34:56 /main-system-bus/pci.0,addr=04.0/virtio-net-pci,mac=52:54:00:12:34:57 /main-system-bus/pci.0,addr=05.0/e1000,mac=52:54:00:12:34:58 /main-system-bus/pci.0,addr=06.0/rtl8139,mac=52:54:00:12:34:59 /main-system-bus/pci.0,addr=07.0/pcnet,mac=52:54:00:12:34:5a /main-system-bus/pci.0,addr=01.1/piix3-ide /main-system-bus/pci.0,addr=01.3/PIIX4_PM /main-system-bus/pci.0,addr=08.0/lsi53c895a /main-system-bus/pci.0,addr=09.0/virtio-blk-pci There are two primary targets for these strings. The first is vmsave, where we currently use a device provided string plus instance number to track SaveStateEntries. This fails when we introduce device hotplug, particularly in a case were we have gaps in the instance numbers that cannot easily be reproduced on a migration target. The second is for naming RAMBlocks. For these, we would like a string that matches the vmstate string. Could you explain why you add identified properties of the immediate parent bus and device? They make the result ver much *not* a dev path in the qdev sense... In order to try to get a unique string. Without looking into device properties, two e1000s would both be: /main-system-bus/pci.0/e1000 /main-system-bus/pci.0/e1000 Which is no better than simply e1000 and would require us to fall back to instance ids again. The goal here is that anything that makes use of passing a dev when registering a vmstate gets an instance id of zero. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
Alex Williamson wrote: On Mon, 2010-06-14 at 08:39 +0200, Markus Armbruster wrote: Alex Williamson alex.william...@redhat.com writes: qdev_get_dev_path() is intended to be the canonical utility for creating a string representing the qdev hierarchy of a device. The path consists of bus and device names as well as identified properties of the immediate parent bus and device. This results in strings like: /main-system-bus/pci.0,addr=00.0/i440FX /main-system-bus/pci.0,addr=01.0/PIIX3 /main-system-bus/pci.0,addr=02.0/cirrus-vga /main-system-bus/pci.0/isa.0/mc146818rtc /main-system-bus/pci.0/isa.0/isa-serial /main-system-bus/pci.0/isa.0/i8042 /main-system-bus/pci.0/isa.0/isa-fdc /main-system-bus/pci.0,addr=03.0/i82551,mac=52:54:00:12:34:56 /main-system-bus/pci.0,addr=04.0/virtio-net-pci,mac=52:54:00:12:34:57 /main-system-bus/pci.0,addr=05.0/e1000,mac=52:54:00:12:34:58 /main-system-bus/pci.0,addr=06.0/rtl8139,mac=52:54:00:12:34:59 /main-system-bus/pci.0,addr=07.0/pcnet,mac=52:54:00:12:34:5a /main-system-bus/pci.0,addr=01.1/piix3-ide /main-system-bus/pci.0,addr=01.3/PIIX4_PM /main-system-bus/pci.0,addr=08.0/lsi53c895a /main-system-bus/pci.0,addr=09.0/virtio-blk-pci There are two primary targets for these strings. The first is vmsave, where we currently use a device provided string plus instance number to track SaveStateEntries. This fails when we introduce device hotplug, particularly in a case were we have gaps in the instance numbers that cannot easily be reproduced on a migration target. The second is for naming RAMBlocks. For these, we would like a string that matches the vmstate string. Could you explain why you add identified properties of the immediate parent bus and device? They make the result ver much *not* a dev path in the qdev sense... In order to try to get a unique string. Without looking into device properties, two e1000s would both be: /main-system-bus/pci.0/e1000 /main-system-bus/pci.0/e1000 Which is no better than simply e1000 and would require us to fall back to instance ids again. The goal here is that anything that makes use of passing a dev when registering a vmstate gets an instance id of zero. Will soon (re-)post a patch that adds per-bus instance numbers to deal with identically named devices. That's required as not all buses have canonical node IDs (e.g. ISA or the main system bus). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 03:50 PM, Balbir Singh wrote: let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? In a consolidated environment, even at the cost of some CPU to run shrinkers, I think potentially yes. I don't understand. If you're running the shrinkers then you're evicting live entries, which could cost you an I/O each. That's expensive, consolidated or not. If you're not running the shrinkers, why does it matter if you're consolidated or not? Drop that age unconditionally. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. True, that is why the first patch is controlled via a boot parameter that the host can pass. For the second patch, I think we'll need something like a balloonsize cache? with the cache argument being optional. Whether a page is duplicated on the host or not is per-page, it cannot be a boot parameter. If we drop unmapped pagecache pages, we need to be sure they can be backed by the host, and that depends on the amount of sharing. Overall, I don't see how a user can tune this. If I were a guest admin, I'd play it safe by not assuming the host will back me, and disabling the feature. To get something like this to work, we need to reward cooperating guests somehow. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. Yes, agreed. So how do we deal with this? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] Nested VMX, v5
On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 0/24] Nested VMX, v5: Overall, very nice. The finer split and better documentation really help reviewing, thanks. Thank you for the review and all the accurate comments! Let's try to get this merged quickly. I'll start fixing the individual patches and resending them individually, and when I've fixed everything I'll resubmit the whole lot. I hope that this time I can do it in a matter of days, not months. Thanks, Nadav. -- Nadav Har'El| Monday, Jun 14 2010, 2 Tammuz 5770 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |An egotist is a person of low taste, more http://nadav.harel.org.il |interested in himself than in me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
/main-system-bus/pci.0,addr=09.0/virtio-blk-pci There's a device missing between the main system bus and the pci bus. Should be something like: /main-system-bus/piix4-pcihost/pci.0/_09.0 Could you explain why you add identified properties of the immediate parent bus and device? They make the result ver much *not* a dev path in the qdev sense... In order to try to get a unique string. Without looking into device properties, two e1000s would both be: /main-system-bus/pci.0/e1000 /main-system-bus/pci.0/e1000 Which is no better than simply e1000 and would require us to fall back to instance ids again. The goal here is that anything that makes use of passing a dev when registering a vmstate gets an instance id of zero. You already got the information you need, you just put it in the wrong place. The canonical ID for the device could be its bus address. In practice we'd probably want to allow the user to specify it by name, provided these are unique. e.g. in the above machine we could accept [...]/virtiio-blk-pci would as an aias for [...]:_09.0. Device names have a restricted namespace, so we can use an initial prefix to disambiguate a name/label from a bus address. For busses that don't have a consistent addressing scheme then some sort of instance ID is unavoidable. I guess it may be possible to invent something based on other device properties (e.g. address of the first IO port/memory region). Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM_AUTOTEST][PATCH] Add support for newer (sysfs) ksm interface
This patch adds support for newer sysfs interface used since Fedora 12. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-1760424 ] Intel real mode emulation failures
Bugs item #1760424, was opened at 2007-07-25 18:40 Message generated for change (Comment added) made by technologov You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: intel Group: None Status: Open Resolution: Fixed Priority: 8 Private: No Submitted By: Technologov (technologov) Assigned to: Nobody/Anonymous (nobody) Summary: Intel real mode emulation failures Initial Comment: Major Linux distros crash on KVM-29, Intel64 CPU. Those include: Mandriva 2007 (32/64-bit) Ubuntu 7.04 (32/64-bit) openSUSE 10.2 (32/64-bit) PCLinuxOS 2007 (32-bit) The all crash because of the same reason: Graphical Bootloader. (GRUB) If we fix the root of the issue, all the problems will gone. All of above work fine with -no-kvm. (Qemu) I thouught, that bug 1740031 (Kubuntu 7.04 fails to boot as guest- hangs after kernel boot) is related to this one, but since it was opened against kernel, not against the bootloader, I have decided to open this bug. -Alexey Eremenko. -- Comment By: Technologov (technologov) Date: 2010-06-14 16:55 Message: The problem happened on Core 2 CPUs... Do you have a system with Core 2 CPU ? -- Comment By: Jes Sorensen (jessorensen) Date: 2010-06-11 15:49 Message: Pulled down a copy and tested. Ubuntu 7.04 installs and boots fine under recent KVM/QEMU. Looks like problem has been resolved. Closing Jes -- Comment By: Technologov (technologov) Date: 2008-08-03 11:25 Message: Logged In: YES user_id=1839746 Originator: YES The situation is improving - distros integrate newest GRUB, which workarounds this problem. Fixed distros: openSUSE 11.0, Ubuntu 8.04. Mandriva Linux (2007/2008/2009-beta1) are still problematic. Bug opened. Mandriva Linux unable to boot or install on KVM Virtualizer due to old GRUB: https://qa.mandriva.com/show_bug.cgi?id=42358 -Alexey, 3.8.2008. -- Comment By: Henrik Holst (henrik_holst) Date: 2008-02-08 16:47 Message: Logged In: YES user_id=2003951 Originator: NO I see the same here with KVM-60, on a Core 2 Duo 6600. Host is debian testing on a 2.6.22-3-amd64 kernel, using kvm-intel. Trying to boot a kubuntu-7.10 iso crashes directly, works flawlessy with -no-kvm and have tried to run both a debian and w2k3 as gursts without any problems. -- Comment By: Technologov (technologov) Date: 2007-09-05 13:19 Message: Logged In: YES user_id=1839746 Originator: YES Update: with KVM-36 this is partially fixed. Works on SVM, but still stucks on VMX. openSUSE 10.2 (32/64-bit) and Mandriva Linux 2007 boots now. -Alexey Technologov -- Comment By: Technologov (technologov) Date: 2007-07-25 18:51 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-Ubuntu7-i386.txt -- Comment By: Technologov (technologov) Date: 2007-07-25 18:46 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-PCLinuxOS2007-i386.txt -- Comment By: Technologov (technologov) Date: 2007-07-25 18:45 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-openSUSE10.2.txt -- Comment By: Technologov (technologov) Date: 2007-07-25 18:44 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-Mandriva2007-i386.txt -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-1760424 ] Intel real mode emulation failures
Bugs item #1760424, was opened at 2007-07-25 17:40 Message generated for change (Settings changed) made by jessorensen You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: intel Group: None Status: Closed Resolution: Fixed Priority: 8 Private: No Submitted By: Technologov (technologov) Assigned to: Nobody/Anonymous (nobody) Summary: Intel real mode emulation failures Initial Comment: Major Linux distros crash on KVM-29, Intel64 CPU. Those include: Mandriva 2007 (32/64-bit) Ubuntu 7.04 (32/64-bit) openSUSE 10.2 (32/64-bit) PCLinuxOS 2007 (32-bit) The all crash because of the same reason: Graphical Bootloader. (GRUB) If we fix the root of the issue, all the problems will gone. All of above work fine with -no-kvm. (Qemu) I thouught, that bug 1740031 (Kubuntu 7.04 fails to boot as guest- hangs after kernel boot) is related to this one, but since it was opened against kernel, not against the bootloader, I have decided to open this bug. -Alexey Eremenko. -- Comment By: Jes Sorensen (jessorensen) Date: 2010-06-14 15:59 Message: Yes I ran my test on a core2 If you can reproduce the problem, please open a new bug for it in launchpad. -- Comment By: Technologov (technologov) Date: 2010-06-14 15:55 Message: The problem happened on Core 2 CPUs... Do you have a system with Core 2 CPU ? -- Comment By: Jes Sorensen (jessorensen) Date: 2010-06-11 14:49 Message: Pulled down a copy and tested. Ubuntu 7.04 installs and boots fine under recent KVM/QEMU. Looks like problem has been resolved. Closing Jes -- Comment By: Technologov (technologov) Date: 2008-08-03 10:25 Message: Logged In: YES user_id=1839746 Originator: YES The situation is improving - distros integrate newest GRUB, which workarounds this problem. Fixed distros: openSUSE 11.0, Ubuntu 8.04. Mandriva Linux (2007/2008/2009-beta1) are still problematic. Bug opened. Mandriva Linux unable to boot or install on KVM Virtualizer due to old GRUB: https://qa.mandriva.com/show_bug.cgi?id=42358 -Alexey, 3.8.2008. -- Comment By: Henrik Holst (henrik_holst) Date: 2008-02-08 15:47 Message: Logged In: YES user_id=2003951 Originator: NO I see the same here with KVM-60, on a Core 2 Duo 6600. Host is debian testing on a 2.6.22-3-amd64 kernel, using kvm-intel. Trying to boot a kubuntu-7.10 iso crashes directly, works flawlessy with -no-kvm and have tried to run both a debian and w2k3 as gursts without any problems. -- Comment By: Technologov (technologov) Date: 2007-09-05 12:19 Message: Logged In: YES user_id=1839746 Originator: YES Update: with KVM-36 this is partially fixed. Works on SVM, but still stucks on VMX. openSUSE 10.2 (32/64-bit) and Mandriva Linux 2007 boots now. -Alexey Technologov -- Comment By: Technologov (technologov) Date: 2007-07-25 17:51 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-Ubuntu7-i386.txt -- Comment By: Technologov (technologov) Date: 2007-07-25 17:46 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-PCLinuxOS2007-i386.txt -- Comment By: Technologov (technologov) Date: 2007-07-25 17:45 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-openSUSE10.2.txt -- Comment By: Technologov (technologov) Date: 2007-07-25 17:44 Message: Logged In: YES user_id=1839746 Originator: YES File Added: KVM29-VMX64-Mandriva2007-i386.txt -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote: If we drop unmapped pagecache pages, we need to be sure they can be backed by the host, and that depends on the amount of sharing. You also have to set up the host up properly, and continue to maintain it in a way that finds and eliminates duplicates. I saw some benchmarks where KSM was doing great, finding lots of duplicate pages. Then, the host filled up, and guests started reclaiming. As memory pressure got worse, so did KSM's ability to find duplicates. At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 06:12 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. Depends. If you've evicted something that will be referenced soon, you're increasing system pressure. If unmapped page cache is the easiest thing to evict, then it should be the first thing that goes when a balloon request comes in, which is the case this patch is trying to handle. If it isn't the easiest thing to evict, then we _shouldn't_ evict it. Easy to evict is just one measure. There's benefit (size of data evicted), cost to refill (seeks, cpu), and likelihood that the cost to refill will be incurred (recency). It's all very complicated. We need better information to make these decisions. For one thing, I'd like to see age information tied to objects. We may have two pages that were referenced in wildly different times be next to each other in LRU order. We have many LRUs, but no idea of the relative recency of the tails of those LRUs. If each page or object had an age, we could scale those ages by the benefit from reclaim and cost to refill and make a better decision as to what to evict first. But of course page-age means increasing sizeof struct page, and we can only approximate its value by scanning the accessed bit, not determine it accurately (unlike the other objects managed by the cache). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote: /main-system-bus/pci.0,addr=09.0/virtio-blk-pci There's a device missing between the main system bus and the pci bus. Should be something like: /main-system-bus/piix4-pcihost/pci.0/_09.0 Ok, I can easily come up with: /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virti o-blk No. Now you've got way to many elements in the path. My point is that you replace the name of the device with the bus address of the device. However you determine the element names the path should be /busname/devicename pairs. I'm undecided whether the root element should be the main system bus, or a system device node that contains the main system bus. You already got the information you need, you just put it in the wrong place. The canonical ID for the device could be its bus address. In practice we'd probably want to allow the user to specify it by name, provided these are unique. e.g. in the above machine we could accept [...]/virtiio-blk-pci would as an aias for [...]:_09.0. Sure, if there was a guaranteed unique char* on a DeviceState that was consistent between guest invocations, we could just use that instead. I suppose all devices could have a global system id property and if that was present we'd use that instead of creating the device path. All we require is some way of uniquely addressing each device on the bus. For PCI that's trivial (devfn). For other busses we get to pick something appropriate. Device names have a restricted namespace, so we can use an initial prefix to disambiguate a name/label from a bus address. I'm not sure it's necessary. It seems like it would only be necessary to differentiate the two if we wanted to translate between namespaces. But I think it's reasonable to require that if a global name is provided, it must always be provided for the life of the VM. I don't think requiring that all devices are given a globally unique name is going to fly. locally (per-bus) unique user-specified are still a PITA, but may be acceptable. Having a canonical addressing system that's independent of user assigned IDs seems like a good thing. For busses that don't have a consistent addressing scheme then some sort of instance ID is unavoidable. I guess it may be possible to invent something based on other device properties (e.g. address of the first IO port/memory region). Instance IDs aren't always bad, we just run into trouble with hotplug and maybe creating unique ramblock names. But, such busses typically don't support hotplug and don't have multiple instances of the same device on the bus, so I don't expect us to hit many issues there as long as we can get a stable address path. Thanks, Simple consecutive instance IDs are inherently unstable. They depend on teh order of device creation. The ID really wants to be something that can be reliably determined from the device bus itself (and/or its interaction with its parent bus). Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 06:33 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote: If we drop unmapped pagecache pages, we need to be sure they can be backed by the host, and that depends on the amount of sharing. You also have to set up the host up properly, and continue to maintain it in a way that finds and eliminates duplicates. I saw some benchmarks where KSM was doing great, finding lots of duplicate pages. Then, the host filled up, and guests started reclaiming. As memory pressure got worse, so did KSM's ability to find duplicates. Yup. KSM needs to be backed up by ballooning, swap, and live migration. At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.
On 06/04/2010 04:45 PM, Cam Macdonell wrote: This is useful for devices that do not want to take memory regions data with them on migration. --- arch_init.c | 28 cpu-all.h|2 ++ cpu-common.h |2 ++ exec.c | 12 4 files changed, 32 insertions(+), 12 deletions(-) diff --git a/arch_init.c b/arch_init.c index cfc03ea..7a234fa 100644 --- a/arch_init.c +++ b/arch_init.c @@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f) current_addr + TARGET_PAGE_SIZE, MIGRATION_DIRTY_FLAG); -p = qemu_get_ram_ptr(current_addr); - -if (is_dup_page(p, *p)) { -qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS); -qemu_put_byte(f, *p); -} else { -qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE); -qemu_put_buffer(f, p, TARGET_PAGE_SIZE); -} +if (!cpu_physical_memory_get_dirty(current_addr, +NO_MIGRATION_FLAG)) { +p = qemu_get_ram_ptr(current_addr); + +if (is_dup_page(p, *p)) { +qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS); +qemu_put_byte(f, *p); +} else { +qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE); +qemu_put_buffer(f, p, TARGET_PAGE_SIZE); +} -found = 1; -break; +found = 1; +break; +} } Shouldn't we just disable live migration out right? I would rather that the device mark migration as impossible having the user hot remove the device before migration and then add it again after migration. Device assignment could also use this functionality. What this does is make migration possible but fundamentally broken which is not a good thing. Regards, Anthony Liguori addr += TARGET_PAGE_SIZE; current_addr = (saved_addr + addr) % last_ram_offset; @@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void) ram_addr_t count = 0; for (addr = 0; addr last_ram_offset; addr += TARGET_PAGE_SIZE) { -if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { +if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG) +cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { count++; } } diff --git a/cpu-all.h b/cpu-all.h index 9080cc7..4df00ab 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -887,6 +887,8 @@ extern int mem_prealloc; #define CODE_DIRTY_FLAG 0x02 #define MIGRATION_DIRTY_FLAG 0x08 +#define NO_MIGRATION_FLAG 0x10 + #define DIRTY_ALL_FLAG (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG) /* read dirty bit (return 0 or 1) */ diff --git a/cpu-common.h b/cpu-common.h index 4b0ba60..a1ebbbe 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -39,6 +39,8 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr, cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0); } +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size); + ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr); ram_addr_t qemu_ram_map(ram_addr_t size, void *host); ram_addr_t qemu_ram_alloc(ram_addr_t); diff --git a/exec.c b/exec.c index 39c18a7..c11d22f 100644 --- a/exec.c +++ b/exec.c @@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path) } #endif +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length) +{ +int i, len; +uint8_t *p; + +len = length TARGET_PAGE_BITS; +p = phys_ram_flags + (start TARGET_PAGE_BITS); +for (i = 0; i len; i++) { +p[i] |= NO_MIGRATION_FLAG; +} +} + ram_addr_t qemu_ram_map(ram_addr_t size, void *host) { RAMBlock *new_block; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support
On 06/11/2010 05:03 PM, Cam Macdonell wrote: Hi Anthony, Is my implementation of master/peer roles acceptable? Yes, it looks good. I realize with Alex's RAMList changes I may need to modify my patch, but is the approach of marking memory non-migratable an acceptable implementation? Please make sure to address some of the CODING_STYLE comments too. Regards, Anthony Liguori Thanks, Cam On Fri, Jun 4, 2010 at 3:45 PM, Cam Macdonellc...@cs.ualberta.ca wrote: Latest patch for PCI shared memory device that maps a host shared memory object to be shared between guests new in this series - migration support with 'master' and 'peer' roles for guest to determine who owns memory. With 'master', the guest has the canonical copy of the shared memory and will copy it with it on migration. With 'role=peer', the guest will not copy the shared memory, but attach to what is on the destination machine. - modified phys_ram_dirty array for marking memory as not to be migrated - add support for non-migrated memory regions v5: - fixed segfault for non-server case - code style fixes - removed limit on the number of guests - shared memory server is now in qemu.git/contrib - made ioeventfd setup function generic - removed interrupts when guest joined (let application handle it) v4: - moved to single Doorbell register and use datamatch to trigger different VMs rather than one register per eventfd - remove writing arbitrary values to eventfds. Only values of 1 are now written to ensure correct usage Cam Macdonell (6): Device specification for shared memory PCI device Adds two new functions for assigning ioeventfd and irqfds. Change phys_ram_dirty to phys_ram_status Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG. Inter-VM shared memory PCI device the stand-alone shared memory server for inter-VM shared memory Makefile.target |3 + arch_init.c | 28 +- contrib/ivshmem-server/Makefile | 16 + contrib/ivshmem-server/README | 30 ++ contrib/ivshmem-server/ivshmem_server.c | 353 + contrib/ivshmem-server/send_scm.c | 208 contrib/ivshmem-server/send_scm.h | 19 + cpu-all.h | 18 +- cpu-common.h|2 + docs/specs/ivshmem_device_spec.txt | 96 exec.c | 48 ++- hw/ivshmem.c| 852 +++ kvm-all.c | 32 ++ kvm.h |1 + qemu-char.c |6 + qemu-char.h |3 + qemu-doc.texi | 32 ++ 17 files changed, 1710 insertions(+), 37 deletions(-) create mode 100644 contrib/ivshmem-server/Makefile create mode 100644 contrib/ivshmem-server/README create mode 100644 contrib/ivshmem-server/ivshmem_server.c create mode 100644 contrib/ivshmem-server/send_scm.c create mode 100644 contrib/ivshmem-server/send_scm.h create mode 100644 docs/specs/ivshmem_device_spec.txt create mode 100644 hw/ivshmem.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: On 06/14/2010 06:33 PM, Dave Hansen wrote: At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. Balbir, can you elaborate a bit on why you would need these patches on a guest that is being ballooned? -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
On 06/04/2010 04:45 PM, Cam Macdonell wrote: this code is a standalone server which will pass file descriptors for the shared memory region and eventfds to support interrupts between guests using inter-VM shared memory. --- contrib/ivshmem-server/Makefile | 16 ++ contrib/ivshmem-server/README | 30 +++ contrib/ivshmem-server/ivshmem_server.c | 353 +++ contrib/ivshmem-server/send_scm.c | 208 ++ contrib/ivshmem-server/send_scm.h | 19 ++ 5 files changed, 626 insertions(+), 0 deletions(-) create mode 100644 contrib/ivshmem-server/Makefile create mode 100644 contrib/ivshmem-server/README create mode 100644 contrib/ivshmem-server/ivshmem_server.c create mode 100644 contrib/ivshmem-server/send_scm.c create mode 100644 contrib/ivshmem-server/send_scm.h diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile new file mode 100644 index 000..da40ffa --- /dev/null +++ b/contrib/ivshmem-server/Makefile @@ -0,0 +1,16 @@ +CC = gcc +CFLAGS = -O3 -Wall -Werror +LIBS = -lrt + +# a very simple makefile to build the inter-VM shared memory server + +all: ivshmem_server + +.c.o: + $(CC) $(CFLAGS) -c $^ -o $@ + +ivshmem_server: ivshmem_server.o send_scm.o + $(CC) $(CFLAGS) -o $@ $^ $(LIBS) + +clean: + rm -f *.o ivshmem_server diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README new file mode 100644 index 000..b1fc2a2 --- /dev/null +++ b/contrib/ivshmem-server/README @@ -0,0 +1,30 @@ +Using the ivshmem shared memory server +-- + +This server is only supported on Linux. + +To use the shared memory server, first compile it. Running 'make' should +accomplish this. An executable named 'ivshmem_server' will be built. + +to display the options run: + +./ivshmem_server -h + +Options +--- + +-h print help message + +-ppath on host +unix socket to listen on. The qemu-kvm chardev needs to connect on +this socket. (default: '/tmp/ivshmem_socket') + +-sstring +POSIX shared object to create that is the shared memory (default: 'ivshmem') + +-m# +size of the POSIX object in MBs (default: 1) + +-n# +number of eventfds for each guest. This number must match the +'vectors' argument passed the ivshmem device. (default: 1) diff --git a/contrib/ivshmem-server/ivshmem_server.c b/contrib/ivshmem-server/ivshmem_server.c new file mode 100644 index 000..e0a7b98 --- /dev/null +++ b/contrib/ivshmem-server/ivshmem_server.c There's no licensing here. I don't think this belongs in the qemu tree either to be honest. If it were to be included, it ought to use all of the existing qemu infrastructure like the other qemu-* tools. Regards, Anthony Liguori @@ -0,0 +1,353 @@ +/* + * A stand-alone shared memory server for inter-VM shared memory for KVM +*/ + +#includeerrno.h +#includestring.h +#includesys/types.h +#includesys/socket.h +#includesys/un.h +#includeunistd.h +#includesys/types.h +#includesys/stat.h +#includefcntl.h +#includesys/eventfd.h +#includesys/mman.h +#includesys/select.h +#includestdio.h +#includestdlib.h +#include send_scm.h + +#define DEFAULT_SOCK_PATH /tmp/ivshmem_socket +#define DEFAULT_SHM_OBJ ivshmem + +#define DEBUG 1 + +typedef struct server_state { +vmguest_t *live_vms; +int nr_allocated_vms; +int shm_size; +long live_count; +long total_count; +int shm_fd; +char * path; +char * shmobj; +int maxfd, conn_socket; +long msi_vectors; +} server_state_t; + +void usage(char const *prg); +int find_set(fd_set * readset, int max); +void print_vec(server_state_t * s, const char * c); + +void add_new_guest(server_state_t * s); +void parse_args(int argc, char **argv, server_state_t * s); +int create_listening_socket(char * path); + +int main(int argc, char ** argv) +{ +fd_set readset; +server_state_t * s; + +s = (server_state_t *)calloc(1, sizeof(server_state_t)); + +s-live_count = 0; +s-total_count = 0; +parse_args(argc, argv, s); + +/* open shared memory file */ +if ((s-shm_fd = shm_open(s-shmobj, O_CREAT|O_RDWR, S_IRWXU)) 0) +{ +fprintf(stderr, kvm_ivshmem: could not open shared file\n); +exit(-1); +} + +ftruncate(s-shm_fd, s-shm_size); + +s-conn_socket = create_listening_socket(s-path); + +s-maxfd = s-conn_socket; + +for(;;) { +int ret, handle, i; +char buf[1024]; + +print_vec(s, vm_sockets); + +FD_ZERO(readset); +/* conn socket is in Live_vms at posn 0 */ +FD_SET(s-conn_socket,readset); +for (i = 0; i s-total_count; i++) { +if (s-live_vms[i].alive != 0) { +FD_SET(s-live_vms[i].sockfd,readset); +} +} + +printf(\nWaiting (maxfd = %d)\n, s-maxfd); + +ret = select(s-maxfd + 1,readset, NULL, NULL, NULL); + +if
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
Alex Williamson wrote: On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote: /main-system-bus/pci.0,addr=09.0/virtio-blk-pci There's a device missing between the main system bus and the pci bus. Should be something like: /main-system-bus/piix4-pcihost/pci.0/_09.0 Ok, I can easily come up with: /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virtio-blk First two elements are redundant, '/' already stands for the main system bus. Then I don't get what last element expresses (the target device is virtio-blk-pci). Is this due to some vmstate layout? But that should not be part into qdev paths (maybe I'm missing your use case). And instead of introducing another hierarchy level with the bus address, I would also prefer to add this as prefix or suffix to the device name, e.g. driver.busaddr. to expand the previous output. Code below. Could you explain why you add identified properties of the immediate parent bus and device? They make the result ver much *not* a dev path in the qdev sense... In order to try to get a unique string. Without looking into device properties, two e1000s would both be: /main-system-bus/pci.0/e1000 /main-system-bus/pci.0/e1000 Which is no better than simply e1000 and would require us to fall back to instance ids again. The goal here is that anything that makes use of passing a dev when registering a vmstate gets an instance id of zero. You already got the information you need, you just put it in the wrong place. The canonical ID for the device could be its bus address. In practice we'd probably want to allow the user to specify it by name, provided these are unique. e.g. in the above machine we could accept [...]/virtiio-blk-pci would as an aias for [...]:_09.0. Sure, if there was a guaranteed unique char* on a DeviceState that was consistent between guest invocations, we could just use that instead. I suppose all devices could have a global system id property and if that was present we'd use that instead of creating the device path. Device names have a restricted namespace, so we can use an initial prefix to disambiguate a name/label from a bus address. I'm not sure it's necessary. It seems like it would only be necessary to differentiate the two if we wanted to translate between namespaces. But I think it's reasonable to require that if a global name is provided, it must always be provided for the life of the VM. For busses that don't have a consistent addressing scheme then some sort of instance ID is unavoidable. I guess it may be possible to invent something based on other device properties (e.g. address of the first IO port/memory region). Instance IDs aren't always bad, we just run into trouble with hotplug and maybe creating unique ramblock names. But, such busses typically don't support hotplug and don't have multiple instances of the same device on the bus, so I don't expect us to hit many issues there as long as we can get a stable address path. Thanks, If stable instance numbers are required, we could simply keep them in DeviceState and search for the smallest free one on additions. Actually, I'm more in favor of this direction than including the bus address. That way we could keep a canonical format across all buses and do not have to provide possibly complex ID generation rules for each of them. BTW, the problem isn't truly solved by printing paths. We also need to parse them. It would be counterproductive if such paths ever see the light of day (ie. leak outside vmstate) and a user tries to hack it into the monitor or pass it on the command line. With my series, I'm currently able to process paths like this one: /i440FX-pcihost.0/pci.0/PIIX4_PM.0/i2c/smbus-eeprom.6 Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.
On Mon, Jun 14, 2010 at 9:51 AM, Anthony Liguori anth...@codemonkey.ws wrote: On 06/04/2010 04:45 PM, Cam Macdonell wrote: This is useful for devices that do not want to take memory regions data with them on migration. --- arch_init.c | 28 cpu-all.h | 2 ++ cpu-common.h | 2 ++ exec.c | 12 4 files changed, 32 insertions(+), 12 deletions(-) diff --git a/arch_init.c b/arch_init.c index cfc03ea..7a234fa 100644 --- a/arch_init.c +++ b/arch_init.c @@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f) current_addr + TARGET_PAGE_SIZE, MIGRATION_DIRTY_FLAG); - p = qemu_get_ram_ptr(current_addr); - - if (is_dup_page(p, *p)) { - qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS); - qemu_put_byte(f, *p); - } else { - qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE); - qemu_put_buffer(f, p, TARGET_PAGE_SIZE); - } + if (!cpu_physical_memory_get_dirty(current_addr, + NO_MIGRATION_FLAG)) { + p = qemu_get_ram_ptr(current_addr); + + if (is_dup_page(p, *p)) { + qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS); + qemu_put_byte(f, *p); + } else { + qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE); + qemu_put_buffer(f, p, TARGET_PAGE_SIZE); + } - found = 1; - break; + found = 1; + break; + } } Shouldn't we just disable live migration out right? I'm confused, as you seemed insistent on migration before. Do you want to support static migration (suspend/resume), but not live migration? What information do the master/peer roles represent then? I would rather that the device mark migration as impossible having the user hot remove the device before migration and then add it again after migration. Device assignment could also use this functionality. Would marking migration impossible be a new mechanism or are there other devices that mark migration impossible? or something added to QMP Sorry, you can't migrate with device 'x' attached? Cam addr += TARGET_PAGE_SIZE; current_addr = (saved_addr + addr) % last_ram_offset; @@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void) ram_addr_t count = 0; for (addr = 0; addr last_ram_offset; addr += TARGET_PAGE_SIZE) { - if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { + if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG) + cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { count++; } } diff --git a/cpu-all.h b/cpu-all.h index 9080cc7..4df00ab 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -887,6 +887,8 @@ extern int mem_prealloc; #define CODE_DIRTY_FLAG 0x02 #define MIGRATION_DIRTY_FLAG 0x08 +#define NO_MIGRATION_FLAG 0x10 + #define DIRTY_ALL_FLAG (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG) /* read dirty bit (return 0 or 1) */ diff --git a/cpu-common.h b/cpu-common.h index 4b0ba60..a1ebbbe 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -39,6 +39,8 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr, cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0); } +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size); + ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr); ram_addr_t qemu_ram_map(ram_addr_t size, void *host); ram_addr_t qemu_ram_alloc(ram_addr_t); diff --git a/exec.c b/exec.c index 39c18a7..c11d22f 100644 --- a/exec.c +++ b/exec.c @@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path) } #endif +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length) +{ + int i, len; + uint8_t *p; + + len = length TARGET_PAGE_BITS; + p = phys_ram_flags + (start TARGET_PAGE_BITS); + for (i = 0; i len; i++) { + p[i] |= NO_MIGRATION_FLAG; + } +} + ram_addr_t qemu_ram_map(ram_addr_t size, void *host) { RAMBlock *new_block; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.
On 06/14/2010 11:08 AM, Cam Macdonell wrote: On Mon, Jun 14, 2010 at 9:51 AM, Anthony Liguorianth...@codemonkey.ws wrote: On 06/04/2010 04:45 PM, Cam Macdonell wrote: This is useful for devices that do not want to take memory regions data with them on migration. --- arch_init.c | 28 cpu-all.h|2 ++ cpu-common.h |2 ++ exec.c | 12 4 files changed, 32 insertions(+), 12 deletions(-) diff --git a/arch_init.c b/arch_init.c index cfc03ea..7a234fa 100644 --- a/arch_init.c +++ b/arch_init.c @@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f) current_addr + TARGET_PAGE_SIZE, MIGRATION_DIRTY_FLAG); -p = qemu_get_ram_ptr(current_addr); - -if (is_dup_page(p, *p)) { -qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS); -qemu_put_byte(f, *p); -} else { -qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE); -qemu_put_buffer(f, p, TARGET_PAGE_SIZE); -} +if (!cpu_physical_memory_get_dirty(current_addr, +NO_MIGRATION_FLAG)) { +p = qemu_get_ram_ptr(current_addr); + +if (is_dup_page(p, *p)) { +qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS); +qemu_put_byte(f, *p); +} else { +qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE); +qemu_put_buffer(f, p, TARGET_PAGE_SIZE); +} -found = 1; -break; +found = 1; +break; +} } Shouldn't we just disable live migration out right? I'm confused, as you seemed insistent on migration before. Do you want to support static migration (suspend/resume), but not live migration? What information do the master/peer roles represent then? When role=master, you should not disable live migration and you should still migrate the contents of the data. Otherwise, when you migrate, you lose the contents of shared memory and since the role is master, it's the one responsible for the data. I would rather that the device mark migration as impossible having the user hot remove the device before migration and then add it again after migration. Device assignment could also use this functionality. Would marking migration impossible be a new mechanism or are there other devices that mark migration impossible? or something added to QMP Sorry, you can't migrate with device 'x' attached? We don't have such a mechanism today. Regards, Anthony Liguori Cam addr += TARGET_PAGE_SIZE; current_addr = (saved_addr + addr) % last_ram_offset; @@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void) ram_addr_t count = 0; for (addr = 0; addrlast_ram_offset; addr += TARGET_PAGE_SIZE) { -if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { +if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG) +cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { count++; } } diff --git a/cpu-all.h b/cpu-all.h index 9080cc7..4df00ab 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -887,6 +887,8 @@ extern int mem_prealloc; #define CODE_DIRTY_FLAG 0x02 #define MIGRATION_DIRTY_FLAG 0x08 +#define NO_MIGRATION_FLAG 0x10 + #define DIRTY_ALL_FLAG (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | MIGRATION_DIRTY_FLAG) /* read dirty bit (return 0 or 1) */ diff --git a/cpu-common.h b/cpu-common.h index 4b0ba60..a1ebbbe 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -39,6 +39,8 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr, cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0); } +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size); + ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr); ram_addr_t qemu_ram_map(ram_addr_t size, void *host); ram_addr_t qemu_ram_alloc(ram_addr_t); diff --git a/exec.c b/exec.c index 39c18a7..c11d22f 100644 --- a/exec.c +++ b/exec.c @@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path) } #endif +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length) +{ +int i, len; +uint8_t *p; + +len = lengthTARGET_PAGE_BITS; +p = phys_ram_flags + (startTARGET_PAGE_BITS); +for (i = 0; ilen; i++) { +p[i] |= NO_MIGRATION_FLAG; +} +} + ram_addr_t qemu_ram_map(ram_addr_t size, void *host) { RAMBlock *new_block; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [Qemu-devel] [PATCH v2] ram_blocks: Convert to a QLIST
On 06/11/2010 12:11 PM, Alex Williamson wrote: This makes the RAM block list easier to manipulate. Also incorporate relevant variables into the RAMList struct. Signed-off-by: Alex Williamsonalex.william...@redhat.com Acked-by: Chris Wrightchr...@redhat.com Applied. Thanks. Regards, Anthony Liguori --- v2: For qemu.git this time arch_init.c | 14 +-- cpu-all.h | 28 +++-- exec.c | 78 --- 3 files changed, 54 insertions(+), 66 deletions(-) diff --git a/arch_init.c b/arch_init.c index 8e849a8..eb5b67c 100644 --- a/arch_init.c +++ b/arch_init.c @@ -110,7 +110,7 @@ static int ram_save_block(QEMUFile *f) ram_addr_t addr = 0; int bytes_sent = 0; -while (addr last_ram_offset) { +while (addr ram_list.last_offset) { if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) { uint8_t *p; @@ -133,7 +133,7 @@ static int ram_save_block(QEMUFile *f) break; } addr += TARGET_PAGE_SIZE; -current_addr = (saved_addr + addr) % last_ram_offset; +current_addr = (saved_addr + addr) % ram_list.last_offset; } return bytes_sent; @@ -146,7 +146,7 @@ static ram_addr_t ram_save_remaining(void) ram_addr_t addr; ram_addr_t count = 0; -for (addr = 0; addr last_ram_offset; addr += TARGET_PAGE_SIZE) { +for (addr = 0; addr ram_list.last_offset; addr += TARGET_PAGE_SIZE) { if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { count++; } @@ -167,7 +167,7 @@ uint64_t ram_bytes_transferred(void) uint64_t ram_bytes_total(void) { -return last_ram_offset; +return ram_list.last_offset; } int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) @@ -191,7 +191,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) bytes_transferred = 0; /* Make sure all dirty bits are set */ -for (addr = 0; addr last_ram_offset; addr += TARGET_PAGE_SIZE) { +for (addr = 0; addr ram_list.last_offset; addr += TARGET_PAGE_SIZE) { if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) { cpu_physical_memory_set_dirty(addr); } @@ -200,7 +200,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) /* Enable dirty memory tracking */ cpu_physical_memory_set_dirty_tracking(1); -qemu_put_be64(f, last_ram_offset | RAM_SAVE_FLAG_MEM_SIZE); +qemu_put_be64(f, ram_list.last_offset | RAM_SAVE_FLAG_MEM_SIZE); } bytes_transferred_last = bytes_transferred; @@ -259,7 +259,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) addr= TARGET_PAGE_MASK; if (flags RAM_SAVE_FLAG_MEM_SIZE) { -if (addr != last_ram_offset) { +if (addr != ram_list.last_offset) { return -EINVAL; } } diff --git a/cpu-all.h b/cpu-all.h index 77eaf85..e31c2de 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -859,9 +859,21 @@ target_phys_addr_t cpu_get_phys_page_debug(CPUState *env, target_ulong addr); /* memory API */ extern int phys_ram_fd; -extern uint8_t *phys_ram_dirty; extern ram_addr_t ram_size; -extern ram_addr_t last_ram_offset; + +typedef struct RAMBlock { +uint8_t *host; +ram_addr_t offset; +ram_addr_t length; +QLIST_ENTRY(RAMBlock) next; +} RAMBlock; + +typedef struct RAMList { +uint8_t *phys_dirty; +ram_addr_t last_offset; +QLIST_HEAD(ram, RAMBlock) blocks; +} RAMList; +extern RAMList ram_list; extern const char *mem_path; extern int mem_prealloc; @@ -891,29 +903,29 @@ extern int mem_prealloc; /* read dirty bit (return 0 or 1) */ static inline int cpu_physical_memory_is_dirty(ram_addr_t addr) { -return phys_ram_dirty[addr TARGET_PAGE_BITS] == 0xff; +return ram_list.phys_dirty[addr TARGET_PAGE_BITS] == 0xff; } static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr) { -return phys_ram_dirty[addr TARGET_PAGE_BITS]; +return ram_list.phys_dirty[addr TARGET_PAGE_BITS]; } static inline int cpu_physical_memory_get_dirty(ram_addr_t addr, int dirty_flags) { -return phys_ram_dirty[addr TARGET_PAGE_BITS] dirty_flags; +return ram_list.phys_dirty[addr TARGET_PAGE_BITS] dirty_flags; } static inline void cpu_physical_memory_set_dirty(ram_addr_t addr) { -phys_ram_dirty[addr TARGET_PAGE_BITS] = 0xff; +ram_list.phys_dirty[addr TARGET_PAGE_BITS] = 0xff; } static inline int cpu_physical_memory_set_dirty_flags(ram_addr_t addr, int dirty_flags) { -return phys_ram_dirty[addr TARGET_PAGE_BITS] |= dirty_flags; +return ram_list.phys_dirty[addr TARGET_PAGE_BITS] |= dirty_flags; } static
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote: Alex Williamson wrote: On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote: /main-system-bus/pci.0,addr=09.0/virtio-blk-pci There's a device missing between the main system bus and the pci bus. Should be something like: /main-system-bus/piix4-pcihost/pci.0/_09.0 Ok, I can easily come up with: /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virtio-blk First two elements are redundant, '/' already stands for the main system bus. Ok, we can start printing after the system bus. Then I don't get what last element expresses (the target device is virtio-blk-pci). Is this due to some vmstate layout? But that should not be part into qdev paths (maybe I'm missing your use case). Sorry, I wasn't clear about that. My printf is in the savevm code, after it's already appended the idstr passed in from the device. The device path in the example above ends at virtio-blk-pci. That's the dev-info-name of the device passed into this function. And instead of introducing another hierarchy level with the bus address, I would also prefer to add this as prefix or suffix to the device name, e.g. driver.busaddr. That's what I had started with. The first post in this thread has pci.0,addr=09.0 as a single hierarchy level. The addr= may be unnecessary there, but I also prefer something along those lines. For PCI it'd make sense to have name:addr, which comes out to pci.0:09.0. (Maybe rather than flagging properties as being relevant to the path and printing them generically, we should extract specific properties based on the bus type.) For busses that don't have a consistent addressing scheme then some sort of instance ID is unavoidable. I guess it may be possible to invent something based on other device properties (e.g. address of the first IO port/memory region). Instance IDs aren't always bad, we just run into trouble with hotplug and maybe creating unique ramblock names. But, such busses typically don't support hotplug and don't have multiple instances of the same device on the bus, so I don't expect us to hit many issues there as long as we can get a stable address path. Thanks, If stable instance numbers are required, we could simply keep them in DeviceState and search for the smallest free one on additions. Actually, I'm more in favor of this direction than including the bus address. That way we could keep a canonical format across all buses and do not have to provide possibly complex ID generation rules for each of them. I started down that path, but it still breaks for hotplug. If we start a VM with two e1000 NICs, then remove the first, we can no longer migrate because the target can't represent having a single e1000 with a non-zero instance ID. BTW, the problem isn't truly solved by printing paths. We also need to parse them. It would be counterproductive if such paths ever see the light of day (ie. leak outside vmstate) and a user tries to hack it into the monitor or pass it on the command line. With my series, I'm currently able to process paths like this one: /i440FX-pcihost.0/pci.0/PIIX4_PM.0/i2c/smbus-eeprom.6 Yes, these are only intended for internal use, but we should get them to sync up as much as possible. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Graphics PCI Passthrough support...
All, Are there any plans for PCI Passthrough support for Graphics within KVM, similar to current Xen support? Regards, Scott Parker -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
Alex Williamson wrote: On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote: Alex Williamson wrote: On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote: /main-system-bus/pci.0,addr=09.0/virtio-blk-pci There's a device missing between the main system bus and the pci bus. Should be something like: /main-system-bus/piix4-pcihost/pci.0/_09.0 Ok, I can easily come up with: /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virtio-blk First two elements are redundant, '/' already stands for the main system bus. Ok, we can start printing after the system bus. Then I don't get what last element expresses (the target device is virtio-blk-pci). Is this due to some vmstate layout? But that should not be part into qdev paths (maybe I'm missing your use case). Sorry, I wasn't clear about that. My printf is in the savevm code, after it's already appended the idstr passed in from the device. The device path in the example above ends at virtio-blk-pci. That's the dev-info-name of the device passed into this function. And instead of introducing another hierarchy level with the bus address, I would also prefer to add this as prefix or suffix to the device name, e.g. driver.busaddr. That's what I had started with. The first post in this thread has pci.0,addr=09.0 as a single hierarchy level. The addr= may be unnecessary there, but I also prefer something along those lines. For PCI it'd make sense to have name:addr, which comes out to pci.0:09.0. (Maybe rather than flagging properties as being relevant to the path and printing them generically, we should extract specific properties based on the bus type.) Not bus.addr, driver.addr. We only have one PCI bus here, not as many as there are slots on that bus. For busses that don't have a consistent addressing scheme then some sort of instance ID is unavoidable. I guess it may be possible to invent something based on other device properties (e.g. address of the first IO port/memory region). Instance IDs aren't always bad, we just run into trouble with hotplug and maybe creating unique ramblock names. But, such busses typically don't support hotplug and don't have multiple instances of the same device on the bus, so I don't expect us to hit many issues there as long as we can get a stable address path. Thanks, If stable instance numbers are required, we could simply keep them in DeviceState and search for the smallest free one on additions. Actually, I'm more in favor of this direction than including the bus address. That way we could keep a canonical format across all buses and do not have to provide possibly complex ID generation rules for each of them. I started down that path, but it still breaks for hotplug. If we start a VM with two e1000 NICs, then remove the first, we can no longer migrate because the target can't represent having a single e1000 with a non-zero instance ID. That's indeed a good point. Still, I'm worried about having to define numbering schemes for all the buses qemu supports. Maybe we can run a mixture: address-based for hotplug-capably buses (they tend to be cooperative in this regard) and simple instance numbers for the rest, likely the majority. BTW, the problem isn't truly solved by printing paths. We also need to parse them. It would be counterproductive if such paths ever see the light of day (ie. leak outside vmstate) and a user tries to hack it into the monitor or pass it on the command line. With my series, I'm currently able to process paths like this one: /i440FX-pcihost.0/pci.0/PIIX4_PM.0/i2c/smbus-eeprom.6 Yes, these are only intended for internal use, but we should get them to sync up as much as possible. Thanks, Unless there is a good reason, the match should be 100%. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH v2 1/3] KVM test: add kvm_monitor.py, an interface to QEMU monitors
This module should replace vm.send_monitor_cmd(). Instead of connecting to the monitor each time a command is issued, this module maintains a continuous connection to the monitor. It disconnects when a test terminates and reconnects as soon as the next test begins (upon unpickling). It currently contains only an interface to the human monitor. A QMP interface will be introduced in a future patch. Changes from v1: - Add name parameter to __init__() - Remove help() method - Rename help attribute to _help_str to indicate private use - Rename lock to _lock - Rename socket to _socket Signed-off-by: Michael Goldish mgold...@redhat.com --- client/tests/kvm/kvm_monitor.py | 354 +++ 1 files changed, 354 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/kvm_monitor.py diff --git a/client/tests/kvm/kvm_monitor.py b/client/tests/kvm/kvm_monitor.py new file mode 100644 index 000..27045a4 --- /dev/null +++ b/client/tests/kvm/kvm_monitor.py @@ -0,0 +1,354 @@ + +Interfaces to the QEMU monitor. + +...@copyright: 2008-2010 Red Hat Inc. + + +import socket, time, threading, logging +import kvm_utils + + +class MonitorError(Exception): +pass + + +class MonitorConnectError(MonitorError): +pass + + +class MonitorSendError(MonitorError): +pass + + +class MonitorLockError(MonitorError): +pass + + +class MonitorProtocolError(MonitorError): +pass + + +class Monitor: + +Common code for monitor classes. + + +def __init__(self, name, filename): + +Initialize the instance. + +@param name: Monitor identifier (a string) +@param filename: Monitor socket filename +@raise MonitorConnectError: Raised if the connection fails + +self.name = name +self.filename = filename +self._lock = threading.RLock() +self._socket = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) +self._socket.setblocking(False) + +try: +self._socket.connect(filename) +except socket.error: +raise MonitorConnectError(Could not connect to monitor socket) + + +def __del__(self): +# Automatically close the connection when the instance is garbage +# collected +try: +self._socket.shutdown(socket.SHUT_RDWR) +except socket.error: +pass +self._socket.close() + + +# The following two functions are defined to make sure the state is set +# exclusively by the constructor call as specified in __getinitargs__(). + +def __getstate__(self): +pass + + +def __setstate__(self, state): +pass + + +def __getinitargs__(self): +# Save some information when pickling -- will be passed to the +# constructor upon unpickling +return self.name, self.filename, True + + +def _acquire_lock(self, timeout=20): +end_time = time.time() + timeout +while time.time() end_time: +if self._lock.acquire(False): +return True +time.sleep(0.05) +return False + + +def _recvall(self): +s = +while True: +try: +data = self._socket.recv(1024) +except socket.error: +break +if not data: +break +s += data +return s + + +class HumanMonitor(Monitor): + +Wraps human monitor commands. + + +def __init__(self, name, filename, suppress_exceptions=False): + +Connect to the monitor socket and find the (qemu) prompt. + +@param name: Monitor identifier (a string) +@param filename: Monitor socket filename +@raise MonitorConnectError: Raised if the connection fails and +suppress_exceptions is False +@raise MonitorProtocolError: Raised if the initial (qemu) prompt isn't +found and suppress_exceptions is False +@note: Other exceptions may be raised. See _get_command_output's +docstring. + +try: +Monitor.__init__(self, name, filename) + +self.protocol = human + +# Find the initial (qemu) prompt +s, o = self._read_up_to_qemu_prompt(20) +if not s: +raise MonitorProtocolError(Could not find (qemu) prompt + after connecting to monitor. + Output so far: %r % o) + +# Save the output of 'help' for future use +self._help_str = self._get_command_output(help) + +except MonitorError, e: +if suppress_exceptions: +logging.warn(e) +else: +raise + + +# Private methods + +def _read_up_to_qemu_prompt(self, timeout=20): +o = +end_time = time.time() + timeout +while time.time() end_time: +try: +
[KVM-AUTOTEST PATCH v2 2/3] KVM test: use new monitor interface
- Add new monitor definition syntax that allows definition of multiple monitors. Monitors are now defined like other objects in the config file: monitors = MyMonitor SomeOtherMonitor YetAnotherMonitor # defines 3 monitors monitor_type = human# default for all monitors monitor_type_SomeOtherMonitor = qmp # applies only to SomeOtherMonitor monitor_type_YetAnotherMonitor = qmp# applies only to YetAnotherMonitor main_monitor = MyMonitor# defines the main monitor to use # in the test - Use the new syntax in tests_base.cfg.sample. - Establish monitor connections using kvm_monitor in VM.create(). Store all monitors in self.monitors. Store main monitor in self.monitor. - Replace calls to send_monitor_cmd() with appropriate calls to methods of self.monitor (the main monitor). - For now, ignore the parameter screendump_verbose because currently monitor commands are always silent (when successful). Changes from v1: - Turn VM.monitor into a property Signed-off-by: Michael Goldish mgold...@redhat.com --- client/tests/kvm/kvm_preprocessing.py | 33 ++-- client/tests/kvm/kvm_test_utils.py | 35 +-- client/tests/kvm/kvm_vm.py | 236 +--- client/tests/kvm/tests/balloon_check.py| 12 +- client/tests/kvm/tests/boot_savevm.py | 41 ++-- client/tests/kvm/tests/ksm_overcommit.py |8 +- client/tests/kvm/tests/pci_hotplug.py | 13 +- client/tests/kvm/tests/physical_resources_check.py | 40 ++-- client/tests/kvm/tests/shutdown.py |2 +- client/tests/kvm/tests/stepmaker.py| 44 ++-- client/tests/kvm/tests/steps.py| 12 +- client/tests/kvm/tests_base.cfg.sample |7 +- 12 files changed, 241 insertions(+), 242 deletions(-) diff --git a/client/tests/kvm/kvm_preprocessing.py b/client/tests/kvm/kvm_preprocessing.py index 76c8268..ee3e9b2 100644 --- a/client/tests/kvm/kvm_preprocessing.py +++ b/client/tests/kvm/kvm_preprocessing.py @@ -1,7 +1,7 @@ import sys, os, time, commands, re, logging, signal, glob, threading, shutil from autotest_lib.client.bin import test, utils from autotest_lib.client.common_lib import error -import kvm_vm, kvm_utils, kvm_subprocess, ppm_utils +import kvm_vm, kvm_utils, kvm_subprocess, kvm_monitor, ppm_utils try: import PIL.Image except ImportError: @@ -83,7 +83,11 @@ def preprocess_vm(test, params, env, name): raise error.TestError(Could not start VM) scrdump_filename = os.path.join(test.debugdir, pre_%s.ppm % name) -vm.send_monitor_cmd(screendump %s % scrdump_filename) +try: +if vm.monitor: +vm.monitor.screendump(scrdump_filename) +except kvm_monitor.MonitorError, e: +logging.warn(e) def postprocess_image(test, params): @@ -117,7 +121,11 @@ def postprocess_vm(test, params, env, name): return scrdump_filename = os.path.join(test.debugdir, post_%s.ppm % name) -vm.send_monitor_cmd(screendump %s % scrdump_filename) +try: +if vm.monitor: +vm.monitor.screendump(scrdump_filename) +except kvm_monitor.MonitorError, e: +logging.warn(e) if params.get(kill_vm) == yes: kill_vm_timeout = float(params.get(kill_vm_timeout, 0)) @@ -356,8 +364,9 @@ def postprocess(test, params, env): for vm in kvm_utils.env_get_all_vms(env): if not vm.is_dead(): logging.info(VM '%s' is alive., vm.name) -logging.info(The monitor unix socket of '%s' is: %s, - vm.name, vm.monitor_file_name) +for m in vm.monitors: +logging.info('%s' has a %s monitor unix socket at: %s, + vm.name, m.protocol, m.filename) logging.info(The command line used to start '%s' was:\n%s, vm.name, vm.make_qemu_command()) raise error.JobError(Abort requested (%s) % exc_string) @@ -403,10 +412,6 @@ def _take_screendumps(test, params, env): kvm_utils.generate_random_string(6)) delay = float(params.get(screendump_delay, 5)) quality = int(params.get(screendump_quality, 30)) -if params.get(screendump_verbose) == 'yes': -screendump_verbose = True -else: -screendump_verbose = False cache = {} @@ -414,11 +419,11 @@ def _take_screendumps(test, params, env): for vm in kvm_utils.env_get_all_vms(env): if vm.is_dead(): continue -if screendump_verbose: -vm.send_monitor_cmd(screendump %s % temp_filename) -else: -vm.send_monitor_cmd(screendump %s % temp_filename, -verbose=False) +try: +
[KVM-AUTOTEST PATCH v2 3/3] KVM test: kvm_monitor.py: add QMP interface
An initial QMP client implementation. Should be fully functional and supports asynchronous events. However, most tests must be modified to support it, because it returns output in a different format from the human monitor (the human monitor returns strings and the QMP one returns dicts or lists). To enable QMP, set main_monitor to a monitor whose monitor_type is qmp. For example (a single QMP monitor): monitors = monitor1 monitor_type_monitor1 = qmp main_monitor = monitor1 Another example (multiple monitors, both human and QMP): monitors = MyMonitor SomeOtherMonitor YetAnotherMonitor # defines 3 monitors monitor_type = human# default for all monitors monitor_type_SomeOtherMonitor = qmp # applies only to SomeOtherMonitor monitor_type_YetAnotherMonitor = qmp# applies only to YetAnotherMonitor main_monitor = SomeOtherMonitor # the main monitor is a QMP one, so # the test will use QMP Note: Monitor methods now raise exceptions such as MonitorLockError and QMPCmdError. If this turns out to be a bad idea, it shouldn't be hard to revert to the old convention of returning a (status, output) tuple. Changes from v1: - Upon connection make sure the QMP greeting is received (otherwise raise an exception) - Upon connection make sure the json module is available - Rename the events attribute to _events to indicate private use - Use _lock instead of lock - Use _socket instead of socket Signed-off-by: Michael Goldish mgold...@redhat.com --- client/tests/kvm/kvm_monitor.py | 305 +++ client/tests/kvm/kvm_vm.py |8 +- 2 files changed, 310 insertions(+), 3 deletions(-) diff --git a/client/tests/kvm/kvm_monitor.py b/client/tests/kvm/kvm_monitor.py index 27045a4..12cf773 100644 --- a/client/tests/kvm/kvm_monitor.py +++ b/client/tests/kvm/kvm_monitor.py @@ -6,6 +6,11 @@ Interfaces to the QEMU monitor. import socket, time, threading, logging import kvm_utils +try: +import json +except ImportError: +logging.warning(Could not import json module. +QMP monitor functionality disabled.) class MonitorError(Exception): @@ -28,6 +33,14 @@ class MonitorProtocolError(MonitorError): pass +class MonitorNotSupportedError(MonitorError): +pass + + +class QMPCmdError(MonitorError): +pass + + class Monitor: Common code for monitor classes. @@ -352,3 +365,295 @@ class HumanMonitor(Monitor): @return: The command's output return self._get_command_output(mouse_button %d % state) + + +class QMPMonitor(Monitor): + +Wraps QMP monitor commands. + + +def __init__(self, name, filename, suppress_exceptions=False): + +Connect to the monitor socket, read the greeting message and issue the +qmp_capabilities command. Also make sure the json module is available. + +@param name: Monitor identifier (a string) +@param filename: Monitor socket filename +@raise MonitorConnectError: Raised if the connection fails and +suppress_exceptions is False +@raise MonitorProtocolError: Raised if the no QMP greeting message is +received and suppress_exceptions is False +@raise MonitorNotSupportedError: Raised if json isn't available and +suppress_exceptions is False +@note: Other exceptions may be raised if the qmp_capabilities command +fails. See _get_command_output's docstring. + +try: +Monitor.__init__(self, name, filename) + +self.protocol = qmp +self._greeting = None +self._events = [] + +# Make sure json is available +try: +json +except NameError: +raise MonitorNotSupportedError(QMP requires the json module + (Python 2.6 and up)) + +# Read greeting message +end_time = time.time() + 20 +while time.time() end_time: +for obj in self._read_objects(): +if QMP in obj: +self._greeting = obj[QMP] +break +if self._greeting: +break +time.sleep(0.1) +else: +raise MonitorProtocolError(No QMP greeting message received) + +# Issue qmp_capabilities +self._get_command_output(qmp_capabilities) + +except MonitorError, e: +if suppress_exceptions: +logging.warn(e) +else: +raise + + +# Private methods + +def _build_cmd(self, cmd, args=None): +obj = {execute: cmd} +if args: +obj[arguments] = args +return obj + + +def _read_objects(self, timeout=5): + +Read
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 08:12:56]: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. If unmapped page cache is the easiest thing to evict, then it should be the first thing that goes when a balloon request comes in, which is the case this patch is trying to handle. If it isn't the easiest thing to evict, then we _shouldn't_ evict it. Like I said earlier, a lot of that works correctly as you said, but it is also an idealization. If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3]: Fixes to IRQ routing
As we've discussed previously, here is a series of patches to fix some of the IRQ routing issues we have in KVM. With this series in place I was able to successfully kdump a RHEL-5 64-bit, and RHEL-6 32- and 64-bit guest on CPU's other than the BSP. RHEL-5 32-bit kdump still does not work; it gets stuck on Checking 'hlt' instruction. However, it does that both before and after this series, so there is something else going on there that I still have to debug. I also need to change the kvm_migrate_pit_timer function to migrate the timer over to the last CPU that handled the timer interrupt, on the theory that that particlar CPU is likely to handle the timer interrupt again in the near future. However, this is an optimization and shouldn't delay the inclusion of the rest of the series for correctness. Changes since RFC: - Changed ps-inject_lock from raw_spinlock_t to spinlock_t - Fixed up some formatting issues - Changed to have one PIT workqueue per-guest - Remember to cancel_work_sync when destroying the PIT Changes since v1: - Call cancel_work_sync everywhere we call hrtimer_cancel - Bring back the reinjection logic - Fix up formatting issues from checkpatch -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] Introduce a workqueue to deliver PIT timer interrupts.
We really want to kvm_set_irq during the hrtimer callback, but that is risky because that is during interrupt context. Instead, offload the work to a workqueue, which is a bit safer and should provide most of the same functionality. Signed-off-by: Chris Lalancette clala...@redhat.com --- arch/x86/kvm/i8254.c | 125 - arch/x86/kvm/i8254.h |4 +- arch/x86/kvm/irq.c |1 - 3 files changed, 74 insertions(+), 56 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 188d827..3bed8ac 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -34,6 +34,7 @@ #include linux/kvm_host.h #include linux/slab.h +#include linux/workqueue.h #include irq.h #include i8254.h @@ -244,11 +245,11 @@ static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian) { struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state, irq_ack_notifier); - raw_spin_lock(ps-inject_lock); + spin_lock(ps-inject_lock); if (atomic_dec_return(ps-pit_timer.pending) 0) atomic_inc(ps-pit_timer.pending); ps-irq_ack = 1; - raw_spin_unlock(ps-inject_lock); + spin_unlock(ps-inject_lock); } void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu) @@ -267,7 +268,8 @@ void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu) static void destroy_pit_timer(struct kvm_timer *pt) { pr_debug(execute del timer!\n); - hrtimer_cancel(pt-timer); + if (hrtimer_cancel(pt-timer)) + cancel_work_sync(pt-kvm-arch.vpit-expired); } static bool kpit_is_periodic(struct kvm_timer *ktimer) @@ -281,6 +283,58 @@ static struct kvm_timer_ops kpit_ops = { .is_periodic = kpit_is_periodic, }; +static void pit_do_work(struct work_struct *work) +{ + struct kvm_pit *pit = container_of(work, struct kvm_pit, expired); + struct kvm *kvm = pit-kvm; + struct kvm_vcpu *vcpu; + int i; + struct kvm_kpit_state *ps = pit-pit_state; + int inject = 0; + + /* Try to inject pending interrupts when +* last one has been acked. +*/ + spin_lock(ps-inject_lock); + if (ps-irq_ack) { + ps-irq_ack = 0; + inject = 1; + } + spin_unlock(ps-inject_lock); + if (inject) { + kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1); + kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 0); + + /* +* Provides NMI watchdog support via Virtual Wire mode. +* The route is: PIT - PIC - LVT0 in NMI mode. +* +* Note: Our Virtual Wire implementation is simplified, only +* propagating PIT interrupts to all VCPUs when they have set +* LVT0 to NMI delivery. Other PIC interrupts are just sent to +* VCPU0, and only if its LVT0 is in EXTINT mode. +*/ + if (kvm-arch.vapics_in_nmi_mode 0) + kvm_for_each_vcpu(i, vcpu, kvm) + kvm_apic_nmi_wd_deliver(vcpu); + } +} + +static enum hrtimer_restart pit_timer_fn(struct hrtimer *data) +{ + struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer); + struct kvm_pit *pt = ktimer-kvm-arch.vpit; + + if (ktimer-reinject) + queue_work(pt-wq, pt-expired); + + if (ktimer-t_ops-is_periodic(ktimer)) { + hrtimer_add_expires_ns(ktimer-timer, ktimer-period); + return HRTIMER_RESTART; + } else + return HRTIMER_NORESTART; +} + static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period) { struct kvm_timer *pt = ps-pit_timer; @@ -291,14 +345,14 @@ static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period) pr_debug(create pit timer, interval is %llu nsec\n, interval); /* TODO The new value only affected after the retriggered */ - hrtimer_cancel(pt-timer); + if (hrtimer_cancel(pt-timer)) + cancel_work_sync(pt-kvm-arch.vpit-expired); pt-period = interval; ps-is_periodic = is_period; - pt-timer.function = kvm_timer_fn; + pt-timer.function = pit_timer_fn; pt-t_ops = kpit_ops; pt-kvm = ps-pit-kvm; - pt-vcpu = pt-kvm-bsp_vcpu; atomic_set(pt-pending, 0); ps-irq_ack = 1; @@ -626,7 +680,14 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags) mutex_init(pit-pit_state.lock); mutex_lock(pit-pit_state.lock); - raw_spin_lock_init(pit-pit_state.inject_lock); + spin_lock_init(pit-pit_state.inject_lock); + + pit-wq = create_singlethread_workqueue(kvm-pit-wq); + if (!pit-wq) { + kfree(pit); + return NULL; + } + INIT_WORK(pit-expired, pit_do_work); kvm-arch.vpit =
[PATCH v2 2/3] Allow any LAPIC to accept PIC interrupts.
If the guest wants to accept timer interrupts on a CPU other than the BSP, we need to remove this gate. Signed-off-by: Chris Lalancette clala...@redhat.com --- arch/x86/kvm/lapic.c | 12 +--- 1 files changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index d8258a0..ee0f76c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1107,13 +1107,11 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu) u32 lvt0 = apic_get_reg(vcpu-arch.apic, APIC_LVT0); int r = 0; - if (kvm_vcpu_is_bsp(vcpu)) { - if (!apic_hw_enabled(vcpu-arch.apic)) - r = 1; - if ((lvt0 APIC_LVT_MASKED) == 0 - GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT) - r = 1; - } + if (!apic_hw_enabled(vcpu-arch.apic)) + r = 1; + if ((lvt0 APIC_LVT_MASKED) == 0 + GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT) + r = 1; return r; } -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] In DM_LOWEST, only deliver interrupts to vcpus with enabled LAPIC's
Otherwise we might try to deliver a timer interrupt to a cpu that can't possibly handle it. Signed-off-by: Chris Lalancette clala...@redhat.com --- virt/kvm/irq_comm.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index 52f412f..06cf61e 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -100,7 +100,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, if (r 0) r = 0; r += kvm_apic_set_irq(vcpu, irq); - } else { + } else if (kvm_lapic_enabled(vcpu)) { if (!lowest) lowest = vcpu; else if (kvm_apic_compare_prio(vcpu, lowest) 0) -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 10:09:31]: On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. The first patch today is again enabled by the host. Both of them are expected to be useful in the cache != none case. The data I have shows more details including the performance and overhead. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 06:55 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: On 06/14/2010 06:33 PM, Dave Hansen wrote: At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 18:34:58]: On 06/14/2010 06:12 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. Depends. If you've evicted something that will be referenced soon, you're increasing system pressure. I don't think slab pages care about being referenced soon, they are either allocated or freed. A page is just a storage unit for the data structure. A new one can be allocated on demand. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 19:34:00]: On 06/14/2010 06:55 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: On 06/14/2010 06:33 PM, Dave Hansen wrote: At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. I think there is another way of looking at it, give some free memory 1. Can the guest run more applications or run faster 2. Can the host potentially get this memory via ballooning or some other means to start newer guest instances I think the answer to 1 and 2 is yes. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. Preferential dropping as selected by the host, that knows about the setup and if there is duplication involved. While we use the term preferential dropping, remember it is still via LRU and we don't always succeed. It is a best effort (if you can and the unmapped pages are not highly referenced) scenario. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. There are two situations 1. Voluntarily drop cache, if it was setup to do so (the host knows that it caches that information anyway) 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote: Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. I can think of quite a few places where this would be beneficial. Ballooning is dangerous. I've OOMed quite a few guests by over-ballooning them. Anything that's voluntary like this is safer than things imposed by the host, although you do trade of effectiveness. If all the guests do this, then it leaves that much more free memory on the host, which can be used flexibly for extra host page cache, new guests, etc... A system in this state where everyone is proactively keeping their footprints down is more likely to be able to handle load spikes. Reclaim is an expensive, costly activity, and this ensures that we don't have to do that when we're busy doing other things like handling load spikes. This was one of the concepts behind CMM2: reduce the overhead during peak periods. It's also handy for planning. Guests exhibiting this behavior will _act_ as if they're under pressure. That's a good thing to approximate how a guest will act when it _is_ under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. I think we're trying to consider things slightly outside of ballooning at this point. If ballooning was the end-all solution, I'm fairly sure Balbir wouldn't be looking at this stuff. Just trying to keep options open. :) -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 5/5] [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo
On Mon, Jun 14, 2010 at 9:44 AM, Nicholas A. Bellinger n...@linux-iscsi.org wrote: From: Nicholas Bellinger n...@linux-iscsi.org This patch adds initial support for using the Linux BSG interface with write/read vectored AIO as a QEMU backstore (SCSIDeviceInfo) with hw/scsi-bus.c compatible HBA emulation. Did I miss the docs? So far it has been tested with x86_64 host and guest using hw/megasas.c and TCM_Loop LLD Port LUNs. Because this path uses struct iovec for struct sg_io_v4-d[out,in]_xferp payloads, which currently requires a patch to linux/block/bsg.c:bsg_map_hdr() in order to setup the user - kernel iovecs. This also will only currently work with paired user/kernel (eg: 64bit user / 64bit kernel) because of different pointer sizes in struct iovec-iov_base. There are also two FIXMEs in hw/scsi-bsg.c:bsg_generic_initfn() related to extraction of SCSI LUN and device type values using BSG and required by QEMU-KVM. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- Makefile.objs | 2 +- hw/scsi-bsg.c | 588 + 2 files changed, 589 insertions(+), 1 deletions(-) create mode 100644 hw/scsi-bsg.c diff --git a/Makefile.objs b/Makefile.objs index 188d617..c4fcb72 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -197,7 +197,7 @@ hw-obj-$(CONFIG_IDE_CMD646) += ide/cmd646.o hw-obj-$(CONFIG_IDE_MACIO) += ide/macio.o # SCSI layer -hw-obj-y += scsi-disk.o scsi-generic.o +hw-obj-y += scsi-disk.o scsi-generic.o scsi-bsg.o Instead of '#ifdef __linux__' (which should be '#ifdef CONFIG_LINUX'), please compile the object only if CONFIG_LINUX is set, something like: hw-obj-$(CONFIG_LINUX) += scsi-bsg.o Please also check if this could be compiled in common-obj set. hw-obj-y += lsi53c895a.o megasas.o hw-obj-$(CONFIG_ESP) += esp.o diff --git a/hw/scsi-bsg.c b/hw/scsi-bsg.c new file mode 100644 index 000..fc76b76 --- /dev/null +++ b/hw/scsi-bsg.c @@ -0,0 +1,588 @@ +/* + * block layer implementation of the sg v4 interface for Linux hosts + * + * Copyright (c) 2010 Rising Tide Systems + * Written by Nicholas A. Bellinger n...@linux-iscsi.org + * + * Based on hw/scsi-generic code by Laurent Vivier, Paul Brook, and Fabrice Bellard + * + * This code is licenced under the LGPL. + */ + +#include qemu-common.h +#include qemu-error.h +#include block.h +#include scsi.h +#include dma.h +#include block/raw-posix-aio.h + +#ifdef __linux__ + +#define DEBUG_BSG +#undef DEBUG_BSG_IO +#undef DEBUG_BSG_MAP This should be //#define DEBUG_BSG //#define DEBUG_BSG_IO //#define DEBUG_BSG_MAP + +#ifdef DEBUG_BSG +#define DPRINTF(fmt, ...) \ +do { printf(scsi-bsg: fmt , ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) do {} while(0) +#endif + +#define BADF(fmt, ...) \ +do { fprintf(stderr, scsi-bsg: fmt , ## __VA_ARGS__); } while (0) + +#include stdio.h +#include sys/types.h +#include sys/stat.h +#include sys/epoll.h +#include unistd.h +#include scsi/sg.h +#include linux/bsg.h +#include scsi-defs.h + +#define SCSI_SENSE_BUF_SIZE 96 + +#define SG_ERR_DRIVER_TIMEOUT 0x06 +#define SG_ERR_DRIVER_SENSE 0x08 + +#ifndef MAX_UINT +#define MAX_UINT ((unsigned int)-1) The standard macro is UINT_MAX. +#endif + +typedef struct SCSIBSGState SCSIBSGState; + +typedef struct SCSIBSGReq { + SCSIRequest req; + uint8_t *buf; + int buflen; + QEMUIOVector iov; + QEMUIOVector aio_iov; + struct sg_io_v4 bsg_hdr; +} SCSIBSGReq; + +struct SCSIBSGState { + SCSIDevice qdev; + BlockDriverState *bs; + int lun; + int driver_status; + uint8_t sensebuf[SCSI_SENSE_BUF_SIZE]; + uint8_t senselen; +}; + +static int bsg_read(int fd, void *p_read, int to_read) +{ + int err; + + while (to_read 0) { + err = read(fd, p_read, to_read); + if (err = 0) { + to_read -= err; + p_read += err; + } else if (errno == EINTR) + continue; + else { + printf(bsg device %d read failed, errno: %d\n, + fd, errno); DPRINTF? + return errno; + } + } + return 0; +} + +static SCSIBSGReq *bsg_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun) +{ + SCSIRequest *req; + SCSIBSGReq *r; + + req = scsi_req_alloc(sizeof(SCSIBSGReq), d, tag, lun); + r = DO_UPCAST(SCSIBSGReq, req, req); + qemu_iovec_init(r-iov, 1); + qemu_iovec_init(r-aio_iov, 1); + return r; +} + +static void bsg_remove_request(SCSIBSGReq *r) +{ + qemu_free(r-buf); + qemu_iovec_destroy(r-iov); + qemu_iovec_destroy(r-aio_iov); + scsi_req_free(r-req); +} + +static void bsg_command_complete(void *opaque, int ret) +{ + SCSIBSGReq *r = (SCSIBSGReq *)opaque; Useless cast in C. + SCSIBSGState *s = DO_UPCAST(SCSIBSGState, qdev, r-req.dev); + + s-driver_status =
[ kvm-Bugs-2933400 ] virtio-blk io errors / data corruption on raw drives 1 TB
Bugs item #2933400, was opened at 2010-01-16 15:35 Message generated for change (Comment added) made by gyver You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2933400group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 9 Private: No Submitted By: MaSc82 (masc82) Assigned to: Nobody/Anonymous (nobody) Summary: virtio-blk io errors / data corruption on raw drives 1 TB Initial Comment: When attaching raw drives 1 TB, buffer io errors will most likely occur, filesystems get corrupted. Processes (like mkfs.ext4) might freeze completely when filesystems are created on the guest. Here's a typical log excerpt when using mkfs.ext4 on a 1.5 TB drive on a Ubuntu 9.10 guest: (kern.log) Jan 15 20:40:44 q kernel: [ 677.076602] Buffer I/O error on device vde, logical block 366283764 Jan 15 20:40:44 q kernel: [ 677.076607] Buffer I/O error on device vde, logical block 366283765 Jan 15 20:40:44 q kernel: [ 677.076611] Buffer I/O error on device vde, logical block 366283766 Jan 15 20:40:44 q kernel: [ 677.076616] Buffer I/O error on device vde, logical block 366283767 Jan 15 20:40:44 q kernel: [ 677.076621] Buffer I/O error on device vde, logical block 366283768 Jan 15 20:40:44 q kernel: [ 677.076626] Buffer I/O error on device vde, logical block 366283769 (messages) Jan 15 20:40:44 q kernel: [ 677.076534] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076541] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076546] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076599] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076604] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076609] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076613] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076618] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076623] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076628] lost page write due to I/O error on vde Jan 15 20:45:55 q Backgrounding to notify hosts... (The following entries will repeat infinitely, mkfs.ext4 will not exit and cannot be killed) Jan 15 20:49:27 q kernel: [ 1200.520096] mkfs.ext4 D 0 1839 1709 0x Jan 15 20:49:27 q kernel: [ 1200.520101] 88004e157cb8 0082 88004e157c58 00015880 Jan 15 20:49:27 q kernel: [ 1200.520115] 88004ef6c7c0 00015880 00015880 00015880 Jan 15 20:49:27 q kernel: [ 1200.520118] 00015880 88004ef6c7c0 00015880 00015880 Jan 15 20:49:27 q kernel: [ 1200.520123] Call Trace: Jan 15 20:49:27 q kernel: [ 1200.520157] [810da0f0] ? sync_page+0x0/0x50 Jan 15 20:49:27 q kernel: [ 1200.520178] [815255f8] io_schedule+0x28/0x40 Jan 15 20:49:27 q kernel: [ 1200.520182] [810da12d] sync_page+0x3d/0x50 Jan 15 20:49:27 q kernel: [ 1200.520185] [81525b17] __wait_on_bit+0x57/0x80 Jan 15 20:49:27 q kernel: [ 1200.520192] [810da29e] wait_on_page_bit+0x6e/0x80 Jan 15 20:49:27 q kernel: [ 1200.520205] [81078650] ? wake_bit_function+0x0/0x40 Jan 15 20:49:27 q kernel: [ 1200.520210] [810e44e0] ? pagevec_lookup_tag+0x20/0x30 Jan 15 20:49:27 q kernel: [ 1200.520213] [810da745] wait_on_page_writeback_range+0xf5/0x190 Jan 15 20:49:27 q kernel: [ 1200.520217] [810da807] filemap_fdatawait+0x27/0x30 Jan 15 20:49:27 q kernel: [ 1200.520220] [810dacb4] filemap_write_and_wait+0x44/0x50 Jan 15 20:49:27 q kernel: [ 1200.520235] [8114ba9f] __sync_blockdev+0x1f/0x40 Jan 15 20:49:27 q kernel: [ 1200.520239] [8114bace] sync_blockdev+0xe/0x10 Jan 15 20:49:27 q kernel: [ 1200.520241] [8114baea] block_fsync+0x1a/0x20 Jan 15 20:49:27 q kernel: [ 1200.520249] [81142f26] vfs_fsync+0x86/0xf0 Jan 15 20:49:27 q kernel: [ 1200.520252] [81142fc9] do_fsync+0x39/0x60 Jan 15 20:49:27 q kernel: [ 1200.520255] [8114301b] sys_fsync+0xb/0x10 Jan 15 20:49:27 q kernel: [ 1200.520271] [81011fc2] system_call_fastpath+0x16/0x1b In my case I was switching to virtio at one point, but the behaviour didn't show until there was 1 TB data on the filesystem. very dangerous. Tested using 2 different SATA controllers, 1.5 TB lvm/mdraid, single 1.5 TB drive and 2 TB lvm/mdraid. The behaviour does not occur with if=scsi or if=ide. #2914397 might be related: https://sourceforge.net/tracker/?func=detailaid=2914397group_id=180599atid=893831 This blog post might also relate: http://www.neuhalfen.name/2009/08/05/OpenSolaris_KVM_and_large_IDE_drives/ CPU: Intel
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
On Mon, 2010-06-14 at 18:49 +0200, Jan Kiszka wrote: Alex Williamson wrote: On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote: And instead of introducing another hierarchy level with the bus address, I would also prefer to add this as prefix or suffix to the device name, e.g. driver.busaddr. That's what I had started with. The first post in this thread has pci.0,addr=09.0 as a single hierarchy level. The addr= may be unnecessary there, but I also prefer something along those lines. For PCI it'd make sense to have name:addr, which comes out to pci.0:09.0. (Maybe rather than flagging properties as being relevant to the path and printing them generically, we should extract specific properties based on the bus type.) Not bus.addr, driver.addr. We only have one PCI bus here, not as many as there are slots on that bus. Ok, I can get it down to something like: /i440FX-pcihost/pci.0/virtio-blk-pci,09.0 The addr on the device is initially a little non-intuitive to me since it's a property of the bus, but I guess it make sense if we think of that level as slot, which includes an address and driver. For busses that don't have a consistent addressing scheme then some sort of instance ID is unavoidable. I guess it may be possible to invent something based on other device properties (e.g. address of the first IO port/memory region). Instance IDs aren't always bad, we just run into trouble with hotplug and maybe creating unique ramblock names. But, such busses typically don't support hotplug and don't have multiple instances of the same device on the bus, so I don't expect us to hit many issues there as long as we can get a stable address path. Thanks, If stable instance numbers are required, we could simply keep them in DeviceState and search for the smallest free one on additions. Actually, I'm more in favor of this direction than including the bus address. That way we could keep a canonical format across all buses and do not have to provide possibly complex ID generation rules for each of them. I started down that path, but it still breaks for hotplug. If we start a VM with two e1000 NICs, then remove the first, we can no longer migrate because the target can't represent having a single e1000 with a non-zero instance ID. That's indeed a good point. Still, I'm worried about having to define numbering schemes for all the buses qemu supports. Maybe we can run a mixture: address-based for hotplug-capably buses (they tend to be cooperative in this regard) and simple instance numbers for the rest, likely the majority. Yep, I share that concern, which is I say instance numbers aren't always bad. If we have devices that we don't care about doing hotplug on, we can live with instance numbers. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string
On Mon, 2010-06-14 at 09:02 +0200, Gerd Hoffmann wrote: Hi, My premise with this attempt is that we walk the hierarchy and use the names to create the base of the path. As we get to the device, particularly to the parent bus of the device, we need to start looking at properties to ensure uniqueness. You'll need that for every bus along the way down to the device. Create a virtual machine with two lsi scsi host adapters, then attach a disk with scsi id 0 to each. Just the scsi id isn't good enougth to identify the device. You'll need the lsi pci address too. Yep, see below. For now, the only properties I've tagged as path properties are PCI bus addresses and MAC addresses. mac address isn't needed here. You need the property which specifies the bus address. For PCI this obviously is the PCI address. For scsi the scsi id. For ISA you can use the I/O port base. virtio-serial the port number, ... PCI: addr SCSI: scsi-id ISA: serial/parallel = iobase, others?? ide-drive: unit I2C: address virtio-serial doesn't seem to make a DeviceState per port, so I think it can be skipped. I'm sure I'm still missing some... That gives me the list below (printed from qdev_init()): /i440FX-pcihost /i440FX-pcihost/pci.0/i440FX.00.0 /i440FX-pcihost/pci.0/PIIX3.01.0 /i440FX-pcihost/pci.0/cirrus-vga.02.0 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/mc146818rtc /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-serial.0x3f8 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-serial.0x2f8 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-parallel.0x378 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/i8042 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-fdc /i440FX-pcihost/pci.0/i82551.03.0 /i440FX-pcihost/pci.0/virtio-net-pci.04.0 /i440FX-pcihost/pci.0/e1000.05.0 /i440FX-pcihost/pci.0/rtl8139.06.0 /i440FX-pcihost/pci.0/pcnet.07.0 /i440FX-pcihost/pci.0/piix3-ide.01.1 /i440FX-pcihost/pci.0/piix3-ide.01.1/ide.0/ide-drive.0 /i440FX-pcihost/pci.0/piix3-ide.01.1/ide.0/ide-drive.1 /i440FX-pcihost/pci.0/piix3-ide.01.1/ide.1/ide-drive.0 /i440FX-pcihost/pci.0/PIIX4_PM.01.3 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.80 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.81 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.82 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.83 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.84 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.85 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.86 /i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.87 /i440FX-pcihost/pci.0/lsi53c895a.08.0/scsi.0/scsi-disk.0 /i440FX-pcihost/pci.0/lsi53c895a.08.0/scsi.0/scsi-disk.3 /i440FX-pcihost/pci.0/lsi53c895a.08.0 /i440FX-pcihost/pci.0/virtio-serial-pci.09.0 /i440FX-pcihost/pci.0/virtio-serial-pci.0a.0 /i440FX-pcihost/pci.0/virtio-blk-pci.0b.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status
PCI hotplug currently doesn't work after a migration because we don't migrate the enable bits of the GPE state. Pull hotplug structs into vmstate. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/acpi_piix4.c | 29 - 1 files changed, 28 insertions(+), 1 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index a87286b..8d1a628 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -283,9 +283,33 @@ static int vmstate_acpi_post_load(void *opaque, int version_id) return 0; } +static const VMStateDescription vmstate_gpe = { +.name = gpe, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT16(sts, struct gpe_regs), +VMSTATE_UINT16(en, struct gpe_regs), +VMSTATE_END_OF_LIST() +} +}; + +static const VMStateDescription vmstate_pci_status = { +.name = pci_status, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT32(up, struct pci_status), +VMSTATE_UINT32(down, struct pci_status), +VMSTATE_END_OF_LIST() +} +}; + static const VMStateDescription vmstate_acpi = { .name = piix4_pm, -.version_id = 1, +.version_id = 2, .minimum_version_id = 1, .minimum_version_id_old = 1, .post_load = vmstate_acpi_post_load, @@ -297,6 +321,9 @@ static const VMStateDescription vmstate_acpi = { VMSTATE_STRUCT(apm, PIIX4PMState, 0, vmstate_apm, APMState), VMSTATE_TIMER(tmr_timer, PIIX4PMState), VMSTATE_INT64(tmr_overflow_time, PIIX4PMState), +VMSTATE_STRUCT(gpe, PIIX4PMState, 2, vmstate_gpe, struct gpe_regs), +VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status, + struct pci_status), VMSTATE_END_OF_LIST() } }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: x86: XSAVE/XRSTOR live migration support
On Sun, Jun 13, 2010 at 05:29:39PM +0800, Sheng Yang wrote: This patch enable save/restore of xsave state. Signed-off-by: Sheng Yang sh...@linux.intel.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] qemu: kvm: Enable XSAVE live migration support
On Fri, Jun 11, 2010 at 12:36:49PM +0800, Sheng Yang wrote: Signed-off-by: Sheng Yang sh...@linux.intel.com --- qemu-kvm-x86.c| 109 - qemu-kvm.c| 24 +++ qemu-kvm.h| 28 + target-i386/cpu.h |5 ++ target-i386/kvm.c |2 + target-i386/machine.c | 20 + 6 files changed, 169 insertions(+), 19 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] test: Add XSAVE unit test
On Sun, Jun 13, 2010 at 04:49:19PM +0800, Sheng Yang wrote: Based on IDT test framework. Signed-off-by: Sheng Yang sh...@linux.intel.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] test: Add IDT framework
On Sun, Jun 13, 2010 at 11:02:02AM +0300, Avi Kivity wrote: Signed-off-by: Sheng Yang sh...@linux.intel.com Signed-off-by: Avi Kivity a...@redhat.com --- v3: rearrange printf()s v2: accurate instruction boundary tests avoid playing with the stack; use exception tables instead Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status
On 06/14/2010 03:28 PM, Alex Williamson wrote: PCI hotplug currently doesn't work after a migration because we don't migrate the enable bits of the GPE state. Pull hotplug structs into vmstate. Signed-off-by: Alex Williamsonalex.william...@redhat.com Applied. Thanks. Regards, Anthony Liguori --- hw/acpi_piix4.c | 29 - 1 files changed, 28 insertions(+), 1 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index a87286b..8d1a628 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -283,9 +283,33 @@ static int vmstate_acpi_post_load(void *opaque, int version_id) return 0; } +static const VMStateDescription vmstate_gpe = { +.name = gpe, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT16(sts, struct gpe_regs), +VMSTATE_UINT16(en, struct gpe_regs), +VMSTATE_END_OF_LIST() +} +}; + +static const VMStateDescription vmstate_pci_status = { +.name = pci_status, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT32(up, struct pci_status), +VMSTATE_UINT32(down, struct pci_status), +VMSTATE_END_OF_LIST() +} +}; + static const VMStateDescription vmstate_acpi = { .name = piix4_pm, -.version_id = 1, +.version_id = 2, .minimum_version_id = 1, .minimum_version_id_old = 1, .post_load = vmstate_acpi_post_load, @@ -297,6 +321,9 @@ static const VMStateDescription vmstate_acpi = { VMSTATE_STRUCT(apm, PIIX4PMState, 0, vmstate_apm, APMState), VMSTATE_TIMER(tmr_timer, PIIX4PMState), VMSTATE_INT64(tmr_overflow_time, PIIX4PMState), +VMSTATE_STRUCT(gpe, PIIX4PMState, 2, vmstate_gpe, struct gpe_regs), +VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status, + struct pci_status), VMSTATE_END_OF_LIST() } }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status
On 14.06.2010, at 22:58, Anthony Liguori wrote: On 06/14/2010 03:28 PM, Alex Williamson wrote: PCI hotplug currently doesn't work after a migration because we don't migrate the enable bits of the GPE state. Pull hotplug structs into vmstate. Signed-off-by: Alex Williamsonalex.william...@redhat.com Applied. Thanks. Is this stable material? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix -DDEBUG oops
Patch is fairly self-explanatory. KVM -DDEBUG fix Fix a slight error with assertion in local APIC code. Signed-off-by: Zachary Amsden zams...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index d8258a0..024f6d1 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -329,7 +329,7 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, dest_mode 0x%x, short_hand 0x%x\n, target, source, dest, dest_mode, short_hand); - ASSERT(!target); + ASSERT(target); switch (short_hand) { case APIC_DEST_NOSHORT: if (dest_mode == 0)
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
On Mon, 2010-06-14 at 18:49 +0200, Jan Kiszka wrote: Alex Williamson wrote: On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote: And instead of introducing another hierarchy level with the bus address, I would also prefer to add this as prefix or suffix to the device name, e.g. driver.busaddr. That's what I had started with. The first post in this thread has pci.0,addr=09.0 as a single hierarchy level. The addr= may be unnecessary there, but I also prefer something along those lines. For PCI it'd make sense to have name:addr, which comes out to pci.0:09.0. (Maybe rather than flagging properties as being relevant to the path and printing them generically, we should extract specific properties based on the bus type.) Not bus.addr, driver.addr. We only have one PCI bus here, not as many as there are slots on that bus. Ok, I can get it down to something like: /i440FX-pcihost/pci.0/virtio-blk-pci,09.0 The addr on the device is initially a little non-intuitive to me since it's a property of the bus, but I guess it make sense if we think of that level as slot, which includes an address and driver. That indicates you're thinking about things the wrong way. The point of this path is to uniquely identify an entity. /i440FX-pcihost/pci0 identifies a PCI bus attached to the i440FX-pcihost device. To identify a device attached to that bus all you need to know is the devfn of the device. For an end-user it may be helpful to allow devices to be identified by the device type (assuming only one device of a particular type on that bus), or specify the device type as a crude error checking mechanism. However we're talking about canonical addresses. These need not include the device type. Verifying that the device is actually what you expect is a separate problem, and the device type is not sufficient for that. i.e. /i440FX-pcihost/pci.0/,09.0 Is an appropriate canonical address. I started down that path, but it still breaks for hotplug. If we start a VM with two e1000 NICs, then remove the first, we can no longer migrate because the target can't represent having a single e1000 with a non-zero instance ID. That's indeed a good point. That's a feature. If you start with two NICs and remove the first, the chances are that the second will be in a different place to the nice created in a single-nic config. Allowing migration between these two will cause a PCI device to suddenly change location. This is not physically or logically possible, and everyone looses. Hot-removing a nic and then hotplugging a new nic in the same location may result in something that is reasonable to migrate. Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory
On Mon, Jun 14, 2010 at 9:53 AM, Anthony Liguori anth...@codemonkey.ws wrote: On 06/04/2010 04:45 PM, Cam Macdonell wrote: this code is a standalone server which will pass file descriptors for the shared memory region and eventfds to support interrupts between guests using inter-VM shared memory. --- contrib/ivshmem-server/Makefile | 16 ++ contrib/ivshmem-server/README | 30 +++ contrib/ivshmem-server/ivshmem_server.c | 353 +++ contrib/ivshmem-server/send_scm.c | 208 ++ contrib/ivshmem-server/send_scm.h | 19 ++ 5 files changed, 626 insertions(+), 0 deletions(-) create mode 100644 contrib/ivshmem-server/Makefile create mode 100644 contrib/ivshmem-server/README create mode 100644 contrib/ivshmem-server/ivshmem_server.c create mode 100644 contrib/ivshmem-server/send_scm.c create mode 100644 contrib/ivshmem-server/send_scm.h diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile new file mode 100644 index 000..da40ffa --- /dev/null +++ b/contrib/ivshmem-server/Makefile @@ -0,0 +1,16 @@ +CC = gcc +CFLAGS = -O3 -Wall -Werror +LIBS = -lrt + +# a very simple makefile to build the inter-VM shared memory server + +all: ivshmem_server + +.c.o: + $(CC) $(CFLAGS) -c $^ -o $@ + +ivshmem_server: ivshmem_server.o send_scm.o + $(CC) $(CFLAGS) -o $@ $^ $(LIBS) + +clean: + rm -f *.o ivshmem_server diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README new file mode 100644 index 000..b1fc2a2 --- /dev/null +++ b/contrib/ivshmem-server/README @@ -0,0 +1,30 @@ +Using the ivshmem shared memory server +-- + +This server is only supported on Linux. + +To use the shared memory server, first compile it. Running 'make' should +accomplish this. An executable named 'ivshmem_server' will be built. + +to display the options run: + +./ivshmem_server -h + +Options +--- + + -h print help message + + -ppath on host + unix socket to listen on. The qemu-kvm chardev needs to connect on + this socket. (default: '/tmp/ivshmem_socket') + + -sstring + POSIX shared object to create that is the shared memory (default: 'ivshmem') + + -m# + size of the POSIX object in MBs (default: 1) + + -n# + number of eventfds for each guest. This number must match the + 'vectors' argument passed the ivshmem device. (default: 1) diff --git a/contrib/ivshmem-server/ivshmem_server.c b/contrib/ivshmem-server/ivshmem_server.c new file mode 100644 index 000..e0a7b98 --- /dev/null +++ b/contrib/ivshmem-server/ivshmem_server.c There's no licensing here. I don't think this belongs in the qemu tree either to be honest. If it were to be included, it ought to use all of the existing qemu infrastructure like the other qemu-* tools. For the time being, I'm willing to leave it out and host it externally. Regards, Anthony Liguori @@ -0,0 +1,353 @@ +/* + * A stand-alone shared memory server for inter-VM shared memory for KVM +*/ + +#includeerrno.h +#includestring.h +#includesys/types.h +#includesys/socket.h +#includesys/un.h +#includeunistd.h +#includesys/types.h +#includesys/stat.h +#includefcntl.h +#includesys/eventfd.h +#includesys/mman.h +#includesys/select.h +#includestdio.h +#includestdlib.h +#include send_scm.h + +#define DEFAULT_SOCK_PATH /tmp/ivshmem_socket +#define DEFAULT_SHM_OBJ ivshmem + +#define DEBUG 1 + +typedef struct server_state { + vmguest_t *live_vms; + int nr_allocated_vms; + int shm_size; + long live_count; + long total_count; + int shm_fd; + char * path; + char * shmobj; + int maxfd, conn_socket; + long msi_vectors; +} server_state_t; + +void usage(char const *prg); +int find_set(fd_set * readset, int max); +void print_vec(server_state_t * s, const char * c); + +void add_new_guest(server_state_t * s); +void parse_args(int argc, char **argv, server_state_t * s); +int create_listening_socket(char * path); + +int main(int argc, char ** argv) +{ + fd_set readset; + server_state_t * s; + + s = (server_state_t *)calloc(1, sizeof(server_state_t)); + + s-live_count = 0; + s-total_count = 0; + parse_args(argc, argv, s); + + /* open shared memory file */ + if ((s-shm_fd = shm_open(s-shmobj, O_CREAT|O_RDWR, S_IRWXU)) 0) + { + fprintf(stderr, kvm_ivshmem: could not open shared file\n); + exit(-1); + } + + ftruncate(s-shm_fd, s-shm_size); + + s-conn_socket = create_listening_socket(s-path); + + s-maxfd = s-conn_socket; + + for(;;) { + int ret, handle, i; + char buf[1024]; + + print_vec(s, vm_sockets); + + FD_ZERO(readset); + /* conn socket is in Live_vms at posn 0 */ + FD_SET(s-conn_socket,readset); +
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
On Mon, 2010-06-14 at 22:43 +0100, Paul Brook wrote: On Mon, 2010-06-14 at 18:49 +0200, Jan Kiszka wrote: Alex Williamson wrote: Ok, I can get it down to something like: /i440FX-pcihost/pci.0/virtio-blk-pci,09.0 The addr on the device is initially a little non-intuitive to me since it's a property of the bus, but I guess it make sense if we think of that level as slot, which includes an address and driver. That indicates you're thinking about things the wrong way. The point of this path is to uniquely identify an entity. /i440FX-pcihost/pci0 identifies a PCI bus attached to the i440FX-pcihost device. To identify a device attached to that bus all you need to know is the devfn of the device. Hmm, I think that identifies where the device is, not what the device is. It's more helpful to say the e1000 in slot 7 than it is to say the device in slot 7. For an end-user it may be helpful to allow devices to be identified by the device type (assuming only one device of a particular type on that bus), or specify the device type as a crude error checking mechanism. However we're talking about canonical addresses. These need not include the device type. Verifying that the device is actually what you expect is a separate problem, and the device type is not sufficient for that. i.e. /i440FX-pcihost/pci.0/,09.0 Is an appropriate canonical address. We seem to keep introducing new problems, and I'm not sure this one exists. If I drop the device name/type and use only the devfn, then what's to prevent the /e1000,09.0/rom (/,09.0/rom) from being stuffed into the /rtl8139,09.0/rom (/,09.0/rom) on a migration? (or we match it to /09.0/savevm when trying to migrate state) We can argue that e1000 isn't a sufficient identifier, but I can't think of a case where it'd fail. I started down that path, but it still breaks for hotplug. If we start a VM with two e1000 NICs, then remove the first, we can no longer migrate because the target can't represent having a single e1000 with a non-zero instance ID. That's indeed a good point. That's a feature. If you start with two NICs and remove the first, the chances are that the second will be in a different place to the nice created in a single-nic config. Allowing migration between these two will cause a PCI device to suddenly change location. This is not physically or logically possible, and everyone looses. If the BAR addresses don't follow the VM when it's migrated, that's another bug that needs to be fixed, but we can't get there until we at least have some infrastructure in place to make that bug possible. Hot-removing a nic and then hotplugging a new nic in the same location may result in something that is reasonable to migrate. It might... it might not. I'd rather make it work than try to document the restrictions. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On Mon, Jun 14, 2010 at 01:11:20PM -0400, Chris Lalancette wrote: We really want to kvm_set_irq during the hrtimer callback, but that is risky because that is during interrupt context. Instead, offload the work to a workqueue, which is a bit safer and should provide most of the same functionality. Signed-off-by: Chris Lalancette clala...@redhat.com --- arch/x86/kvm/i8254.c | 125 - arch/x86/kvm/i8254.h |4 +- arch/x86/kvm/irq.c |1 - 3 files changed, 74 insertions(+), 56 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 188d827..3bed8ac 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -34,6 +34,7 @@ #include linux/kvm_host.h #include linux/slab.h +#include linux/workqueue.h #include irq.h #include i8254.h @@ -244,11 +245,11 @@ static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian) { struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state, irq_ack_notifier); - raw_spin_lock(ps-inject_lock); + spin_lock(ps-inject_lock); if (atomic_dec_return(ps-pit_timer.pending) 0) atomic_inc(ps-pit_timer.pending); ps-irq_ack = 1; - raw_spin_unlock(ps-inject_lock); + spin_unlock(ps-inject_lock); } void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu) @@ -267,7 +268,8 @@ void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu) static void destroy_pit_timer(struct kvm_timer *pt) { pr_debug(execute del timer!\n); - hrtimer_cancel(pt-timer); + if (hrtimer_cancel(pt-timer)) + cancel_work_sync(pt-kvm-arch.vpit-expired); } static bool kpit_is_periodic(struct kvm_timer *ktimer) @@ -281,6 +283,58 @@ static struct kvm_timer_ops kpit_ops = { .is_periodic = kpit_is_periodic, }; +static void pit_do_work(struct work_struct *work) +{ + struct kvm_pit *pit = container_of(work, struct kvm_pit, expired); + struct kvm *kvm = pit-kvm; + struct kvm_vcpu *vcpu; + int i; + struct kvm_kpit_state *ps = pit-pit_state; + int inject = 0; + + /* Try to inject pending interrupts when + * last one has been acked. + */ + spin_lock(ps-inject_lock); + if (ps-irq_ack) { + ps-irq_ack = 0; + inject = 1; + } + spin_unlock(ps-inject_lock); + if (inject) { + kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1); + kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 0); + + /* + * Provides NMI watchdog support via Virtual Wire mode. + * The route is: PIT - PIC - LVT0 in NMI mode. + * + * Note: Our Virtual Wire implementation is simplified, only + * propagating PIT interrupts to all VCPUs when they have set + * LVT0 to NMI delivery. Other PIC interrupts are just sent to + * VCPU0, and only if its LVT0 is in EXTINT mode. + */ + if (kvm-arch.vapics_in_nmi_mode 0) + kvm_for_each_vcpu(i, vcpu, kvm) + kvm_apic_nmi_wd_deliver(vcpu); + } +} + +static enum hrtimer_restart pit_timer_fn(struct hrtimer *data) +{ + struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer); + struct kvm_pit *pt = ktimer-kvm-arch.vpit; + + if (ktimer-reinject) + queue_work(pt-wq, pt-expired); The problem is queue_work only queues the work if it was not already queued. So multiple queue_work() calls can collapse into one executed job. You need to maintain a counter here in pit_timer_fn, and reinject at some point (perhaps on ACK) if there are multiple interrupts pending. + + if (ktimer-t_ops-is_periodic(ktimer)) { + hrtimer_add_expires_ns(ktimer-timer, ktimer-period); + return HRTIMER_RESTART; + } else + return HRTIMER_NORESTART; +} + static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period) { struct kvm_timer *pt = ps-pit_timer; @@ -291,14 +345,14 @@ static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period) pr_debug(create pit timer, interval is %llu nsec\n, interval); /* TODO The new value only affected after the retriggered */ - hrtimer_cancel(pt-timer); + if (hrtimer_cancel(pt-timer)) + cancel_work_sync(pt-kvm-arch.vpit-expired); There can be a queued work instance even if the hrtimer is not active, so cancel_work_sync should be unconditional. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] KVM: MMU: don't walk every parent pages while mark unsync
On Fri, Jun 11, 2010 at 09:35:15PM +0800, Xiao Guangrong wrote: While we mark the parent's unsync_child_bitmap, if the parent is already unsynced, it no need walk it's parent, it can reduce some unnecessary workload Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 61 ++- 1 files changed, 17 insertions(+), 44 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index eb20682..a92863f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -175,7 +175,7 @@ struct kvm_shadow_walk_iterator { shadow_walk_okay((_walker)); \ shadow_walk_next((_walker))) -typedef int (*mmu_parent_walk_fn) (struct kvm_mmu_page *sp); +typedef void (*mmu_parent_walk_fn) (struct kvm_mmu_page *sp, u64 *spte); static struct kmem_cache *pte_chain_cache; static struct kmem_cache *rmap_desc_cache; @@ -1024,7 +1024,6 @@ static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp, BUG(); } - static void mmu_parent_walk(struct kvm_mmu_page *sp, mmu_parent_walk_fn fn) { struct kvm_pte_chain *pte_chain; @@ -1034,63 +1033,37 @@ static void mmu_parent_walk(struct kvm_mmu_page *sp, mmu_parent_walk_fn fn) if (!sp-multimapped sp-parent_pte) { parent_sp = page_header(__pa(sp-parent_pte)); - fn(parent_sp); - mmu_parent_walk(parent_sp, fn); + fn(parent_sp, sp-parent_pte); return; } + hlist_for_each_entry(pte_chain, node, sp-parent_ptes, link) for (i = 0; i NR_PTE_CHAIN_ENTRIES; ++i) { - if (!pte_chain-parent_ptes[i]) + u64 *spte = pte_chain-parent_ptes[i]; + + if (!spte) break; - parent_sp = page_header(__pa(pte_chain-parent_ptes[i])); - fn(parent_sp); - mmu_parent_walk(parent_sp, fn); + parent_sp = page_header(__pa(spte)); + fn(parent_sp, spte); } } -static void kvm_mmu_update_unsync_bitmap(u64 *spte) +static void mark_unsync(struct kvm_mmu_page *sp, u64 *spte); +static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) { - unsigned int index; - struct kvm_mmu_page *sp = page_header(__pa(spte)); - - index = spte - sp-spt; - if (!__test_and_set_bit(index, sp-unsync_child_bitmap)) - sp-unsync_children++; - WARN_ON(!sp-unsync_children); + mmu_parent_walk(sp, mark_unsync); } -static void kvm_mmu_update_parents_unsync(struct kvm_mmu_page *sp) +static void mark_unsync(struct kvm_mmu_page *sp, u64 *spte) { - struct kvm_pte_chain *pte_chain; - struct hlist_node *node; - int i; + unsigned int index; - if (!sp-parent_pte) + index = spte - sp-spt; + if (__test_and_set_bit(index, sp-unsync_child_bitmap)) return; - - if (!sp-multimapped) { - kvm_mmu_update_unsync_bitmap(sp-parent_pte); + if (sp-unsync_children++) return; This looks wrong. If the sp has an unrelated children marked as unsync (which increased sp-unsync_children), you stop the walk? Applied 1-6, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: VMX: optimize APIC EOI
Use information provided in exit_qualification to shortcut EOI path. Reduces EOI latency from 4k to 2k cycles on Nehalem. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: kvm/arch/x86/kvm/lapic.c === --- kvm.orig/arch/x86/kvm/lapic.c +++ kvm/arch/x86/kvm/lapic.c @@ -1279,3 +1279,9 @@ int kvm_hv_vapic_msr_read(struct kvm_vcp return 0; } + +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + apic_set_eoi(vcpu-arch.apic); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); Index: kvm/arch/x86/kvm/lapic.h === --- kvm.orig/arch/x86/kvm/lapic.h +++ kvm/arch/x86/kvm/lapic.h @@ -52,6 +52,8 @@ int kvm_x2apic_msr_read(struct kvm_vcpu int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data); int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); + static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu) { return vcpu-arch.hv_vapic HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE; Index: kvm/arch/x86/kvm/vmx.c === --- kvm.orig/arch/x86/kvm/vmx.c +++ kvm/arch/x86/kvm/vmx.c @@ -3395,6 +3395,17 @@ static int handle_wbinvd(struct kvm_vcpu static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + if (access_type == 1 offset == APIC_EOI) { + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()
Ok, I can get it down to something like: /i440FX-pcihost/pci.0/virtio-blk-pci,09.0 The addr on the device is initially a little non-intuitive to me since it's a property of the bus, but I guess it make sense if we think of that level as slot, which includes an address and driver. That indicates you're thinking about things the wrong way. The point of this path is to uniquely identify an entity. /i440FX-pcihost/pci0 identifies a PCI bus attached to the i440FX-pcihost device. To identify a device attached to that bus all you need to know is the devfn of the device. Hmm, I think that identifies where the device is, not what the device is. It's more helpful to say the e1000 in slot 7 than it is to say the device in slot 7. Why is this more useful? Canonical addresses should not be helpful. They should identify entities within a machine that is already known to be consistent. Making them helpful just makes them more volatile. For an end-user it may be helpful to allow devices to be identified by the device type (assuming only one device of a particular type on that bus), or specify the device type as a crude error checking mechanism. However we're talking about canonical addresses. These need not include the device type. Verifying that the device is actually what you expect is a separate problem, and the device type is not sufficient for that. i.e. /i440FX-pcihost/pci.0/,09.0 Is an appropriate canonical address. We seem to keep introducing new problems, and I'm not sure this one exists. If I drop the device name/type and use only the devfn, then what's to prevent the /e1000,09.0/rom (/,09.0/rom) from being stuffed into the /rtl8139,09.0/rom (/,09.0/rom) on a migration? (or we match it to /09.0/savevm when trying to migrate state) We can argue that e1000 isn't a sufficient identifier, but I can't think of a case where it'd fail. The migration code needs to check that the devices are actually compatible. I'd expect this to require much more than just the device name. What you actually need is more like An e1000 with 64k eeprom, fast ethernet PHY, and frobnitz B. In fact what you really want to do is transfer the device tree (including properties), and create the machine from scratch, not load state into a pre-supplied device tree. I started down that path, but it still breaks for hotplug. If we start a VM with two e1000 NICs, then remove the first, we can no longer migrate because the target can't represent having a single e1000 with a non-zero instance ID. That's indeed a good point. That's a feature. If you start with two NICs and remove the first, the chances are that the second will be in a different place to the nice created in a single-nic config. Allowing migration between these two will cause a PCI device to suddenly change location. This is not physically or logically possible, and everyone looses. If the BAR addresses don't follow the VM when it's migrated, that's another bug that needs to be fixed, but we can't get there until we at least have some infrastructure in place to make that bug possible. Not BAR addresses, the actual PCI device addresses. Devices on the PCI bus are addressed by device and function. This is guest visible. The device part of this address corresponds to the physical slot, which typically effects IRQ routing (amongst other things). If you arbitrarily move a device from slot A to slot B then this will have catastrophic effects on a running machine. Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VT-d: fix PCI device detach from virtual machine
On Thu, 2009-02-26 at 17:31 +0800, Han, Weidong wrote: When assign a device behind conventional PCI bridge or PCIe to PCI/PCI-x bridge to a domain, it must assign its bridge and may also need to assign secondary interface to the same domain. Dependent assignment is already there, but dependent deassignment is missed when detach device from virtual machine. This results in conventional PCI device assignment failure after it has been assigned once. This patch addes dependent deassignment, and fixes the issue. Um, this code makes my head hurt. Why are we doing this in the first place? Because the IOMMU works on the source-id in PCIe transactions, the pci_find_upstream_pcie_bridge() function effectively tells us which PCI device our own device will be masquerading as, for the purposes of DMA. So why do we bother setting up a context in the IOMMU for the device itself, when no DMA will ever appear to come from this device? And likewise why do we bother setting up a context for intermediate PCI bridges? Why not just jump straight to the 'DMA proxy' device, and use that _only_? We'll have to cope with multiple devices behind the same 'proxy', but it looks like our handling of that is totally screwed already... what happens right now if you have two PCI devices behind the same PCIe-PCI bridge, and try to attach both of them to different domains... or both to the _same_ domain, in fact, and then detach just one of them. I think the answer to the latter question is that your newly-added iommu_detach_dependent_devices() routine will tear down the mapping on the 'proxy' device and faults will start happening for the device which is supposed to still be mapped? Confused... and tempted to rip it all out and start over. -- dwmw2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 5/5] [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo
On Mon, 2010-06-14 at 18:11 +, Blue Swirl wrote: On Mon, Jun 14, 2010 at 9:44 AM, Nicholas A. Bellinger n...@linux-iscsi.org wrote: From: Nicholas Bellinger n...@linux-iscsi.org This patch adds initial support for using the Linux BSG interface with write/read vectored AIO as a QEMU backstore (SCSIDeviceInfo) with hw/scsi-bus.c compatible HBA emulation. Did I miss the docs? I assume you mean the docs for CLI usage, yes..? This ops are the same as scsi-generic, and using megasas HBA emulation look like: qemu-system-x86_64 -m 512 -boot c ~/lenny-32bit.img -drive if=none,id=mydisk1,file=/dev/4\:0\:1\:0 -device megasas,id=raid -device scsi-bsg,bus=raid.0,scsi-id=1,drive=mydisk1 So far it has been tested with x86_64 host and guest using hw/megasas.c and TCM_Loop LLD Port LUNs. Because this path uses struct iovec for struct sg_io_v4-d[out,in]_xferp payloads, which currently requires a patch to linux/block/bsg.c:bsg_map_hdr() in order to setup the user - kernel iovecs. This also will only currently work with paired user/kernel (eg: 64bit user / 64bit kernel) because of different pointer sizes in struct iovec-iov_base. There are also two FIXMEs in hw/scsi-bsg.c:bsg_generic_initfn() related to extraction of SCSI LUN and device type values using BSG and required by QEMU-KVM. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org --- Makefile.objs |2 +- hw/scsi-bsg.c | 588 + 2 files changed, 589 insertions(+), 1 deletions(-) create mode 100644 hw/scsi-bsg.c diff --git a/Makefile.objs b/Makefile.objs index 188d617..c4fcb72 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -197,7 +197,7 @@ hw-obj-$(CONFIG_IDE_CMD646) += ide/cmd646.o hw-obj-$(CONFIG_IDE_MACIO) += ide/macio.o # SCSI layer -hw-obj-y += scsi-disk.o scsi-generic.o +hw-obj-y += scsi-disk.o scsi-generic.o scsi-bsg.o Instead of '#ifdef __linux__' (which should be '#ifdef CONFIG_LINUX'), please compile the object only if CONFIG_LINUX is set, something like: hw-obj-$(CONFIG_LINUX) += scsi-bsg.o Please also check if this could be compiled in common-obj set. Ok, I added the 'hw-obj-$(CONFIG_LINUX) += scsi-bsg.o' mentioned above.. hw-obj-y += lsi53c895a.o megasas.o hw-obj-$(CONFIG_ESP) += esp.o diff --git a/hw/scsi-bsg.c b/hw/scsi-bsg.c new file mode 100644 index 000..fc76b76 --- /dev/null +++ b/hw/scsi-bsg.c @@ -0,0 +1,588 @@ +/* + * block layer implementation of the sg v4 interface for Linux hosts + * + * Copyright (c) 2010 Rising Tide Systems + * Written by Nicholas A. Bellinger n...@linux-iscsi.org + * + * Based on hw/scsi-generic code by Laurent Vivier, Paul Brook, and Fabrice Bellard + * + * This code is licenced under the LGPL. + */ + +#include qemu-common.h +#include qemu-error.h +#include block.h +#include scsi.h +#include dma.h +#include block/raw-posix-aio.h + +#ifdef __linux__ + +#define DEBUG_BSG +#undef DEBUG_BSG_IO +#undef DEBUG_BSG_MAP This should be //#define DEBUG_BSG //#define DEBUG_BSG_IO //#define DEBUG_BSG_MAP Fixed + +#ifdef DEBUG_BSG +#define DPRINTF(fmt, ...) \ +do { printf(scsi-bsg: fmt , ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) do {} while(0) +#endif + +#define BADF(fmt, ...) \ +do { fprintf(stderr, scsi-bsg: fmt , ## __VA_ARGS__); } while (0) + +#include stdio.h +#include sys/types.h +#include sys/stat.h +#include sys/epoll.h +#include unistd.h +#include scsi/sg.h +#include linux/bsg.h +#include scsi-defs.h + +#define SCSI_SENSE_BUF_SIZE 96 + +#define SG_ERR_DRIVER_TIMEOUT 0x06 +#define SG_ERR_DRIVER_SENSE 0x08 + +#ifndef MAX_UINT +#define MAX_UINT ((unsigned int)-1) The standard macro is UINT_MAX. Fixed (This was originally from hw/scsi-generic.c) +#endif + +typedef struct SCSIBSGState SCSIBSGState; + +typedef struct SCSIBSGReq { +SCSIRequest req; +uint8_t *buf; +int buflen; +QEMUIOVector iov; +QEMUIOVector aio_iov; +struct sg_io_v4 bsg_hdr; +} SCSIBSGReq; + +struct SCSIBSGState { +SCSIDevice qdev; +BlockDriverState *bs; +int lun; +int driver_status; +uint8_t sensebuf[SCSI_SENSE_BUF_SIZE]; +uint8_t senselen; +}; + +static int bsg_read(int fd, void *p_read, int to_read) +{ +int err; + +while (to_read 0) { +err = read(fd, p_read, to_read); +if (err = 0) { +to_read -= err; +p_read += err; +} else if (errno == EINTR) +continue; +else { +printf(bsg device %d read failed, errno: %d\n, +fd, errno); DPRINTF? I made this into a error_report() +return errno; +} +} +return 0; +} + +static SCSIBSGReq
[PATCH] KVM: x86 emulator: fix pusha instruction emulation
emulate pusha instruction only writeback the last EDI register, but the other registers which need to be writeback is ignored. This patch fixed it. --- arch/x86/kvm/emulate.c | 133 ++-- 1 files changed, 73 insertions(+), 60 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index a4c2dcd..c990db0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1553,6 +1553,64 @@ exception: return X86EMUL_PROPAGATE_FAULT; } +static inline int writeback(struct x86_emulate_ctxt *ctxt, + struct x86_emulate_ops *ops) +{ + int rc; + struct decode_cache *c = ctxt-decode; + u32 err; + + switch (c-dst.type) { + case OP_REG: + /* The 4-byte case *is* correct: +* in 64-bit mode we zero-extend. +*/ + switch (c-dst.bytes) { + case 1: + *(u8 *)c-dst.ptr = (u8)c-dst.val; + break; + case 2: + *(u16 *)c-dst.ptr = (u16)c-dst.val; + break; + case 4: + *c-dst.ptr = (u32)c-dst.val; + break; /* 64b: zero-ext */ + case 8: + *c-dst.ptr = c-dst.val; + break; + } + break; + case OP_MEM: + if (c-lock_prefix) + rc = ops-cmpxchg_emulated( + (unsigned long)c-dst.ptr, + c-dst.orig_val, + c-dst.val, + c-dst.bytes, + err, + ctxt-vcpu); + else + rc = ops-write_emulated( + (unsigned long)c-dst.ptr, + c-dst.val, + c-dst.bytes, + err, + ctxt-vcpu); + if (rc == X86EMUL_PROPAGATE_FAULT) + emulate_pf(ctxt, + (unsigned long)c-dst.ptr, err); + if (rc != X86EMUL_CONTINUE) + return rc; + break; + case OP_NONE: + /* no writeback */ + break; + default: + break; + } + return X86EMUL_CONTINUE; +} + static inline void emulate_push(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) { @@ -1651,11 +1709,12 @@ static int emulate_pop_sreg(struct x86_emulate_ctxt *ctxt, return rc; } -static void emulate_pusha(struct x86_emulate_ctxt *ctxt, +static int emulate_pusha(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops) { struct decode_cache *c = ctxt-decode; unsigned long old_esp = c-regs[VCPU_REGS_RSP]; + int rc = X86EMUL_CONTINUE; int reg = VCPU_REGS_RAX; while (reg = VCPU_REGS_RDI) { @@ -1663,8 +1722,18 @@ static void emulate_pusha(struct x86_emulate_ctxt *ctxt, (c-src.val = old_esp) : (c-src.val = c-regs[reg]); emulate_push(ctxt, ops); + + rc = writeback(ctxt, ops); + if (rc != X86EMUL_CONTINUE) + return rc; + ++reg; } + + /* Disable writeback. */ + c-dst.type = OP_NONE; + + return rc; } static int emulate_popa(struct x86_emulate_ctxt *ctxt, @@ -1817,64 +1886,6 @@ static int emulate_ret_far(struct x86_emulate_ctxt *ctxt, return rc; } -static inline int writeback(struct x86_emulate_ctxt *ctxt, - struct x86_emulate_ops *ops) -{ - int rc; - struct decode_cache *c = ctxt-decode; - u32 err; - - switch (c-dst.type) { - case OP_REG: - /* The 4-byte case *is* correct: -* in 64-bit mode we zero-extend. -*/ - switch (c-dst.bytes) { - case 1: - *(u8 *)c-dst.ptr = (u8)c-dst.val; - break; - case 2: - *(u16 *)c-dst.ptr = (u16)c-dst.val; - break; - case 4: - *c-dst.ptr = (u32)c-dst.val; - break; /* 64b: zero-ext */ - case 8: - *c-dst.ptr = c-dst.val; - break; - } - break; - case OP_MEM: - if (c-lock_prefix) - rc = ops-cmpxchg_emulated( - (unsigned long)c-dst.ptr, - c-dst.orig_val, -