Re: Raw vs. tap (was: Re: [Qemu-devel] Re: Release plan for 0.12.0)
On Wed, 2009-10-14 at 17:53 -0500, Anthony Liguori wrote: So at this point, I think it's a mistake to include raw socket support. If the goal is to improve networking usability such that it just works as a root user, let's incorporate a default network script that creates a bridge or something like that. There are better ways to achieve that goal. FWIW, I haven't really played with the raw backend yet, but my initial thought was also what exactly does this gain us apart from yet more confusion for users?. So, I tend to agree, but I'm not so hung up on the user confusion aspect - the users that I worry about confusing (e.g. virt-manager users) would never even know the backend exists, even if qemu did support it. The one hope I had for raw is that it might allow us to get closer to the NIC, get more details on the NIC tx queue and have more intelligent tx mitigation. This is probably better explored in the context of vhost-net, though. Wrt. to configuring bridges, libvirt comes with a good default setup - a bridge without any physical NICs connected, but NAT set up for access to the outside world. For bridging to a physical NIC, our plan continues to be that NetworkManager will eventually make this trivial for users, but that's still in progress. In the meantime, the config isn't all that complex: http://wiki.libvirt.org/page/Networking#Fedora.2FRHEL_Bridging Cheers, Mark. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM test: Add PCI pass through test
On Wed, Oct 14, 2009 at 09:08:00AM -0300, Lucas Meneghel Rodrigues wrote: Add a new PCI pass trough test. It supports both SR-IOV virtual functions and physical NIC card pass through. Single Root I/O Virtualization (SR-IOV) allows a single PCI device to be shared amongst multiple virtual machines while retaining the performance benefit of assigning a PCI device to a virtual machine. A common example is where a single SR-IOV capable NIC - with perhaps only a single physical network port - might be shared with multiple virtual machines by assigning a virtual function to each VM. SR-IOV support is implemented in the kernel. The core implementation is contained in the PCI subsystem, but there must also be driver support for both the Physical Function (PF) and Virtual Function (VF) devices. With an SR-IOV capable device one can allocate VFs from a PF. The VFs surface as PCI devices which are backed on the physical PCI device by resources (queues, and register sets). Device support: In 2.6.30, the Intel® 82576 Gigabit Ethernet Controller is the only SR-IOV capable device supported. The igb driver has PF support and the igbvf has VF support. In 2.6.31 the Neterion® X3100™ is supported as well. This device uses the same vxge driver for the PF as well as the VFs. Wow, new NIC card supports SR-IOV... At this rate, do we need to move the driver name and its parameter into config file so that in future if a new NIC card using different driver is supported, we could handle it without changing code ? In order to configure the test: * For SR-IOV virtual functions passthrough, we could specify the module parameter 'max_vfs' in config file. * For physical NIC card pass through, we should specify the device name(s). Signed-off-by: Yolkfull Chow yz...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample | 11 ++- client/tests/kvm/kvm_utils.py | 278 + client/tests/kvm/kvm_vm.py| 72 + 3 files changed, 360 insertions(+), 1 deletions(-) diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index cc3228a..1dad188 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -786,13 +786,22 @@ variants: only default image_format = raw - variants: - @smallpages: - hugepages: pre_command = /usr/bin/python scripts/hugepage.py /mnt/kvm_hugepage extra_params += -mem-path /mnt/kvm_hugepage +variants: +- @no_passthrough: +pass_through = no +- nic_passthrough: +pass_through = pf +passthrough_devs = eth1 +- vfs_passthrough: +pass_through = vf +max_vfs = 7 +vfs_count = 7 variants: - @basic: diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py index 53b664a..0e3398c 100644 --- a/client/tests/kvm/kvm_utils.py +++ b/client/tests/kvm/kvm_utils.py @@ -788,3 +788,281 @@ def md5sum_file(filename, size=None): size -= len(data) f.close() return o.hexdigest() + + +def get_full_id(pci_id): + +Get full PCI ID of pci_id. + +cmd = lspci -D | awk '/%s/ {print $1}' % pci_id +status, full_id = commands.getstatusoutput(cmd) +if status != 0: +return None +return full_id + + +def get_vendor_id(pci_id): + +Check out the device vendor ID according to PCI ID. + +cmd = lspci -n | awk '/%s/ {print $3}' % pci_id +return re.sub(:, , commands.getoutput(cmd)) + + +def release_dev(pci_id, pci_dict): + +Release a single PCI device. + +@param pci_id: PCI ID of a given PCI device +@param pci_dict: Dictionary with information about PCI devices + +base_dir = /sys/bus/pci +full_id = get_full_id(pci_id) +vendor_id = get_vendor_id(pci_id) +drv_path = os.path.join(base_dir, devices/%s/driver % full_id) +if 'pci-stub' in os.readlink(drv_path): +cmd = echo '%s' %s/new_id % (vendor_id, drv_path) +if os.system(cmd): +return False + +stub_path = os.path.join(base_dir, drivers/pci-stub) +cmd = echo '%s' %s/unbind % (full_id, stub_path) +if os.system(cmd): +return False + +prev_driver = pci_dict[pci_id] +cmd = echo '%s' %s/bind % (full_id, prev_driver) +if os.system(cmd): +return False +return True + + +def release_pci_devs(pci_dict): + +Release all PCI devices assigned to host. + +@param pci_dict: Dictionary with information about PCI devices + +for pci_id in pci_dict: +if not release_dev(pci_id, pci_dict): +logging.error(Failed to release device [%s] to host % pci_id) +else: +logging.info(Release device [%s] successfully % pci_id) + + +class PassThrough(object):
Re: [Autotest] [PATCH] Add pass through feature test (support SR-IOV)
On Wed, Oct 14, 2009 at 09:13:59AM -0300, Lucas Meneghel Rodrigues wrote: Yolkfull, I've studied about single root IO virtualization before reviewing your patch, the general approach here looks good. There were some stylistic points as far as code is concerned, so I have rebased your patch against the latest trunk, and added some explanation about the features being tested and referenced (extracted from a Fedora 12 blueprint). Please let me know if you are OK with it, I guess I will review this patch a couple more times, as the code and the features being tested are fairly complex. Thanks! Lucas, thank you very much for adding a detailed explanation and improving for this test. I had reviewed the new patch and some new consideration came to my mind. I had added them on the email, please reviewed. :) On Mon, Sep 14, 2009 at 11:20 PM, Yolkfull Chow yz...@redhat.com wrote: It supports both SR-IOV virtual functions' and physical NIC card pass through. * For SR-IOV virtual functions passthrough, we could specify the module parameter 'max_vfs' in config file. * For physical NIC card pass through, we should specify the device name(s). Signed-off-by: Yolkfull Chow yz...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample | 12 ++ client/tests/kvm/kvm_utils.py | 248 - client/tests/kvm/kvm_vm.py | 68 +- 3 files changed, 326 insertions(+), 2 deletions(-) diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index a83ef9b..c6037da 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -627,6 +627,18 @@ variants: variants: + - @no_passthrough: + pass_through = no + - nic_passthrough: + pass_through = pf + passthrough_devs = eth1 + - vfs_passthrough: + pass_through = vf + max_vfs = 7 + vfs_count = 7 + + +variants: - @basic: only Fedora Windows - @full: diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py index dfca938..1fe3b31 100644 --- a/client/tests/kvm/kvm_utils.py +++ b/client/tests/kvm/kvm_utils.py @@ -1,5 +1,5 @@ import md5, thread, subprocess, time, string, random, socket, os, signal, pty -import select, re, logging, commands +import select, re, logging, commands, cPickle from autotest_lib.client.bin import utils from autotest_lib.client.common_lib import error import kvm_subprocess @@ -795,3 +795,249 @@ def md5sum_file(filename, size=None): size -= len(data) f.close() return o.hexdigest() + + +def get_full_id(pci_id): + + Get full PCI ID of pci_id. + + cmd = lspci -D | awk '/%s/ {print $1}' % pci_id + status, full_id = commands.getstatusoutput(cmd) + if status != 0: + return None + return full_id + + +def get_vendor_id(pci_id): + + Check out the device vendor ID according to PCI ID. + + cmd = lspci -n | awk '/%s/ {print $3}' % pci_id + return re.sub(:, , commands.getoutput(cmd)) + + +def release_pci_devs(dict): + + Release assigned PCI devices to host. + + def release_dev(pci_id): + base_dir = /sys/bus/pci + full_id = get_full_id(pci_id) + vendor_id = get_vendor_id(pci_id) + drv_path = os.path.join(base_dir, devices/%s/driver % full_id) + if 'pci-stub' in os.readlink(drv_path): + cmd = echo '%s' %s/new_id % (vendor_id, drv_path) + if os.system(cmd): + return False + + stub_path = os.path.join(base_dir, drivers/pci-stub) + cmd = echo '%s' %s/unbind % (full_id, stub_path) + if os.system(cmd): + return False + + prev_driver = self.dev_prev_drivers[pci_id] + cmd = echo '%s' %s/bind % (full_id, prev_driver) + if os.system(cmd): + return False + return True + + for pci_id in dict.keys(): + if not release_dev(pci_id): + logging.error(Failed to release device [%s] to host % pci_id) + else: + logging.info(Release device [%s] successfully % pci_id) + + +class PassThrough: + + Request passthroughable devices on host. It will check whether to request + PF(physical NIC cards) or VF(Virtual Functions). + + def __init__(self, type=nic_vf, max_vfs=None, names=None): + + Initialize parameter 'type' which could be: + nic_vf: Virtual Functions + nic_pf: Physical NIC card + mixed: Both includes VFs and PFs + + If pass through Physical NIC cards, we need to specify which devices + to be assigned, e.g. 'eth1 eth2'. + + If pass through Virtual Functions, we
Re: Add qemu_send_raw() to vlan.
Hi Gleb, On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote: It gets packet without virtio header and adds it if needed. Allows to inject packets to vlan from outside. To send gracious arp for instance. ... diff --git a/net.h b/net.h index 931133b..3d0b6f2 100644 --- a/net.h +++ b/net.h ... @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc); ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov, int iovcnt); int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int size); void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]); void qemu_check_nic_model(NICInfo *nd, const char *model); void qemu_check_nic_model_list(NICInfo *nd, const char * const *models, I've only just now noticed that we never actually made announce_self() use this ... care to do that? Cheers, Mark. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Xen PV-on-HVM guest support (v2)
On 10/15/2009 02:41 PM, Ed Swierk wrote: Support for Xen PV-on-HVM guests can be implemented almost entirely in userspace, except for handling one annoying MSR that maps a Xen hypercall blob into guest address space. A generic mechanism to delegate MSR writes to userspace seems overkill and risks encouraging similar MSR abuse in the future. Thus this patch adds special support for the Xen HVM MSR. I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell KVM which MSR the guest will write to, as well as the starting address and size of the hypercall blobs (one each for 32-bit and 64-bit) that userspace has loaded from files. When the guest writes to the MSR, KVM copies one page of the blob from userspace to the guest. I've tested this patch with a hacked-up version of Gerd's userspace code, booting a number of guests (CentOS 5.3 i386 and x86_64, and FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices. v2: fix ioctl struct padding; renumber CAP and ioctl constants; check kvm_write_guest() return value; change printks to KERN_DEBUG (I think they're worth keeping for debugging userspace) +#ifdef KVM_CAP_XEN_HVM +struct kvm_xen_hvm_config { + __u32 msr; + __u8 pad[2]; + __u8 blob_size[2]; + __u64 blob_addr[2]; +}; +#endif Please change the arrays to separate variables (e.g. blob_size_32, blob_size_64), so readers don't have to guess the meaning. Also, reserve a bunch of space at the end in case we need more hackery. Is the msr number really variable? Isn't it an ABI? * ioctls for vcpu fds Index: kvm-kmod/include/linux/kvm_host.h === --- kvm-kmod.orig/include/linux/kvm_host.h +++ kvm-kmod/include/linux/kvm_host.h @@ -236,6 +236,10 @@ struct kvm { unsigned long mmu_notifier_seq; long mmu_notifier_count; #endif + +#ifdef KVM_CAP_XEN_HVM + struct kvm_xen_hvm_config xen_hvm_config; +#endif }; struct kvm_arch is a better place for this. /* The guest did something we don't support. */ Index: kvm-kmod/x86/x86.c === --- kvm-kmod.orig/x86/x86.c +++ kvm-kmod/x86/x86.c @@ -875,6 +875,35 @@ static int set_msr_mce(struct kvm_vcpu * return 0; } +#ifdef KVM_CAP_XEN_HVM No need for the ifdef - it will always be defined for x86. +static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data) +{ + int blob = !!(vcpu-arch.shadow_efer EFER_LME); Can use is_long_mode() for this. + u32 pnum = data ~PAGE_MASK; + u64 paddr = data PAGE_MASK; + u8 *page; + int r = 1; + + if (pnum= vcpu-kvm-xen_hvm_config.blob_size[blob]) + goto out; + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!page) + goto out; + if (copy_from_user(page, (u8 *)vcpu-kvm-xen_hvm_config.blob_addr[blob] + + pnum * PAGE_SIZE, PAGE_SIZE)) + goto out_free; We want to return -EFAULT here (but make sure the entire code path allows this). + if (kvm_write_guest(vcpu-kvm, paddr, page, PAGE_SIZE)) + goto out_free; + printk(KERN_DEBUG kvm: copied xen hvm blob %d page %d to 0x%llx\n, + blob, pnum, paddr); + r = 0; +out_free: + kfree(page); +out: + return r; +} +#endif + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -990,6 +1019,10 @@ int kvm_set_msr_common(struct kvm_vcpu * 0x%x data 0x%llx\n, msr, data); break; default: +#ifdef KVM_CAP_XEN_HVM + if (msr (msr == vcpu-kvm-xen_hvm_config.msr)) + return xen_hvm_config(vcpu, data); +#endif Again, can skip the ifdef. if (!ignore_msrs) { pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n, msr, data); @@ -2453,6 +2486,17 @@ long kvm_arch_vm_ioctl(struct file *filp r = 0; break; } +#ifdef KVM_CAP_XEN_HVM + case KVM_XEN_HVM_CONFIG: { + r = -EFAULT; + if (copy_from_user(kvm-xen_hvm_config, argp, + sizeof(struct kvm_xen_hvm_config))) + goto out; + printk(KERN_DEBUG kvm: configured xen hvm\n); + r = 0; + break; + } +#endif default: ; } Do we need support for reading the msr? IMO you can drop the debugging printk()s. I don't see how they add much value. Please submit the patch against a current kernel tree, not kvm-kmod. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH] kvm: Prevent kvm_init from corrupting debugfs structures
On 10/15/2009 08:21 AM, Darrick J. Wong wrote: I'm seeing an oops condition when kvm-intel and kvm-amd are modprobe'd during boot (say on an Intel system) and then rmmod'd: # modprobe kvm-intel kvm_init() kvm_init_debug() kvm_arch_init()-- stores debugfs dentries internally (success, etc) # modprobe kvm-amd kvm_init() kvm_init_debug()-- second initialization clobbers kvm's internal pointers to dentries kvm_arch_init() kvm_exit_debug()-- and frees them # rmmod kvm-intel kvm_exit() kvm_exit_debug()-- double free of debugfs files! *BOOM* If execution gets to the end of kvm_init(), then the calling module has been established as the kvm provider. Move the debugfs initialization to the end of the function, and remove the now-unnecessary call to kvm_exit_debug() from the error path. That way we avoid trampling on the debugfs entries and freeing them twice. Looks good. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Xen PV-on-HVM guest support (v2)
Ed Swierk wrote: Support for Xen PV-on-HVM guests can be implemented almost entirely in userspace, except for handling one annoying MSR that maps a Xen hypercall blob into guest address space. A generic mechanism to delegate MSR writes to userspace seems overkill and risks encouraging similar MSR abuse in the future. Thus this patch adds special support for the Xen HVM MSR. I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell KVM which MSR the guest will write to, as well as the starting address and size of the hypercall blobs (one each for 32-bit and 64-bit) that userspace has loaded from files. When the guest writes to the MSR, KVM copies one page of the blob from userspace to the guest. I've tested this patch with a hacked-up version of Gerd's userspace code, booting a number of guests (CentOS 5.3 i386 and x86_64, and FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices. v2: fix ioctl struct padding; renumber CAP and ioctl constants; check kvm_write_guest() return value; change printks to KERN_DEBUG (I think they're worth keeping for debugging userspace) I disagree /wrt the print in the IOCTL path (missing configuration can also be reported on access), and the guest triggered path at least requires a pr_debug conversion. Looks fine to me otherwise. Jan Signed-off-by: Ed Swierk eswi...@aristanetworks.com --- Index: kvm-kmod/include/asm-x86/kvm.h === --- kvm-kmod.orig/include/asm-x86/kvm.h +++ kvm-kmod/include/asm-x86/kvm.h @@ -59,6 +59,7 @@ #define __KVM_HAVE_MSIX #define __KVM_HAVE_MCE #define __KVM_HAVE_PIT_STATE2 +#define __KVM_HAVE_XEN_HVM /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 Index: kvm-kmod/include/linux/kvm.h === --- kvm-kmod.orig/include/linux/kvm.h +++ kvm-kmod/include/linux/kvm.h @@ -476,6 +476,9 @@ struct kvm_ioeventfd { #endif #define KVM_CAP_IOEVENTFD 36 #define KVM_CAP_SET_IDENTITY_MAP_ADDR 37 +#ifdef __KVM_HAVE_XEN_HVM +#define KVM_CAP_XEN_HVM 38 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -528,6 +531,15 @@ struct kvm_x86_mce { }; #endif +#ifdef KVM_CAP_XEN_HVM +struct kvm_xen_hvm_config { + __u32 msr; + __u8 pad[2]; + __u8 blob_size[2]; + __u64 blob_addr[2]; +}; +#endif + #define KVM_IRQFD_FLAG_DEASSIGN (1 0) struct kvm_irqfd { @@ -586,6 +598,7 @@ struct kvm_irqfd { #define KVM_CREATE_PIT2 _IOW(KVMIO, 0x77, struct kvm_pit_config) #define KVM_SET_BOOT_CPU_ID_IO(KVMIO, 0x78) #define KVM_IOEVENTFD _IOW(KVMIO, 0x79, struct kvm_ioeventfd) +#define KVM_XEN_HVM_CONFIG_IOW(KVMIO, 0x7a, struct kvm_xen_hvm_config) /* * ioctls for vcpu fds Index: kvm-kmod/include/linux/kvm_host.h === --- kvm-kmod.orig/include/linux/kvm_host.h +++ kvm-kmod/include/linux/kvm_host.h @@ -236,6 +236,10 @@ struct kvm { unsigned long mmu_notifier_seq; long mmu_notifier_count; #endif + +#ifdef KVM_CAP_XEN_HVM + struct kvm_xen_hvm_config xen_hvm_config; +#endif }; /* The guest did something we don't support. */ Index: kvm-kmod/x86/x86.c === --- kvm-kmod.orig/x86/x86.c +++ kvm-kmod/x86/x86.c @@ -875,6 +875,35 @@ static int set_msr_mce(struct kvm_vcpu * return 0; } +#ifdef KVM_CAP_XEN_HVM +static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data) +{ + int blob = !!(vcpu-arch.shadow_efer EFER_LME); + u32 pnum = data ~PAGE_MASK; + u64 paddr = data PAGE_MASK; + u8 *page; + int r = 1; + + if (pnum = vcpu-kvm-xen_hvm_config.blob_size[blob]) + goto out; + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!page) + goto out; + if (copy_from_user(page, (u8 *)vcpu-kvm-xen_hvm_config.blob_addr[blob] ++ pnum * PAGE_SIZE, PAGE_SIZE)) + goto out_free; + if (kvm_write_guest(vcpu-kvm, paddr, page, PAGE_SIZE)) + goto out_free; + printk(KERN_DEBUG kvm: copied xen hvm blob %d page %d to 0x%llx\n, +blob, pnum, paddr); + r = 0; +out_free: + kfree(page); +out: + return r; +} +#endif + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -990,6 +1019,10 @@ int kvm_set_msr_common(struct kvm_vcpu * 0x%x data 0x%llx\n, msr, data); break; default: +#ifdef KVM_CAP_XEN_HVM + if (msr (msr == vcpu-kvm-xen_hvm_config.msr)) + return xen_hvm_config(vcpu, data); +#endif if (!ignore_msrs) { pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n, msr, data); @@ -2453,6
Re: [PATCH][RFC] Xen PV-on-HVM guest support
On 10/15/09 09:17, Jan Kiszka wrote: Ed Swierk wrote: Overall it seems pretty solid for Linux PV-on-HVM guests. I think more work is needed to support full PV guests, but I don't know how much. Have folks been asking about PV-on-HVM or full PV? Not all requests weren't that concrete /wrt technology, but some had older setups and were definitely using full PV. I had full pv working at some point as well, but I think those patches are almost a year old by now and need quite some work to make them work on todays master branch ... cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Xen PV-on-HVM guest support
Ed Swierk wrote: Thanks for the feedback; I'll post a new version shortly. On Tue, Oct 13, 2009 at 11:45 PM, Jan Kiszka jan.kis...@web.de wrote: Interesting stuff. How usable is your work at this point? I've no immediate demand, but the question if one could integrate Xen guests with KVM already popped up more than once @work. So far I've managed to boot CentOS 5.3 (both i386 and x86_64) and use the Xen PV block and net devices, with pretty good performance. I've also booted FreeBSD 8.0-RC1 (amd64 only) with a XENHVM kernel and used the Xen PV block and net devices, but the performance of the net device is significantly worse than with CentOS. Also some FreeBSD applications use a flag that's not yet implemented in the net device emulation, but I'm working on fixing that. Overall it seems pretty solid for Linux PV-on-HVM guests. I think more work is needed to support full PV guests, but I don't know how much. Have folks been asking about PV-on-HVM or full PV? Not all requests weren't that concrete /wrt technology, but some had older setups and were definitely using full PV. Jan signature.asc Description: OpenPGP digital signature
Re: Add qemu_send_raw() to vlan.
On Thu, Oct 15, 2009 at 08:04:45AM +0100, Mark McLoughlin wrote: Hi Gleb, On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote: It gets packet without virtio header and adds it if needed. Allows to inject packets to vlan from outside. To send gracious arp for instance. ... diff --git a/net.h b/net.h index 931133b..3d0b6f2 100644 --- a/net.h +++ b/net.h ... @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc); ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov, int iovcnt); int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int size); void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]); void qemu_check_nic_model(NICInfo *nd, const char *model); void qemu_check_nic_model_list(NICInfo *nd, const char * const *models, I've only just now noticed that we never actually made announce_self() use this ... care to do that? Something like this: --- Use qemu_send_packet_raw to send gracious arp. This will ensure that vnet header is handled properly. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/savevm.c b/savevm.c index 7a363b6..8ea2daf 100644 --- a/savevm.c +++ b/savevm.c @@ -132,7 +132,7 @@ static void qemu_announce_self_once(void *opaque) len = announce_self_create(buf, nd_table[i].macaddr); vlan = nd_table[i].vlan; for(vc = vlan-first_client; vc != NULL; vc = vc-next) { -vc-receive(vc, buf, len); +qemu_send_packet_raw(vc, buf, len); } } if (count--) { -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Add qemu_send_raw() to vlan.
On Thu, 2009-10-15 at 09:33 +0200, Gleb Natapov wrote: On Thu, Oct 15, 2009 at 08:04:45AM +0100, Mark McLoughlin wrote: Hi Gleb, On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote: It gets packet without virtio header and adds it if needed. Allows to inject packets to vlan from outside. To send gracious arp for instance. ... diff --git a/net.h b/net.h index 931133b..3d0b6f2 100644 --- a/net.h +++ b/net.h ... @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc); ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov, int iovcnt); int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int size); void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]); void qemu_check_nic_model(NICInfo *nd, const char *model); void qemu_check_nic_model_list(NICInfo *nd, const char * const *models, I've only just now noticed that we never actually made announce_self() use this ... care to do that? Something like this: --- Use qemu_send_packet_raw to send gracious arp. This will ensure that vnet header is handled properly. Signed-off-by: Gleb Natapov g...@redhat.com Acked-by: Mark McLoughlin mar...@redhat.com diff --git a/savevm.c b/savevm.c index 7a363b6..8ea2daf 100644 --- a/savevm.c +++ b/savevm.c @@ -132,7 +132,7 @@ static void qemu_announce_self_once(void *opaque) len = announce_self_create(buf, nd_table[i].macaddr); vlan = nd_table[i].vlan; for(vc = vlan-first_client; vc != NULL; vc = vc-next) { -vc-receive(vc, buf, len); +qemu_send_packet_raw(vc, buf, len); This makes things even more gratuitous because we're making every net client send the packet rather than receive it, but it works fine in practice. Cheers, Mark. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [STABLE PATCH] hotplug: fix scsi hotplug.
On 10/14/09 19:30, Dustin Kirkland wrote: Also note that I did not replace the bios.bin, as it appears to me that the qemu-kvm-0.11 bios.bin is working properly. Yes, kvm has its own bios, only for vanilla upstream the bios must be replaced. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Release plan for 0.12.0
On Wed, Oct 14, 2009 at 02:10:00PM -0700, Sridhar Samudrala wrote: On Wed, 2009-10-14 at 17:50 +0200, Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 04:19:17PM +0100, Jamie Lokier wrote: Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 09:17:15AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: Looks like Or has abandoned it. I have an updated version which works with new APIs, etc. Let me post it and we'll go from there. I'm generally inclined to oppose the functionality as I don't think it offers any advantages over the existing backends. I patch it in and use it all the time. It's much easier to setup on a random machine than a bridged config. Having two things that do the same thing is just going to lead to user confusion. They do not do the same thing. With raw socket you can use windows update without a bridge in the host, with tap you can't. On the other hand, with raw socket, guest Windows can't access files on the host's Samba share can it? So it's not that useful even for Windows guests. I guess this depends on whether you use the same host for samba :) If the problem is tap is too hard to setup, we should try to simplify tap configuration. The problem is bridge is too hard to setup. Simplifying that is a good idea, but outside the scope of the qemu project. I venture it's important enough for qemu that it's worth working on that. Something that looks like the raw socket but behaves like an automatically instantiated bridge attached to the bound interface would be a useful interface. I agree, that would be good to have. Can't we bind the raw socket to the tap interface instead of the physical interface and allow the bridge config to work. We can, kind of (e.g. with veth) but what's the point then? Thanks Sridhar I don't have much time, but I'll help anybody who wants to do that. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raw vs. tap (was: Re: [Qemu-devel] Re: Release plan for 0.12.0)
On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote: I would be much more inclined to consider taking raw and improving the performance long term if guest-host networking worked. This appears to be a fundamental limitation though and I think it's something that will forever plague users if we include this feature. In fact, I think it's fixable with a raw socket bound to a macvlan. Would that be enough? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
On 10/14/2009 01:06 AM, Jan Kiszka wrote: Add a new IOCTL pair to retrieve or set the VCPU state in one chunk. More precisely, the IOCTL is able to process a list of substates to be read or written. This list is easily extensible without breaking the existing ABI, thus we will no longer have to add new IOCTLs when we discover a missing VCPU state field or want to support new hardware features. This patch establishes the generic infrastructure for KVM_GET/ SET_VCPU_STATE and adds support for the generic substates REGS, SREGS, FPU, and MP. To avoid code duplication, the entry point for the corresponding original IOCTLs are converted to make use of the new infrastructure internally, too. +/* for KVM_GET_VCPU_STATE and KVM_SET_VCPU_STATE */ +#define KVM_VCPU_REGS 0 +#define KVM_VCPU_SREGS 1 +#define KVM_VCPU_FPU 2 +#define KVM_VCPU_MP3 KVM_VCPU_STATE_*, to avoid collisions. Better to split sse from fpu since we already know it is about to be replaced. + +struct kvm_vcpu_substate { + __u32 type; + __u32 pad; + __s64 offset; +}; + +#define KVM_MAX_VCPU_SUBSTATES 64 + +struct kvm_vcpu_state { + __u32 nsubstates; /* number of elements in substates */ + __u32 nprocessed; /* return value: successfully processed substates */ + struct kvm_vcpu_substate substates[0]; +}; + Wouldn't having an ordinary struct with lots of reserved space be simpler? If we add a bitmask, we can even selectively get/set the fields we want (important if new state extends old state: avx vs sse). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
On 10/14/2009 01:06 AM, Jan Kiszka wrote: @@ -1586,6 +1719,7 @@ static long kvm_dev_ioctl_check_extension_generic(long arg) case KVM_CAP_USER_MEMORY: case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: + case KVM_CAP_VCPU_STATE: #ifdef CONFIG_KVM_APIC_ARCHITECTURE case KVM_CAP_SET_BOOT_CPU_ID: #endif This should be done only for the archs that implement the ioctl. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states
On 10/14/2009 01:06 AM, Jan Kiszka wrote: This plugs an NMI-related hole in the VCPU synchronization between kernel and user space. So far, neither pending NMIs nor the inhibit NMI mask was properly read/set which was able to cause problems on vmsave/restore, live migration and system reset. Fix it by making use of the new VCPU substate interface. +struct kvm_nmi_state { + __u8 pending; + __u8 masked; + __u8 pad1[2]; +}; Best to be conservative and use 64-bit alignment. Who knows what we might put after this someday. @@ -513,6 +513,8 @@ struct kvm_x86_ops { unsigned char *hypercall_addr); void (*set_irq)(struct kvm_vcpu *vcpu); void (*set_nmi)(struct kvm_vcpu *vcpu); + int (*get_nmi_mask)(struct kvm_vcpu *vcpu); + void (*set_nmi_mask)(struct kvm_vcpu *vcpu, int masked); Prefer bool for booleans, please. Needs a KVM_CAP as well. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
On 10/14/2009 01:06 AM, Jan Kiszka wrote: Add a new IOCTL pair to retrieve or set the VCPU state in one chunk. More precisely, the IOCTL is able to process a list of substates to be read or written. This list is easily extensible without breaking the existing ABI, thus we will no longer have to add new IOCTLs when we discover a missing VCPU state field or want to support new hardware features. This patch establishes the generic infrastructure for KVM_GET/ SET_VCPU_STATE and adds support for the generic substates REGS, SREGS, FPU, and MP. To avoid code duplication, the entry point for the corresponding original IOCTLs are converted to make use of the new infrastructure internally, too. One last thing - Documentation/kvm/api.txt needs updating. Glauber, this holds for your patches as well. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Xen PV-on-HVM guest support (v2)
Hi, Is the msr number really variable? Isn't it an ABI? Yes, it is variable. The guests gets the msr number via cpuid ... Do we need support for reading the msr? I don't think so. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Xen PV-on-HVM guest support (v2)
On 10/15/2009 05:11 PM, Gerd Hoffmann wrote: Hi, Is the msr number really variable? Isn't it an ABI? Yes, it is variable. The guests gets the msr number via cpuid ... Do we need support for reading the msr? I don't think so. Thanks. So Ed, I think you're good to go, but please update Documentation/kvm/api.txt for your next round. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
Avi Kivity wrote: On 10/14/2009 01:06 AM, Jan Kiszka wrote: Add a new IOCTL pair to retrieve or set the VCPU state in one chunk. More precisely, the IOCTL is able to process a list of substates to be read or written. This list is easily extensible without breaking the existing ABI, thus we will no longer have to add new IOCTLs when we discover a missing VCPU state field or want to support new hardware features. This patch establishes the generic infrastructure for KVM_GET/ SET_VCPU_STATE and adds support for the generic substates REGS, SREGS, FPU, and MP. To avoid code duplication, the entry point for the corresponding original IOCTLs are converted to make use of the new infrastructure internally, too. +/* for KVM_GET_VCPU_STATE and KVM_SET_VCPU_STATE */ +#define KVM_VCPU_REGS 0 +#define KVM_VCPU_SREGS 1 +#define KVM_VCPU_FPU2 +#define KVM_VCPU_MP 3 KVM_VCPU_STATE_*, to avoid collisions. OK. Better to split sse from fpu since we already know it is about to be replaced. The idea is to reuse the existing state structures, including struct kvm_fpu. This allows to provide the avoid substates for all archs and simplifies the migration (see my qemu conversion patch). I think, once we need support for new/wider registers in x86, we can introduce new KVM_X86_VCPU_STATE_FPU_WHATEVER substates that are able to replace the old one. + +struct kvm_vcpu_substate { +__u32 type; +__u32 pad; +__s64 offset; +}; + +#define KVM_MAX_VCPU_SUBSTATES 64 + +struct kvm_vcpu_state { +__u32 nsubstates; /* number of elements in substates */ +__u32 nprocessed; /* return value: successfully processed substates */ +struct kvm_vcpu_substate substates[0]; +}; + Wouldn't having an ordinary struct with lots of reserved space be simpler? If we add a bitmask, we can even selectively get/set the fields we want (important if new state extends old state: avx vs sse). Simpler - hmm, maybe. But also less flexible. This would establish a static order inside this constantly growing mega struct. And a user only interested in something small at its end would still have to allocate memory for the whole thing (maybe megabytes in the future, who knows?). And this mega struct will always carry all the legacy substates, even if they aren't used anymore in practice. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
Avi Kivity wrote: On 10/14/2009 01:06 AM, Jan Kiszka wrote: @@ -1586,6 +1719,7 @@ static long kvm_dev_ioctl_check_extension_generic(long arg) case KVM_CAP_USER_MEMORY: case KVM_CAP_DESTROY_MEMORY_REGION_WORKS: case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: +case KVM_CAP_VCPU_STATE: #ifdef CONFIG_KVM_APIC_ARCHITECTURE case KVM_CAP_SET_BOOT_CPU_ID: #endif This should be done only for the archs that implement the ioctl. All archs already implement the core substates. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states
Avi Kivity wrote: On 10/14/2009 01:06 AM, Jan Kiszka wrote: This plugs an NMI-related hole in the VCPU synchronization between kernel and user space. So far, neither pending NMIs nor the inhibit NMI mask was properly read/set which was able to cause problems on vmsave/restore, live migration and system reset. Fix it by making use of the new VCPU substate interface. +struct kvm_nmi_state { + __u8 pending; + __u8 masked; + __u8 pad1[2]; +}; Best to be conservative and use 64-bit alignment. Who knows what we might put after this someday. OK. @@ -513,6 +513,8 @@ struct kvm_x86_ops { unsigned char *hypercall_addr); void (*set_irq)(struct kvm_vcpu *vcpu); void (*set_nmi)(struct kvm_vcpu *vcpu); +int (*get_nmi_mask)(struct kvm_vcpu *vcpu); +void (*set_nmi_mask)(struct kvm_vcpu *vcpu, int masked); Prefer bool for booleans, please. OK. Needs a KVM_CAP as well. KVM_CAP_VCPU_STATE will imply KVM_CAP_NMI_STATE, so I skipped the latter (user space code would use the former anyway to avoid yet another #ifdef layer). Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
Avi Kivity wrote: On 10/14/2009 01:06 AM, Jan Kiszka wrote: Add a new IOCTL pair to retrieve or set the VCPU state in one chunk. More precisely, the IOCTL is able to process a list of substates to be read or written. This list is easily extensible without breaking the existing ABI, thus we will no longer have to add new IOCTLs when we discover a missing VCPU state field or want to support new hardware features. This patch establishes the generic infrastructure for KVM_GET/ SET_VCPU_STATE and adds support for the generic substates REGS, SREGS, FPU, and MP. To avoid code duplication, the entry point for the corresponding original IOCTLs are converted to make use of the new infrastructure internally, too. One last thing - Documentation/kvm/api.txt needs updating. Glauber, this holds for your patches as well. OK, will be done once the interface settled. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states
On 10/15/2009 06:22 PM, Jan Kiszka wrote: Needs a KVM_CAP as well. KVM_CAP_VCPU_STATE will imply KVM_CAP_NMI_STATE, so I skipped the latter (user space code would use the former anyway to avoid yet another #ifdef layer). OK. New bits will need the KVM_CAP, though. Perhaps it makes sense to query about individual states, including existing ones? That will allow us to deprecate and then phase out broken states. It's probably not worth it. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Can't make virtio block driver work on Windows 2003
Maybe you can find some useful hints in this thread: http://www.proxmox.com/forum/showthread.php?t=1990 Best Regards, Martin -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Asdo Sent: Mittwoch, 14. Oktober 2009 19:52 To: kvm@vger.kernel.org Subject: Can't make virtio block driver work on Windows 2003 Hi all I have a new installation of Windows 2003 SBS server 32bit which I installed using IDE disk. KVM version is QEMU PC emulator version 0.10.50 (qemu-kvm-devel-86) compiled by myself on kernel 2.6.28-11-server. I have already moved networking from e1000 to virtio (e1000 was performing very sluggishly btw, probably was losing many packets, virtio seems to work) Now I want to move the disk to virtio... This is complex so I thought that first I wanted to see virtio installed and working on another drive. So I tried adding another drive, a virtio one, (a new 100MB file at host side) to the virtual machine and rebooting. A first problem is that Windows does not detect the new device upon boot or Add Hardware scan. Here is the kvm commandline (it's complex because it comes from libvirt): /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc -m 4096-smp 4 -name winserv2 -uuid -monitor pty -boot c -drive file=/virtual_machines/kvm/nfsimport/winserv2.raw,if=ide,index=0,boot=o n -drive file=/virtual_machines/kvm/nfsimport/zerofile,if=virtio,index=1 -net nic,macaddr=xx:xx:xx:xx:xx:xx,vlan=0,model=virtio -net tap,fd=25,vlan=0 -serial none -parallel none -usb -vnc 127.0.0.1:4 Even if Windows couldn't detect the new device I tried to install the driver anyway. On Add Hardware I go through to -- SCSI and RAID controllers -- Have Disk .. and point it to the location of viostor files (windows 2003 x86) downloaded from: http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers http://people.redhat.com/~yvugenfi/24.09.2009/viostor.zip Windows does install the driver, however at the end it says: The software for this device is now installed, but may not work correctly. This device cannot start. (Code 10) and the new device gets flagged with a yellow exclamation mark in Device Manager. I don't know if it's the same reason as before, that the device is not detected so the driver cannot work, or another reason. Any idea? Thanks for your help -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
On 10/15/2009 06:22 PM, Jan Kiszka wrote: Better to split sse from fpu since we already know it is about to be replaced. The idea is to reuse the existing state structures, including struct kvm_fpu. This allows to provide the avoid substates for all archs and simplifies the migration (see my qemu conversion patch). I think, once we need support for new/wider registers in x86, we can introduce new KVM_X86_VCPU_STATE_FPU_WHATEVER substates that are able to replace the old one. Makes sense, especially if we keep the list instead of the structure. + +struct kvm_vcpu_substate { + __u32 type; + __u32 pad; + __s64 offset; +}; + +#define KVM_MAX_VCPU_SUBSTATES 64 + +struct kvm_vcpu_state { + __u32 nsubstates; /* number of elements in substates */ + __u32 nprocessed; /* return value: successfully processed substates */ + struct kvm_vcpu_substate substates[0]; +}; + Wouldn't having an ordinary struct with lots of reserved space be simpler? If we add a bitmask, we can even selectively get/set the fields we want (important if new state extends old state: avx vs sse). Simpler - hmm, maybe. But also less flexible. This would establish a static order inside this constantly growing mega struct. And a user only interested in something small at its end would still have to allocate memory for the whole thing (maybe megabytes in the future, who knows?). And this mega struct will always carry all the legacy substates, even if they aren't used anymore in practice. I hope cpu state doesn't grow into megabytes, or we'll have problems live migrating them. But I see your point. The initial split assumed userspace would be interested in optimizing access (we used to have many more exits, and really old versions relied on qemu for emulation), that turned out not to be the case, but it's better to keep this capability for other possible userspaces. So let's go ahead with the list. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Autotest] [PATCH] Test 802.1Q vlan of nic
Test 802.1Q vlan of nic, config it by vconfig command. 1) Create two VMs 2) Setup guests in different vlan by vconfig and test communication by ping using hard-coded ip address 3) Setup guests in same vlan and test communication by ping 4) Recover the vlan config Signed-off-by: Amos Kong ak...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample |6 +++ client/tests/kvm/tests/vlan_tag.py| 73 + 2 files changed, 79 insertions(+), 0 deletions(-) mode change 100644 = 100755 client/tests/kvm/scripts/qemu-ifup create mode 100644 client/tests/kvm/tests/vlan_tag.py diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index 9ccc9b5..4e47767 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -166,6 +166,12 @@ variants: used_cpus = 5 used_mem = 2560 +- vlan_tag: install setup +type = vlan_tag +subnet2 = 192.168.123 +vlans = 10 20 +nic_mode = tap +nic_model = e1000 - autoit: install setup type = autoit diff --git a/client/tests/kvm/scripts/qemu-ifup b/client/tests/kvm/scripts/qemu-ifup old mode 100644 new mode 100755 diff --git a/client/tests/kvm/tests/vlan_tag.py b/client/tests/kvm/tests/vlan_tag.py new file mode 100644 index 000..15e763f --- /dev/null +++ b/client/tests/kvm/tests/vlan_tag.py @@ -0,0 +1,73 @@ +import logging, time +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils + +def run_vlan_tag(test, params, env): + +Test 802.1Q vlan of nic, config it by vconfig command. + +1) Create two VMs +2) Setup guests in different vlan by vconfig and test communication by ping + using hard-coded ip address +3) Setup guests in same vlan and test communication by ping +4) Recover the vlan config + +@param test: Kvm test object +@param params: Dictionary with the test parameters. +@param env: Dictionary with test environment. + + +vm = [] +session = [] +subnet2 = params.get(subnet2) +vlans = params.get(vlans).split() + +vm.append(kvm_test_utils.get_living_vm(env, %s % params.get(main_vm))) + +params_vm2 = params.copy() +params_vm2['image_snapshot'] = yes +params_vm2['kill_vm_gracefully'] = no +params_vm2[address_index] = int(params.get(address_index, 0))+1 +vm.append(vm[0].clone(vm2, params_vm2)) +kvm_utils.env_register_vm(env, vm2, vm[1]) +if not vm[1].create(): +raise error.TestError(VM 1 create faild) + +for i in range(2): +session.append(kvm_test_utils.wait_for_login(vm[i])) + +try: +vconfig_cmd = vconfig add eth0 %s;ifconfig eth0.%s %s.%s +# Attempt to configure IPs for the VMs and record the results in +# boolean variables +# Make vm1 and vm2 in the different vlan + +ip_config_vm1_ok = (session[0].get_command_status(vconfig_cmd + % (vlans[0], vlans[0], subnet2, 11)) == 0) +ip_config_vm2_ok = (session[1].get_command_status(vconfig_cmd + % (vlans[1], vlans[1], subnet2, 12)) == 0) +if not ip_config_vm1_ok or not ip_config_vm2_ok: +raise error.TestError, Fail to config VMs ip address +ping_diff_vlan_ok = (session[0].get_command_status( + ping -c 2 %s.12 % subnet2) == 0) + +if ping_diff_vlan_ok: +raise error.TestFail(VM 2 is unexpectedly pingable in different + vlan) +# Make vm2 in the same vlan with vm1 +vlan_config_vm2_ok = (session[1].get_command_status( + vconfig rem eth0.%s;vconfig add eth0 %s; + ifconfig eth0.%s %s.12 % + (vlans[1], vlans[0], vlans[0], subnet2)) == 0) +if not vlan_config_vm2_ok: +raise error.TestError, Fail to config ip address of VM 2 + +ping_same_vlan_ok = (session[0].get_command_status( + ping -c 2 %s.12 % subnet2) == 0) +if not ping_same_vlan_ok: +raise error.TestFail(Fail to ping the guest in same vlan) +finally: +# Clean the vlan config +for i in range(2): +session[i].sendline(vconfig rem eth0.%s % vlans[0]) +session[i].close() -- 1.5.5.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH 1/3] KVM test: VM.send_monitor_cmd() minor cleanup
- Move some of the code into a try..finally block - Shutdown the socket before closing it Signed-off-by: Michael Goldish mgold...@redhat.com --- client/tests/kvm/kvm_vm.py | 45 --- 1 files changed, 25 insertions(+), 20 deletions(-) diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py index 82f1eb4..a8d96ca 100755 --- a/client/tests/kvm/kvm_vm.py +++ b/client/tests/kvm/kvm_vm.py @@ -465,7 +465,7 @@ class VM: end_time = time.time() + timeout while time.time() end_time: try: -o += s.recv(16384) +o += s.recv(1024) if o.splitlines()[-1].split()[-1] == (qemu): return (True, o) except: @@ -481,27 +481,32 @@ class VM: except: logging.debug(Could not connect to monitor socket) return (1, ) -status, data = read_up_to_qemu_prompt(s, timeout) -if not status: -s.close() -logging.debug(Could not find (qemu) prompt; output so far: \ -+ kvm_utils.format_str_for_message(data)) -return (1, ) -# Send command -s.sendall(command + \n) -# Receive command output -data = -if block: + +# Send the command and get the resulting output +try: status, data = read_up_to_qemu_prompt(s, timeout) -data = \n.join(data.splitlines()[1:]) if not status: -s.close() -logging.debug(Could not find (qemu) prompt after command; - output so far: %s, - kvm_utils.format_str_for_message(data)) -return (1, data) -s.close() -return (0, data) +logging.debug(Could not find (qemu) prompt; output so far: + + kvm_utils.format_str_for_message(data)) +return (1, ) +# Send command +s.sendall(command + \n) +# Receive command output +data = +if block: +status, data = read_up_to_qemu_prompt(s, timeout) +data = \n.join(data.splitlines()[1:]) +if not status: +logging.debug(Could not find (qemu) prompt after command; + output so far: + + kvm_utils.format_str_for_message(data)) +return (1, data) +return (0, data) + +# Clean up before exiting +finally: +s.shutdown(socket.SHUT_RDWR) +s.close() def destroy(self, gracefully=True): -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH 3/3] KVM test: modify messages in kvm_test_utils.wait_for_login()
Use logged into guest instead of logged in guest. AFAIK this is more correct. Signed-off-by: Michael Goldish mgold...@redhat.com --- client/tests/kvm/kvm_test_utils.py |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/client/tests/kvm/kvm_test_utils.py b/client/tests/kvm/kvm_test_utils.py index e51520a..bf8aed2 100644 --- a/client/tests/kvm/kvm_test_utils.py +++ b/client/tests/kvm/kvm_test_utils.py @@ -52,12 +52,12 @@ def wait_for_login(vm, nic_index=0, timeout=240, start=0, step=2): @param timeout: Time to wait before giving up. @return: A shell session object. -logging.info(Try to login to guest '%s'... % vm.name) +logging.info(Trying to log into guest '%s'... % vm.name) session = kvm_utils.wait_for(lambda: vm.remote_login(nic_index=nic_index), timeout, start, step) if not session: raise error.TestFail(Could not log into guest '%s' % vm.name) -logging.info(Logged in '%s' % vm.name) +logging.info(Logged into guest '%s' % vm.name) return session -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH 2/3] KVM test: corrections to guest_s4
- Log into the guest again after it resumes from S4 ('session2' doesn't survive S4 in user networking mode). - Use != 0 when checking the status returned by get_command_status(), because when things go wrong, it can also return None. - Use time.sleep(float(params.get(...))) instead of time.sleep(params.get(...)) (time.sleep() doesn't accept strings). - Do not check that the VM is alive after vm.create() because vm.create() already does that. - Use get_command_output(kill_test_s4_cmd) instead of sendline(kill_test_s4_cmd), because get_command_output() waits for the prompt to return, and allows us to be sure that the command got delivered to the guest. This is especially important if the session is closed immediately after sending the command. In the case of the command that performs suspend to disk (set_s4_cmd), and the test_s4_cmd for Linux (nohup ...), it is OK to use sendline() because the prompt may never return. In the latter case, sleep(5) after the sendline() call should ensure that the command got delivered before the test proceeds. - Change the timeouts in the wait_for(vm.is_dead, ...) call. - For Windows guests modify test_s4_cmd to 'start ping -t localhost'. Using start /B isn't safe because then ping's output is redirected to a dead shell session. - For Windows guests mdify check_s4_cmd to 'tasklist | find /I ping.exe'. (It's a little more specific than just ping.) - Run guest_s4 after autoit tests because it can leave Windows guests with an annoying welcome screen. - Make the logging messages more consistent with the style of other tests. Signed-off-by: Michael Goldish mgold...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample | 23 - client/tests/kvm/tests/guest_s4.py| 45 +++- 2 files changed, 38 insertions(+), 30 deletions(-) diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index 9ccc9b5..296449d 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -118,15 +118,6 @@ variants: - linux_s3: install setup type = linux_s3 -- guest_s4: -type = guest_s4 -check_s4_support_cmd = grep -q disk /sys/power/state -test_s4_cmd = cd /tmp/;nohup tcpdump -q -t ip host localhost -check_s4_cmd = pgrep tcpdump -set_s4_cmd = echo disk /sys/power/state -kill_test_s4_cmd = pkill tcpdump -services_up_timeout = 30 - - timedrift:install setup extra_params += -rtc-td-hack variants: @@ -166,7 +157,6 @@ variants: used_cpus = 5 used_mem = 2560 - - autoit: install setup type = autoit autoit_binary = D:\AutoIt3.exe @@ -176,6 +166,15 @@ variants: - notepad: autoit_script = autoit/notepad1.au3 +- guest_s4: +type = guest_s4 +check_s4_support_cmd = grep -q disk /sys/power/state +test_s4_cmd = cd /tmp; nohup tcpdump -q -t ip host localhost +check_s4_cmd = pgrep tcpdump +set_s4_cmd = echo disk /sys/power/state +kill_test_s4_cmd = pkill tcpdump +services_up_timeout = 30 + - nic_hotplug: install setup type = pci_hotplug pci_type = nic @@ -518,8 +517,8 @@ variants: host_load_instances = 8 guest_s4: check_s4_support_cmd = powercfg /hibernate on -test_s4_cmd = start /B ping -n 3000 localhost -check_s4_cmd = tasklist | find /I ping +test_s4_cmd = start ping -t localhost +check_s4_cmd = tasklist | find /I ping.exe set_s4_cmd = rundll32.exe PowrProf.dll, SetSuspendState kill_test_s4_cmd = taskkill /IM ping.exe /F services_up_timeout = 30 diff --git a/client/tests/kvm/tests/guest_s4.py b/client/tests/kvm/tests/guest_s4.py index 7147e3b..f08b9d2 100644 --- a/client/tests/kvm/tests/guest_s4.py +++ b/client/tests/kvm/tests/guest_s4.py @@ -5,7 +5,7 @@ import kvm_test_utils, kvm_utils def run_guest_s4(test, params, env): -Suspend guest to disk,supports both Linux Windows OSes. +Suspend guest to disk, supports both Linux Windows OSes. @param test: kvm test object. @param params: Dictionary with test parameters. @@ -14,53 +14,62 @@ def run_guest_s4(test, params, env): vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) session = kvm_test_utils.wait_for_login(vm) -logging.info(Checking whether guest OS supports suspend to disk (S4)) +logging.info(Checking whether guest OS supports suspend to disk (S4)...) status = session.get_command_status(params.get(check_s4_support_cmd)) if status is None: logging.error(Failed to check if guest OS supports S4) elif status != 0: raise error.TestFail(Guest OS does not support S4) -logging.info(Wait until all guest OS services are fully started)
Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states
Avi Kivity wrote: On 10/15/2009 06:22 PM, Jan Kiszka wrote: Needs a KVM_CAP as well. KVM_CAP_VCPU_STATE will imply KVM_CAP_NMI_STATE, so I skipped the latter (user space code would use the former anyway to avoid yet another #ifdef layer). OK. New bits will need the KVM_CAP, though. For sure. Perhaps it makes sense to query about individual states, including existing ones? That will allow us to deprecate and then phase out broken states. It's probably not worth it. You may do this already with the given design: Set up a VCPU, then issue KVM_GET_VCPU_STATE on the substate in question. You will either get an error code or 0 if the substate is supported. At least no additional kernel code required. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [PATCH] KVM test: Add PCI pass through test
On Thu, Oct 15, 2009 at 3:45 AM, Yolkfull Chow yz...@redhat.com wrote: On Wed, Oct 14, 2009 at 09:08:00AM -0300, Lucas Meneghel Rodrigues wrote: Add a new PCI pass trough test. It supports both SR-IOV virtual functions and physical NIC card pass through. Single Root I/O Virtualization (SR-IOV) allows a single PCI device to be shared amongst multiple virtual machines while retaining the performance benefit of assigning a PCI device to a virtual machine. A common example is where a single SR-IOV capable NIC - with perhaps only a single physical network port - might be shared with multiple virtual machines by assigning a virtual function to each VM. SR-IOV support is implemented in the kernel. The core implementation is contained in the PCI subsystem, but there must also be driver support for both the Physical Function (PF) and Virtual Function (VF) devices. With an SR-IOV capable device one can allocate VFs from a PF. The VFs surface as PCI devices which are backed on the physical PCI device by resources (queues, and register sets). Device support: In 2.6.30, the Intel® 82576 Gigabit Ethernet Controller is the only SR-IOV capable device supported. The igb driver has PF support and the igbvf has VF support. In 2.6.31 the Neterion® X3100™ is supported as well. This device uses the same vxge driver for the PF as well as the VFs. Wow, new NIC card supports SR-IOV... At this rate, do we need to move the driver name and its parameter into config file so that in future if a new NIC card using different driver is supported, we could handle it without changing code ? Absolutely, it didn't occur to me at first, but yes, we ought to move the driver name and parameters to the config file. In order to configure the test: * For SR-IOV virtual functions passthrough, we could specify the module parameter 'max_vfs' in config file. * For physical NIC card pass through, we should specify the device name(s). Signed-off-by: Yolkfull Chow yz...@redhat.com --- client/tests/kvm/kvm_tests.cfg.sample | 11 ++- client/tests/kvm/kvm_utils.py | 278 + client/tests/kvm/kvm_vm.py | 72 + 3 files changed, 360 insertions(+), 1 deletions(-) diff --git a/client/tests/kvm/kvm_tests.cfg.sample b/client/tests/kvm/kvm_tests.cfg.sample index cc3228a..1dad188 100644 --- a/client/tests/kvm/kvm_tests.cfg.sample +++ b/client/tests/kvm/kvm_tests.cfg.sample @@ -786,13 +786,22 @@ variants: only default image_format = raw - variants: - @smallpages: - hugepages: pre_command = /usr/bin/python scripts/hugepage.py /mnt/kvm_hugepage extra_params += -mem-path /mnt/kvm_hugepage +variants: + - @no_passthrough: + pass_through = no + - nic_passthrough: + pass_through = pf + passthrough_devs = eth1 + - vfs_passthrough: + pass_through = vf + max_vfs = 7 + vfs_count = 7 variants: - @basic: diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py index 53b664a..0e3398c 100644 --- a/client/tests/kvm/kvm_utils.py +++ b/client/tests/kvm/kvm_utils.py @@ -788,3 +788,281 @@ def md5sum_file(filename, size=None): size -= len(data) f.close() return o.hexdigest() + + +def get_full_id(pci_id): + + Get full PCI ID of pci_id. + + cmd = lspci -D | awk '/%s/ {print $1}' % pci_id + status, full_id = commands.getstatusoutput(cmd) + if status != 0: + return None + return full_id + + +def get_vendor_id(pci_id): + + Check out the device vendor ID according to PCI ID. + + cmd = lspci -n | awk '/%s/ {print $3}' % pci_id + return re.sub(:, , commands.getoutput(cmd)) + + +def release_dev(pci_id, pci_dict): + + Release a single PCI device. + + �...@param pci_id: PCI ID of a given PCI device + �...@param pci_dict: Dictionary with information about PCI devices + + base_dir = /sys/bus/pci + full_id = get_full_id(pci_id) + vendor_id = get_vendor_id(pci_id) + drv_path = os.path.join(base_dir, devices/%s/driver % full_id) + if 'pci-stub' in os.readlink(drv_path): + cmd = echo '%s' %s/new_id % (vendor_id, drv_path) + if os.system(cmd): + return False + + stub_path = os.path.join(base_dir, drivers/pci-stub) + cmd = echo '%s' %s/unbind % (full_id, stub_path) + if os.system(cmd): + return False + + prev_driver = pci_dict[pci_id] + cmd = echo '%s' %s/bind % (full_id, prev_driver) + if os.system(cmd): + return False + return True + + +def release_pci_devs(pci_dict): + + Release all PCI devices assigned to host. + + �...@param pci_dict: Dictionary with information about PCI devices + + for pci_id in pci_dict: + if not release_dev(pci_id, pci_dict): +
Re: Can't make virtio block driver work on Windows 2003
Vadim Rozenfeld wrote: On 10/14/2009 07:52 PM, Asdo wrote: ... So I tried adding another drive, a virtio one, (a new 100MB file at host side) to the virtual machine and rebooting. A first problem is that Windows does not detect the new device upon boot or Add Hardware scan. Check PCI devices with info pci. You must have SCSI controller: PCI device 1af4:1001 device reported. It's not there. Does this make it a KVM bug? I'm attaching the PCI32.EXE output at the bottom of this email BTW I would probably be able to switch to virtio anyway on this installation of Windows 2003, if I knew the way to insert the viostor driver into the windows boot image (windows's initrd), because if I set the first disk hda as virtio then kvm really makes it virtio (so maybe it's a configuration with one IDE and one virtio that does not work in KVM) and Windows bluescreens at boot. However I don't know how to insert the viostor driver in the windows boot image. Any suggestions? Here is the kvm commandline (it's complex because it comes from libvirt): /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc -m 4096-smp 4 -name winserv2 -uuid -monitor pty -boot c -drive file=/virtual_machines/kvm/nfsimport/winserv2.raw,if=ide,index=0,boot=on -drive file=/virtual_machines/kvm/nfsimport/zerofile,if=virtio,index=1 -net nic,macaddr=xx:xx:xx:xx:xx:xx,vlan=0,model=virtio -net tap,fd=25,vlan=0 -serial none -parallel none -usb -vnc 127.0.0.1:4 Craig Hart's PCI+AGP bus sniffer, Version 1.6, freeware made in 1996-2005. Searching for Devices using CFG Mechanism 1 [OS: Win 2003 Service Pack 1] Bus 0 (PCI), Device Number 0, Device Function 0 Vendor 8086h Intel Corporation Device 1237h 82441FX 440FX (Natoma) System Controller Rev 2 (SU053) Command h (Bus Access Disabled!!) Status h Revision 02h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Bridge, type PCI to HOST Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Bus 0 (PCI), Device Number 1, Device Function 0 Vendor 8086h Intel Corporation Device 7000h 82371SB PIIX3 ISA Bridge Command 0007h (I/O Access, Memory Access, BusMaster) Status 0200h (Medium Timing) Revision 00h, Header Type 80h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Bridge, type PCI to ISA Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Bus 0 (PCI), Device Number 1, Device Function 1 Vendor 8086h Intel Corporation Device 7010h 82371SB PIIX3 EIDE Controller Command 0007h (I/O Access, Memory Access, BusMaster) Status 0280h (Supports Back-To-Back Trans., Medium Timing) Revision 00h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Storage, type IDE (ATA) PCI EIDE Controller Features : BusMaster EIDE is supported Primary Channel is at I/O Port 01F0h and IRQ 14 Secondary Channel is at I/O Port 0170h and IRQ 15 Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Address 4 is an I/O Port : C000h Bus 0 (PCI), Device Number 1, Device Function 2 Vendor 8086h Intel Corporation Device 7020h 82371SB PIIX3 USB Controller Rev 1 (SU093) Command 0007h (I/O Access, Memory Access, BusMaster) Status h Revision 01h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Serial, type USB (UHCI) Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Address 4 is an I/O Port : C020h System IRQ 11, INT# D Bus 0 (PCI), Device Number 1, Device Function 3 Vendor 8086h Intel Corporation Device 7113h 82371MB PIIX4M Power Management Controller Command h (Bus Access Disabled!!) Status 0280h (Supports Back-To-Back Trans., Medium Timing) Revision 03h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Bridge, type PCI to Other Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown System IRQ 9, INT# A Bus 0 (PCI), Device Number 2, Device Function 0 Vendor 1013h Cirrus Logic Device 00B8h CL-GD5446 PCI Command 0007h (I/O Access, Memory Access, BusMaster) Status h Revision 00h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Display, type VGA Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Address 0 is a Memory Address (anywhere in 0-4Gb, Prefetchable) : F000h Address 1 is a Memory Address (anywhere in 0-4Gb) : F200h Bus 0 (PCI), Device Number 3, Device Function 0 Vendor 1AF4h Unknown Device 1000h Unknown Command 0007h (I/O Access, Memory Access, BusMaster) Status h Revision 00h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Network, type Ethernet Subsystem ID 00011AF4h Unknown Subsystem Vendor 1AF4h Unknown Address 0 is an I/O Port : C040h System IRQ 10, INT# A Bus 0 (PCI), Device Number 4, Device Function 0 Vendor 1AF4h Unknown Device 1002h Unknown Command 0001h (I/O Access) Status h Revision 00h, Header Type
Re: sync guest calls made async on host - SQLite performance
On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote: Historically it didn't and the only safe way to use virtio was in cache=writethrough mode. Which should be the default on Ubuntu's kvm that this report is concerned with so I'm a bit confused. So can we please get the detailed setup where this happens, that is: filesystem used in the guest any volume manager / software raid used in the guest kernel version in the guest image format used qemu command line including caching mode, using ide/scsi/virtio, etc qemu/kvm version filesystem used in the host any volume manager / software raid used in the host kernel version in the host Avi's patch is a performance optimization, not a correctness issue? It could actually minimally degrade performace. For the existing filesystems as upper layer it doesn not improve correctness either. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single memory slot
On 15.10.2009, at 09:33, Avi Kivity wrote: One way to improve the gfn_to_pfn() memslot search is to register just one slot. This can only work on 64-bit, since even the smallest guests need 4GB of physical address space. Apart from speeding up gfn_to_page(), it would also speed up mmio which must iterate over all slots, so a lookup cache cannot help. This would require quite a bunch of changes: - modify gfn_to_pfn() to fail gracefully if the page is in the slot but unmapped (hole handling) - modify qemu to reserve the guest physical address space - modify qemu memory allocation to use MAP_FIXED to allocate memory - some hack for the vga aliases (mmap an fd multiple times?) - some hack for the vmx-specific pages (e.g. APIC-access page) Not sure it's worthwhile, but something to keep in mind if a simple cache or sort by size is insufficient due to mmio. One thing I've been wondering for quite a while is that slot loop. Why do we loop over all possible slots? Couldn't we just remember the max extry (usually 1 or 2) and not loop MAX_SLOT_AMOUNT times? That would be a really easy patch and give instant speed improvements for everyone. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync guest calls made async on host - SQLite performance
On Thu, Oct 15, 2009 at 02:17:02PM +0200, Christoph Hellwig wrote: On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote: Historically it didn't and the only safe way to use virtio was in cache=writethrough mode. Which should be the default on Ubuntu's kvm that this report is concerned with so I'm a bit confused. So can we please get the detailed setup where this happens, that is: filesystem used in the guest any volume manager / software raid used in the guest kernel version in the guest image format used qemu command line including caching mode, using ide/scsi/virtio, etc qemu/kvm version filesystem used in the host any volume manager / software raid used in the host kernel version in the host And very important the mount options (/proc/self/mounts) of both host and guest. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't make virtio block driver work on Windows 2003
On 10/15/2009 01:42 PM, Asdo wrote: Vadim Rozenfeld wrote: On 10/14/2009 07:52 PM, Asdo wrote: ... So I tried adding another drive, a virtio one, (a new 100MB file at host side) to the virtual machine and rebooting. A first problem is that Windows does not detect the new device upon boot or Add Hardware scan. Check PCI devices with info pci. You must have SCSI controller: PCI device 1af4:1001 device reported. It's not there. Does this make it a KVM bug? Looks like virtio-blk device wasn't initialized. Otherwise I cannot explain why 0x1100 device is here. Try to start block device without index=1 Anyway, if you can, please send info pci output from QEMU monitor console. Thank you, Vadim. I'm attaching the PCI32.EXE output at the bottom of this email BTW I would probably be able to switch to virtio anyway on this installation of Windows 2003, if I knew the way to insert the viostor driver into the windows boot image (windows's initrd), because if I set the first disk hda as virtio then kvm really makes it virtio (so maybe it's a configuration with one IDE and one virtio that does not work in KVM) and Windows bluescreens at boot. However I don't know how to insert the viostor driver in the windows boot image. Any suggestions? Here is the kvm commandline (it's complex because it comes from libvirt): /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc -m 4096-smp 4 -name winserv2 -uuid -monitor pty -boot c -drive file=/virtual_machines/kvm/nfsimport/winserv2.raw,if=ide,index=0,boot=on -drive file=/virtual_machines/kvm/nfsimport/zerofile,if=virtio,index=1 -net nic,macaddr=xx:xx:xx:xx:xx:xx,vlan=0,model=virtio -net tap,fd=25,vlan=0 -serial none -parallel none -usb -vnc 127.0.0.1:4 Craig Hart's PCI+AGP bus sniffer, Version 1.6, freeware made in 1996-2005. Searching for Devices using CFG Mechanism 1 [OS: Win 2003 Service Pack 1] Bus 0 (PCI), Device Number 0, Device Function 0 Vendor 8086h Intel Corporation Device 1237h 82441FX 440FX (Natoma) System Controller Rev 2 (SU053) Command h (Bus Access Disabled!!) Status h Revision 02h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Bridge, type PCI to HOST Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Bus 0 (PCI), Device Number 1, Device Function 0 Vendor 8086h Intel Corporation Device 7000h 82371SB PIIX3 ISA Bridge Command 0007h (I/O Access, Memory Access, BusMaster) Status 0200h (Medium Timing) Revision 00h, Header Type 80h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Bridge, type PCI to ISA Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Bus 0 (PCI), Device Number 1, Device Function 1 Vendor 8086h Intel Corporation Device 7010h 82371SB PIIX3 EIDE Controller Command 0007h (I/O Access, Memory Access, BusMaster) Status 0280h (Supports Back-To-Back Trans., Medium Timing) Revision 00h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Storage, type IDE (ATA) PCI EIDE Controller Features : BusMaster EIDE is supported Primary Channel is at I/O Port 01F0h and IRQ 14 Secondary Channel is at I/O Port 0170h and IRQ 15 Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Address 4 is an I/O Port : C000h Bus 0 (PCI), Device Number 1, Device Function 2 Vendor 8086h Intel Corporation Device 7020h 82371SB PIIX3 USB Controller Rev 1 (SU093) Command 0007h (I/O Access, Memory Access, BusMaster) Status h Revision 01h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Serial, type USB (UHCI) Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Address 4 is an I/O Port : C020h System IRQ 11, INT# D Bus 0 (PCI), Device Number 1, Device Function 3 Vendor 8086h Intel Corporation Device 7113h 82371MB PIIX4M Power Management Controller Command h (Bus Access Disabled!!) Status 0280h (Supports Back-To-Back Trans., Medium Timing) Revision 03h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Bridge, type PCI to Other Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown System IRQ 9, INT# A Bus 0 (PCI), Device Number 2, Device Function 0 Vendor 1013h Cirrus Logic Device 00B8h CL-GD5446 PCI Command 0007h (I/O Access, Memory Access, BusMaster) Status h Revision 00h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Display, type VGA Subsystem ID 11001AF4h Unknown Subsystem Vendor 1AF4h Unknown Address 0 is a Memory Address (anywhere in 0-4Gb, Prefetchable) : F000h Address 1 is a Memory Address (anywhere in 0-4Gb) : F200h Bus 0 (PCI), Device Number 3, Device Function 0 Vendor 1AF4h Unknown Device 1000h Unknown Command 0007h (I/O Access, Memory Access, BusMaster) Status h Revision 00h, Header Type 00h, Bus Latency Timer 00h Self test 00h (Self test not supported) PCI Class Network, type
Re: Raw vs. tap
Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote: I would be much more inclined to consider taking raw and improving the performance long term if guest-host networking worked. This appears to be a fundamental limitation though and I think it's something that will forever plague users if we include this feature. In fact, I think it's fixable with a raw socket bound to a macvlan. Would that be enough? What setup does that entail on the part of a user? Wouldn't we be back to square one wrt users having to run archaic networking commands in order to set things up? Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't make virtio block driver work on Windows 2003
Vadim Rozenfeld wrote: On 10/15/2009 01:42 PM, Asdo wrote: Vadim Rozenfeld wrote: On 10/14/2009 07:52 PM, Asdo wrote: ... So I tried adding another drive, a virtio one, (a new 100MB file at host side) to the virtual machine and rebooting. A first problem is that Windows does not detect the new device upon boot or Add Hardware scan. Check PCI devices with info pci. You must have SCSI controller: PCI device 1af4:1001 device reported. It's not there. Does this make it a KVM bug? Looks like virtio-blk device wasn't initialized. Otherwise I cannot explain why 0x1100 device is here. Try to start block device without index=1 Anyway, if you can, please send info pci output from QEMU monitor console. Owh! Ok THAT was info pci Ok I am copying by hand before removing index=1 (qemu) info pci Bus 0, device 0, function, 0: Host bridge: PCI device 8086:1237 Bus 0 device 1, function 0: ISA ridge: PCI device 8086:7000 Bos 0 device 1 function 1: IDE controller: PCI device 8086:7010 BAR4: I/O at 0xc000 [0xc00f]. Bus 0 device 1 function 3: Bridge: PCI device 8086:7133 IRQ 9 Bus 0 device 2 function 0: VGA controller: PCI device 1013:00b8 BAR0: 32 but memory at 0xf000 [0xf1ff] BAR1: 32 but memory at 0xf200 [0xf2000fff] Bus 0 device 3 function 0: Ethernet controller PCI device 1af4:1000 IRQ 11 BAR0: I/O at 0xc020 [0xc03f] Bus 0 device 4 function 0 RAM controller: PCI device 1af4:1002 IRQ 11 BAR0 : I/O at 0xc040 (qemu) so it's not there Now I remove index=1: WOW it's there now! ... Bus 0 device 4 function 0: Storage controller: PCI device 1af4:1001 IRQ 11 BAR0: I/O at 0xc040 [0xc07f] (just before the 1002 device) So now windows sees it and I was able to install the viostor drivers (btw Windows was not happy with the previously installed viostor drivers, I had to reinstall those and I got two devices, and the previous one still had the yellow exclamation mark, so I had to uninstall that one. After the procedure I was able to boot on virtio too! Yeah!). Great so yes, I'd say you *DO* have a KVM bug: one has to remove index=1 for the second disk to appear. How did you know that, Vadim, is it a known issue with kvm? It's better to fix that because libvirt puts index=n for all drives so it's impossible to workaround the problem if one uses libvirt. I had to launch manually... Thanks a lot Vadim. Asdo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst
From: Orit Wasserman or...@il.ibm.com --- arch/x86/kvm/vmx.c | 468 ++-- arch/x86/kvm/x86.c |3 +- 2 files changed, 459 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 411cbdb..8c186e0 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -61,20 +61,168 @@ module_param_named(unrestricted_guest, static int __read_mostly emulate_invalid_guest_state = 0; module_param(emulate_invalid_guest_state, bool, S_IRUGO); + +struct __attribute__ ((__packed__)) shadow_vmcs { + u32 revision_id; + u32 abort; + u16 virtual_processor_id; + u16 guest_es_selector; + u16 guest_cs_selector; + u16 guest_ss_selector; + u16 guest_ds_selector; + u16 guest_fs_selector; + u16 guest_gs_selector; + u16 guest_ldtr_selector; + u16 guest_tr_selector; + u16 host_es_selector; + u16 host_cs_selector; + u16 host_ss_selector; + u16 host_ds_selector; + u16 host_fs_selector; + u16 host_gs_selector; + u16 host_tr_selector; + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + u64 vm_exit_msr_store_addr; + u64 vm_exit_msr_load_addr; + u64 vm_entry_msr_load_addr; + u64 tsc_offset; + u64 virtual_apic_page_addr; + u64 apic_access_addr; + u64 ept_pointer; + u64 guest_physical_address; + u64 vmcs_link_pointer; + u64 guest_ia32_debugctl; + u64 guest_ia32_pat; + u64 guest_pdptr0; + u64 guest_pdptr1; + u64 guest_pdptr2; + u64 guest_pdptr3; + u64 host_ia32_pat; + u32 pin_based_vm_exec_control; + u32 cpu_based_vm_exec_control; + u32 exception_bitmap; + u32 page_fault_error_code_mask; + u32 page_fault_error_code_match; + u32 cr3_target_count; + u32 vm_exit_controls; + u32 vm_exit_msr_store_count; + u32 vm_exit_msr_load_count; + u32 vm_entry_controls; + u32 vm_entry_msr_load_count; + u32 vm_entry_intr_info_field; + u32 vm_entry_exception_error_code; + u32 vm_entry_instruction_len; + u32 tpr_threshold; + u32 secondary_vm_exec_control; + u32 vm_instruction_error; + u32 vm_exit_reason; + u32 vm_exit_intr_info; + u32 vm_exit_intr_error_code; + u32 idt_vectoring_info_field; + u32 idt_vectoring_error_code; + u32 vm_exit_instruction_len; + u32 vmx_instruction_info; + u32 guest_es_limit; + u32 guest_cs_limit; + u32 guest_ss_limit; + u32 guest_ds_limit; + u32 guest_fs_limit; + u32 guest_gs_limit; + u32 guest_ldtr_limit; + u32 guest_tr_limit; + u32 guest_gdtr_limit; + u32 guest_idtr_limit; + u32 guest_es_ar_bytes; + u32 guest_cs_ar_bytes; + u32 guest_ss_ar_bytes; + u32 guest_ds_ar_bytes; + u32 guest_fs_ar_bytes; + u32 guest_gs_ar_bytes; + u32 guest_ldtr_ar_bytes; + u32 guest_tr_ar_bytes; + u32 guest_interruptibility_info; + u32 guest_activity_state; + u32 guest_sysenter_cs; + u32 host_ia32_sysenter_cs; + unsigned long cr0_guest_host_mask; + unsigned long cr4_guest_host_mask; + unsigned long cr0_read_shadow; + unsigned long cr4_read_shadow; + unsigned long cr3_target_value0; + unsigned long cr3_target_value1; + unsigned long cr3_target_value2; + unsigned long cr3_target_value3; + unsigned long exit_qualification; + unsigned long guest_linear_address; + unsigned long guest_cr0; + unsigned long guest_cr3; + unsigned long guest_cr4; + unsigned long guest_es_base; + unsigned long guest_cs_base; + unsigned long guest_ss_base; + unsigned long guest_ds_base; + unsigned long guest_fs_base; + unsigned long guest_gs_base; + unsigned long guest_ldtr_base; + unsigned long guest_tr_base; + unsigned long guest_gdtr_base; + unsigned long guest_idtr_base; + unsigned long guest_dr7; + unsigned long guest_rsp; + unsigned long guest_rip; + unsigned long guest_rflags; + unsigned long guest_pending_dbg_exceptions; + unsigned long guest_sysenter_esp; + unsigned long guest_sysenter_eip; + unsigned long host_cr0; + unsigned long host_cr3; + unsigned long host_cr4; + unsigned long host_fs_base; + unsigned long host_gs_base; + unsigned long host_tr_base; + unsigned long host_gdtr_base; + unsigned long host_idtr_base; + unsigned long host_ia32_sysenter_esp; + unsigned long host_ia32_sysenter_eip; + unsigned long host_rsp; + unsigned long host_rip; +}; + struct __attribute__ ((__packed__)) level_state { /* Has the level1 guest done vmclear? */ bool vmclear; + u16 vpid; + u64 shadow_efer; + unsigned long cr2; +
[PATCH 2/5] Nested VMX patch 2 implements vmclear
From: Orit Wasserman or...@il.ibm.com --- arch/x86/kvm/vmx.c | 70 --- 1 files changed, 65 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 71bd91a..411cbdb 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -61,15 +61,26 @@ module_param_named(unrestricted_guest, static int __read_mostly emulate_invalid_guest_state = 0; module_param(emulate_invalid_guest_state, bool, S_IRUGO); -struct vmcs { - u32 revision_id; - u32 abort; - char data[0]; +struct __attribute__ ((__packed__)) level_state { + /* Has the level1 guest done vmclear? */ + bool vmclear; }; struct nested_vmx { /* Has the level1 guest done vmxon? */ bool vmxon; + + /* +* Level 2 state : includes vmcs,registers and +* a copy of vmcs12 for vmread/vmwrite +*/ + struct level_state *l2_state; +}; + +struct vmcs { + u32 revision_id; + u32 abort; + char data[0]; }; struct vcpu_vmx { @@ -186,6 +197,8 @@ static struct kvm_vmx_segment_field { static void ept_save_pdptrs(struct kvm_vcpu *vcpu); +static int create_l2_state(struct kvm_vcpu *vcpu); + /* * Keep MSR_K6_STAR at the end, as setup_msrs() will try to optimize it * away by decrementing the array size. @@ -1293,6 +1306,30 @@ static void vmclear_local_vcpus(void) __vcpu_clear(vmx); } +struct level_state *create_state(void) +{ + struct level_state *state = NULL; + + state = kzalloc(sizeof(struct level_state), GFP_KERNEL); + if (!state) { + printk(KERN_INFO Error create level state\n); + return NULL; + } + return state; +} + +int create_l2_state(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (!vmx-nested.l2_state) { + vmx-nested.l2_state = create_state(); + if (!vmx-nested.l2_state) + return -ENOMEM; + } + + return 0; +} /* Just like cpu_vmxoff(), but with the __kvm_handle_fault_on_reboot() * tricks. @@ -3261,6 +3298,27 @@ static int handle_vmx_insn(struct kvm_vcpu *vcpu) return 1; } +static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu) +{ + unsigned long rflags; + rflags = vmx_get_rflags(vcpu); + rflags = ~(X86_EFLAGS_CF | X86_EFLAGS_ZF); + vmx_set_rflags(vcpu, rflags); +} + +static int handle_vmclear(struct kvm_vcpu *vcpu) +{ + if (!nested_vmx_check_permission(vcpu)) + return 1; + + to_vmx(vcpu)-nested.l2_state-vmclear = 1; + + skip_emulated_instruction(vcpu); + clear_rflags_cf_zf(vcpu); + + return 1; +} + static int handle_vmoff(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -3310,6 +3368,8 @@ static int handle_vmon(struct kvm_vcpu *vcpu) vmx-nested.vmxon = 1; + create_l2_state(vcpu); + skip_emulated_instruction(vcpu); return 1; } @@ -3582,7 +3642,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { [EXIT_REASON_HLT] = handle_halt, [EXIT_REASON_INVLPG] = handle_invlpg, [EXIT_REASON_VMCALL] = handle_vmcall, - [EXIT_REASON_VMCLEAR] = handle_vmx_insn, + [EXIT_REASON_VMCLEAR] = handle_vmclear, [EXIT_REASON_VMLAUNCH]= handle_vmx_insn, [EXIT_REASON_VMPTRLD] = handle_vmx_insn, [EXIT_REASON_VMPTRST] = handle_vmx_insn, -- 1.6.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite
From: Orit Wasserman or...@il.ibm.com --- arch/x86/kvm/vmx.c | 591 +++- 1 files changed, 589 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 8c186e0..6a4c252 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -225,6 +225,21 @@ struct nested_vmx { struct level_state *l1_state; }; +enum vmcs_field_type { + VMCS_FIELD_TYPE_U16 = 0, + VMCS_FIELD_TYPE_U64 = 1, + VMCS_FIELD_TYPE_U32 = 2, + VMCS_FIELD_TYPE_ULONG = 3 +}; + +#define VMCS_FIELD_LENGTH_OFFSET 13 +#define VMCS_FIELD_LENGTH_MASK 0x6000 + +static inline int vmcs_field_length(unsigned long field) +{ + return (VMCS_FIELD_LENGTH_MASK field) 13; +} + struct vmcs { u32 revision_id; u32 abort; @@ -288,6 +303,404 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) return container_of(vcpu, struct vcpu_vmx, vcpu); } +#define SHADOW_VMCS_OFFSET(x) offsetof(struct shadow_vmcs, x) + +static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = { + + [VIRTUAL_PROCESSOR_ID] = + SHADOW_VMCS_OFFSET(virtual_processor_id), + [GUEST_ES_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_es_selector), + [GUEST_CS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_cs_selector), + [GUEST_SS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_ss_selector), + [GUEST_DS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_ds_selector), + [GUEST_FS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_fs_selector), + [GUEST_GS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_gs_selector), + [GUEST_LDTR_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_ldtr_selector), + [GUEST_TR_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_tr_selector), + [HOST_ES_SELECTOR] = + SHADOW_VMCS_OFFSET(host_es_selector), + [HOST_CS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_cs_selector), + [HOST_SS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_ss_selector), + [HOST_DS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_ds_selector), + [HOST_FS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_fs_selector), + [HOST_GS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_gs_selector), + [HOST_TR_SELECTOR] = + SHADOW_VMCS_OFFSET(host_tr_selector), + [IO_BITMAP_A] = + SHADOW_VMCS_OFFSET(io_bitmap_a), + [IO_BITMAP_A_HIGH] = + SHADOW_VMCS_OFFSET(io_bitmap_a)+4, + [IO_BITMAP_B] = + SHADOW_VMCS_OFFSET(io_bitmap_b), + [IO_BITMAP_B_HIGH] = + SHADOW_VMCS_OFFSET(io_bitmap_b)+4, + [MSR_BITMAP] = + SHADOW_VMCS_OFFSET(msr_bitmap), + [MSR_BITMAP_HIGH] = + SHADOW_VMCS_OFFSET(msr_bitmap)+4, + [VM_EXIT_MSR_STORE_ADDR] = + SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr), + [VM_EXIT_MSR_STORE_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr)+4, + [VM_EXIT_MSR_LOAD_ADDR] = + SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr), + [VM_EXIT_MSR_LOAD_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr)+4, + [VM_ENTRY_MSR_LOAD_ADDR] = + SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr), + [VM_ENTRY_MSR_LOAD_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr)+4, + [TSC_OFFSET] = + SHADOW_VMCS_OFFSET(tsc_offset), + [TSC_OFFSET_HIGH] = + SHADOW_VMCS_OFFSET(tsc_offset)+4, + [VIRTUAL_APIC_PAGE_ADDR] = + SHADOW_VMCS_OFFSET(virtual_apic_page_addr), + [VIRTUAL_APIC_PAGE_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(virtual_apic_page_addr)+4, + [APIC_ACCESS_ADDR] = + SHADOW_VMCS_OFFSET(apic_access_addr), + [APIC_ACCESS_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(apic_access_addr)+4, + [EPT_POINTER] = + SHADOW_VMCS_OFFSET(ept_pointer), + [EPT_POINTER_HIGH] = + SHADOW_VMCS_OFFSET(ept_pointer)+4, + [GUEST_PHYSICAL_ADDRESS] = + SHADOW_VMCS_OFFSET(guest_physical_address), + [GUEST_PHYSICAL_ADDRESS_HIGH] = + SHADOW_VMCS_OFFSET(guest_physical_address)+4, + [VMCS_LINK_POINTER] = + SHADOW_VMCS_OFFSET(vmcs_link_pointer), + [VMCS_LINK_POINTER_HIGH] = + SHADOW_VMCS_OFFSET(vmcs_link_pointer)+4, + [GUEST_IA32_DEBUGCTL] = + SHADOW_VMCS_OFFSET(guest_ia32_debugctl), + [GUEST_IA32_DEBUGCTL_HIGH] = + SHADOW_VMCS_OFFSET(guest_ia32_debugctl)+4, + [GUEST_IA32_PAT] = + SHADOW_VMCS_OFFSET(guest_ia32_pat), + [GUEST_IA32_PAT_HIGH] = + SHADOW_VMCS_OFFSET(guest_ia32_pat)+4, + [GUEST_PDPTR0] = + SHADOW_VMCS_OFFSET(guest_pdptr0), + [GUEST_PDPTR0_HIGH] = +
[PATCH 1/5] Nested VMX patch 1 implements vmon and vmoff
From: Orit Wasserman or...@il.ibm.com --- arch/x86/kvm/svm.c |3 - arch/x86/kvm/vmx.c | 217 +++- arch/x86/kvm/x86.c |6 +- arch/x86/kvm/x86.h |2 + 4 files changed, 222 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 2df9b45..3c1f22a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -124,9 +124,6 @@ static int npt = 1; module_param(npt, int, S_IRUGO); -static int nested = 1; -module_param(nested, int, S_IRUGO); - static void svm_flush_tlb(struct kvm_vcpu *vcpu); static void svm_complete_interrupts(struct vcpu_svm *svm); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 78101dd..71bd91a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -67,6 +67,11 @@ struct vmcs { char data[0]; }; +struct nested_vmx { + /* Has the level1 guest done vmxon? */ + bool vmxon; +}; + struct vcpu_vmx { struct kvm_vcpu vcpu; struct list_head local_vcpus_link; @@ -114,6 +119,9 @@ struct vcpu_vmx { ktime_t entry_time; s64 vnmi_blocked_time; u32 exit_reason; + + /* Nested vmx */ + struct nested_vmx nested; }; static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) @@ -967,6 +975,95 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc) } /* + * Handles msr read for nested virtualization + */ +static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, + u64 *pdata) +{ + u64 vmx_msr = 0; + + switch (msr_index) { + case MSR_IA32_FEATURE_CONTROL: + *pdata = 0; + break; + case MSR_IA32_VMX_BASIC: + *pdata = 0; + rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr); + *pdata = (vmx_msr 0x00cf); + break; + case MSR_IA32_VMX_PINBASED_CTLS: + rdmsrl(MSR_IA32_VMX_PINBASED_CTLS, vmx_msr); + *pdata = (PIN_BASED_EXT_INTR_MASK vmcs_config.pin_based_exec_ctrl) | + (PIN_BASED_NMI_EXITING vmcs_config.pin_based_exec_ctrl) | + (PIN_BASED_VIRTUAL_NMIS vmcs_config.pin_based_exec_ctrl); + break; + case MSR_IA32_VMX_PROCBASED_CTLS: + { + u32 vmx_msr_high, vmx_msr_low; + u32 control = CPU_BASED_HLT_EXITING | +#ifdef CONFIG_X86_64 + CPU_BASED_CR8_LOAD_EXITING | + CPU_BASED_CR8_STORE_EXITING | +#endif + CPU_BASED_CR3_LOAD_EXITING | + CPU_BASED_CR3_STORE_EXITING | + CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_MOV_DR_EXITING | + CPU_BASED_USE_TSC_OFFSETING | + CPU_BASED_INVLPG_EXITING | + CPU_BASED_TPR_SHADOW | + CPU_BASED_USE_MSR_BITMAPS | + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; + + rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high); + + control = vmx_msr_high; /* bit == 0 in high word == must be zero */ + control |= vmx_msr_low; /* bit == 1 in low word == must be one */ + + *pdata = (CPU_BASED_HLT_EXITING control) | +#ifdef CONFIG_X86_64 + (CPU_BASED_CR8_LOAD_EXITING control) | + (CPU_BASED_CR8_STORE_EXITING control) | +#endif + (CPU_BASED_CR3_LOAD_EXITING control) | + (CPU_BASED_CR3_STORE_EXITING control) | + (CPU_BASED_USE_IO_BITMAPS control) | + (CPU_BASED_MOV_DR_EXITING control) | + (CPU_BASED_USE_TSC_OFFSETING control) | + (CPU_BASED_INVLPG_EXITING control) ; + + if (cpu_has_secondary_exec_ctrls()) + *pdata |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; + + if (vm_need_tpr_shadow(vcpu-kvm)) + *pdata |= CPU_BASED_TPR_SHADOW; + break; + } + case MSR_IA32_VMX_EXIT_CTLS: + *pdata = 0; +#ifdef CONFIG_X86_64 + *pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE; +#endif + break; + case MSR_IA32_VMX_ENTRY_CTLS: + *pdata = 0; + break; + case MSR_IA32_VMX_PROCBASED_CTLS2: + *pdata = 0; + if (vm_need_virtualize_apic_accesses(vcpu-kvm)) + *pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + break; + case MSR_IA32_VMX_EPT_VPID_CAP: + *pdata = 0; + break; + default: + return 1; + } + + return 0; +} + +/* * Reads an msr value (of 'msr_index') into 'pdata'. * Returns 0 on success, non-0 otherwise. * Assumes vcpu_load() was already called. @@ -1005,6 +1102,9 @@ static int
Nested VMX support v3
Avi, We have addressed all of the comments, please apply. The following patches implement nested VMX support. The patches enable a guest to use the VMX APIs in order to run its own nested guest (i.e., enable running other hypervisors which use VMX under KVM). The current patches support running Linux under a nested KVM using shadow page table (with bypass_guest_pf disabled). SMP support was fixed. Reworking EPT support to mesh cleanly with the current shadow paging design per Avi's comments is a work-in-progress. The current patches only support a single nested hypervisor, which can only run a single guest (multiple guests are work in progress). Only 64-bit nested hypervisors are supported. Additional patches for running Windows under nested KVM, and Linux under nested VMware server(!), are currently running in the lab. We are in the process of forward-porting those patches to -tip. This patches were written by: Orit Wasserman, or...@il.ibm.com Ben-Ami Yassor, ben...@il.ibm.com Abel Gordon, ab...@il.ibm.com Muli Ben-Yehuda, m...@il.ibm.com With contributions by: Anthony Liguori, aligu...@us.ibm.com Mike Day, md...@us.ibm.com This work was inspired by the nested SVM support by Alexander Graf and Joerg Roedel. Changes since v2: Added check to nested_vmx_get_msr. Static initialization of the vmcs_field_to_offset_table array. Use the memory allocated by L1 for VMCS12 to store the shadow vmcs. Some optimization to the prepare_vmcs_12 function. vpid allocation will be updated with the multiguest support (work in progress). We are working on fixing the cr0.TS handling, it works for nested kvm by not for vmware server. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] allow userspace to adjust kvmclock offset
On Thu, Oct 15, 2009 at 09:46:52AM +0900, Avi Kivity wrote: On 10/13/2009 09:46 PM, Glauber Costa wrote: On Tue, Oct 13, 2009 at 03:31:08PM +0300, Avi Kivity wrote: On 10/13/2009 03:28 PM, Glauber Costa wrote: Do we want an absolute or relative adjustment? What exactly do you mean? Absolute adjustment: clock = t Relative adjustment: clock += t The delta is absolute, but the adjustment in the clock is relative. So we pick the difference between what userspace is passing us and what we currently have, then relatively adds up so we can make sure we won't go back or suffer a too big skew. The motivation for relative adjustment is when you have a jitter resistant place to gather timing information (like the kernel, which can disable interrupts and preemption), then pass it on to kvm without losing information due to scheduling. For migration there is no such place since it involves two hosts, but it makes sense to support relative adjustments. Since we added the padding you asked for, we could use that bit of information to define whether it will be a relative or absolute adjustment, then. Right now, I don't see the point of implementing a code path that will be completely untested. I'd leave it this way until someone comes up with a need. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't make virtio block driver work on Windows 2003
On 10/15/2009 04:23 PM, Asdo wrote: Vadim Rozenfeld wrote: On 10/15/2009 01:42 PM, Asdo wrote: Vadim Rozenfeld wrote: On 10/14/2009 07:52 PM, Asdo wrote: ... So I tried adding another drive, a virtio one, (a new 100MB file at host side) to the virtual machine and rebooting. A first problem is that Windows does not detect the new device upon boot or Add Hardware scan. Check PCI devices with info pci. You must have SCSI controller: PCI device 1af4:1001 device reported. It's not there. Does this make it a KVM bug? Looks like virtio-blk device wasn't initialized. Otherwise I cannot explain why 0x1100 device is here. Try to start block device without index=1 Anyway, if you can, please send info pci output from QEMU monitor console. Owh! Ok THAT was info pci Ok I am copying by hand before removing index=1 (qemu) info pci Bus 0, device 0, function, 0: Host bridge: PCI device 8086:1237 Bus 0 device 1, function 0: ISA ridge: PCI device 8086:7000 Bos 0 device 1 function 1: IDE controller: PCI device 8086:7010 BAR4: I/O at 0xc000 [0xc00f]. Bus 0 device 1 function 3: Bridge: PCI device 8086:7133 IRQ 9 Bus 0 device 2 function 0: VGA controller: PCI device 1013:00b8 BAR0: 32 but memory at 0xf000 [0xf1ff] BAR1: 32 but memory at 0xf200 [0xf2000fff] Bus 0 device 3 function 0: Ethernet controller PCI device 1af4:1000 IRQ 11 BAR0: I/O at 0xc020 [0xc03f] Bus 0 device 4 function 0 RAM controller: PCI device 1af4:1002 IRQ 11 BAR0 : I/O at 0xc040 (qemu) so it's not there Now I remove index=1: WOW it's there now! ... Bus 0 device 4 function 0: Storage controller: PCI device 1af4:1001 IRQ 11 BAR0: I/O at 0xc040 [0xc07f] (just before the 1002 device) So now windows sees it and I was able to install the viostor drivers (btw Windows was not happy with the previously installed viostor drivers, I had to reinstall those and I got two devices, and the previous one still had the yellow exclamation mark, so I had to uninstall that one. After the procedure I was able to boot on virtio too! Yeah!). Great so yes, I'd say you *DO* have a KVM bug: one has to remove index=1 for the second disk to appear. How did you know that, Vadim, is it a known issue with kvm? I don't know. I think, I've seen it once or twice while debugging viostor on old qemu-kvm. But it definitely works with the recent versions. Regards, Vadim It's better to fix that because libvirt puts index=n for all drives so it's impossible to workaround the problem if one uses libvirt. I had to launch manually... Thanks a lot Vadim. Asdo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raw vs. tap
On Thu, Oct 15, 2009 at 08:32:03AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote: I would be much more inclined to consider taking raw and improving the performance long term if guest-host networking worked. This appears to be a fundamental limitation though and I think it's something that will forever plague users if we include this feature. In fact, I think it's fixable with a raw socket bound to a macvlan. Would that be enough? What setup does that entail on the part of a user? Wouldn't we be back to square one wrt users having to run archaic networking commands in order to set things up? Unlike bridge, qemu could set up macvlan without disrupting host networking. The only issue would be cleanup if qemu is killed. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raw vs. tap
Michael S. Tsirkin wrote: On Thu, Oct 15, 2009 at 08:32:03AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote: I would be much more inclined to consider taking raw and improving the performance long term if guest-host networking worked. This appears to be a fundamental limitation though and I think it's something that will forever plague users if we include this feature. In fact, I think it's fixable with a raw socket bound to a macvlan. Would that be enough? What setup does that entail on the part of a user? Wouldn't we be back to square one wrt users having to run archaic networking commands in order to set things up? Unlike bridge, qemu could set up macvlan without disrupting host networking. The only issue would be cleanup if qemu is killed. But this would require additional features in macvlan, correct? This also only works if a guest uses the mac address assigned to it, correct? If a guest was bridging the virtual nic, this would all come apart? Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raw vs. tap
On Thu, Oct 15, 2009 at 10:18:18AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: On Thu, Oct 15, 2009 at 08:32:03AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote: I would be much more inclined to consider taking raw and improving the performance long term if guest-host networking worked. This appears to be a fundamental limitation though and I think it's something that will forever plague users if we include this feature. In fact, I think it's fixable with a raw socket bound to a macvlan. Would that be enough? What setup does that entail on the part of a user? Wouldn't we be back to square one wrt users having to run archaic networking commands in order to set things up? Unlike bridge, qemu could set up macvlan without disrupting host networking. The only issue would be cleanup if qemu is killed. But this would require additional features in macvlan, correct? Not sure: what is the this that you are talking about. It can already be set up without disturbing host networking. This also only works if a guest uses the mac address assigned to it, correct? If a guest was bridging the virtual nic, this would all come apart? Hmm, you could enable promisc mode, but generally this is true: if you require bridging, use a bridge. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Switch pcbios to submodule
On Mon, Oct 12, 2009 at 12:25:40PM +0200, Avi Kivity wrote: Instead of carrying pcbios as a subtree in kvm/bios/, switch to a submodule in roms/pcbios/. The submodule contains all of the subtree history and is merge-compatible with qemu.git's pcbios submodule. Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
Glauber Costa wrote: On Thu, Oct 15, 2009 at 05:11:27PM +0900, Avi Kivity wrote: On 10/14/2009 01:06 AM, Jan Kiszka wrote: Add a new IOCTL pair to retrieve or set the VCPU state in one chunk. More precisely, the IOCTL is able to process a list of substates to be read or written. This list is easily extensible without breaking the existing ABI, thus we will no longer have to add new IOCTLs when we discover a missing VCPU state field or want to support new hardware features. This patch establishes the generic infrastructure for KVM_GET/ SET_VCPU_STATE and adds support for the generic substates REGS, SREGS, FPU, and MP. To avoid code duplication, the entry point for the corresponding original IOCTLs are converted to make use of the new infrastructure internally, too. One last thing - Documentation/kvm/api.txt needs updating. Glauber, this holds for your patches as well. Now looking at it... you do realize that that file is terribly outdated, right? At least it's terribly incomplete. I just decided to add my stuff at the bottom and wait for a bored soul to refactor, fix, extend, whatever this thing. :) Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
On Thu, Oct 15, 2009 at 06:06:04PM +0200, Jan Kiszka wrote: Glauber Costa wrote: On Thu, Oct 15, 2009 at 05:11:27PM +0900, Avi Kivity wrote: On 10/14/2009 01:06 AM, Jan Kiszka wrote: Add a new IOCTL pair to retrieve or set the VCPU state in one chunk. More precisely, the IOCTL is able to process a list of substates to be read or written. This list is easily extensible without breaking the existing ABI, thus we will no longer have to add new IOCTLs when we discover a missing VCPU state field or want to support new hardware features. This patch establishes the generic infrastructure for KVM_GET/ SET_VCPU_STATE and adds support for the generic substates REGS, SREGS, FPU, and MP. To avoid code duplication, the entry point for the corresponding original IOCTLs are converted to make use of the new infrastructure internally, too. One last thing - Documentation/kvm/api.txt needs updating. Glauber, this holds for your patches as well. Now looking at it... you do realize that that file is terribly outdated, right? At least it's terribly incomplete. I just decided to add my stuff at the bottom and wait for a bored soul to refactor, fix, extend, whatever this thing. :) We'll probably clash, then =p. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix MSR_COUNT for kvm_arch_save_regs()
On Wed, Oct 14, 2009 at 03:02:27PM -0300, Eduardo Habkost wrote: A new register was added to the load/save list on commit d283d5a65a2bdcc570065267be21848bd6fe3d78, but MSR_COUNT was not updated, leading to potential stack corruption on kvm_arch_save_regs(). The following registers are saved by kvm_arch_save_regs(): 1) MSR_IA32_SYSENTER_CS 2) MSR_IA32_SYSENTER_ESP 3) MSR_IA32_SYSENTER_EIP 4) MSR_STAR 5) MSR_IA32_TSC 6) MSR_VM_HSAVE_PA 7) MSR_CSTAR (x86_64 only) 8) MSR_KERNELGSBASE (x86_64 only) 9) MSR_FMASK (x86_64 only) 10) MSR_LSTAR (x86_64 only) Signed-off-by: Eduardo Habkost ehabk...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Prevent kvm_init from corrupting debugfs structures
On Wed, Oct 14, 2009 at 04:21:00PM -0700, Darrick J. Wong wrote: I'm seeing an oops condition when kvm-intel and kvm-amd are modprobe'd during boot (say on an Intel system) and then rmmod'd: # modprobe kvm-intel kvm_init() kvm_init_debug() kvm_arch_init() -- stores debugfs dentries internally (success, etc) # modprobe kvm-amd kvm_init() kvm_init_debug() -- second initialization clobbers kvm's internal pointers to dentries kvm_arch_init() kvm_exit_debug() -- and frees them # rmmod kvm-intel kvm_exit() kvm_exit_debug() -- double free of debugfs files! *BOOM* If execution gets to the end of kvm_init(), then the calling module has been established as the kvm provider. Move the debugfs initialization to the end of the function, and remove the now-unnecessary call to kvm_exit_debug() from the error path. That way we avoid trampling on the debugfs entries and freeing them twice. Signed-off-by: Darrick J. Wong djw...@us.ibm.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] change function signatures so that they don't take a vcpu argument
At this point, vcpu arguments are passed only for the fd field. We already provide that in env, as kvm_fd. Replace it. Signed-off-by: Glauber Costa glom...@redhat.com --- cpu-defs.h |1 - hw/apic.c |4 +- kvm-tpr-opt.c | 16 +- qemu-kvm-x86.c | 91 ++-- qemu-kvm.c | 97 +++ qemu-kvm.h | 74 ++- 6 files changed, 134 insertions(+), 149 deletions(-) diff --git a/cpu-defs.h b/cpu-defs.h index 1f48267..cf502e9 100644 --- a/cpu-defs.h +++ b/cpu-defs.h @@ -141,7 +141,6 @@ struct qemu_work_item; struct KVMCPUState { pthread_t thread; int signalled; -void *vcpu_ctx; struct qemu_work_item *queued_work_first, *queued_work_last; int regs_modified; }; diff --git a/hw/apic.c b/hw/apic.c index b8fe529..9e707bd 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -900,7 +900,7 @@ static void kvm_kernel_lapic_save_to_user(APICState *s) struct kvm_lapic_state *kapic = apic; int i, v; -kvm_get_lapic(s-cpu_env-kvm_cpu_state.vcpu_ctx, kapic); +kvm_get_lapic(s-cpu_env, kapic); s-id = kapic_reg(kapic, 0x2) 24; s-tpr = kapic_reg(kapic, 0x8); @@ -953,7 +953,7 @@ static void kvm_kernel_lapic_load_from_user(APICState *s) kapic_set_reg(klapic, 0x38, s-initial_count); kapic_set_reg(klapic, 0x3e, s-divide_conf); -kvm_set_lapic(s-cpu_env-kvm_cpu_state.vcpu_ctx, klapic); +kvm_set_lapic(s-cpu_env, klapic); } #endif diff --git a/kvm-tpr-opt.c b/kvm-tpr-opt.c index f7b6f3b..932b49b 100644 --- a/kvm-tpr-opt.c +++ b/kvm-tpr-opt.c @@ -70,7 +70,7 @@ static uint8_t read_byte_virt(CPUState *env, target_ulong virt) { struct kvm_sregs sregs; -kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs); +kvm_get_sregs(env, sregs); return ldub_phys(map_addr(sregs, virt, NULL)); } @@ -78,7 +78,7 @@ static void write_byte_virt(CPUState *env, target_ulong virt, uint8_t b) { struct kvm_sregs sregs; -kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs); +kvm_get_sregs(env, sregs); stb_phys(map_addr(sregs, virt, NULL), b); } @@ -86,7 +86,7 @@ static __u64 kvm_rsp_read(CPUState *env) { struct kvm_regs regs; -kvm_get_regs(env-kvm_cpu_state.vcpu_ctx, regs); +kvm_get_regs(env, regs); return regs.rsp; } @@ -192,7 +192,7 @@ static int bios_is_mapped(CPUState *env, uint64_t rip) if (bios_enabled) return 1; -kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs); +kvm_get_sregs(env, sregs); probe = (rip 0xf000) + 0xe; phys = map_addr(sregs, probe, perms); @@ -240,7 +240,7 @@ static int enable_vapic(CPUState *env) if (pcr_cpu 0) return 0; -kvm_enable_vapic(env-kvm_cpu_state.vcpu_ctx, vapic_phys + (pcr_cpu 7)); +kvm_enable_vapic(env, vapic_phys + (pcr_cpu 7)); cpu_physical_memory_rw(vapic_phys + (pcr_cpu 7) + 4, one, 1, 1); bios_enabled = 1; @@ -313,7 +313,7 @@ void kvm_tpr_access_report(CPUState *env, uint64_t rip, int is_write) void kvm_tpr_vcpu_start(CPUState *env) { -kvm_enable_tpr_access_reporting(env-kvm_cpu_state.vcpu_ctx); +kvm_enable_tpr_access_reporting(env); if (bios_enabled) enable_vapic(env); } @@ -363,7 +363,7 @@ static void vtpr_ioport_write(void *opaque, uint32_t addr, uint32_t val) struct kvm_sregs sregs; uint32_t rip; -kvm_get_regs(env-kvm_cpu_state.vcpu_ctx, regs); +kvm_get_regs(env, regs); rip = regs.rip - 2; write_byte_virt(env, rip, 0x66); write_byte_virt(env, rip + 1, 0x90); @@ -371,7 +371,7 @@ static void vtpr_ioport_write(void *opaque, uint32_t addr, uint32_t val) return; if (!bios_is_mapped(env, rip)) printf(bios not mapped?\n); -kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs); +kvm_get_sregs(env, sregs); for (addr = 0xf000u; addr = 0x8000u; addr -= 4096) if (map_addr(sregs, addr, NULL) == 0xfee0u) { real_tpr = addr + 0x80; diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index fffcfd8..8c4140d 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -172,14 +172,14 @@ static int kvm_handle_tpr_access(CPUState *env) } -int kvm_enable_vapic(kvm_vcpu_context_t vcpu, uint64_t vapic) +int kvm_enable_vapic(CPUState *env, uint64_t vapic) { int r; struct kvm_vapic_addr va = { .vapic_addr = vapic, }; - r = ioctl(vcpu-fd, KVM_SET_VAPIC_ADDR, va); + r = ioctl(env-kvm_fd, KVM_SET_VAPIC_ADDR, va); if (r == -1) { r = -errno; perror(kvm_enable_vapic); @@ -281,12 +281,12 @@ int kvm_destroy_memory_alias(kvm_context_t kvm, uint64_t phys_start) #ifdef KVM_CAP_IRQCHIP -int kvm_get_lapic(kvm_vcpu_context_t vcpu, struct kvm_lapic_state *s) +int kvm_get_lapic(CPUState *env, struct kvm_lapic_state *s) { int r; if
[PATCH v2 2/3] get rid of vcpu structure
We have no use for it anymore. Only trace of it was in vcpu_create. Make it disappear. Signed-off-by: Glauber Costa glom...@redhat.com --- qemu-kvm.c | 11 +++ qemu-kvm.h |5 - 2 files changed, 3 insertions(+), 13 deletions(-) diff --git a/qemu-kvm.c b/qemu-kvm.c index 700d030..7943281 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -440,16 +440,13 @@ static void kvm_create_vcpu(CPUState *env, int id) { long mmap_size; int r; -kvm_vcpu_context_t vcpu_ctx = qemu_malloc(sizeof(struct kvm_vcpu_context)); r = kvm_vm_ioctl(kvm_state, KVM_CREATE_VCPU, id); if (r 0) { fprintf(stderr, kvm_create_vcpu: %m\n); -goto err; +return; } -vcpu_ctx-fd = r; - env-kvm_fd = r; env-kvm_state = kvm_state; @@ -459,7 +456,7 @@ static void kvm_create_vcpu(CPUState *env, int id) goto err_fd; } env-kvm_run = -mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu_ctx-fd, +mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, env-kvm_fd, 0); if (env-kvm_run == MAP_FAILED) { fprintf(stderr, mmap vcpu area: %m\n); @@ -468,9 +465,7 @@ static void kvm_create_vcpu(CPUState *env, int id) return; err_fd: -close(vcpu_ctx-fd); - err: -free(vcpu_ctx); +close(env-kvm_fd); } static int kvm_set_boot_vcpu_id(kvm_context_t kvm, uint32_t id) diff --git a/qemu-kvm.h b/qemu-kvm.h index abcb98d..588bc80 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -76,12 +76,7 @@ struct kvm_context { int max_gsi; }; -struct kvm_vcpu_context { -int fd; -}; - typedef struct kvm_context *kvm_context_t; -typedef struct kvm_vcpu_context *kvm_vcpu_context_t; #include kvm.h int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory, -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] use upstream kvm_vcpu_ioctl
[v2: we already return -errno, so fix testers ] Signed-off-by: Glauber Costa glom...@redhat.com --- kvm-all.c |3 -- qemu-kvm-x86.c | 57 +++ qemu-kvm.c | 31 - qemu-kvm.h |1 + 4 files changed, 26 insertions(+), 66 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 1356aa8..5ea999e 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -861,7 +861,6 @@ int kvm_vm_ioctl(KVMState *s, int type, ...) return ret; } -#ifdef KVM_UPSTREAM int kvm_vcpu_ioctl(CPUState *env, int type, ...) { int ret; @@ -879,8 +878,6 @@ int kvm_vcpu_ioctl(CPUState *env, int type, ...) return ret; } -#endif - int kvm_has_sync_mmu(void) { #ifdef KVM_CAP_SYNC_MMU diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 8c4140d..c8e37ed 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -174,18 +174,11 @@ static int kvm_handle_tpr_access(CPUState *env) int kvm_enable_vapic(CPUState *env, uint64_t vapic) { - int r; struct kvm_vapic_addr va = { .vapic_addr = vapic, }; - r = ioctl(env-kvm_fd, KVM_SET_VAPIC_ADDR, va); - if (r == -1) { - r = -errno; - perror(kvm_enable_vapic); - return r; - } - return 0; + return kvm_vcpu_ioctl(env, KVM_SET_VAPIC_ADDR, va); } #endif @@ -283,28 +276,16 @@ int kvm_destroy_memory_alias(kvm_context_t kvm, uint64_t phys_start) int kvm_get_lapic(CPUState *env, struct kvm_lapic_state *s) { - int r; if (!kvm_irqchip_in_kernel()) return 0; - r = ioctl(env-kvm_fd, KVM_GET_LAPIC, s); - if (r == -1) { - r = -errno; - perror(kvm_get_lapic); - } - return r; + return kvm_vcpu_ioctl(env, KVM_GET_LAPIC, s); } int kvm_set_lapic(CPUState *env, struct kvm_lapic_state *s) { - int r; if (!kvm_irqchip_in_kernel()) return 0; - r = ioctl(env-kvm_fd, KVM_SET_LAPIC, s); - if (r == -1) { - r = -errno; - perror(kvm_set_lapic); - } - return r; + return kvm_vcpu_ioctl(env, KVM_SET_LAPIC, s); } #endif @@ -420,29 +401,25 @@ struct kvm_msr_list *kvm_get_msr_list(kvm_context_t kvm) int kvm_get_msrs(CPUState *env, struct kvm_msr_entry *msrs, int n) { struct kvm_msrs *kmsrs = qemu_malloc(sizeof *kmsrs + n * sizeof *msrs); -int r, e; +int r; kmsrs-nmsrs = n; memcpy(kmsrs-entries, msrs, n * sizeof *msrs); -r = ioctl(env-kvm_fd, KVM_GET_MSRS, kmsrs); -e = errno; +r = kvm_vcpu_ioctl(env, KVM_GET_MSRS, kmsrs); memcpy(msrs, kmsrs-entries, n * sizeof *msrs); free(kmsrs); -errno = e; return r; } int kvm_set_msrs(CPUState *env, struct kvm_msr_entry *msrs, int n) { struct kvm_msrs *kmsrs = qemu_malloc(sizeof *kmsrs + n * sizeof *msrs); -int r, e; +int r; kmsrs-nmsrs = n; memcpy(kmsrs-entries, msrs, n * sizeof *msrs); -r = ioctl(env-kvm_fd, KVM_SET_MSRS, kmsrs); -e = errno; +r = kvm_vcpu_ioctl(env, KVM_SET_MSRS, kmsrs); free(kmsrs); -errno = e; return r; } @@ -464,7 +441,7 @@ int kvm_get_mce_cap_supported(kvm_context_t kvm, uint64_t *mce_cap, int kvm_setup_mce(CPUState *env, uint64_t *mcg_cap) { #ifdef KVM_CAP_MCE -return ioctl(env-kvm_fd, KVM_X86_SETUP_MCE, mcg_cap); +return kvm_vcpu_ioctl(env, KVM_X86_SETUP_MCE, mcg_cap); #else return -ENOSYS; #endif @@ -473,7 +450,7 @@ int kvm_setup_mce(CPUState *env, uint64_t *mcg_cap) int kvm_set_mce(CPUState *env, struct kvm_x86_mce *m) { #ifdef KVM_CAP_MCE -return ioctl(env-kvm_fd, KVM_X86_SET_MCE, m); +return kvm_vcpu_ioctl(env, KVM_X86_SET_MCE, m); #else return -ENOSYS; #endif @@ -563,7 +540,7 @@ int kvm_setup_cpuid(CPUState *env, int nent, cpuid-nent = nent; memcpy(cpuid-entries, entries, nent * sizeof(*entries)); - r = ioctl(env-kvm_fd, KVM_SET_CPUID, cpuid); + r = kvm_vcpu_ioctl(env, KVM_SET_CPUID, cpuid); free(cpuid); return r; @@ -579,11 +556,7 @@ int kvm_setup_cpuid2(CPUState *env, int nent, cpuid-nent = nent; memcpy(cpuid-entries, entries, nent * sizeof(*entries)); - r = ioctl(env-kvm_fd, KVM_SET_CPUID2, cpuid); - if (r == -1) { - fprintf(stderr, kvm_setup_cpuid2: %m\n); - r = -errno; - } + r = kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid); free(cpuid); return r; } @@ -634,13 +607,7 @@ static int tpr_access_reporting(CPUState *env, int enabled) r = kvm_ioctl(kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_VAPIC); if (r = 0) return -ENOSYS; - r = ioctl(env-kvm_fd, KVM_TPR_ACCESS_REPORTING, tac); - if (r == -1) { - r = -errno; - perror(KVM_TPR_ACCESS_REPORTING); - return r; - } - return 0; + return
[PATCH v2 3/4] KVM: x86: Add support for KVM_GET/SET_VCPU_STATE
Add support for getting/setting MSRs, CPUID tree, and the LACPIC via the new VCPU state interface. Also in this case we convert the existing IOCTLs to use the new infrastructure internally. The MSR interface has to be extended to pass back the number of processed MSRs via the header structure instead of the return code as the latter is not available with the new IOCTL. The semantic of the original KVM_GET/SET_MSRS is not affected by this change. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Documentation/kvm/api.txt | 18 arch/x86/include/asm/kvm.h |8 +- arch/x86/kvm/x86.c | 209 3 files changed, 156 insertions(+), 79 deletions(-) diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index 7c0be8d..bee5bbd 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -830,3 +830,21 @@ Deprecates: KVM_GET/SET_MP_STATE struct kvm_mp_state { __u32 mp_state; /* KVM_MP_STATE_* */ }; + +6.5 KVM_X86_VCPU_STATE_MSRS + +Architectures: x86 +Payload: struct kvm_msrs (see KVM_GET_MSRS) +Deprecates: KVM_GET/SET_MSRS + +6.6 KVM_X86_VCPU_STATE_CPUID + +Architectures: x86 +Payload: struct kvm_cpuid2 +Deprecates: KVM_GET/SET_CPUID2 + +6.7 KVM_X86_VCPU_STATE_LAPIC + +Architectures: x86 +Payload: struct kvm_lapic +Deprecates: KVM_GET/SET_LAPIC diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index f02e87a..326615a 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -150,7 +150,7 @@ struct kvm_msr_entry { /* for KVM_GET_MSRS and KVM_SET_MSRS */ struct kvm_msrs { __u32 nmsrs; /* number of msrs in entries */ - __u32 pad; + __u32 nprocessed; /* return value: successfully processed entries */ struct kvm_msr_entry entries[0]; }; @@ -251,4 +251,10 @@ struct kvm_reinject_control { __u8 pit_reinject; __u8 reserved[31]; }; + +/* for KVM_GET/SET_VCPU_STATE */ +#define KVM_X86_VCPU_STATE_MSRS1000 +#define KVM_X86_VCPU_STATE_CPUID 1001 +#define KVM_X86_VCPU_STATE_LAPIC 1002 + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 685215b..46fad88 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1182,11 +1182,11 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs, static int msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs __user *user_msrs, int (*do_msr)(struct kvm_vcpu *vcpu, unsigned index, u64 *data), - int writeback) + int writeback, int write_nprocessed) { struct kvm_msrs msrs; struct kvm_msr_entry *entries; - int r, n; + int r; unsigned size; r = -EFAULT; @@ -1207,15 +1207,22 @@ static int msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs __user *user_msrs, if (copy_from_user(entries, user_msrs-entries, size)) goto out_free; - r = n = __msr_io(vcpu, msrs, entries, do_msr); + r = __msr_io(vcpu, msrs, entries, do_msr); if (r 0) goto out_free; + msrs.nprocessed = r; + r = -EFAULT; + if (write_nprocessed + copy_to_user(user_msrs-nprocessed, msrs.nprocessed, +sizeof(msrs.nprocessed))) + goto out_free; + if (writeback copy_to_user(user_msrs-entries, entries, size)) goto out_free; - r = n; + r = msrs.nprocessed; out_free: vfree(entries); @@ -1792,55 +1799,36 @@ long kvm_arch_vcpu_ioctl(struct file *filp, { struct kvm_vcpu *vcpu = filp-private_data; void __user *argp = (void __user *)arg; + struct kvm_vcpu_substate substate; int r; - struct kvm_lapic_state *lapic = NULL; switch (ioctl) { - case KVM_GET_LAPIC: { - lapic = kzalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL); - - r = -ENOMEM; - if (!lapic) - goto out; - r = kvm_vcpu_ioctl_get_lapic(vcpu, lapic); - if (r) - goto out; - r = -EFAULT; - if (copy_to_user(argp, lapic, sizeof(struct kvm_lapic_state))) - goto out; - r = 0; + case KVM_GET_LAPIC: + substate.type = KVM_X86_VCPU_STATE_LAPIC; + substate.offset = 0; + r = kvm_arch_vcpu_get_substate(vcpu, argp, substate); break; - } - case KVM_SET_LAPIC: { - lapic = kmalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL); - r = -ENOMEM; - if (!lapic) - goto out; - r = -EFAULT; - if (copy_from_user(lapic, argp, sizeof(struct kvm_lapic_state))) - goto out; - r = kvm_vcpu_ioctl_set_lapic(vcpu, lapic); - if (r) -
[PATCH v2 1/4] KVM: Reorder IOCTLs in main kvm.h
Obviously, people tend to extend this header at the bottom - more or less blindly. Ensure that deprecated stuff gets its own corner again by moving things to the top. Also add some comments and reindent IOCTLs to make them more readable and reduce the risk of number collisions. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm.h | 228 ++- 1 files changed, 114 insertions(+), 114 deletions(-) diff --git a/include/linux/kvm.h b/include/linux/kvm.h index f8f8900..7d8c382 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -14,12 +14,76 @@ #define KVM_API_VERSION 12 -/* for KVM_TRACE_ENABLE, deprecated */ +/* *** Deprecated interfaces *** */ + +#define KVM_TRC_SHIFT 16 + +#define KVM_TRC_ENTRYEXIT (1 KVM_TRC_SHIFT) +#define KVM_TRC_HANDLER (1 (KVM_TRC_SHIFT + 1)) + +#define KVM_TRC_VMENTRY (KVM_TRC_ENTRYEXIT + 0x01) +#define KVM_TRC_VMEXIT (KVM_TRC_ENTRYEXIT + 0x02) +#define KVM_TRC_PAGE_FAULT (KVM_TRC_HANDLER + 0x01) + +#define KVM_TRC_HEAD_SIZE 12 +#define KVM_TRC_CYCLE_SIZE 8 +#define KVM_TRC_EXTRA_MAX 7 + +#define KVM_TRC_INJ_VIRQ (KVM_TRC_HANDLER + 0x02) +#define KVM_TRC_REDELIVER_EVT(KVM_TRC_HANDLER + 0x03) +#define KVM_TRC_PEND_INTR(KVM_TRC_HANDLER + 0x04) +#define KVM_TRC_IO_READ (KVM_TRC_HANDLER + 0x05) +#define KVM_TRC_IO_WRITE (KVM_TRC_HANDLER + 0x06) +#define KVM_TRC_CR_READ (KVM_TRC_HANDLER + 0x07) +#define KVM_TRC_CR_WRITE (KVM_TRC_HANDLER + 0x08) +#define KVM_TRC_DR_READ (KVM_TRC_HANDLER + 0x09) +#define KVM_TRC_DR_WRITE (KVM_TRC_HANDLER + 0x0A) +#define KVM_TRC_MSR_READ (KVM_TRC_HANDLER + 0x0B) +#define KVM_TRC_MSR_WRITE(KVM_TRC_HANDLER + 0x0C) +#define KVM_TRC_CPUID(KVM_TRC_HANDLER + 0x0D) +#define KVM_TRC_INTR (KVM_TRC_HANDLER + 0x0E) +#define KVM_TRC_NMI (KVM_TRC_HANDLER + 0x0F) +#define KVM_TRC_VMMCALL (KVM_TRC_HANDLER + 0x10) +#define KVM_TRC_HLT (KVM_TRC_HANDLER + 0x11) +#define KVM_TRC_CLTS (KVM_TRC_HANDLER + 0x12) +#define KVM_TRC_LMSW (KVM_TRC_HANDLER + 0x13) +#define KVM_TRC_APIC_ACCESS (KVM_TRC_HANDLER + 0x14) +#define KVM_TRC_TDP_FAULT(KVM_TRC_HANDLER + 0x15) +#define KVM_TRC_GTLB_WRITE (KVM_TRC_HANDLER + 0x16) +#define KVM_TRC_STLB_WRITE (KVM_TRC_HANDLER + 0x17) +#define KVM_TRC_STLB_INVAL (KVM_TRC_HANDLER + 0x18) +#define KVM_TRC_PPC_INSTR(KVM_TRC_HANDLER + 0x19) + struct kvm_user_trace_setup { - __u32 buf_size; /* sub_buffer size of each per-cpu */ - __u32 buf_nr; /* the number of sub_buffers of each per-cpu */ + __u32 buf_size; + __u32 buf_nr; }; +#define __KVM_DEPRECATED_MAIN_W_0x06 \ + _IOW(KVMIO, 0x06, struct kvm_user_trace_setup) +#define __KVM_DEPRECATED_MAIN_0x07 _IO(KVMIO, 0x07) +#define __KVM_DEPRECATED_MAIN_0x08 _IO(KVMIO, 0x08) + +#define __KVM_DEPRECATED_VM_R_0x70 _IOR(KVMIO, 0x70, struct kvm_assigned_irq) + +struct kvm_breakpoint { + __u32 enabled; + __u32 padding; + __u64 address; +}; + +struct kvm_debug_guest { + __u32 enabled; + __u32 pad; + struct kvm_breakpoint breakpoints[4]; + __u32 singlestep; +}; + +#define __KVM_DEPRECATED_VCPU_W_0x87 _IOW(KVMIO, 0x87, struct kvm_debug_guest) + +/* *** End of deprecated interfaces *** */ + + /* for KVM_CREATE_MEMORY_REGION */ struct kvm_memory_region { __u32 slot; @@ -329,24 +393,6 @@ struct kvm_ioeventfd { __u8 pad[36]; }; -#define KVM_TRC_SHIFT 16 -/* - * kvm trace categories - */ -#define KVM_TRC_ENTRYEXIT (1 KVM_TRC_SHIFT) -#define KVM_TRC_HANDLER (1 (KVM_TRC_SHIFT + 1)) /* only 12 bits */ - -/* - * kvm trace action - */ -#define KVM_TRC_VMENTRY (KVM_TRC_ENTRYEXIT + 0x01) -#define KVM_TRC_VMEXIT (KVM_TRC_ENTRYEXIT + 0x02) -#define KVM_TRC_PAGE_FAULT (KVM_TRC_HANDLER + 0x01) - -#define KVM_TRC_HEAD_SIZE 12 -#define KVM_TRC_CYCLE_SIZE 8 -#define KVM_TRC_EXTRA_MAX 7 - #define KVMIO 0xAE /* @@ -367,12 +413,10 @@ struct kvm_ioeventfd { */ #define KVM_GET_VCPU_MMAP_SIZE_IO(KVMIO, 0x04) /* in bytes */ #define KVM_GET_SUPPORTED_CPUID _IOWR(KVMIO, 0x05, struct kvm_cpuid2) -/* - * ioctls for kvm trace - */ -#define KVM_TRACE_ENABLE _IOW(KVMIO, 0x06, struct kvm_user_trace_setup) -#define KVM_TRACE_PAUSE _IO(KVMIO, 0x07) -#define KVM_TRACE_DISABLE _IO(KVMIO, 0x08) +#define KVM_TRACE_ENABLE __KVM_DEPRECATED_MAIN_W_0x06 +#define KVM_TRACE_PAUSE __KVM_DEPRECATED_MAIN_0x07 +#define KVM_TRACE_DISABLE __KVM_DEPRECATED_MAIN_0x08 + /* * Extension capability list. */ @@ -500,52 +544,54 @@ struct kvm_irqfd { /* * ioctls for VM fds */ -#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 0x40, struct kvm_memory_region) +#define KVM_SET_MEMORY_REGION
[PATCH v2 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
Add a new IOCTL pair to retrieve or set the VCPU state in one chunk. More precisely, the IOCTL is able to process a list of substates to be read or written. This list is easily extensible without breaking the existing ABI, thus we will no longer have to add new IOCTLs when we discover a missing VCPU state field or want to support new hardware features. This patch establishes the generic infrastructure for KVM_GET/ SET_VCPU_STATE and adds support for the generic substates REGS, SREGS, FPU, and MP. To avoid code duplication, the entry point for the corresponding original IOCTLs are converted to make use of the new infrastructure internally, too. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Documentation/kvm/api.txt | 73 ++ arch/ia64/kvm/kvm-ia64.c | 12 ++ arch/powerpc/kvm/powerpc.c | 12 ++ arch/s390/kvm/kvm-s390.c | 12 ++ arch/x86/kvm/x86.c | 12 ++ include/linux/kvm.h| 24 +++ include/linux/kvm_host.h |5 + virt/kvm/kvm_main.c| 318 +++- 8 files changed, 376 insertions(+), 92 deletions(-) diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index 5a4bc8c..7c0be8d 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -593,6 +593,49 @@ struct kvm_irqchip { } chip; }; +4.27 KVM_GET/SET_VCPU_STATE + +Capability: KVM_CAP_VCPU_STATE +Architectures: all (substate support may vary across architectures) +Type: vcpu ioctl +Parameters: struct kvm_vcpu_state (in/out) +Returns: 0 on success, -1 on error + +Reads or sets one or more vcpu substates. + +The data structures exchanged between user space and kernel are organized +in two layers. Layer one is the header structure kvm_vcpu_state: + +struct kvm_vcpu_state { + __u32 nsubstates; /* number of elements in substates */ + __u32 nprocessed; /* return value: successfully processed substates */ + struct kvm_vcpu_substate substates[0]; +}; + +The kernel accepts up to KVM_MAX_VCPU_SUBSTATES elements in the substates +array. An element is described by kvm_vcpu_substate: + +struct kvm_vcpu_substate { + __u32 type; /* KVM_VCPU_STATE_* or KVM_$(ARCH)_VCPU_STATE_* */ + __u32 pad; + __s64 offset; /* payload offset to kvm_vcpu_state in bytes */ +}; + +Layer two are the substate-specific payload structures. See section 6 for a +list of supported substates and their payload format. + +Exemplary setup for a single-substate query via KVM_GET_VCPU_STATE: + + struct { + struct kvm_vcpu_state header; + struct kvm_vcpu_substate substates[1]; + } request; + struct kvm_regs regs; + + request.header.nsubstates = 1; + request.header.substates[0].type = KVM_VCPU_STATE_REGS; + request.header.substates[0].offset = (size_t)regs - (size_t)request; + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by @@ -757,3 +800,33 @@ powerpc specific. char padding[256]; }; }; + +6. Supported vcpu substates + +6.1 KVM_VCPU_STATE_REGS + +Architectures: all +Payload: struct kvm_regs (see KVM_GET_REGS) +Deprecates: KVM_GET/SET_REGS + +6.2 KVM_VCPU_STATE_SREGS + +Architectures: all +Payload: struct kvm_sregs (see KVM_GET_SREGS) +Deprecates: KVM_GET/SET_SREGS + +6.3 KVM_VCPU_STATE_FPU + +Architectures: all +Payload: struct kvm_fpu (see KVM_GET_FPU) +Deprecates: KVM_GET/SET_FPU + +6.4 KVM_VCPU_STATE_MP + +Architectures: x86, ia64 +Payload: struct kvm_mp_state +Deprecates: KVM_GET/SET_MP_STATE + +struct kvm_mp_state { + __u32 mp_state; /* KVM_MP_STATE_* */ +}; diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 5fdeec5..c3450a6 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1991,3 +1991,15 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, vcpu_put(vcpu); return r; } + +int kvm_arch_vcpu_get_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base, + struct kvm_vcpu_substate *substate) +{ + return -EINVAL; +} + +int kvm_arch_vcpu_set_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base, + struct kvm_vcpu_substate *substate) +{ + return -EINVAL; +} diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 5902bbc..3336ad5 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -436,3 +436,15 @@ int kvm_arch_init(void *opaque) void kvm_arch_exit(void) { } + +int kvm_arch_vcpu_get_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base, + struct kvm_vcpu_substate *substate) +{ + return -EINVAL; +} + +int kvm_arch_vcpu_set_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base, + struct kvm_vcpu_substate *substate) +{ + return -EINVAL; +} diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 5445058..978ed6c 100644 ---
[PATCH v2 0/4] Extensible VCPU state IOCTL
This version addresses the review comments: - rename KVM[_X86]_VCPU_* - KVM[_X86]_VCPU_STATE_* - more padding for kvm_nmi_state - use bool in get/set_nmi_mask - add basic documentation. Find this series also at git://git.kiszka.org/linux-kvm.git queues/vcpu-state Jan Kiszka (4): KVM: Reorder IOCTLs in main kvm.h KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL KVM: x86: Add support for KVM_GET/SET_VCPU_STATE KVM: x86: Add VCPU substate for NMI states Documentation/kvm/api.txt | 103 + arch/ia64/kvm/kvm-ia64.c| 12 ++ arch/powerpc/kvm/powerpc.c | 12 ++ arch/s390/kvm/kvm-s390.c| 12 ++ arch/x86/include/asm/kvm.h | 15 ++- arch/x86/include/asm/kvm_host.h |2 + arch/x86/kvm/svm.c | 22 +++ arch/x86/kvm/vmx.c | 30 arch/x86/kvm/x86.c | 243 - include/linux/kvm.h | 246 +-- include/linux/kvm_host.h|5 + virt/kvm/kvm_main.c | 318 +++--- 12 files changed, 740 insertions(+), 280 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raw vs. tap
Michael S. Tsirkin wrote: Not sure: what is the this that you are talking about. I meant, fixing guest-host traffic. You're former argument hinged on being able to having networking that Just Worked without adding new things. If this doesn't work today with macvlan, then the argument is invalid. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single memory slot
Avi Kivity wrote: One way to improve the gfn_to_pfn() memslot search is to register just one slot. This can only work on 64-bit, since even the smallest guests need 4GB of physical address space. Apart from speeding up gfn_to_page(), it would also speed up mmio which must iterate over all slots, so a lookup cache cannot help. This would require quite a bunch of changes: - modify gfn_to_pfn() to fail gracefully if the page is in the slot but unmapped (hole handling) - modify qemu to reserve the guest physical address space It could potentially speed up qemu quite a lot too as we would return to a model where host va == fixed address + guest pa. That makes things like stl_phys/ldl_phys trivial. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single memory slot
On Thu, Oct 15, 2009 at 04:33:11PM +0900, Avi Kivity wrote: One way to improve the gfn_to_pfn() memslot search is to register just one slot. This can only work on 64-bit, since even the smallest guests need 4GB of physical address space. Apart from speeding up gfn_to_page(), it would also speed up mmio which must iterate over all slots, so a lookup cache cannot help. This would require quite a bunch of changes: - modify gfn_to_pfn() to fail gracefully if the page is in the slot but unmapped (hole handling) - modify qemu to reserve the guest physical address space - modify qemu memory allocation to use MAP_FIXED to allocate memory - some hack for the vga aliases (mmap an fd multiple times?) - some hack for the vmx-specific pages (e.g. APIC-access page) Not sure it's worthwhile, but something to keep in mind if a simple cache or sort by size is insufficient due to mmio. Downside is you lose the ability to write protect a small slot only (could mprotect(MAP_READ) the desired area but get_log+write_protect must be atomic). Also if you enable dirty log for the large slot largepages are disabled. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single memory slot
On Thu, Oct 15, 2009 at 02:46:38PM +0200, Alexander Graf wrote: On 15.10.2009, at 09:33, Avi Kivity wrote: One way to improve the gfn_to_pfn() memslot search is to register just one slot. This can only work on 64-bit, since even the smallest guests need 4GB of physical address space. Apart from speeding up gfn_to_page(), it would also speed up mmio which must iterate over all slots, so a lookup cache cannot help. This would require quite a bunch of changes: - modify gfn_to_pfn() to fail gracefully if the page is in the slot but unmapped (hole handling) - modify qemu to reserve the guest physical address space - modify qemu memory allocation to use MAP_FIXED to allocate memory - some hack for the vga aliases (mmap an fd multiple times?) - some hack for the vmx-specific pages (e.g. APIC-access page) Not sure it's worthwhile, but something to keep in mind if a simple cache or sort by size is insufficient due to mmio. One thing I've been wondering for quite a while is that slot loop. Why do we loop over all possible slots? Couldn't we just remember the max extry (usually 1 or 2) and not loop MAX_SLOT_AMOUNT times? That would be a really easy patch and give instant speed improvements for everyone. gfn_to_memslot_unaliased uses kvm-nmemslots which is the max entry. Oh, kvm_is_visible_gfn does not. It should just use gfn_to_memslot_unaliased. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Do clock adjustments over migration
On Thu, Oct 15, 2009 at 01:15:25PM -0400, Glauber Costa wrote: Hey, This patch is a proposal only. Among other things, it relies on a patch Juan is yet to send, and I also would want to give it a bit more testing. It shows my indented use of the new ioctl interface I've been proposing. First of all, we have to save the kvmclock msrs. This is per-cpu, and we were failing to do it so far. The ioctls are issued in pre-save and post-load sections of a new vmstate handler. I am not doing it in the cpu vmstate handler, because this has to be done once per VM, not cpu. What I basically do is to grab the time from GET ioctl, pass on through migration, and then do a SET on the other side. Should be straighforward. Please let me hear your thoughts. And don't get me started with this you can't hear thoughts thing! Signed-off-by: Glauber Costa glom...@redhat.com --- kvm/include/linux/kvm.h |9 + qemu-kvm-x86.c |6 ++ qemu-kvm.c | 29 + target-i386/cpu.h |3 ++- target-i386/machine.c |2 ++ 5 files changed, 48 insertions(+), 1 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index fffcfd8..75e2ffd 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -834,6 +834,9 @@ static int get_msr_entry(struct kvm_msr_entry *entry, CPUState *env) case MSR_VM_HSAVE_PA: env-vm_hsave = entry-data; break; +case MSR_KVM_SYSTEM_TIME: +env-system_time_msr = entry-data; +break; default: printf(Warning unknown msr index 0x%x\n, entry-index); return 1; @@ -1001,6 +1004,7 @@ void kvm_arch_load_regs(CPUState *env) set_msr_entry(msrs[n++], MSR_LSTAR , env-lstar); } #endif +set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); rc = kvm_set_msrs(env-kvm_cpu_state.vcpu_ctx, msrs, n); if (rc == -1) @@ -1179,6 +1183,8 @@ void kvm_arch_save_regs(CPUState *env) msrs[n++].index = MSR_LSTAR; } #endif +msrs[n++].index = MSR_KVM_SYSTEM_TIME; + fix MSR_COUNT for kvm_arch_save_regs() Otherwise looks good. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in kvm_intel
On Thu, 2009-10-15 at 02:10 +0900, Avi Kivity wrote: On 10/13/2009 11:04 PM, Andrew Theurer wrote: Look at the address where vmx_vcpu_run starts, add 0x26d, and show the surrounding code. Thinking about it, it probably _is_ what you showed, due to module page alignment. But please verify this; I can't reconcile the fault address (9fe9a2b) with %rsp at the time of the fault. Here is the start of the function: 3884vmx_vcpu_run: 3884: 55 push %rbp 3885: 48 89 e5mov%rsp,%rbp and 0x26d later is 0x3af1: 3ad2: 4c 8b b1 88 01 00 00mov0x188(%rcx),%r14 3ad9: 4c 8b b9 90 01 00 00mov0x190(%rcx),%r15 3ae0: 48 8b 89 20 01 00 00mov0x120(%rcx),%rcx 3ae7: 75 05 jne3aeevmx_vcpu_run+0x26a 3ae9: 0f 01 c2vmlaunch 3aec: eb 03 jmp3af1vmx_vcpu_run+0x26d 3aee: 0f 01 c3vmresume 3af1: 48 87 0c 24 xchg %rcx,(%rsp) 3af5: 48 89 81 18 01 00 00mov%rax,0x118(%rcx) 3afc: 48 89 99 30 01 00 00mov%rbx,0x130(%rcx) 3b03: ff 34 24pushq (%rsp) 3b06: 8f 81 20 01 00 00 popq 0x120(%rcx) Ok. So it faults on the xchg instruction, rsp is 8806369ffc80 but the fault address is 9fe9a2b4. So it looks like the IDT is corrupted. Can you check what's around 9fe9a2b4 in System.map? 85d85b24 B __bss_stop 85d86000 B __brk_base 85d96000 b .brk.dmi_alloc 85da6000 B __brk_limit ff60 T vgettimeofday ff600100 t vread_tsc ff600130 t vread_hpet ff600140 D __vsyscall_gtod_data ff600400 T vtime -Andrew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] get rid of MSR_COUNT
qemu.git uses an array of 100 entries for the msr list, which is arguably large enough (tm). I propose we follow the same path, for two reasons: 1) ease future merge. 2) avoid stack overflow problems that had already began to appear Signed-off-by: Glauber Costa glom...@redhat.com --- qemu-kvm-x86.c | 10 ++ 1 files changed, 2 insertions(+), 8 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index c8e37ed..350e5fd 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -808,12 +808,6 @@ static int get_msr_entry(struct kvm_msr_entry *entry, CPUState *env) return 0; } -#ifdef TARGET_X86_64 -#define MSR_COUNT 9 -#else -#define MSR_COUNT 5 -#endif - static void set_v8086_seg(struct kvm_segment *lhs, const SegmentCache *rhs) { lhs-selector = rhs-selector; @@ -868,7 +862,7 @@ void kvm_arch_load_regs(CPUState *env) struct kvm_regs regs; struct kvm_fpu fpu; struct kvm_sregs sregs; -struct kvm_msr_entry msrs[MSR_COUNT]; +struct kvm_msr_entry msrs[100]; int rc, n, i; regs.rax = env-regs[R_EAX]; @@ -1021,7 +1015,7 @@ void kvm_arch_save_regs(CPUState *env) struct kvm_regs regs; struct kvm_fpu fpu; struct kvm_sregs sregs; -struct kvm_msr_entry msrs[MSR_COUNT]; +struct kvm_msr_entry msrs[100]; uint32_t hflags; uint32_t i, n, rc; -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Xen PV-on-HVM guest support (v3)
Support for Xen PV-on-HVM guests can be implemented almost entirely in userspace, except for handling one annoying MSR that maps a Xen hypercall blob into guest address space. A generic mechanism to delegate MSR writes to userspace seems overkill and risks encouraging similar MSR abuse in the future. Thus this patch adds special support for the Xen HVM MSR. I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell KVM which MSR the guest will write to, as well as the starting address and size of the hypercall blobs (one each for 32-bit and 64-bit) that userspace has loaded from files. When the guest writes to the MSR, KVM copies one page of the blob from userspace to the guest. I've tested this patch with a hacked-up version of Gerd's userspace code, booting a number of guests (CentOS 5.3 i386 and x86_64, and FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices. v3: separate blob_{addr,size}_{32,64}; move xen_hvm_config to struct kvm_arch; remove unneeded ifdefs; return -EFAULT, -E2BIG, etc. from xen_hvm_config; use is_long_mode(); remove debug printks; document ioctl in api.txt Signed-off-by: Ed Swierk eswi...@aristanetworks.com --- diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index 5a4bc8c..5980113 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -593,6 +593,30 @@ struct kvm_irqchip { } chip; }; +4.27 KVM_XEN_HVM_CONFIG + +Capability: KVM_CAP_XEN_HVM +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_xen_hvm_config (in) +Returns: 0 on success, -1 on error + +Sets the MSR that the Xen HVM guest uses to initialize its hypercall +page, and provides the starting address and size of the hypercall +blobs in userspace. When the guest writes the MSR, kvm copies one +page of a blob (32- or 64-bit, depending on the vcpu mode) to guest +memory. + +struct kvm_xen_hvm_config { + __u32 msr; + __u32 pad1; + __u64 blob_addr_32; + __u64 blob_addr_64; + __u8 blob_size_32; + __u8 blob_size_64; + __u8 pad2[30]; +}; + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index f02e87a..ef9b4b7 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -19,6 +19,7 @@ #define __KVM_HAVE_MSIX #define __KVM_HAVE_MCE #define __KVM_HAVE_PIT_STATE2 +#define __KVM_HAVE_XEN_HVM /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 45226f0..aee95b2 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -410,6 +410,8 @@ struct kvm_arch{ unsigned long irq_sources_bitmap; u64 vm_init_tsc; + + struct kvm_xen_hvm_config xen_hvm_config; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1d454d9..66149fa 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -835,6 +835,37 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 data) return 0; } +static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data) +{ + int lm = is_long_mode(vcpu); + u8 *blob_addr = lm ? (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_64 + : (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_32; + u8 blob_size = lm ? vcpu-kvm-arch.xen_hvm_config.blob_size_64 + : vcpu-kvm-arch.xen_hvm_config.blob_size_32; + u32 page_num = data ~PAGE_MASK; + u64 page_addr = data PAGE_MASK; + u8 *page; + int r; + + r = -E2BIG; + if (page_num = blob_size) + goto out; + r = -ENOMEM; + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!page) + goto out; + r = -EFAULT; + if (copy_from_user(page, blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE)) + goto out_free; + if (kvm_write_guest(vcpu-kvm, page_addr, page, PAGE_SIZE)) + goto out_free; + r = 0; +out_free: + kfree(page); +out: + return r; +} + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -950,6 +981,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) 0x%x data 0x%llx\n, msr, data); break; default: + if (msr (msr == vcpu-kvm-arch.xen_hvm_config.msr)) + return xen_hvm_config(vcpu, data); if (!ignore_msrs) { pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n, msr, data); @@ -2411,6 +2444,14 @@ long kvm_arch_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_XEN_HVM_CONFIG: { + r = -EFAULT; + if (copy_from_user(kvm-arch.xen_hvm_config, argp, + sizeof(struct
Re: Raw vs. tap
On Thu, Oct 15, 2009 at 01:37:20PM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: Not sure: what is the this that you are talking about. I meant, fixing guest-host traffic. This needs to be supported in networking core. You're former argument hinged on being able to having networking that Just Worked without adding new things. If this doesn't work today with macvlan, then the argument is invalid. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][REPOST] Xen PV-on-HVM guest support (v3)
[Repost; the patch was garbled in my previous attempt.] Support for Xen PV-on-HVM guests can be implemented almost entirely in userspace, except for handling one annoying MSR that maps a Xen hypercall blob into guest address space. A generic mechanism to delegate MSR writes to userspace seems overkill and risks encouraging similar MSR abuse in the future. Thus this patch adds special support for the Xen HVM MSR. I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell KVM which MSR the guest will write to, as well as the starting address and size of the hypercall blobs (one each for 32-bit and 64-bit) that userspace has loaded from files. When the guest writes to the MSR, KVM copies one page of the blob from userspace to the guest. I've tested this patch with a hacked-up version of Gerd's userspace code, booting a number of guests (CentOS 5.3 i386 and x86_64, and FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices. v3: separate blob_{addr,size}_{32,64}; move xen_hvm_config to struct kvm_arch; remove unneeded ifdefs; return -EFAULT, -E2BIG, etc. from xen_hvm_config; use is_long_mode(); remove debug printks; document ioctl in api.txt Signed-off-by: Ed Swierk eswi...@aristanetworks.com --- diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index 5a4bc8c..5980113 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -593,6 +593,30 @@ struct kvm_irqchip { } chip; }; +4.27 KVM_XEN_HVM_CONFIG + +Capability: KVM_CAP_XEN_HVM +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_xen_hvm_config (in) +Returns: 0 on success, -1 on error + +Sets the MSR that the Xen HVM guest uses to initialize its hypercall +page, and provides the starting address and size of the hypercall +blobs in userspace. When the guest writes the MSR, kvm copies one +page of a blob (32- or 64-bit, depending on the vcpu mode) to guest +memory. + +struct kvm_xen_hvm_config { + __u32 msr; + __u32 pad1; + __u64 blob_addr_32; + __u64 blob_addr_64; + __u8 blob_size_32; + __u8 blob_size_64; + __u8 pad2[30]; +}; + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index f02e87a..ef9b4b7 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -19,6 +19,7 @@ #define __KVM_HAVE_MSIX #define __KVM_HAVE_MCE #define __KVM_HAVE_PIT_STATE2 +#define __KVM_HAVE_XEN_HVM /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 45226f0..aee95b2 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -410,6 +410,8 @@ struct kvm_arch{ unsigned long irq_sources_bitmap; u64 vm_init_tsc; + + struct kvm_xen_hvm_config xen_hvm_config; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1d454d9..66149fa 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -835,6 +835,37 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 data) return 0; } +static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data) +{ + int lm = is_long_mode(vcpu); + u8 *blob_addr = lm ? (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_64 + : (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_32; + u8 blob_size = lm ? vcpu-kvm-arch.xen_hvm_config.blob_size_64 + : vcpu-kvm-arch.xen_hvm_config.blob_size_32; + u32 page_num = data ~PAGE_MASK; + u64 page_addr = data PAGE_MASK; + u8 *page; + int r; + + r = -E2BIG; + if (page_num = blob_size) + goto out; + r = -ENOMEM; + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!page) + goto out; + r = -EFAULT; + if (copy_from_user(page, blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE)) + goto out_free; + if (kvm_write_guest(vcpu-kvm, page_addr, page, PAGE_SIZE)) + goto out_free; + r = 0; +out_free: + kfree(page); +out: + return r; +} + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -950,6 +981,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) 0x%x data 0x%llx\n, msr, data); break; default: + if (msr (msr == vcpu-kvm-arch.xen_hvm_config.msr)) + return xen_hvm_config(vcpu, data); if (!ignore_msrs) { pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n, msr, data); @@ -2411,6 +2444,14 @@ long kvm_arch_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_XEN_HVM_CONFIG: { + r = -EFAULT; + if
Re: Add qemu_send_raw() to vlan.
On Thu, Oct 15, 2009 at 09:33:12AM +0200, Gleb Natapov wrote: On Thu, Oct 15, 2009 at 08:04:45AM +0100, Mark McLoughlin wrote: Hi Gleb, On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote: It gets packet without virtio header and adds it if needed. Allows to inject packets to vlan from outside. To send gracious arp for instance. ... diff --git a/net.h b/net.h index 931133b..3d0b6f2 100644 --- a/net.h +++ b/net.h ... @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc); ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov, int iovcnt); int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int size); void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]); void qemu_check_nic_model(NICInfo *nd, const char *model); void qemu_check_nic_model_list(NICInfo *nd, const char * const *models, I've only just now noticed that we never actually made announce_self() use this ... care to do that? Something like this: --- Use qemu_send_packet_raw to send gracious arp. This will ensure that vnet header is handled properly. Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Do clock adjustments over migration
On Thu, Oct 15, 2009 at 05:08:16PM -0300, Marcelo Tosatti wrote: On Thu, Oct 15, 2009 at 01:15:25PM -0400, Glauber Costa wrote: Hey, This patch is a proposal only. Among other things, it relies on a patch Juan is yet to send, and I also would want to give it a bit more testing. It shows my indented use of the new ioctl interface I've been proposing. First of all, we have to save the kvmclock msrs. This is per-cpu, and we were failing to do it so far. The ioctls are issued in pre-save and post-load sections of a new vmstate handler. I am not doing it in the cpu vmstate handler, because this has to be done once per VM, not cpu. What I basically do is to grab the time from GET ioctl, pass on through migration, and then do a SET on the other side. Should be straighforward. Please let me hear your thoughts. And don't get me started with this you can't hear thoughts thing! Otherwise looks good. Also, note that, in vmstate table for kvmclock, I am using U64, instead of UINT64. This is unexistant on current qemu, I had to patch it. Juan said he's planning on sending it to qemu-devel today or tomorrow. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single memory slot
On 10/16/2009 03:51 AM, Anthony Liguori wrote: Avi Kivity wrote: One way to improve the gfn_to_pfn() memslot search is to register just one slot. This can only work on 64-bit, since even the smallest guests need 4GB of physical address space. Apart from speeding up gfn_to_page(), it would also speed up mmio which must iterate over all slots, so a lookup cache cannot help. This would require quite a bunch of changes: - modify gfn_to_pfn() to fail gracefully if the page is in the slot but unmapped (hole handling) - modify qemu to reserve the guest physical address space It could potentially speed up qemu quite a lot too as we would return to a model where host va == fixed address + guest pa. That makes things like stl_phys/ldl_phys trivial. This doesn't work on 32-bit, and you still need to perform a lookup for mmio. It just shortens the loop. Note qemu can't depend on mmio holes being unmapped (you could trap the SEGV, but that would be unbearably slow). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single memory slot
On 10/16/2009 04:46 AM, Marcelo Tosatti wrote: On Thu, Oct 15, 2009 at 04:33:11PM +0900, Avi Kivity wrote: One way to improve the gfn_to_pfn() memslot search is to register just one slot. This can only work on 64-bit, since even the smallest guests need 4GB of physical address space. Apart from speeding up gfn_to_page(), it would also speed up mmio which must iterate over all slots, so a lookup cache cannot help. This would require quite a bunch of changes: - modify gfn_to_pfn() to fail gracefully if the page is in the slot but unmapped (hole handling) - modify qemu to reserve the guest physical address space - modify qemu memory allocation to use MAP_FIXED to allocate memory - some hack for the vga aliases (mmap an fd multiple times?) - some hack for the vmx-specific pages (e.g. APIC-access page) Not sure it's worthwhile, but something to keep in mind if a simple cache or sort by size is insufficient due to mmio. Downside is you lose the ability to write protect a small slot only (could mprotect(MAP_READ) the desired area but get_log+write_protect must be atomic). Also if you enable dirty log for the large slot largepages are disabled. I guess that shoots this idea down. We could perhaps only enable it if a vnc client is not connected and we don't track vga updates. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/5] kvmclock: account stolen time
Which makes stolen time information available in procfs/vmstat. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: kvm/arch/x86/kernel/kvmclock.c === --- kvm.orig/arch/x86/kernel/kvmclock.c +++ kvm/arch/x86/kernel/kvmclock.c @@ -22,11 +22,14 @@ #include asm/msr.h #include asm/apic.h #include linux/percpu.h +#include linux/kernel_stat.h #include asm/x86_init.h #include asm/reboot.h +#include asm/cputime.h #define KVM_SCALE 22 +#define NS_PER_TICK (10LL / HZ) static int kvmclock = 1; @@ -50,6 +53,29 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(str static struct pvclock_wall_clock wall_clock; +static DEFINE_PER_CPU(u64, total_stolen); +static DEFINE_PER_CPU(u64, residual_stolen); + +void kvm_account_steal_time(void) +{ + struct kvm_vcpu_runtime_info *rinfo; + cputime_t ticks; + u64 stolen_time, stolen_delta; + + rinfo = get_cpu_var(run_info); + stolen_time = rinfo-stolen_time; + stolen_delta = stolen_time - __get_cpu_var(total_stolen); + + __get_cpu_var(total_stolen) = stolen_time; + put_cpu_var(rinfo); + + stolen_delta += __get_cpu_var(residual_stolen); + + ticks = iter_div_u64_rem(stolen_delta, NS_PER_TICK, stolen_delta); + __get_cpu_var(residual_stolen) = stolen_delta; + account_steal_ticks(ticks); +} + /* * The wallclock is the time of day when we booted. Since then, some time may * have elapsed since the hypervisor wrote the data. So we try to account for Index: kvm/kernel/sched.c === --- kvm.orig/kernel/sched.c +++ kvm/kernel/sched.c @@ -74,6 +74,9 @@ #include asm/tlb.h #include asm/irq_regs.h +#ifdef CONFIG_KVM_CLOCK +#include asm/kvm_para.h +#endif #include sched_cpupri.h @@ -5102,6 +5105,9 @@ void account_process_tick(struct task_st one_jiffy_scaled); else account_idle_time(cputime_one_jiffy); +#ifdef CONFIG_KVM_CLOCK + kvm_account_steal_time(); +#endif } /* Index: kvm/arch/x86/include/asm/kvm_para.h === --- kvm.orig/arch/x86/include/asm/kvm_para.h +++ kvm/arch/x86/include/asm/kvm_para.h @@ -58,6 +58,7 @@ struct kvm_vcpu_runtime_info { }; extern void kvmclock_init(void); +extern void kvm_account_steal_time(void); /* This instruction is vmcall. On non-VT architectures, it will generate a -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 5/5] qemu-kvm-x86: report pvclock runtime capability
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index fffcfd8..0b8a858 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1223,6 +1223,9 @@ struct kvm_para_features { #ifdef KVM_CAP_CR3_CACHE { KVM_CAP_CR3_CACHE, KVM_FEATURE_CR3_CACHE }, #endif +#ifdef KVM_CAP_PVCLOCK_RUNTIME + { KVM_CAP_PVCLOCK_RUNTIME, KVM_FEATURE_RUNTIME_INFO }, +#endif { -1, -1 } }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/5] report stolen time via pvclock
Stolen time can be useful diagnostic information when available to guests. Xen provides it for sometime, so recent vmstat versions already display it. Also increases guests sched_clock accuracy. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/5] pvclock: move code to pvclock.h
To be used by kvmclock.c. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: kvm/arch/x86/include/asm/pvclock.h === --- kvm.orig/arch/x86/include/asm/pvclock.h +++ kvm/arch/x86/include/asm/pvclock.h @@ -4,6 +4,20 @@ #include linux/clocksource.h #include asm/pvclock-abi.h +/* + * These are perodically updated + *xen: magic shared_info page + *kvm: gpa registered via msr + * and then copied here. + */ +struct pvclock_shadow_time { + u64 tsc_timestamp; /* TSC at last update of time vals. */ + u64 system_timestamp; /* Time, in nanosecs, since boot.*/ + u32 tsc_to_nsec_mul; + int tsc_shift; + u32 version; +}; + /* some helper functions for xen and kvm pv clock sources */ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src); unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); @@ -11,4 +25,8 @@ void pvclock_read_wallclock(struct pvclo struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); +u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow); +unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, +struct pvclock_vcpu_time_info *src); + #endif /* _ASM_X86_PVCLOCK_H */ Index: kvm/arch/x86/kernel/pvclock.c === --- kvm.orig/arch/x86/kernel/pvclock.c +++ kvm/arch/x86/kernel/pvclock.c @@ -20,20 +20,6 @@ #include asm/pvclock.h /* - * These are perodically updated - *xen: magic shared_info page - *kvm: gpa registered via msr - * and then copied here. - */ -struct pvclock_shadow_time { - u64 tsc_timestamp; /* TSC at last update of time vals. */ - u64 system_timestamp; /* Time, in nanosecs, since boot.*/ - u32 tsc_to_nsec_mul; - int tsc_shift; - u32 version; -}; - -/* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, * yielding a 64-bit result. */ @@ -71,7 +57,7 @@ static inline u64 scale_delta(u64 delta, return product; } -static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) +u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) { u64 delta = native_read_tsc() - shadow-tsc_timestamp; return scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift); @@ -81,8 +67,8 @@ static u64 pvclock_get_nsec_offset(struc * Reads a consistent set of time-base values from hypervisor, * into a shadow data area. */ -static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, - struct pvclock_vcpu_time_info *src) +unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, +struct pvclock_vcpu_time_info *src) { do { dst-version = src-version; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/5] KVM: x86: report stolen time
Report stolen time (run_delay field from schedstat) to guests via pvclock. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: kvm/arch/x86/include/asm/kvm_para.h === --- kvm.orig/arch/x86/include/asm/kvm_para.h +++ kvm/arch/x86/include/asm/kvm_para.h @@ -15,9 +15,11 @@ #define KVM_FEATURE_CLOCKSOURCE0 #define KVM_FEATURE_NOP_IO_DELAY 1 #define KVM_FEATURE_MMU_OP 2 +#define KVM_FEATURE_RUNTIME_INFO 3 #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 +#define MSR_KVM_RUN_TIME0x13 #define KVM_MAX_MMU_OP_BATCH 32 @@ -50,6 +52,11 @@ struct kvm_mmu_op_release_pt { #ifdef __KERNEL__ #include asm/processor.h +struct kvm_vcpu_runtime_info { + u64 stolen_time;/* time spent starving */ + u64 reserved[3];/* for future use */ +}; + extern void kvmclock_init(void); Index: kvm/arch/x86/include/asm/kvm_host.h === --- kvm.orig/arch/x86/include/asm/kvm_host.h +++ kvm/arch/x86/include/asm/kvm_host.h @@ -354,6 +354,10 @@ struct kvm_vcpu_arch { unsigned int time_offset; struct page *time_page; + bool stolen_time_enable; + struct kvm_vcpu_runtime_info stolen_time; + unsigned int stolen_time_offset; + bool singlestep; /* guest is single stepped by KVM */ bool nmi_pending; bool nmi_injected; Index: kvm/arch/x86/kvm/x86.c === --- kvm.orig/arch/x86/kvm/x86.c +++ kvm/arch/x86/kvm/x86.c @@ -507,9 +507,9 @@ static inline u32 bit(int bitno) * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN2 +#define KVM_SAVE_MSRS_BEGIN3 static u32 msrs_to_save[] = { - MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, + MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_RUN_TIME, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_K6_STAR, #ifdef CONFIG_X86_64 @@ -679,6 +679,7 @@ static void kvm_write_guest_time(struct struct kvm_vcpu_arch *vcpu = v-arch; void *shared_kaddr; unsigned long this_tsc_khz; + struct task_struct *task = current; if ((!vcpu-time_page)) return; @@ -700,6 +701,9 @@ static void kvm_write_guest_time(struct vcpu-hv_clock.system_time = ts.tv_nsec + (NSEC_PER_SEC * (u64)ts.tv_sec); + + vcpu-stolen_time.stolen_time = task-sched_info.run_delay; + /* * The interface expects us to write an even number signaling that the * update is finished. Since the guest won't see the intermediate @@ -712,6 +716,10 @@ static void kvm_write_guest_time(struct memcpy(shared_kaddr + vcpu-time_offset, vcpu-hv_clock, sizeof(vcpu-hv_clock)); + if (vcpu-stolen_time_enable) + memcpy(shared_kaddr + vcpu-stolen_time_offset, + vcpu-stolen_time, sizeof(vcpu-stolen_time)); + kunmap_atomic(shared_kaddr, KM_USER0); mark_page_dirty(v-kvm, vcpu-time PAGE_SHIFT); @@ -937,6 +945,35 @@ int kvm_set_msr_common(struct kvm_vcpu * kvm_request_guest_time_update(vcpu); break; } + case MSR_KVM_RUN_TIME: { + struct page *page; + unsigned int stolen_time_offset; + + if (!vcpu-arch.time_page) + return 1; + + /* we verify if the enable bit is set... */ + if (!(data 1)) + break; + + /* ...but clean it before doing the actual write */ + stolen_time_offset = data ~(PAGE_MASK | 1); + + /* that it matches the hvclock page */ + page = gfn_to_page(vcpu-kvm, data PAGE_SHIFT); + if (is_error_page(page)) { + kvm_release_page_clean(page); + return 1; + } + if (page != vcpu-arch.time_page) { + kvm_release_page_clean(page); + return 1; + } + kvm_release_page_clean(page); + vcpu-arch.stolen_time_offset = stolen_time_offset; + vcpu-arch.stolen_time_enable = 1; + break; + } case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1: @@ -1246,6 +1283,7 @@ int kvm_dev_ioctl_check_extension(long e case KVM_CAP_PIT2: case KVM_CAP_PIT_STATE2: case KVM_CAP_SET_IDENTITY_MAP_ADDR: + case KVM_CAP_PVCLOCK_RUNTIME: r = 1; break; case KVM_CAP_COALESCED_MMIO: Index: kvm/arch/x86/kvm/Kconfig
[patch 3/5] kvmclock: stolen time aware sched_clock
sched_clock() should time the vcpu run time. Subtract stolen time from realtime pvclock. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: kvm/arch/x86/kernel/kvmclock.c === --- kvm.orig/arch/x86/kernel/kvmclock.c +++ kvm/arch/x86/kernel/kvmclock.c @@ -38,7 +38,16 @@ static int parse_no_kvmclock(char *arg) early_param(no-kvmclock, parse_no_kvmclock); /* The hypervisor will put information about time periodically here */ -static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock); +struct time_info { + struct pvclock_vcpu_time_info hv_clock; + struct kvm_vcpu_runtime_info run_info; +}; + +static DEFINE_PER_CPU_SHARED_ALIGNED(struct time_info, time_info); + +#define hv_clock time_info.hv_clock +#define run_info time_info.run_info + static struct pvclock_wall_clock wall_clock; /* @@ -84,6 +93,40 @@ static cycle_t kvm_clock_get_cycles(stru return kvm_clock_read(); } +cycle_t kvm_runtime_read(struct pvclock_vcpu_time_info *src, +struct kvm_vcpu_runtime_info *rinfo) +{ + struct pvclock_shadow_time shadow; + unsigned version; + cycle_t ret, offset; + unsigned long long stolen; + + do { + version = pvclock_get_time_values(shadow, src); + barrier(); + offset = pvclock_get_nsec_offset(shadow); + stolen = rinfo-stolen_time; + ret = shadow.system_timestamp + offset - stolen; + barrier(); + } while (version != src-version); + + return ret; +} + +static cycle_t kvm_clock_read_unstolen(void) +{ + struct pvclock_vcpu_time_info *src; + struct kvm_vcpu_runtime_info *rinfo; + cycle_t ret; + + src = get_cpu_var(hv_clock); + rinfo = get_cpu_var(run_info); + ret = kvm_runtime_read(src, rinfo); + put_cpu_var(run_info); + put_cpu_var(hv_clock); + return ret; +} + /* * If we don't do that, there is the possibility that the guest * will calibrate under heavy load - thus, getting a lower lpj - @@ -133,14 +176,30 @@ static int kvm_register_clock(char *txt) return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high); } +static int kvm_register_run_info(char *txt) +{ + int cpu = smp_processor_id(); + int low, high; + + low = (int) __pa(per_cpu(run_info, cpu)) | 1; + high = ((u64)__pa(per_cpu(run_info, cpu)) 32); + printk(KERN_INFO kvm-runtime-info: cpu %d, msr %x:%x, %s\n, + cpu, high, low, txt); + return native_write_msr_safe(MSR_KVM_RUN_TIME, low, high); +} + #ifdef CONFIG_X86_LOCAL_APIC static void __cpuinit kvm_setup_secondary_clock(void) { + char *txt = secondary cpu clock; + /* * Now that the first cpu already had this clocksource initialized, * we shouldn't fail. */ - WARN_ON(kvm_register_clock(secondary cpu clock)); + WARN_ON(kvm_register_clock(txt)); + if (kvm_para_has_feature(KVM_FEATURE_RUNTIME_INFO)) + kvm_register_run_info(txt); /* ok, done with our trickery, call native */ setup_secondary_APIC_clock(); } @@ -149,7 +208,11 @@ static void __cpuinit kvm_setup_secondar #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { - WARN_ON(kvm_register_clock(primary cpu clock)); + char *txt = primary cpu clock; + + WARN_ON(kvm_register_clock(txt)); + if (kvm_para_has_feature(KVM_FEATURE_RUNTIME_INFO)) + kvm_register_run_info(txt); native_smp_prepare_boot_cpu(); } #endif @@ -204,4 +267,6 @@ void __init kvmclock_init(void) pv_info.paravirt_enabled = 1; pv_info.name = KVM; } + if (kvm_para_has_feature(KVM_FEATURE_RUNTIME_INFO)) + pv_time_ops.sched_clock = kvm_clock_read_unstolen; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html