RE: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
-Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Thursday, November 11, 2010 1:47 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net. From: xiaohui@intel.com Date: Wed, 10 Nov 2010 17:23:28 +0800 From: Xin Xiaohui xiaohui@intel.com 2) The idea to key off of skb-dev in skb_release_data() is fundamentally flawed since many actions can change skb-dev on you, which will end up causing a leak of your external data areas. How about this one? If the destructor_arg is not a good candidate, then I have to add an apparent field in shinfo. If destructor_arg is actually a net_device pointer or similar, you will need to take a reference count on it or similar. Do you mean destructor_arg will be consumed by other user? If that case, may I add a new structure member in shinfo? Thus only zero-copy will use it, and no need for the reference count. Which means -- good bye performance especially on SMP. You're going to be adding new serialization points and at least two new atomics per packet. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: buildbot for kvm.git
On Thursday, November 11, 2010 02:31:06 am Avi Kivity wrote: Daniel, the buildbot has been fairly effective in keeping qemu-kvm.git building. I'd like to extend that to kvm.git, especially for non-x86 architectures. [...] Can you help with this? Sure. I'll look into that next week. Best Regards, Daniel -- Daniel Gollub Linux Consultant Developer Mail: gol...@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537 signature.asc Description: This is a digitally signed message part.
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On 11/11/2010 02:56 AM, Huang Ying wrote: On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote: On 11/10/2010 02:34 AM, Avi Kivity wrote: Why the gpa - hva mapping is not consistent for RAM if -mempath is not used? Video RAM in the range a-b and PCI mapped RAM can change gpas (while remaining in the same hva). Even for ordinary RAM, which doesn't normally change gpa/hva, I'm not sure we want to guarantee that it won't. We can't universally either. Memory hot remove potentially breaks the mapping and some non-x86 architectures (like ARM) can alias RAM via a guest triggered mechanism. Thanks for clarification. Now I think we have two options. 1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a testing only interface, and should be used only on restricted environment (normal memory, without hot-remove, maybe only on x86). 2) Find some way to lock the gpa - hva mapping during operating. Such as gpa2hva_begin and gpa2hva_end and lock the mapping in between. I think 2) may be possible. But if there are no other users, why do that for a test case? So I think 1) may be the better option. 3) Move the poisoning code into qemu, so the command becomes posison-address addr (though physical addresses aren't stable either) 4) Use -mempath on /dev/shm and poison a page in the backing file (if poisoning works on tmpfs) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 08/14] Add get_fw_dev_path callback for pci bus.
On 11/10/10 18:34, Michael S. Tsirkin wrote: On Wed, Nov 10, 2010 at 07:14:15PM +0200, Gleb Natapov wrote: Signed-off-by: Gleb Natapovg...@redhat.com Good stuff. We should also consider using this for CLI and monitor. Some comments below. Oh, we already have a table to map pci classes to descriptions for 'info pci'. I'd strongly suggest to just add the fw names to that table instead of creating a second one ... cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 03/14] Keep track of ISA ports ISA device is using in qdev.
Hi, register_ioport_write (s-port, 1, 1, gus_writeb, s); register_ioport_write (s-port, 1, 2, gus_writew, s); +isa_init_ioport_range(dev, s-port, 2); register_ioport_read ((s-port + 0x100) 0xf00, 1, 1, gus_readb, s); register_ioport_read ((s-port + 0x100) 0xf00, 1, 2, gus_readw, s); +isa_init_ioport_range(dev, (s-port + 0x100) 0xf00, 2); register_ioport_write (s-port + 6, 10, 1, gus_writeb, s); register_ioport_write (s-port + 6, 10, 2, gus_writew, s); register_ioport_read (s-port + 6, 10, 1, gus_readb, s); register_ioport_read (s-port + 6, 10, 2, gus_readw, s); +isa_init_ioport_range(dev, s-port + 6, 10); register_ioport_write (s-port + 0x100, 8, 1, gus_writeb, s); register_ioport_write (s-port + 0x100, 8, 2, gus_writew, s); register_ioport_read (s-port + 0x100, 8, 1, gus_readb, s); register_ioport_read (s-port + 0x100, 8, 2, gus_readw, s); +isa_init_ioport_range(dev, s-port + 0x100, 8); Seeing all the duplication here and elsewhere ... How about moving the register_ioport_{read,write} calls into isa_init_ioport_range() ? cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 08/14] Add get_fw_dev_path callback for pci bus.
On Thu, Nov 11, 2010 at 11:07:01AM +0100, Gerd Hoffmann wrote: On 11/10/10 18:34, Michael S. Tsirkin wrote: On Wed, Nov 10, 2010 at 07:14:15PM +0200, Gleb Natapov wrote: Signed-off-by: Gleb Natapovg...@redhat.com Good stuff. We should also consider using this for CLI and monitor. Some comments below. Oh, we already have a table to map pci classes to descriptions for 'info pci'. I'd strongly suggest to just add the fw names to that table instead of creating a second one ... Do you mean pci_class_descriptions? For some classes open firmware spec defines single name for all subclasses. For instance all 0Axx should be called dock no matter what xx is. This can't be encoded in a simple table. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: VMX: Fix host userspace gsbase corruption
We now use load_gs_index() to load gs safely; unfortunately this also changes MSR_KERNEL_GS_BASE, which we managed separately. This resulted in confusion and breakage running 32-bit host userspace on a 64-bit kernel. Fix by - saving guest MSR_KERNEL_GS_BASE before we we reload the host's gs - doing the host save/load unconditionally, instead of only when in guest long mode Things can be cleaned up further, but this is the minmal fix for now. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/vmx.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 9367abc..0badeac 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -821,10 +821,9 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu) #endif #ifdef CONFIG_X86_64 - if (is_long_mode(vmx-vcpu)) { - rdmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base); + rdmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base); + if (is_long_mode(vmx-vcpu)) wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_guest_kernel_gs_base); - } #endif for (i = 0; i vmx-save_nmsrs; ++i) kvm_set_shared_msr(vmx-guest_msrs[i].index, @@ -839,11 +838,14 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx) ++vmx-vcpu.stat.host_state_reload; vmx-host_state.loaded = 0; +#ifdef CONFIG_X86_64 + if (is_long_mode(vmx-vcpu)) + rdmsrl(MSR_KERNEL_GS_BASE, vmx-msr_guest_kernel_gs_base); +#endif if (vmx-host_state.gs_ldt_reload_needed) { kvm_load_ldt(vmx-host_state.ldt_sel); #ifdef CONFIG_X86_64 load_gs_index(vmx-host_state.gs_sel); - wrmsrl(MSR_KERNEL_GS_BASE, current-thread.gs); #else loadsegment(gs, vmx-host_state.gs_sel); #endif @@ -852,10 +854,7 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx) loadsegment(fs, vmx-host_state.fs_sel); reload_tss(); #ifdef CONFIG_X86_64 - if (is_long_mode(vmx-vcpu)) { - rdmsrl(MSR_KERNEL_GS_BASE, vmx-msr_guest_kernel_gs_base); - wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base); - } + wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base); #endif if (current_thread_info()-status TS_USEDFPU) clts(); -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 08/14] Add get_fw_dev_path callback for pci bus.
Oh, we already have a table to map pci classes to descriptions for 'info pci'. I'd strongly suggest to just add the fw names to that table instead of creating a second one ... Do you mean pci_class_descriptions? Exactly. For some classes open firmware spec defines single name for all subclasses. For instance all 0Axx should be called dock no matter what xx is. This can't be encoded in a simple table. Well, actually it is a list not a table. Should be easy to extend, like adding a mask, so you can have { .class = 0xa000, .mask = 0xff00, .fw_name = dock, .desc = whatever } to match 0xA0xx ... cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: Add subtest clock_getres
In order to track a regression that happens using kvmclock in guests, add test of the clock_getres() syscall, to see what is the clocksource resolution reported by the guest. This, combined with variants that set different -cpu params for the qemu command line, will give us the desired outcome. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/deps/test_clock_getres/Makefile | 11 .../kvm/deps/test_clock_getres/test_clock_getres.c | 58 client/tests/kvm/tests/clock_getres.py | 44 +++ client/tests/kvm/tests_base.cfg.sample |3 + 4 files changed, 116 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/deps/test_clock_getres/Makefile create mode 100644 client/tests/kvm/deps/test_clock_getres/test_clock_getres.c create mode 100644 client/tests/kvm/tests/clock_getres.py diff --git a/client/tests/kvm/deps/test_clock_getres/Makefile b/client/tests/kvm/deps/test_clock_getres/Makefile new file mode 100644 index 000..b4f73c7 --- /dev/null +++ b/client/tests/kvm/deps/test_clock_getres/Makefile @@ -0,0 +1,11 @@ +CC = gcc +PROG = test_clock_getres +SRC = test_clock_getres.c +LIBS = -lrt + +all: $(PROG) + +$(PROG): + $(CC) $(LIBS) -o $(PROG) $(SRC) +clean: + rm -f $(PROG) diff --git a/client/tests/kvm/deps/test_clock_getres/test_clock_getres.c b/client/tests/kvm/deps/test_clock_getres/test_clock_getres.c new file mode 100644 index 000..81d3b9c --- /dev/null +++ b/client/tests/kvm/deps/test_clock_getres/test_clock_getres.c @@ -0,0 +1,58 @@ +/* + * Test clock resolution for KVM guests that have kvm-clock as clock source + * + * Copyright (c) 2010 Red Hat, Inc + * Author: Lucas Meneghel Rodrigues l...@redhat.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see http://www.gnu.org/licenses/. + */ +#include stdio.h +#include time.h +#include stdlib.h +#include string.h + +int main(void) { + struct timespec res; + int clock_return = clock_getres(CLOCK_MONOTONIC, res); + char clocksource[50]; + char line[80]; + FILE *fr; + if ((fr = fopen( + /sys/devices/system/clocksource/clocksource0/current_clocksource, + rt)) == NULL) { + perror(fopen); + return EXIT_FAILURE; + } + while (fgets(line, 80, fr) != NULL) { + sscanf(line, %s, clocksource); + } + fclose(fr); + if (!strncmp(clocksource, kvm-clock, strlen(kvm-clock))) { + if (clock_return == 0) { + if (res.tv_sec 1 || res.tv_nsec 100) { + printf(FAIL: clock_getres returned bad clock resolution\n); + return EXIT_FAILURE; + } else { + printf(PASS: check successful\n); + return EXIT_SUCCESS; + } + } else { + printf(FAIL: clock_getres failed\n); + return EXIT_FAILURE; + } + } else { + printf(FAIL: invalid clock source: %s\n, clocksource); + return EXIT_FAILURE; + } +} diff --git a/client/tests/kvm/tests/clock_getres.py b/client/tests/kvm/tests/clock_getres.py new file mode 100644 index 000..7cadae8 --- /dev/null +++ b/client/tests/kvm/tests/clock_getres.py @@ -0,0 +1,44 @@ +import logging, time, os +from autotest_lib.client.common_lib import error +from autotest_lib.client.common_lib import utils +import kvm_test_utils, kvm_utils + + +def run_clock_getres(test, params, env): + +Verify if guests using kvm-clock as the time source have a sane clock +resolution. + +@param test: kvm test object. +@param params: Dictionary with test parameters. +@param env: Dictionary with the test environment. + +t_name = test_clock_getres +base_dir = /tmp + +deps_dir = os.path.join(test.bindir, deps, t_name) +os.chdir(deps_dir) +try: +utils.system(make clean) +utils.system(make) +except: +raise error.TestError(Failed to compile %s % t_name) + +test_clock = os.path.join(deps_dir, t_name) +if not os.path.isfile(test_clock): +raise error.TestError(Could not find %s % t_name) + +vm = kvm_test_utils.get_living_vm(env,
RE: Unable to start VM using COWed image
From: Prasad Joshi Sent: 10 November 2010 13:01 To: Stefan Hajnoczi Cc: Keqin Hong; kvm@vger.kernel.org Subject: RE: Unable to start VM using COWed image From: Stefan Hajnoczi [stefa...@gmail.com] Sent: 10 November 2010 12:47 To: Prasad Joshi Cc: Keqin Hong; kvm@vger.kernel.org Subject: Re: Unable to start VM using COWed image On Wed, Nov 10, 2010 at 12:40 PM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: From: Stefan Hajnoczi [stefa...@gmail.com] Sent: 10 November 2010 11:12 To: Prasad Joshi Cc: Keqin Hong; kvm@vger.kernel.org Subject: Re: Unable to start VM using COWed image On Wed, Nov 10, 2010 at 10:08 AM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: Where can I get the code of the qemu-kvm program? I cloned the qemu-lvm git repository and compiled the code. But it looks like qemu-kvm program is not part of this code.-- qemu-kvm.git contains the qemu-kvm codebase but the binary is built in x86_64-softmmu/qemu-system-x86_64. Distro packages typically rename it to qemu-kvm. Thanks Stefan for your reply. I guess you pointed out the problem in the first mail. QEMU places a restriction on location of the COWed file. The source image and COWed image should be in the same drectory. In my case the source image was in directory /var/lib/libvirt/images/ and the COWed image was in /home/prasad/Virtual directory. While debuging the source code using gdb I realized this limitation. It would be good to fix this problem. I will see if I can solve this problem. This behavior is a feature. You chose to use a relative backing file path when you used qemu-img create -b relative-path. If you want an absolute path you need to use qemu-img create -b /home/prasad/Virtual/... (i.e. specify an absolute path instead of a relative path). Oh I see. I am such a stupid. ha ha ha ha Thanks a lot for letting me know. It worked after using absolute paths. Gr8 Thanks a lot. Though specifying the absolute path for source image worked for me. Can any one please let me know the situation in which one would not want to specify the absolute path? How does relative path help? Advantage of using relative path rather than absolute path. I think using the absolute path would always work. One more question on the same lines, How does QEMU detect the file is COWed and the name of the file (not whole path) from it is COWed? COW support comes from the image file format that you choose. A qcow file is not just a raw image file like the kind you can dd from a real disk. Instead it has its own file format including a header and metadata for tracking allocated space. The header contains the name of the backing file. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to start VM using COWed image
On Thu, Nov 11, 2010 at 12:17 PM, Prasad Joshi p.g.jo...@student.reading.ac.uk wrote: Though specifying the absolute path for source image worked for me. Can any one please let me know the situation in which one would not want to specify the absolute path? How does relative path help? Advantage of using relative path rather than absolute path. I think using the absolute path would always work. Relative paths are useful when sharing images with other people. An absolute path won't work on another machine unless you use the same parent directory structure. If you send me an image file with an absolute path in your home directory, I won't be able to use it easily on my machine. (Actually the new qemu-img rebase -u command can be used to fix up the image file on the destination machine but it's an extra step and not user-friendly.) Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] virtio: Use ioeventfd for virtqueue notify
This is a rewrite of the virtio-ioeventfd patchset to work at the virtio-pci.c level instead of virtio.c. This results in better integration with the host/guest notifier code and makes the code simpler (no more state machine). Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Now that this code is in virtio-pci.c it is possible to explicitly enable devices for which virtio-ioeventfd should be used. Only virtio-blk and virtio-net are enabled at this time. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Care must be taken not to interfere with vhost-net, which already uses ioeventfd host notifiers. The following list shows the behavior implemented in this patch and is designed to take vhost-net into account: * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, qemu_set_fd_handler(virtio_pci_host_notifier_read) * reset - qemu_set_fd_handler(NULL), deassign host notifiers * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL) * virtio_pci_set_host_notifier(false) - qemu_set_fd_handler(virtio_pci_host_notifier_read) Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 155 +++--- hw/virtio.c |5 ++ hw/virtio.h |1 + 3 files changed, 129 insertions(+), 32 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 549118d..436fc59 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -83,6 +83,10 @@ /* Flags track per-device state like workarounds for quirks in older guests. */ #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) +/* Performance improves when virtqueue kick processing is decoupled from the + * vcpu thread using ioeventfd for some devices. */ +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD (1 1) + /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for * KVM or if kqemu gets SMP support. @@ -179,12 +183,108 @@ static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f) return 0; } +static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy, int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +int r; +if (assign) { +r = event_notifier_init(notifier, 1); +if (r 0) { +return r; +} +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +event_notifier_cleanup(notifier); +} +} else { +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +return r; +} +event_notifier_cleanup(notifier); +} +return r; +} + +static void virtio_pci_host_notifier_read(void *opaque) +{ +VirtQueue *vq = opaque; +EventNotifier *n = virtio_queue_get_host_notifier(vq); +if (event_notifier_test_and_clear(n)) { +virtio_queue_notify_vq(vq); +} +} + +static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy, int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +if (assign) { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +virtio_pci_host_notifier_read, NULL, vq); +} else { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +NULL, NULL, NULL); +} +} + +static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign) +{ +int n, r; + +for (n = 0; n VIRTIO_PCI_QUEUE_MAX; n++) { +if (!virtio_queue_get_num(proxy-vdev, n)) { +continue; +} + +if (assign) { +r = virtio_pci_set_host_notifier_ioeventfd(proxy, n, true); +if (r 0) { +goto assign_error; +} + +virtio_pci_set_host_notifier_fd_handler(proxy, n, true); +} else { +virtio_pci_set_host_notifier_fd_handler(proxy, n, false); +virtio_pci_set_host_notifier_ioeventfd(proxy, n, false); +} +} +return 0; + +assign_error: +proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD; +while (--n = 0) { +virtio_pci_set_host_notifier_fd_handler(proxy, n, false); +
[PATCH 1/3] virtio-pci: Rename bugs field to flags
The VirtIOPCIProxy bugs field is currently used to enable workarounds for older guests. Rename it to flags so that other per-device behavior can be tracked. A later patch uses the flags field to remember whether ioeventfd should be used for virtqueue host notification. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 729917d..549118d 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -80,9 +80,8 @@ * 12 is historical, and due to x86 page size. */ #define VIRTIO_PCI_QUEUE_ADDR_SHIFT12 -/* We can catch some guest bugs inside here so we continue supporting older - guests. */ -#define VIRTIO_PCI_BUG_BUS_MASTER (1 0) +/* Flags track per-device state like workarounds for quirks in older guests. */ +#define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for @@ -95,7 +94,7 @@ typedef struct { PCIDevice pci_dev; VirtIODevice *vdev; -uint32_t bugs; +uint32_t flags; uint32_t addr; uint32_t class_code; uint32_t nvectors; @@ -159,7 +158,7 @@ static int virtio_pci_load_config(void * opaque, QEMUFile *f) in ready state. Then we have a buggy guest OS. */ if ((proxy-vdev-status VIRTIO_CONFIG_S_DRIVER_OK) !(proxy-pci_dev.config[PCI_COMMAND] PCI_COMMAND_MASTER)) { -proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER; +proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG; } return 0; } @@ -185,7 +184,7 @@ static void virtio_pci_reset(DeviceState *d) VirtIOPCIProxy *proxy = container_of(d, VirtIOPCIProxy, pci_dev.qdev); virtio_reset(proxy-vdev); msix_reset(proxy-pci_dev); -proxy-bugs = 0; +proxy-flags = 0; } static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) @@ -235,7 +234,7 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) some safety checks. */ if ((val VIRTIO_CONFIG_S_DRIVER_OK) !(proxy-pci_dev.config[PCI_COMMAND] PCI_COMMAND_MASTER)) { -proxy-bugs |= VIRTIO_PCI_BUG_BUS_MASTER; +proxy-flags |= VIRTIO_PCI_FLAG_BUS_MASTER_BUG; } break; case VIRTIO_MSI_CONFIG_VECTOR: @@ -403,7 +402,7 @@ static void virtio_write_config(PCIDevice *pci_dev, uint32_t address, if (PCI_COMMAND == address) { if (!(val PCI_COMMAND_MASTER)) { -if (!(proxy-bugs VIRTIO_PCI_BUG_BUS_MASTER)) { +if (!(proxy-flags VIRTIO_PCI_FLAG_BUS_MASTER_BUG)) { virtio_set_status(proxy-vdev, proxy-vdev-status ~VIRTIO_CONFIG_S_DRIVER_OK); } -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] virtio-pci: Don't use ioeventfd on old kernels
There used to be a limit of 6 KVM io bus devices inside the kernel. On such a kernel, don't use ioeventfd for virtqueue host notification since the limit is reached too easily. This ensures that existing vhost-net setups (which always use ioeventfd) have ioeventfds available so they can continue to work. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c |4 kvm-all.c | 46 ++ kvm-stub.c |5 + kvm.h |1 + 4 files changed, 56 insertions(+), 0 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 436fc59..365a26b 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -646,6 +646,10 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, pci_register_bar(proxy-pci_dev, 0, size, PCI_BASE_ADDRESS_SPACE_IO, virtio_map); +if (!kvm_has_many_ioeventfds()) { +proxy-flags = ~VIRTIO_PCI_FLAG_USE_IOEVENTFD; +} + virtio_bind_device(vdev, virtio_pci_bindings, proxy); proxy-host_features |= 0x1 VIRTIO_F_NOTIFY_ON_EMPTY; proxy-host_features |= 0x1 VIRTIO_F_BAD_FEATURE; diff --git a/kvm-all.c b/kvm-all.c index 37b99c7..ba302bc 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -28,6 +28,11 @@ #include kvm.h #include bswap.h +/* This check must be after config-host.h is included */ +#ifdef CONFIG_EVENTFD +#include sys/eventfd.h +#endif + /* KVM uses PAGE_SIZE in it's definition of COALESCED_MMIO_MAX */ #define PAGE_SIZE TARGET_PAGE_SIZE @@ -72,6 +77,7 @@ struct KVMState int irqchip_in_kernel; int pit_in_kernel; int xsave, xcrs; +int many_ioeventfds; }; static KVMState *kvm_state; @@ -441,6 +447,39 @@ int kvm_check_extension(KVMState *s, unsigned int extension) return ret; } +static int kvm_check_many_ioeventfds(void) +{ +/* Older kernels have a 6 device limit on the KVM io bus. Find out so we + * can avoid creating too many ioeventfds. + */ +#ifdef CONFIG_EVENTFD +int ioeventfds[7]; +int i, ret = 0; +for (i = 0; i ARRAY_SIZE(ioeventfds); i++) { +ioeventfds[i] = eventfd(0, EFD_CLOEXEC); +if (ioeventfds[i] 0) { +break; +} +ret = kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, true); +if (ret 0) { +close(ioeventfds[i]); +break; +} +} + +/* Decide whether many devices are supported or not */ +ret = i == ARRAY_SIZE(ioeventfds); + +while (i-- 0) { +kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, false); +close(ioeventfds[i]); +} +return ret; +#else +return 0; +#endif +} + static void kvm_set_phys_mem(target_phys_addr_t start_addr, ram_addr_t size, ram_addr_t phys_offset) @@ -717,6 +756,8 @@ int kvm_init(int smp_cpus) kvm_state = s; cpu_register_phys_memory_client(kvm_cpu_phys_memory_client); +s-many_ioeventfds = kvm_check_many_ioeventfds(); + return 0; err: @@ -1046,6 +1087,11 @@ int kvm_has_xcrs(void) return kvm_state-xcrs; } +int kvm_has_many_ioeventfds(void) +{ +return kvm_state-many_ioeventfds; +} + void kvm_setup_guest_memory(void *start, size_t size) { if (!kvm_has_sync_mmu()) { diff --git a/kvm-stub.c b/kvm-stub.c index 5384a4b..33d4476 100644 --- a/kvm-stub.c +++ b/kvm-stub.c @@ -99,6 +99,11 @@ int kvm_has_robust_singlestep(void) return 0; } +int kvm_has_many_ioeventfds(void) +{ +return 0; +} + void kvm_setup_guest_memory(void *start, size_t size) { } diff --git a/kvm.h b/kvm.h index 60a9b42..ce08d42 100644 --- a/kvm.h +++ b/kvm.h @@ -42,6 +42,7 @@ int kvm_has_robust_singlestep(void); int kvm_has_debugregs(void); int kvm_has_xsave(void); int kvm_has_xcrs(void); +int kvm_has_many_ioeventfds(void); #ifdef NEED_CPU_H int kvm_init_vcpu(CPUState *env); -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GPGPU passthrough in linux kvm
On 10/15/2010 09:11 AM, André Weidemann wrote: Hi Federico, On 15.06.2010 18:18, Fede wrote: On Wed, Jun 9, 2010 at 18:51, Adhyas Avasthiadh...@gmail.com wrote: I read an old email thread which talked about GPGPU passthroughin linux-kvm. Was this implemented? If not, are there some quick hacks I can use to enable it in my tree? Right now, I try to follow the same commands as I do for NIC passthrough, but it complains that my device is busy. I have two NVidia Graphics cards on my box. I'm working on it. It doesn't work because graphics cards have BIOSes. I need to debug some issues with the BIOS of my 9600GT, but I think I'm quite close to succeed. Are you still working on implementing this feature? Did you make some progress in debugging the BIOS issue? I am still very much looking forward to kvm being able to pass a video card to a VM. hope you soon have success so kvm can do GPU passthrough we can not let xen be the only one who can do this see beta XenServer 5.6 now with GPU passthrough Regards, Jason -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recent nested-virtualization publications
On Tue, Oct 26, 2010 at 03:10:58PM +0200, Nadav Har'El wrote: I put copies of all above mentioned documents (in case there's difficulty in finding them), in http://www.math.technion.ac.il/~nyh/nested/ Very interesting. Thanks for the overview on that topic. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test : Adding RHEL 6 to the list of guests
Signed-off-by: Pradeep Kumar psuri...@linux.vnet.ibm.com --- client/tests/kvm/tests_base.cfg.sample | 26 +- client/tests/kvm/unattended/RHEL-6-series.ks | 37 ++ 2 files changed, 62 insertions(+), 1 deletions(-) create mode 100644 client/tests/kvm/unattended/RHEL-6-series.ks diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 07328df..07b6015 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -1334,6 +1334,30 @@ variants: #floppy = images/rhel55-64/ks.vfd cdrom_unattended = images/rhel55-64/ks.iso +- 6.0.i386: +no setup +image_name = rhel6-32 +cdrom_cd1 = linux/RHEL-6.0-i386-DVD.iso +md5sum = 291d234c93442405972689b4b41c14bc +md5sum_1m = ee2cc3d3babe91a1d581a07099c4318b +unattended_install.cdrom: +unattended_file = unattended/RHEL-6-series.ks +tftp = images/rhel60-32/tftpboot +#floppy = images/rhel60-32/ks.vfd +cdrom_unattended = images/rhel60-32/ks.iso + +- 6.0.x86_64: +no setup +image_name = rhel6-64 +cdrom_cd1 = linux/RHEL-6.0-x86_64-DVD.iso +md5sum = f7141396c6a19399d63e8c195354317d +md5sum_1m = b060eeef63e2c8700db54ae02056e80c +unattended_install.cdrom: +unattended_file = unattended/RHEL-6-series.ks +tftp = images/rhel60-64/tftpboot +#floppy = images/rhel60-64/ks.vfd +cdrom_unattended = images/rhel60-64/ks.iso + # Windows section - @Windows: @@ -1930,7 +1954,7 @@ variants: virtio_net|virtio_blk|e1000|balloon_check: -only Fedora.11 Fedora.12 Fedora.13 RHEL.5 OpenSUSE.11 SLES.11 Ubuntu-8.10-server +only Fedora.11 Fedora.12 Fedora.13 RHEL.5 RHEL.6 OpenSUSE.11 SLES.11 Ubuntu-8.10-server # only WinXP Win2003 Win2008 WinVista Win7 Fedora.11 Fedora.12 Fedora.13 RHEL.5 OpenSUSE.11 SLES.11 Ubuntu-8.10-server kdump: diff --git a/client/tests/kvm/unattended/RHEL-6-series.ks b/client/tests/kvm/unattended/RHEL-6-series.ks new file mode 100644 index 000..3ee84f1 --- /dev/null +++ b/client/tests/kvm/unattended/RHEL-6-series.ks @@ -0,0 +1,37 @@ +install +KVM_TEST_MEDIUM +text +reboot +lang en_US.UTF-8 +keyboard us +key --skip +network --bootproto dhcp +rootpw 123456 +firewall --enabled --ssh +selinux --enforcing +timezone --utc America/New_York +firstboot --disable +bootloader --location=mbr --append=console=tty0 console=ttyS0,115200 +zerombr +clearpart --all --initlabel +autopart +reboot + +%packages +...@base +...@development-libs +...@development-tools +kexec-tools + +%post --interpreter /usr/bin/python +import socket, os +os.system('dhclient') +os.system('chkconfig sshd on') +os.system('iptables -F') +os.system('echo 0 /selinux/enforce') +server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) +server.bind(('', 12323)) +server.listen(1) +(client, addr) = server.accept() +client.send(done) +client.close() -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: Add a subtest kdump
From: Jason Wang jasow...@redhat.com Add a new subtest to check whether kdump work correctly in guest. This test just try to trigger crash on each vcpu and then verify it by checking the vmcore. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/tests/kdump.py | 80 ++ client/tests/kvm/tests_base.cfg.sample | 11 client/tests/kvm/unattended/RHEL-5-series.ks |1 + 3 files changed, 92 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/kdump.py diff --git a/client/tests/kvm/tests/kdump.py b/client/tests/kvm/tests/kdump.py new file mode 100644 index 000..a5843c7 --- /dev/null +++ b/client/tests/kvm/tests/kdump.py @@ -0,0 +1,80 @@ +import logging, time +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils + + +def run_kdump(test, params, env): + +KVM reboot test: +1) Log into a guest +2) Check and enable the kdump +3) For each vcpu, trigger a crash and check the vmcore + +@param test: kvm test object +@param params: Dictionary with the test parameters +@param env: Dictionary with test environment. + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +timeout = float(params.get(login_timeout, 240)) +crash_timeout = float(params.get(crash_timeout, 360)) +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2) +def_kernel_param_cmd = (grubby --update-kernel=`grubby --default-kernel` + --args=crashkernel=1...@64m) +kernel_param_cmd = params.get(kernel_param_cmd, def_kernel_param_cmd) +def_kdump_enable_cmd = chkconfig kdump on service kdump start +kdump_enable_cmd = params.get(kdump_enable_cmd, def_kdump_enable_cmd) + +def crash_test(vcpu): + +Trigger a crash dump through sysrq-trigger + +@param vcpu: vcpu which is used to trigger a crash + +session = kvm_test_utils.wait_for_login(vm, 0, timeout, 0, 2) +session.get_command_status(rm -rf /var/crash/*) + +logging.info(Triggering crash on vcpu %d ..., vcpu) +crash_cmd = taskset -c %d echo c /proc/sysrq-trigger % vcpu +session.sendline(crash_cmd) + +if not kvm_utils.wait_for(lambda: not session.is_responsive(), 240, 0, + 1): +raise error.TestFail(Could not trigger crash on vcpu %d % vcpu) + +logging.info(Waiting for kernel crash dump to complete) +session = kvm_test_utils.wait_for_login(vm, 0, crash_timeout, 0, 2) + +logging.info(Probing vmcore file...) +s = session.get_command_status(ls -R /var/crash | grep vmcore) +if s != 0: +raise error.TestFail(Could not find the generated vmcore file) +else: +logging.info(Found vmcore.) + +session.get_command_status(rm -rf /var/crash/*) + +try: +logging.info(Checking the existence of crash kernel...) +prob_cmd = grep -q 1 /sys/kernel/kexec_crash_loaded +s = session.get_command_status(prob_cmd) +if s != 0: +logging.info(Crash kernel is not loaded. Trying to load it) +# We need to setup the kernel params +s, o = session.get_command_status_output(kernel_param_cmd) +if s != 0: +raise error.TestFail(Could not add crashkernel params to + kernel) +session = kvm_test_utils.reboot(vm, session, timeout=timeout); + +logging.info(Enabling kdump service...) +# the initrd may be rebuilt here so we need to wait a little more +s, o = session.get_command_status_output(kdump_enable_cmd, timeout=120) +if s != 0: +raise error.TestFail(Could not enable kdump service: %s % o) + +nvcpu = int(params.get(smp, 1)) +for i in range (nvcpu): +crash_test(i) + +finally: +session.close() diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 8fab6e2..00bd3bc 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -664,6 +664,15 @@ variants: image_name_snapshot1 = sn1 image_name_snapshot2 = sn2 +- kdump: +type = kdump +# time waited for the completion of crash dump +# crash_timeout = 360 +# command to add the crashkerne...@y to kernel cmd line +# kernel_param_cmd = grubby --update-kernel=`grubby --default-kernel` --args=crashkernel=1...@64m +# command to enable kdump service +# kdump_enable_cmd = chkconfig kdump on service kdump start + # system_powerdown, system_reset and shutdown *must* be the last ones # defined (in this order), since the effect of such tests can leave # the VM on a bad state. @@ -1924,6 +1933,8 @@ virtio_net|virtio_blk|e1000|balloon_check: only
Re: [PATCH v2 1/2] KVM: SVM: Replace svm_has() by standard Linux cpuid accessors
On Tue, Nov 09, 2010 at 04:15:42PM +0200, Avi Kivity wrote: - if (nr == BP_VECTOR !svm_has(SVM_FEATURE_NRIP)) { + if (nr == BP_VECTOR !static_cpu_has(X86_FEATURE_NRIPS)) { What is static_cpu_has and why you use it only here and boot_cpu_has in all other places? Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] KVM: SVM: Replace svm_has() by standard Linux cpuid accessors
On 11/11/2010 04:46 PM, Joerg Roedel wrote: On Tue, Nov 09, 2010 at 04:15:42PM +0200, Avi Kivity wrote: - if (nr == BP_VECTOR !svm_has(SVM_FEATURE_NRIP)) { + if (nr == BP_VECTOR !static_cpu_has(X86_FEATURE_NRIPS)) { What is static_cpu_has It's like boot_cpu_has, only it works by patching instead of a dynamic test. and why you use it only here and boot_cpu_has in all other places? A nano optimization, this is a more commonly used path. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 08/14] Add get_fw_dev_path callback for pci bus.
On Thu, Nov 11, 2010 at 11:07:01AM +0100, Gerd Hoffmann wrote: On 11/10/10 18:34, Michael S. Tsirkin wrote: On Wed, Nov 10, 2010 at 07:14:15PM +0200, Gleb Natapov wrote: Signed-off-by: Gleb Natapovg...@redhat.com Good stuff. We should also consider using this for CLI and monitor. Some comments below. Oh, we already have a table to map pci classes to descriptions for 'info pci'. You remind me, that one must be fixed to use names from pci_ids.h I'd strongly suggest to just add the fw names to that table instead of creating a second one ... cheers, Gerd No, I mean the path. We are currently passing domain:bus:slot.function to point at a device and we know it's broken for nested bridges and kind of broken for multiple domains. It's a bug that we must fix. Class names aren't that interesting to me, and I'd be just as happy with packing these into 2 tables as a single one personally, it all seems to be a matter of style. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] KVM: SVM: Replace svm_has() by standard Linux cpuid accessors
On Thu, Nov 11, 2010 at 04:50:10PM +0200, Avi Kivity wrote: On 11/11/2010 04:46 PM, Joerg Roedel wrote: On Tue, Nov 09, 2010 at 04:15:42PM +0200, Avi Kivity wrote: - if (nr == BP_VECTOR !svm_has(SVM_FEATURE_NRIP)) { + if (nr == BP_VECTOR !static_cpu_has(X86_FEATURE_NRIPS)) { What is static_cpu_has It's like boot_cpu_has, only it works by patching instead of a dynamic test. and why you use it only here and boot_cpu_has in all other places? A nano optimization, this is a more commonly used path. Ok, I was just curious because I couldn't find the static_cpu_has by a quick grep. Thanks for the explanation. Acked-by: Joerg Roedel joerg.roe...@amd.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote: Virtqueue notify is currently handled synchronously in userspace virtio. This prevents the vcpu from executing guest code while hardware emulation code handles the notify. On systems that support KVM, the ioeventfd mechanism can be used to make virtqueue notify a lightweight exit by deferring hardware emulation to the iothread and allowing the VM to continue execution. This model is similar to how vhost receives virtqueue notifies. The result of this change is improved performance for userspace virtio devices. Virtio-blk throughput increases especially for multithreaded scenarios and virtio-net transmit throughput increases substantially. Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Care must be taken not to interfere with vhost-net, which already uses ioeventfd host notifiers. The following list shows the behavior implemented in this patch and is designed to take vhost-net into account: * VIRTIO_CONFIG_S_DRIVER_OK - assign host notifiers, qemu_set_fd_handler(virtio_pci_host_notifier_read) we should also deassign when VIRTIO_CONFIG_S_DRIVER_OK is cleared by io write or bus master bit? * reset - qemu_set_fd_handler(NULL), deassign host notifiers * virtio_pci_set_host_notifier(true) - qemu_set_fd_handler(NULL) * virtio_pci_set_host_notifier(false) - qemu_set_fd_handler(virtio_pci_host_notifier_read) Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/virtio-pci.c | 155 +++--- hw/virtio.c |5 ++ hw/virtio.h |1 + 3 files changed, 129 insertions(+), 32 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 549118d..436fc59 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -83,6 +83,10 @@ /* Flags track per-device state like workarounds for quirks in older guests. */ #define VIRTIO_PCI_FLAG_BUS_MASTER_BUG (1 0) +/* Performance improves when virtqueue kick processing is decoupled from the + * vcpu thread using ioeventfd for some devices. */ +#define VIRTIO_PCI_FLAG_USE_IOEVENTFD (1 1) + /* QEMU doesn't strictly need write barriers since everything runs in * lock-step. We'll leave the calls to wmb() in though to make it obvious for * KVM or if kqemu gets SMP support. @@ -179,12 +183,108 @@ static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f) return 0; } +static int virtio_pci_set_host_notifier_ioeventfd(VirtIOPCIProxy *proxy, int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +int r; +if (assign) { +r = event_notifier_init(notifier, 1); +if (r 0) { +return r; +} +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +event_notifier_cleanup(notifier); +} +} else { +r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier), + proxy-addr + VIRTIO_PCI_QUEUE_NOTIFY, + n, assign); +if (r 0) { +return r; +} +event_notifier_cleanup(notifier); +} +return r; +} + +static void virtio_pci_host_notifier_read(void *opaque) +{ +VirtQueue *vq = opaque; +EventNotifier *n = virtio_queue_get_host_notifier(vq); +if (event_notifier_test_and_clear(n)) { +virtio_queue_notify_vq(vq); +} +} + +static void virtio_pci_set_host_notifier_fd_handler(VirtIOPCIProxy *proxy, int n, bool assign) +{ +VirtQueue *vq = virtio_get_queue(proxy-vdev, n); +EventNotifier *notifier = virtio_queue_get_host_notifier(vq); +if (assign) { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +virtio_pci_host_notifier_read, NULL, vq); +} else { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +NULL, NULL, NULL); +} +} + +static int virtio_pci_set_host_notifiers(VirtIOPCIProxy *proxy, bool assign) +{ +int n, r; + +for (n = 0; n VIRTIO_PCI_QUEUE_MAX; n++) { +if (!virtio_queue_get_num(proxy-vdev, n)) { +continue; +} + +if (assign) { +r = virtio_pci_set_host_notifier_ioeventfd(proxy, n, true); +if (r 0) { +goto assign_error; +} + +virtio_pci_set_host_notifier_fd_handler(proxy, n, true); +} else { +virtio_pci_set_host_notifier_fd_handler(proxy, n, false); +
Re: [PATCHv3 08/14] Add get_fw_dev_path callback for pci bus.
On Thu, Nov 11, 2010 at 05:05:11PM +0200, Michael S. Tsirkin wrote: On Thu, Nov 11, 2010 at 11:07:01AM +0100, Gerd Hoffmann wrote: On 11/10/10 18:34, Michael S. Tsirkin wrote: On Wed, Nov 10, 2010 at 07:14:15PM +0200, Gleb Natapov wrote: Signed-off-by: Gleb Natapovg...@redhat.com Good stuff. We should also consider using this for CLI and monitor. Some comments below. Oh, we already have a table to map pci classes to descriptions for 'info pci'. You remind me, that one must be fixed to use names from pci_ids.h For now I used numbers like other elements of the table. If I would do the conversion I would use convenience marcros to not have names like PCI_CLASS_NETWORK_TOKEN_RING all over the table :) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO
On Thu, Nov 11, 2010 at 03:47:00PM +0800, Sheng Yang wrote: Add marco for big-endian machine.(Untested!) Signed-off-by: Sheng Yang sh...@linux.intel.com I presume this is tested at the same level as the previous patch? So you want to fold this into the previous patch. Also, please build with sparse (C=1) to find some endian-ness issues. --- virt/kvm/assigned-dev.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 3010d7d..15b5f74 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -848,9 +848,9 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, idx = (addr - adev-msix_mmio_base) / PCI_MSIX_ENTRY_SIZE; if ((addr % PCI_MSIX_ENTRY_SIZE) == PCI_MSIX_ENTRY_VECTOR_CTRL) - *(unsigned long *)val = + *(unsigned long *)val = le32_to_cpu( This should be cpu_to_le32. And val cast to __le32. test_bit(idx, adev-msix_mask_bitmap) ? - PCI_MSIX_ENTRY_CTRL_MASKBIT : 0; + PCI_MSIX_ENTRY_CTRL_MASKBIT : 0); else r = -EOPNOTSUPP; goto out; In this function, entry must be __le32 too, and fix up endian-ness where you fill it in. @@ -869,6 +869,7 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, adev-msix_mask_bitmap); memcpy(val, entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len); + *(unsigned long *)val = le32_to_cpu(*(unsigned long *)val); out: mutex_unlock(adev-kvm-lock); return r; @@ -881,7 +882,7 @@ static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, container_of(this, struct kvm_assigned_dev_kernel, msix_mmio_dev); int idx, r = 0; - unsigned long new_val = *(unsigned long *)val; + unsigned long new_val = cpu_to_le32(*(unsigned long *)val); __le32 mutex_lock(adev-kvm-lock); if (!msix_mmio_in_range(adev, addr, len)) { -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 01:47:21PM +, Stefan Hajnoczi wrote: Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Who guarantees that less common virtio-blk and virtio-net guest drivers for non-Linux OSes are fine with it? Maybe you should add a feature flag that the guest has to ACK to enable it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On 11/11/2010 03:47 PM, Stefan Hajnoczi wrote: Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Which drivers are these? I only know of the virtio-gl driver, which isn't upstream. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 06:59:29PM +0200, Avi Kivity wrote: On 11/11/2010 03:47 PM, Stefan Hajnoczi wrote: Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Which drivers are these? I only know of the virtio-gl driver, which isn't upstream. I think that PXE virtio drivers do polling. But I think this change does not break these drivers. It'll probably make them a bit slower. Right, Gleb? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 08/14] Add get_fw_dev_path callback for pci bus.
On Thu, Nov 11, 2010 at 06:07:53PM +0200, Gleb Natapov wrote: On Thu, Nov 11, 2010 at 05:05:11PM +0200, Michael S. Tsirkin wrote: On Thu, Nov 11, 2010 at 11:07:01AM +0100, Gerd Hoffmann wrote: On 11/10/10 18:34, Michael S. Tsirkin wrote: On Wed, Nov 10, 2010 at 07:14:15PM +0200, Gleb Natapov wrote: Signed-off-by: Gleb Natapovg...@redhat.com Good stuff. We should also consider using this for CLI and monitor. Some comments below. Oh, we already have a table to map pci classes to descriptions for 'info pci'. You remind me, that one must be fixed to use names from pci_ids.h For now I used numbers like other elements of the table. Fair enough. If I would do the conversion I would use convenience marcros to not have names like PCI_CLASS_NETWORK_TOKEN_RING all over the table :) -- Gleb. I'd prefer to keep it simple :) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] PCI: Add mask bit definition for MSI-X table
On Thu, 11 Nov 2010 15:46:55 +0800 Sheng Yang sh...@linux.intel.com wrote: Then we can use it instead of magic number 1. Reviewed-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com Cc: Matthew Wilcox wi...@linux.intel.com Cc: Jesse Barnes jbar...@virtuousgeek.org Cc: linux-...@vger.kernel.org Signed-off-by: Sheng Yang sh...@linux.intel.com --- drivers/pci/msi.c|5 +++-- include/linux/pci_regs.h |1 + 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 69b7be3..095634e 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -158,8 +158,9 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag) u32 mask_bits = desc-masked; unsigned offset = desc-msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE + PCI_MSIX_ENTRY_VECTOR_CTRL; - mask_bits = ~1; - mask_bits |= flag; + mask_bits = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; + if (flag) + mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT; writel(mask_bits, desc-mask_base + offset); return mask_bits; diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index acfc224..ff51632 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -313,6 +313,7 @@ #define PCI_MSIX_ENTRY_UPPER_ADDR 4 #define PCI_MSIX_ENTRY_DATA 8 #define PCI_MSIX_ENTRY_VECTOR_CTRL 12 +#define PCI_MSIX_ENTRY_CTRL_MASKBIT1 /* CompactPCI Hotswap Register */ Applied 1/7 and 2/7 to my linux-next tree, thanks. If it's easier to push them both through the kvm tree let me know; you can just add my acked-by in that case. -- Jesse Barnes, Intel Open Source Technology Center -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio-pci: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 07:12:40PM +0200, Michael S. Tsirkin wrote: On Thu, Nov 11, 2010 at 06:59:29PM +0200, Avi Kivity wrote: On 11/11/2010 03:47 PM, Stefan Hajnoczi wrote: Some virtio devices are known to have guest drivers which expect a notify to be processed synchronously and spin waiting for completion. Only enable ioeventfd for virtio-blk and virtio-net for now. Which drivers are these? I only know of the virtio-gl driver, which isn't upstream. I think that PXE virtio drivers do polling. But I think this change does not break these drivers. It'll probably make them a bit slower. Right, Gleb? Can't tell about PXE virtio drivers, but virtio-blk in seabios does polling. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Always false condition in rmap_add
Hello All, I have question on code of rmap_add Here is the code of the function 613 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) 614 { 624 rmapp = gfn_to_rmap(vcpu-kvm, gfn, sp-role.level); 625 if (!*rmapp) { 626 rmap_printk(rmap_add: %p %llx 0-1\n, spte, *spte); 627 *rmapp = (unsigned long)spte; 628 } else if (!(*rmapp 1)) { 629 rmap_printk(rmap_add: %p %llx 1-many\n, spte, *spte); 630 desc = mmu_alloc_rmap_desc(vcpu); 631 desc-sptes[0] = (u64 *)*rmapp; 632 desc-sptes[1] = spte; 633 *rmapp = (unsigned long)desc | 1; 634 ++count; 635 } else { The line 628 checks whether the last bit of the rmapp is 1. If it is one then line 633 assigns a new value to rmapp with and sets the last bit to 1. But the line 633 is the only place that sets rmapp's last bit is set to 1. IMHO the condition on line 628 would never be true. Please let me know if I am wrong. Thanks and Regards, Prasad -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] virtio: Use ioeventfd for virtqueue notify
On Thu, Nov 11, 2010 at 01:47:19PM +, Stefan Hajnoczi wrote: This is a rewrite of the virtio-ioeventfd patchset to work at the virtio-pci.c level instead of virtio.c. This results in better integration with the host/guest notifier code and makes the code simpler (no more state machine). Yep, much cleaner. Sent some comments separately. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 00/14] boot order specification
On Thu, Nov 11, 2010 at 10:21 AM, Gerd Hoffmann kra...@redhat.com wrote: On 11/10/10 18:14, Gleb Natapov wrote: This is current sate of the patch series for people to comment on. I am using open firmware naming scheme to specify device path names. Names look like this on pci machine: /p...@i0cf8/i...@1,1/dr...@1/d...@0 /p...@i0cf8/i...@1/f...@03f1/flo...@1 /p...@i0cf8/i...@1/f...@03f1/flo...@0 /p...@i0cf8/i...@1,1/dr...@1/d...@1 /p...@i0cf8/i...@1,1/dr...@0/d...@0 /p...@i0cf8/s...@3/d...@0 /p...@i0cf8/ether...@4/ethernet-...@0 /p...@i0cf8/ether...@5/ethernet-...@0 /p...@i0cf8/i...@1,1/dr...@0/d...@1 /p...@i0cf8/i...@1/i...@01e8/dr...@0/d...@0 /p...@i0cf8/u...@1,2/netw...@0/ether...@0 /p...@i0cf8/u...@1,2/h...@1/netw...@0/ether...@0 Good stuff overall, but see replies to patches for some nits. IIRC some powerpc (+sparc?) boards pass a device tree to the guest. So with this (and maybe some more bits) we might be able to dynamically generate a device tree from our qdev tree? Any comments from the ppc/sparc folks on this? Sparc does not use the device tree in real life, except a binary machine description is used by Sun4v hypervisor. But using a device tree could simplify how OpenBIOS (for both Sparc and PPC) supports different machine types. Device tree is already used by for example Bamboo (PPC) and Petalogix (Microblaze) boards. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem booting Microsoft Windows KVM virtual machine
Il giorno Gio 04 Nov 2010 14:04:59 CET, Martin Maurer ha scritto: Hi, Before you begin, prepare the running w2k to use ide disk (and boot on KVM with ide disks) For w2k I followed this link (solution 2 worked for me): http://www.motherboard.windowsreinstall.com/problems.htm And I am using clonezilla live cd´s to do cloning jobs, for me this is a fast and convenient way. Also, take a look on our http://pve.proxmox.com/wiki/Migration_of_servers_to_Proxmox_VE, including some hints for several sceanrios. Br, Martin Hi Martin, I can finally confirm that the solution 2 of the link you gave me worked for me too! I followed those indications before creating a vmdk with vmware converter, then I converted the vmdk in a raw image with qemu-img convert and finally doing a dd in the drbd device done the trick! Still remains the fact that doing P2V with Win2k is some kind of wild horse (for example if you use the vmvga video driver it freezes), but you give me a big hand. So thank you very much! -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Always false condition in rmap_add
On Thu, Nov 11, 2010 at 06:38:47PM +, Prasad Joshi wrote: Hello All, I have question on code of rmap_add Here is the code of the function 613 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) 614 { 624 rmapp = gfn_to_rmap(vcpu-kvm, gfn, sp-role.level); 625 if (!*rmapp) { 626 rmap_printk(rmap_add: %p %llx 0-1\n, spte, *spte); 627 *rmapp = (unsigned long)spte; 628 } else if (!(*rmapp 1)) { 629 rmap_printk(rmap_add: %p %llx 1-many\n, spte, *spte); 630 desc = mmu_alloc_rmap_desc(vcpu); 631 desc-sptes[0] = (u64 *)*rmapp; 632 desc-sptes[1] = spte; 633 *rmapp = (unsigned long)desc | 1; 634 ++count; 635 } else { The line 628 checks whether the last bit of the rmapp is 1. If it is one then line 633 assigns a new value to rmapp with and sets the last bit to 1. No, line 628 checks if the last bit is not 1, so the check works. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/rbd block driver for qemu-kvm (v7)
On Fri, Oct 15, 2010 at 8:54 PM, Christian Brunner c...@muc.de wrote: Hi, once again, Yehuda committed fixes for all the suggestions made on the list (and more). Here is the next update for the ceph/rbd block driver. Please let us know if there are any pending issues. For those who didn't follow the previous postings: This is an block driver for the distributed file system Ceph (http://ceph.newdream.net/). This driver uses librados (which is part of the Ceph server) for direct access to the Ceph object store and is running entirely in userspace (Yehuda also wrote a driver for the linux kernel, that can be used to access rbd volumes as a block device). Kind Regards, Christian Signed-off-by: Christian Brunner c...@muc.de Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net --- Makefile.objs |1 + block/rbd.c | 1059 + block/rbd_types.h | 71 configure | 31 ++ 4 files changed, 1162 insertions(+), 0 deletions(-) create mode 100644 block/rbd.c create mode 100644 block/rbd_types.h This patch is close now. Just some minor issues below. diff --git a/Makefile.objs b/Makefile.objs index 6ee077c..56a13c1 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -19,6 +19,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o +block-nested-$(CONFIG_RBD) += rbd.o block-obj-y += $(addprefix block/, $(block-nested-y)) diff --git a/block/rbd.c b/block/rbd.c new file mode 100644 index 000..fbfb93e --- /dev/null +++ b/block/rbd.c @@ -0,0 +1,1059 @@ +/* + * QEMU Block driver for RADOS (Ceph) + * + * Copyright (C) 2010 Christian Brunner c...@muc.de + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include qemu-error.h + +#include rbd_types.h +#include block_int.h + +#include rados/librados.h + + + +/* + * When specifying the image filename use: + * + * rbd:poolname/devicename + * + * poolname must be the name of an existing rados pool + * + * devicename is the basename for all objects used to + * emulate the raw device. + * + * Metadata information (image size, ...) is stored in an + * object with the name devicename.rbd. + * + * The raw device is split into 4MB sized objects by default. + * The sequencenumber is encoded in a 12 byte long hex-string, + * and is attached to the devicename, separated by a dot. + * e.g. devicename.1234567890ab + * + */ + +#define OBJ_MAX_SIZE (1UL OBJ_DEFAULT_OBJ_ORDER) + +typedef struct RBDAIOCB { +BlockDriverAIOCB common; +QEMUBH *bh; +int ret; +QEMUIOVector *qiov; +char *bounce; +int write; +int64_t sector_num; +int aiocnt; +int error; +struct BDRVRBDState *s; +int cancelled; +} RBDAIOCB; + +typedef struct RADOSCB { +int rcbid; +RBDAIOCB *acb; +struct BDRVRBDState *s; +int done; +int64_t segsize; +char *buf; +int ret; +} RADOSCB; + +#define RBD_FD_READ 0 +#define RBD_FD_WRITE 1 + +typedef struct BDRVRBDState { +int fds[2]; +rados_pool_t pool; +rados_pool_t header_pool; +char name[RBD_MAX_OBJ_NAME_SIZE]; +char block_name[RBD_MAX_BLOCK_NAME_SIZE]; +uint64_t size; +uint64_t objsize; +int qemu_aio_count; +int read_only; +int event_reader_pos; +RADOSCB *event_rcb; +} BDRVRBDState; + +typedef struct rbd_obj_header_ondisk RbdHeader1; + +static int rbd_next_tok(char *dst, int dst_len, +char *src, char delim, +const char *name, +char **p) +{ +int l; +char *end; + +*p = NULL; + +if (delim != '\0') { +end = strchr(src, delim); +if (end) { +*p = end + 1; +*end = '\0'; +} +} +l = strlen(src); +if (l = dst_len) { +error_report(%s too long, name); +return -EINVAL; +} else if (l == 0) { +error_report(%s too short, name); +return -EINVAL; +} + +pstrcpy(dst, dst_len, src); + +return 0; +} + +static int rbd_parsename(const char *filename, + char *pool, int pool_len, + char *snap, int snap_len, + char *name, int name_len) +{ +const char *start; +char *p, *buf; +int ret; + +if (!strstart(filename, rbd:, start)) { +return -EINVAL; +} + +buf = qemu_strdup(start); +p = buf; + +ret = rbd_next_tok(pool, pool_len, p, '/', pool name, p); +if (ret 0 || !p) { +ret = -EINVAL; +goto done; +} +ret = rbd_next_tok(name, name_len, p, '@', object name, p); +if (ret 0) { +
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Thu, 2010-11-11 at 17:39 +0800, Avi Kivity wrote: On 11/11/2010 02:56 AM, Huang Ying wrote: On Thu, 2010-11-11 at 00:49 +0800, Anthony Liguori wrote: On 11/10/2010 02:34 AM, Avi Kivity wrote: Why the gpa - hva mapping is not consistent for RAM if -mempath is not used? Video RAM in the range a-b and PCI mapped RAM can change gpas (while remaining in the same hva). Even for ordinary RAM, which doesn't normally change gpa/hva, I'm not sure we want to guarantee that it won't. We can't universally either. Memory hot remove potentially breaks the mapping and some non-x86 architectures (like ARM) can alias RAM via a guest triggered mechanism. Thanks for clarification. Now I think we have two options. 1) Clearly mark gpa2hva (pfa2hva now, should renamed to gpa2hva) as a testing only interface, and should be used only on restricted environment (normal memory, without hot-remove, maybe only on x86). 2) Find some way to lock the gpa - hva mapping during operating. Such as gpa2hva_begin and gpa2hva_end and lock the mapping in between. I think 2) may be possible. But if there are no other users, why do that for a test case? So I think 1) may be the better option. 3) Move the poisoning code into qemu, so the command becomes posison-address addr (though physical addresses aren't stable either) Poisoning includes: a) gva - gpa b) gpa - hva c) hva - hpa d) inject the MCE into host via some external tool poison-address need to do b, c, d. Do you think it is good to do all these inside qemu? 4) Use -mempath on /dev/shm and poison a page in the backing file If we can poison a page in the backing file, how do we know the corresponding gpa and hpa? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO
On Friday 12 November 2010 00:12:21 Michael S. Tsirkin wrote: On Thu, Nov 11, 2010 at 03:47:00PM +0800, Sheng Yang wrote: Add marco for big-endian machine.(Untested!) Signed-off-by: Sheng Yang sh...@linux.intel.com I presume this is tested at the same level as the previous patch? So you want to fold this into the previous patch. Of course tested on Intel's platform, but it didn't make sense for big-endian patch... I'd like to make it separate still, because I can't test it properly. Also, please build with sparse (C=1) to find some endian-ness issues. Sure. (Have it run with the patches, nothing new found...) --- virt/kvm/assigned-dev.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 3010d7d..15b5f74 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -848,9 +848,9 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, idx = (addr - adev-msix_mmio_base) / PCI_MSIX_ENTRY_SIZE; if ((addr % PCI_MSIX_ENTRY_SIZE) == PCI_MSIX_ENTRY_VECTOR_CTRL) - *(unsigned long *)val = + *(unsigned long *)val = le32_to_cpu( This should be cpu_to_le32. And val cast to __le32. I think we can just cast val to __le32 here? The result should be big-endian. No, no, I suppose all parameter of msix_mmio_read/write should be little-endian because it's being written into PCI MSI-X table? test_bit(idx, adev-msix_mask_bitmap) ? - PCI_MSIX_ENTRY_CTRL_MASKBIT : 0; + PCI_MSIX_ENTRY_CTRL_MASKBIT : 0); else r = -EOPNOTSUPP; goto out; In this function, entry must be __le32 too, and fix up endian-ness where you fill it in. @@ -869,6 +869,7 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, adev-msix_mask_bitmap); memcpy(val, entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len); + *(unsigned long *)val = le32_to_cpu(*(unsigned long *)val); out: mutex_unlock(adev-kvm-lock); return r; @@ -881,7 +882,7 @@ static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, container_of(this, struct kvm_assigned_dev_kernel, msix_mmio_dev); int idx, r = 0; - unsigned long new_val = *(unsigned long *)val; + unsigned long new_val = cpu_to_le32(*(unsigned long *)val); __le32 Should it be unsigned long new_val = le32_to_cpu(*(unsigned long *)val); *val should be little endian I guess? Or Michael, could you provide a patch for this? I really don't think I can guarantee correctness of this patch... -- regards Yang, Sheng mutex_lock(adev-kvm-lock); if (!msix_mmio_in_range(adev, addr, len)) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] PCI capability and device assignment improvements
This series attempts to clean up capability support between common code and device assignment. In doing so, we can move existing MSI MSI-X capabilities to offsets matching real hardware, and further enable more capabilities to be exposed. The last patch is only for RFC, I'd like some input on what we should pass directly and where we should only provide read-only/emulated access. Patches 1-7 are submitted for commit. Thanks, Alex --- Alex Williamson (8): device-assignment: pass through and stub more PCI caps pci: Pass ID for capability read/write handlers device-assignment: Move PCI capabilities to match physical hardware pci: Remove cap.length, cap.start, cap.supported pci: Replace used bitmap with capability byte map device-assignment: Use PCI capabilities support pci: Remove pci_enable_capability_support() pci: pci_default_cap_write_config ignores wmask hw/device-assignment.c | 273 hw/pci.c | 103 +++--- hw/pci.h | 25 ++-- 3 files changed, 256 insertions(+), 145 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] pci: pci_default_cap_write_config ignores wmask
Make use of wmask, just like the rest of config space. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pci.c | 19 --- 1 files changed, 8 insertions(+), 11 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 92aaa85..12c47ac 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1175,13 +1175,14 @@ uint32_t pci_default_read_config(PCIDevice *d, return pci_read_config(d, address, len); } -static void pci_write_config(PCIDevice *pci_dev, - uint32_t address, uint32_t val, int len) +static void pci_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { int i; -for (i = 0; i len; i++) { -pci_dev-config[address + i] = val 0xff; -val = 8; +uint32_t config_size = pci_config_size(d); + +for (i = 0; i l addr + i config_size; val = 8, ++i) { +uint8_t wmask = d-wmask[addr + i]; +d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); } } @@ -1207,18 +1208,14 @@ void pci_default_cap_write_config(PCIDevice *pci_dev, void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { -int i, was_irq_disabled = pci_irq_disabled(d); -uint32_t config_size = pci_config_size(d); +int was_irq_disabled = pci_irq_disabled(d); if (pci_access_cap_config(d, addr, l)) { d-cap.config_write(d, addr, val, l); return; } -for (i = 0; i l addr + i config_size; val = 8, ++i) { -uint8_t wmask = d-wmask[addr + i]; -d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); -} +pci_write_config(d, addr, val, l); #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT if (kvm_enabled() kvm_irqchip_in_kernel() -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] pci: Remove pci_enable_capability_support()
This interface doesn't make much sense, adding a capability can take care of everything, just provide a means to register capability read/write handlers. Device assignment does it's own thing, so requires a couple ugly hacks that will be cleaned by subsequent patches. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 12 --- hw/pci.c | 52 +--- hw/pci.h |9 +++- 3 files changed, 35 insertions(+), 38 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index bde231d..311c197 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1292,7 +1292,12 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) PCIRegion *pci_region = dev-real_device.regions; int next_cap_pt = 0; +pci_dev-cap.supported = 1; +pci_dev-cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR; pci_dev-cap.length = 0; +pci_dev-config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; +pci_dev-config[PCI_CAPABILITY_LIST] = pci_dev-cap.start; + #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability @@ -1468,9 +1473,10 @@ static int assigned_initfn(struct PCIDevice *pci_dev) dev-h_busnr = dev-host.bus; dev-h_devfn = PCI_DEVFN(dev-host.dev, dev-host.func); -if (pci_enable_capability_support(pci_dev, 0, NULL, -assigned_device_pci_cap_write_config, -assigned_device_pci_cap_init) 0) +pci_register_capability_handlers(pci_dev, NULL, + assigned_device_pci_cap_write_config); + +if (assigned_device_pci_cap_init(pci_dev) 0) goto out; /* assign device to guest */ diff --git a/hw/pci.c b/hw/pci.c index 12c47ac..6b2b320 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -787,6 +787,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, config_write = pci_default_write_config; pci_dev-config_read = config_read; pci_dev-config_write = config_write; +pci_dev-cap.config_read = pci_default_cap_read_config; +pci_dev-cap.config_write = pci_default_cap_write_config; bus-devices[devfn] = pci_dev; pci_dev-irq = qemu_allocate_irqs(pci_set_irq, pci_dev, PCI_NUM_PINS); pci_dev-version_id = 2; /* Current pci device vmstate version */ @@ -1893,35 +1895,21 @@ PCIDevice *pci_create_simple_multifunction(PCIBus *bus, int devfn, return dev; } -int pci_enable_capability_support(PCIDevice *pci_dev, - uint32_t config_start, - PCICapConfigReadFunc *config_read, - PCICapConfigWriteFunc *config_write, - PCICapConfigInitFunc *config_init) +void pci_register_capability_handlers(PCIDevice *pdev, + PCICapConfigReadFunc *config_read, + PCICapConfigWriteFunc *config_write) { -if (!pci_dev) -return -ENODEV; - -pci_dev-config[0x06] |= 0x10; // status = capabilities - -if (config_start == 0) - pci_dev-cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR; -else if (config_start = 0x40 config_start 0xff) -pci_dev-cap.start = config_start; -else -return -EINVAL; +if (config_read) { +pdev-cap.config_read = config_read; +} else { +pdev-cap.config_read = pci_default_cap_read_config; +} -if (config_read) -pci_dev-cap.config_read = config_read; -else -pci_dev-cap.config_read = pci_default_cap_read_config; -if (config_write) -pci_dev-cap.config_write = config_write; -else -pci_dev-cap.config_write = pci_default_cap_write_config; -pci_dev-cap.supported = 1; -pci_dev-config[PCI_CAPABILITY_LIST] = pci_dev-cap.start; -return config_init(pci_dev); +if (config_write) { +pdev-cap.config_write = config_write; +} else { +pdev-cap.config_write = pci_default_cap_write_config; +} } PCIDevice *pci_create(PCIBus *bus, int devfn, const char *name) @@ -2045,12 +2033,16 @@ int pci_add_capability_at_offset(PCIDevice *pdev, uint8_t cap_id, config[PCI_CAP_LIST_ID] = cap_id; config[PCI_CAP_LIST_NEXT] = pdev-config[PCI_CAPABILITY_LIST]; pdev-config[PCI_CAPABILITY_LIST] = offset; -pdev-config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; memset(pdev-used + offset, 0xFF, size); /* Make capability read-only by default */ memset(pdev-wmask + offset, 0, size); /* Check capability by default */ memset(pdev-cmask + offset, 0xFF, size); + +pdev-config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; +pdev-cap.supported = 1; +pdev-cap.start = pdev-cap.start ? MIN(pdev-cap.start, offset) : offset; + return offset; } @@ -2078,8 +2070,10 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size)
[PATCH 3/8] device-assignment: Use PCI capabilities support
Convert to use common pci_add_capabilities() rather than creating our own mess. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 112 +++- 1 files changed, 63 insertions(+), 49 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 311c197..74cdd26 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -38,6 +38,7 @@ #include device-assignment.h #include loader.h #include monitor.h +#include range.h #include pci/header.h /* From linux/ioport.h */ @@ -1075,17 +1076,17 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, unsigned int ctrl_pos) } if (ctrl_byte PCI_MSI_FLAGS_ENABLE) { +int pos = ctrl_pos - PCI_MSI_FLAGS; assigned_dev-entry = calloc(1, sizeof(struct kvm_irq_routing_entry)); if (!assigned_dev-entry) { perror(assigned_dev_update_msi: ); return; } assigned_dev-entry-u.msi.address_lo = -*(uint32_t *)(pci_dev-config + pci_dev-cap.start + - PCI_MSI_ADDRESS_LO); +pci_get_long(pci_dev-config + pos + PCI_MSI_ADDRESS_LO); assigned_dev-entry-u.msi.address_hi = 0; -assigned_dev-entry-u.msi.data = *(uint16_t *)(pci_dev-config + -pci_dev-cap.start + PCI_MSI_DATA_32); +assigned_dev-entry-u.msi.data = +pci_get_word(pci_dev-config + pos + PCI_MSI_DATA_32); assigned_dev-entry-type = KVM_IRQ_ROUTING_MSI; r = kvm_get_irq_route_gsi(); if (r 0) { @@ -1123,10 +1124,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) struct kvm_assigned_msix_entry msix_entry; void *va = adev-msix_table_page; -if (adev-cap.available ASSIGNED_DEVICE_CAP_MSI) -pos = pci_dev-cap.start + PCI_CAPABILITY_CONFIG_MSI_LENGTH; -else -pos = pci_dev-cap.start; +pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX); entries_max_nr = *(uint16_t *)(pci_dev-config + pos + 2); entries_max_nr = PCI_MSIX_TABSIZE; @@ -1260,26 +1258,23 @@ static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t ad uint32_t val, int len) { AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev); -unsigned int pos = pci_dev-cap.start, ctrl_pos; pci_default_cap_write_config(pci_dev, address, val, len); #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI if (assigned_dev-cap.available ASSIGNED_DEVICE_CAP_MSI) { -ctrl_pos = pos + PCI_MSI_FLAGS; -if (address = ctrl_pos address + len ctrl_pos) -assigned_dev_update_msi(pci_dev, ctrl_pos); -pos += PCI_CAPABILITY_CONFIG_MSI_LENGTH; +int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSI); +if (ranges_overlap(address, len, pos + PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev, pos + PCI_MSI_FLAGS); +} } #endif #ifdef KVM_CAP_DEVICE_MSIX if (assigned_dev-cap.available ASSIGNED_DEVICE_CAP_MSIX) { -ctrl_pos = pos + 3; -if (address = ctrl_pos address + len ctrl_pos) { -ctrl_pos--; /* control is word long */ -assigned_dev_update_msix(pci_dev, ctrl_pos); +int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX); +if (ranges_overlap(address, len, pos + PCI_MSIX_FLAGS + 1, 1)) { +assigned_dev_update_msix(pci_dev, pos + PCI_MSIX_FLAGS); } -pos += PCI_CAPABILITY_CONFIG_MSIX_LENGTH; } #endif #endif @@ -1290,58 +1285,77 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) { AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev); PCIRegion *pci_region = dev-real_device.regions; -int next_cap_pt = 0; -pci_dev-cap.supported = 1; -pci_dev-cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR; +/* Clear initial capabilities pointer and status copied from hw */ +pci_set_byte(pci_dev-config + PCI_CAPABILITY_LIST, 0); +pci_set_word(pci_dev-config + PCI_STATUS, + pci_get_word(pci_dev-config + PCI_STATUS) + ~PCI_STATUS_CAP_LIST); + pci_dev-cap.length = 0; -pci_dev-config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; -pci_dev-config[PCI_CAPABILITY_LIST] = pci_dev-cap.start; #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability * MSI capability is the 1st capability in capability config */ if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) { +int vpos, ppos; +uint16_t flags; + dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI; -memset(pci_dev-config[pci_dev-cap.start + pci_dev-cap.length], - 0, PCI_CAPABILITY_CONFIG_MSI_LENGTH); -pci_dev-config[pci_dev-cap.start + pci_dev-cap.length] = -PCI_CAP_ID_MSI; +vpos =
[PATCH 4/8] pci: Replace used bitmap with capability byte map
Capabilities are allocated in bytes, so we can track both whether a byte is used and by what capability in the same structure. Remove pci_reserve_capability() as there are no users. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pci.c | 16 +--- hw/pci.h |6 ++ 2 files changed, 7 insertions(+), 15 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 6b2b320..e1e8a77 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -730,7 +730,7 @@ static void pci_config_alloc(PCIDevice *pci_dev) pci_dev-config = qemu_mallocz(config_size); pci_dev-cmask = qemu_mallocz(config_size); pci_dev-wmask = qemu_mallocz(config_size); -pci_dev-used = qemu_mallocz(config_size); +pci_dev-cap_map = qemu_mallocz(config_size); } static void pci_config_free(PCIDevice *pci_dev) @@ -738,7 +738,7 @@ static void pci_config_free(PCIDevice *pci_dev) qemu_free(pci_dev-config); qemu_free(pci_dev-cmask); qemu_free(pci_dev-wmask); -qemu_free(pci_dev-used); +qemu_free(pci_dev-cap_map); } /* -1 for devfn means auto assign */ @@ -1928,7 +1928,7 @@ static int pci_find_space(PCIDevice *pdev, uint8_t size) int offset = PCI_CONFIG_HEADER_SIZE; int i; for (i = PCI_CONFIG_HEADER_SIZE; i config_size; ++i) -if (pdev-used[i]) +if (pdev-cap_map[i]) offset = i + 1; else if (i - offset + 1 == size) return offset; @@ -2033,7 +2033,7 @@ int pci_add_capability_at_offset(PCIDevice *pdev, uint8_t cap_id, config[PCI_CAP_LIST_ID] = cap_id; config[PCI_CAP_LIST_NEXT] = pdev-config[PCI_CAPABILITY_LIST]; pdev-config[PCI_CAPABILITY_LIST] = offset; -memset(pdev-used + offset, 0xFF, size); +memset(pdev-cap_map + offset, cap_id, size); /* Make capability read-only by default */ memset(pdev-wmask + offset, 0, size); /* Check capability by default */ @@ -2068,7 +2068,7 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) memset(pdev-wmask + offset, 0xff, size); /* Clear cmask as device-specific registers can't be checked */ memset(pdev-cmask + offset, 0, size); -memset(pdev-used + offset, 0, size); +memset(pdev-cap_map + offset, 0, size); if (!pdev-config[PCI_CAPABILITY_LIST]) { pdev-config[PCI_STATUS] = ~PCI_STATUS_CAP_LIST; @@ -2076,12 +2076,6 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) } } -/* Reserve space for capability at a known offset (to call after load). */ -void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size) -{ -memset(pdev-used + offset, 0xff, size); -} - uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id) { return pci_find_capability_list(pdev, cap_id, NULL); diff --git a/hw/pci.h b/hw/pci.h index 0712e55..2265c70 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -151,8 +151,8 @@ struct PCIDevice { /* Used to implement R/W bytes */ uint8_t *wmask; -/* Used to allocate config space for capabilities. */ -uint8_t *used; +/* Used to allocate config space and track capabilities. */ +uint8_t *cap_map; /* the following fields are read only */ PCIBus *bus; @@ -239,8 +239,6 @@ int pci_add_capability_at_offset(PCIDevice *pci_dev, uint8_t cap_id, void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size); -void pci_reserve_capability(PCIDevice *pci_dev, uint8_t offset, uint8_t size); - uint8_t pci_find_capability(PCIDevice *pci_dev, uint8_t cap_id); uint32_t pci_default_read_config(PCIDevice *d, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] pci: Remove cap.length, cap.start, cap.supported
Capabilities aren't required to be contiguous, so cap.length never really made much sense. Likewise, cap.start is mostly meaningless too. Both of these are better served by the capability map. We can also get rid of cap.supported, since it's really now unused and redundant with flag in the status word anyway. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c |4 hw/pci.c |8 +--- hw/pci.h |2 -- 3 files changed, 1 insertions(+), 13 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 74cdd26..322fa9f 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1292,8 +1292,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) pci_get_word(pci_dev-config + PCI_STATUS) ~PCI_STATUS_CAP_LIST); -pci_dev-cap.length = 0; - #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability @@ -1320,7 +1318,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE); pci_set_long(pci_dev-wmask + vpos + PCI_MSI_ADDRESS_LO, 0xfffc); pci_set_long(pci_dev-wmask + vpos + PCI_MSI_DATA_32, 0x); -pci_dev-cap.length += PCI_CAPABILITY_CONFIG_MSI_LENGTH; } #endif #ifdef KVM_CAP_DEVICE_MSIX @@ -1356,7 +1353,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) bar_nr = msix_table_entry PCI_MSIX_BIR; msix_table_entry = ~PCI_MSIX_BIR; dev-msix_table_addr = pci_region[bar_nr].base_addr + msix_table_entry; -pci_dev-cap.length += PCI_CAPABILITY_CONFIG_MSIX_LENGTH; } #endif #endif diff --git a/hw/pci.c b/hw/pci.c index e1e8a77..d566e33 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1190,10 +1190,7 @@ static void pci_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) int pci_access_cap_config(PCIDevice *pci_dev, uint32_t address, int len) { -if (pci_dev-cap.supported address = pci_dev-cap.start -(address + len) pci_dev-cap.start + pci_dev-cap.length) -return 1; -return 0; +return pci_dev-cap_map[address]; } uint32_t pci_default_cap_read_config(PCIDevice *pci_dev, @@ -2040,8 +2037,6 @@ int pci_add_capability_at_offset(PCIDevice *pdev, uint8_t cap_id, memset(pdev-cmask + offset, 0xFF, size); pdev-config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; -pdev-cap.supported = 1; -pdev-cap.start = pdev-cap.start ? MIN(pdev-cap.start, offset) : offset; return offset; } @@ -2072,7 +2067,6 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) if (!pdev-config[PCI_CAPABILITY_LIST]) { pdev-config[PCI_STATUS] = ~PCI_STATUS_CAP_LIST; -pdev-cap.start = pdev-cap.length = 0; } } diff --git a/hw/pci.h b/hw/pci.h index 2265c70..177008a 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -208,8 +208,6 @@ struct PCIDevice { /* Device capability configuration space */ struct { -int supported; -unsigned int start, length; PCICapConfigReadFunc *config_read; PCICapConfigWriteFunc *config_write; } cap; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] device-assignment: Move PCI capabilities to match physical hardware
Now that common PCI code doesn't have a hangup on capabilities being contiguous, move assigned device capabilities to match their offset on physical hardware. This helps for drivers that assume a capability configuration and don't bother searching. We can also remove several calls to assigned_dev_pci_read_* because we're overlaying the capability at the same location as the initial copy we made of config space. We can therefore just use pci_get_*. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 67 +++- 1 files changed, 21 insertions(+), 46 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 322fa9f..39f19be 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -366,16 +366,6 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos) return (uint8_t)assigned_dev_pci_read(d, pos, 1); } -static uint16_t assigned_dev_pci_read_word(PCIDevice *d, int pos) -{ -return (uint16_t)assigned_dev_pci_read(d, pos, 2); -} - -static uint32_t assigned_dev_pci_read_long(PCIDevice *d, int pos) -{ -return assigned_dev_pci_read(d, pos, 4); -} - static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap) { int id; @@ -1285,6 +1275,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) { AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev); PCIRegion *pci_region = dev-real_device.regions; +int pos; /* Clear initial capabilities pointer and status copied from hw */ pci_set_byte(pci_dev-config + PCI_CAPABILITY_LIST, 0); @@ -1296,60 +1287,44 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability * MSI capability is the 1st capability in capability config */ -if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) { -int vpos, ppos; -uint16_t flags; - +if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) { dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI; -vpos = pci_add_capability(pci_dev, PCI_CAP_ID_MSI, - PCI_CAPABILITY_CONFIG_MSI_LENGTH); - -memset(pci_dev-config + vpos + PCI_CAP_FLAGS, 0, - PCI_CAPABILITY_CONFIG_MSI_LENGTH - PCI_CAP_FLAGS); +pci_add_capability_at_offset(pci_dev, PCI_CAP_ID_MSI, pos, + PCI_CAPABILITY_CONFIG_MSI_LENGTH); /* Only 32-bit/no-mask currently supported */ -ppos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI); -flags = assigned_dev_pci_read_word(pci_dev, ppos + PCI_MSI_FLAGS); -flags = PCI_MSI_FLAGS_QMASK; -pci_set_word(pci_dev-config + vpos + PCI_MSI_FLAGS, flags); +pci_set_word(pci_dev-config + pos + PCI_MSI_FLAGS, + pci_get_word(pci_dev-config + pos + PCI_MSI_FLAGS) + PCI_MSI_FLAGS_QMASK); +pci_set_long(pci_dev-config + pos + PCI_MSI_ADDRESS_LO, 0); +pci_set_word(pci_dev-config + pos + PCI_MSI_DATA_32, 0); /* Set writable fields */ -pci_set_word(pci_dev-wmask + vpos + PCI_MSI_FLAGS, +pci_set_word(pci_dev-wmask + pos + PCI_MSI_FLAGS, PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE); -pci_set_long(pci_dev-wmask + vpos + PCI_MSI_ADDRESS_LO, 0xfffc); -pci_set_long(pci_dev-wmask + vpos + PCI_MSI_DATA_32, 0x); +pci_set_long(pci_dev-wmask + pos + PCI_MSI_ADDRESS_LO, 0xfffc); +pci_set_word(pci_dev-wmask + pos + PCI_MSI_DATA_32, 0x); } #endif #ifdef KVM_CAP_DEVICE_MSIX /* Expose MSI-X capability */ -if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX)) { -int vpos, ppos, entry_nr, bar_nr; +if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX))) { +int bar_nr; uint32_t msix_table_entry; dev-cap.available |= ASSIGNED_DEVICE_CAP_MSIX; -vpos = pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, - PCI_CAPABILITY_CONFIG_MSIX_LENGTH); +pci_add_capability_at_offset(pci_dev, PCI_CAP_ID_MSIX, pos, + PCI_CAPABILITY_CONFIG_MSIX_LENGTH); -memset(pci_dev-config + vpos + PCI_CAP_FLAGS, 0, - PCI_CAPABILITY_CONFIG_MSIX_LENGTH - PCI_CAP_FLAGS); +pci_set_word(pci_dev-config + pos + PCI_MSIX_FLAGS, + pci_get_word(pci_dev-config + pos + PCI_MSIX_FLAGS) + PCI_MSIX_TABSIZE); /* Only enable and function mask bits are writable */ -pci_set_word(pci_dev-wmask + vpos + PCI_MSIX_FLAGS, +pci_set_word(pci_dev-wmask + pos + PCI_MSIX_FLAGS, PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL); -ppos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX); - -entry_nr = assigned_dev_pci_read_word(pci_dev, ppos + PCI_MSIX_FLAGS); -entry_nr = PCI_MSIX_TABSIZE; -
[PATCH 7/8] pci: Pass ID for capability read/write handlers
Any handlers that actually want to interact with specific capabilities are going to want to know the capability ID being accessed. With the capability map, this is readily available, so we can save handlers the trouble of figuring it out. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 36 ++-- hw/pci.c | 14 -- hw/pci.h |8 3 files changed, 34 insertions(+), 24 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 39f19be..179c7dc 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1244,30 +1244,38 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) #endif #endif -static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t address, +static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, + uint8_t cap_id, + uint32_t address, uint32_t val, int len) { -AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev); +pci_default_cap_write_config(pci_dev, cap_id, address, val, len); -pci_default_cap_write_config(pci_dev, address, val, len); +switch (cap_id) { #ifdef KVM_CAP_IRQ_ROUTING +case PCI_CAP_ID_MSI: #ifdef KVM_CAP_DEVICE_MSI -if (assigned_dev-cap.available ASSIGNED_DEVICE_CAP_MSI) { -int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSI); -if (ranges_overlap(address, len, pos + PCI_MSI_FLAGS, 1)) { -assigned_dev_update_msi(pci_dev, pos + PCI_MSI_FLAGS); +{ +uint8_t cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); +} } -} #endif +break; + +case PCI_CAP_ID_MSIX: #ifdef KVM_CAP_DEVICE_MSIX -if (assigned_dev-cap.available ASSIGNED_DEVICE_CAP_MSIX) { -int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX); -if (ranges_overlap(address, len, pos + PCI_MSIX_FLAGS + 1, 1)) { -assigned_dev_update_msix(pci_dev, pos + PCI_MSIX_FLAGS); - } -} +{ +uint8_t cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { +assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); +} +} #endif +break; #endif +} return; } diff --git a/hw/pci.c b/hw/pci.c index d566e33..e1c0917 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1168,10 +1168,11 @@ static uint32_t pci_read_config(PCIDevice *d, uint32_t pci_default_read_config(PCIDevice *d, uint32_t address, int len) { +uint8_t cap_id; assert(len == 1 || len == 2 || len == 4); -if (pci_access_cap_config(d, address, len)) { -return d-cap.config_read(d, address, len); +if ((cap_id = pci_access_cap_config(d, address, len))) { +return d-cap.config_read(d, cap_id, address, len); } return pci_read_config(d, address, len); @@ -1193,13 +1194,13 @@ int pci_access_cap_config(PCIDevice *pci_dev, uint32_t address, int len) return pci_dev-cap_map[address]; } -uint32_t pci_default_cap_read_config(PCIDevice *pci_dev, +uint32_t pci_default_cap_read_config(PCIDevice *pci_dev, uint8_t cap_id, uint32_t address, int len) { return pci_read_config(pci_dev, address, len); } -void pci_default_cap_write_config(PCIDevice *pci_dev, +void pci_default_cap_write_config(PCIDevice *pci_dev, uint8_t cap_id, uint32_t address, uint32_t val, int len) { pci_write_config(pci_dev, address, val, len); @@ -1208,9 +1209,10 @@ void pci_default_cap_write_config(PCIDevice *pci_dev, void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { int was_irq_disabled = pci_irq_disabled(d); +uint8_t cap_id; -if (pci_access_cap_config(d, addr, l)) { -d-cap.config_write(d, addr, val, l); +if ((cap_id = pci_access_cap_config(d, addr, l))) { +d-cap.config_write(d, cap_id, addr, val, l); return; } diff --git a/hw/pci.h b/hw/pci.h index 177008a..3f0b4e0 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -83,9 +83,9 @@ typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int region_num, pcibus_t addr, pcibus_t size, int type); typedef int PCIUnregisterFunc(PCIDevice *pci_dev); -typedef void PCICapConfigWriteFunc(PCIDevice *pci_dev, +typedef void PCICapConfigWriteFunc(PCIDevice *pci_dev, uint8_t cap_id, uint32_t address, uint32_t val, int len); -typedef uint32_t PCICapConfigReadFunc(PCIDevice *pci_dev, +typedef
[RFC PATCH 8/8] device-assignment: pass through and stub more PCI caps
Some drivers depend on finding capabilities like power management, PCI express/X, vital product data, or vendor specific fields. Now that we have better capability support, we can pass more of these tables through to the guest. Note that VPD and VNDR are direct pass through capabilies, the rest are mostly empty shells with a few writable bits where necessary. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 160 +--- 1 files changed, 149 insertions(+), 11 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 179c7dc..1b228ad 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -366,6 +366,27 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos) return (uint8_t)assigned_dev_pci_read(d, pos, 1); } +static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int len) +{ +AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +ssize_t ret; +int fd = pci_dev-real_device.config_fd; + +again: +ret = pwrite(fd, val, len, pos); +if (ret != len) { + if ((ret 0) (errno == EINTR || errno == EAGAIN)) + goto again; + + fprintf(stderr, %s: pwrite failed, ret = %zd errno = %d\n, + __func__, ret, errno); + + exit(1); +} + +return; +} + static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap) { int id; @@ -1244,37 +1265,75 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) #endif #endif +static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev, +uint8_t cap_id, +uint32_t address, int len) +{ +uint8_t cap; + +switch (cap_id) { + +case PCI_CAP_ID_VPD: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap = PCI_CAP_FLAGS) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; + +case PCI_CAP_ID_VNDR: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap PCI_CAP_FLAGS) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; +} + +return pci_default_cap_read_config(pci_dev, cap_id, address, len); +} + static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint8_t cap_id, uint32_t address, uint32_t val, int len) { +uint8_t cap; + pci_default_cap_write_config(pci_dev, cap_id, address, val, len); switch (cap_id) { #ifdef KVM_CAP_IRQ_ROUTING case PCI_CAP_ID_MSI: #ifdef KVM_CAP_DEVICE_MSI -{ -uint8_t cap = pci_find_capability(pci_dev, cap_id); -if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { -assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); -} +cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); } #endif break; case PCI_CAP_ID_MSIX: #ifdef KVM_CAP_DEVICE_MSIX -{ -uint8_t cap = pci_find_capability(pci_dev, cap_id); -if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { -assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); -} +cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { +assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); } #endif break; #endif + +case PCI_CAP_ID_VPD: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap = PCI_CAP_FLAGS) { +assigned_dev_pci_write(pci_dev, address, val, len); +} +break; + +case PCI_CAP_ID_VNDR: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap PCI_CAP_FLAGS) { +assigned_dev_pci_write(pci_dev, address, val, len); +} +break; } return; } @@ -1340,6 +1399,84 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) #endif #endif +/* Minimal PM support, make the state bits writable so the guest + * thinks it's doing something. */ +if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_PM))) { +uint16_t pmc, pmcsr; +pci_add_capability_at_offset(pci_dev, PCI_CAP_ID_PM, pos, + PCI_PM_SIZEOF); + +pmc = pci_get_word(pci_dev-config + pos + PCI_CAP_FLAGS); +pmc = (PCI_PM_CAP_VER_MASK | PCI_PM_CAP_DSI); +pci_set_word(pci_dev-config + pos + PCI_CAP_FLAGS, pmc); + +pmcsr = pci_get_word(pci_dev-config + pos +
Re: [PATCH 1/8] pci: pci_default_cap_write_config ignores wmask
On Thu, Nov 11, 2010 at 07:55:01PM -0700, Alex Williamson wrote: Make use of wmask, just like the rest of config space. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pci.c | 19 --- 1 files changed, 8 insertions(+), 11 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 92aaa85..12c47ac 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1175,13 +1175,14 @@ uint32_t pci_default_read_config(PCIDevice *d, return pci_read_config(d, address, len); } -static void pci_write_config(PCIDevice *pci_dev, - uint32_t address, uint32_t val, int len) +static void pci_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { int i; -for (i = 0; i len; i++) { -pci_dev-config[address + i] = val 0xff; -val = 8; +uint32_t config_size = pci_config_size(d); + +for (i = 0; i l addr + i config_size; val = 8, ++i) { +uint8_t wmask = d-wmask[addr + i]; +d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); } } Let's not name an internal static helper pci_write_config. This is really update_config_by_mask or something like that. But see below: maybe we don't need it at all? @@ -1207,18 +1208,14 @@ void pci_default_cap_write_config(PCIDevice *pci_dev, void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { -int i, was_irq_disabled = pci_irq_disabled(d); -uint32_t config_size = pci_config_size(d); +int was_irq_disabled = pci_irq_disabled(d); if (pci_access_cap_config(d, addr, l)) { d-cap.config_write(d, addr, val, l); return; } I would like to also examine the need for _cap_ functions. Why can assigned devices just do pci_default_write_config if (range_overlap(...msi)) { } if (range_overlap(...msix)) { } and then we could remove all the _cap_ extensions altogether? -for (i = 0; i l addr + i config_size; val = 8, ++i) { -uint8_t wmask = d-wmask[addr + i]; -d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); -} +pci_write_config(d, addr, val, l); #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT if (kvm_enabled() kvm_irqchip_in_kernel() -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/8] device-assignment: pass through and stub more PCI caps
On Thu, Nov 11, 2010 at 07:56:46PM -0700, Alex Williamson wrote: Some drivers depend on finding capabilities like power management, PCI express/X, vital product data, or vendor specific fields. Now that we have better capability support, we can pass more of these tables through to the guest. Note that VPD and VNDR are direct pass through capabilies, the rest are mostly empty shells with a few writable bits where necessary. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 160 +--- 1 files changed, 149 insertions(+), 11 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 179c7dc..1b228ad 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -366,6 +366,27 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos) return (uint8_t)assigned_dev_pci_read(d, pos, 1); } +static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int len) +{ +AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +ssize_t ret; +int fd = pci_dev-real_device.config_fd; + +again: +ret = pwrite(fd, val, len, pos); +if (ret != len) { + if ((ret 0) (errno == EINTR || errno == EAGAIN)) + goto again; do {} while() ? + + fprintf(stderr, %s: pwrite failed, ret = %zd errno = %d\n, + __func__, ret, errno); + + exit(1); +} + +return; +} + static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap) { int id; @@ -1244,37 +1265,75 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) #endif #endif +static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev, +uint8_t cap_id, +uint32_t address, int len) +{ +uint8_t cap; + +switch (cap_id) { + +case PCI_CAP_ID_VPD: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap = PCI_CAP_FLAGS) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; + +case PCI_CAP_ID_VNDR: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap PCI_CAP_FLAGS) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; +} + +return pci_default_cap_read_config(pci_dev, cap_id, address, len); +} + static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint8_t cap_id, uint32_t address, uint32_t val, int len) { +uint8_t cap; + pci_default_cap_write_config(pci_dev, cap_id, address, val, len); switch (cap_id) { #ifdef KVM_CAP_IRQ_ROUTING case PCI_CAP_ID_MSI: #ifdef KVM_CAP_DEVICE_MSI -{ -uint8_t cap = pci_find_capability(pci_dev, cap_id); -if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { -assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); -} +cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); } #endif break; case PCI_CAP_ID_MSIX: #ifdef KVM_CAP_DEVICE_MSIX -{ -uint8_t cap = pci_find_capability(pci_dev, cap_id); -if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { -assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); -} +cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { +assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); } #endif break; #endif + +case PCI_CAP_ID_VPD: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap = PCI_CAP_FLAGS) { +assigned_dev_pci_write(pci_dev, address, val, len); +} +break; + +case PCI_CAP_ID_VNDR: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap PCI_CAP_FLAGS) { +assigned_dev_pci_write(pci_dev, address, val, len); +} +break; I have a feeling we should use overlap functions instead of address math. What do you think? Also - put cap offsets in assigned device structure to avoid find calls? } return; } @@ -1340,6 +1399,84 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) #endif #endif +/* Minimal PM support, make the state bits writable so the guest + * thinks it's doing something. */ +if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_PM))) { +
Re: [PATCH 0/8] PCI capability and device assignment improvements
On Thu, Nov 11, 2010 at 07:54:49PM -0700, Alex Williamson wrote: This series attempts to clean up capability support between common code and device assignment. In doing so, we can move existing MSI MSI-X capabilities to offsets matching real hardware, and further enable more capabilities to be exposed. Very good cleanup overall. Some further suggestions posted. Thanks! The last patch is only for RFC, I'd like some input on what we should pass directly and where we should only provide read-only/emulated access. Patches 1-7 are submitted for commit. Thanks, Alex --- Alex Williamson (8): device-assignment: pass through and stub more PCI caps pci: Pass ID for capability read/write handlers device-assignment: Move PCI capabilities to match physical hardware pci: Remove cap.length, cap.start, cap.supported pci: Replace used bitmap with capability byte map device-assignment: Use PCI capabilities support pci: Remove pci_enable_capability_support() pci: pci_default_cap_write_config ignores wmask hw/device-assignment.c | 273 hw/pci.c | 103 +++--- hw/pci.h | 25 ++-- 3 files changed, 256 insertions(+), 145 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] pci: Replace used bitmap with capability byte map
On Thu, Nov 11, 2010 at 07:55:43PM -0700, Alex Williamson wrote: Capabilities are allocated in bytes, so we can track both whether a byte is used and by what capability in the same structure. Remove pci_reserve_capability() as there are no users. Signed-off-by: Alex Williamson alex.william...@redhat.com I actually wanted to remove the used array completely and ask all users to add offsets directly. Will this be needed then? --- hw/pci.c | 16 +--- hw/pci.h |6 ++ 2 files changed, 7 insertions(+), 15 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 6b2b320..e1e8a77 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -730,7 +730,7 @@ static void pci_config_alloc(PCIDevice *pci_dev) pci_dev-config = qemu_mallocz(config_size); pci_dev-cmask = qemu_mallocz(config_size); pci_dev-wmask = qemu_mallocz(config_size); -pci_dev-used = qemu_mallocz(config_size); +pci_dev-cap_map = qemu_mallocz(config_size); } static void pci_config_free(PCIDevice *pci_dev) @@ -738,7 +738,7 @@ static void pci_config_free(PCIDevice *pci_dev) qemu_free(pci_dev-config); qemu_free(pci_dev-cmask); qemu_free(pci_dev-wmask); -qemu_free(pci_dev-used); +qemu_free(pci_dev-cap_map); } /* -1 for devfn means auto assign */ @@ -1928,7 +1928,7 @@ static int pci_find_space(PCIDevice *pdev, uint8_t size) int offset = PCI_CONFIG_HEADER_SIZE; int i; for (i = PCI_CONFIG_HEADER_SIZE; i config_size; ++i) -if (pdev-used[i]) +if (pdev-cap_map[i]) offset = i + 1; else if (i - offset + 1 == size) return offset; @@ -2033,7 +2033,7 @@ int pci_add_capability_at_offset(PCIDevice *pdev, uint8_t cap_id, config[PCI_CAP_LIST_ID] = cap_id; config[PCI_CAP_LIST_NEXT] = pdev-config[PCI_CAPABILITY_LIST]; pdev-config[PCI_CAPABILITY_LIST] = offset; -memset(pdev-used + offset, 0xFF, size); +memset(pdev-cap_map + offset, cap_id, size); /* Make capability read-only by default */ memset(pdev-wmask + offset, 0, size); /* Check capability by default */ @@ -2068,7 +2068,7 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) memset(pdev-wmask + offset, 0xff, size); /* Clear cmask as device-specific registers can't be checked */ memset(pdev-cmask + offset, 0, size); -memset(pdev-used + offset, 0, size); +memset(pdev-cap_map + offset, 0, size); if (!pdev-config[PCI_CAPABILITY_LIST]) { pdev-config[PCI_STATUS] = ~PCI_STATUS_CAP_LIST; @@ -2076,12 +2076,6 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) } } -/* Reserve space for capability at a known offset (to call after load). */ -void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size) -{ -memset(pdev-used + offset, 0xff, size); -} - uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id) { return pci_find_capability_list(pdev, cap_id, NULL); diff --git a/hw/pci.h b/hw/pci.h index 0712e55..2265c70 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -151,8 +151,8 @@ struct PCIDevice { /* Used to implement R/W bytes */ uint8_t *wmask; -/* Used to allocate config space for capabilities. */ -uint8_t *used; +/* Used to allocate config space and track capabilities. */ +uint8_t *cap_map; /* the following fields are read only */ PCIBus *bus; @@ -239,8 +239,6 @@ int pci_add_capability_at_offset(PCIDevice *pci_dev, uint8_t cap_id, void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size); -void pci_reserve_capability(PCIDevice *pci_dev, uint8_t offset, uint8_t size); - uint8_t pci_find_capability(PCIDevice *pci_dev, uint8_t cap_id); uint32_t pci_default_read_config(PCIDevice *d, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] pci: pci_default_cap_write_config ignores wmask
On Fri, 2010-11-12 at 07:22 +0200, Michael S. Tsirkin wrote: On Thu, Nov 11, 2010 at 07:55:01PM -0700, Alex Williamson wrote: Make use of wmask, just like the rest of config space. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pci.c | 19 --- 1 files changed, 8 insertions(+), 11 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 92aaa85..12c47ac 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1175,13 +1175,14 @@ uint32_t pci_default_read_config(PCIDevice *d, return pci_read_config(d, address, len); } -static void pci_write_config(PCIDevice *pci_dev, - uint32_t address, uint32_t val, int len) +static void pci_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { int i; -for (i = 0; i len; i++) { -pci_dev-config[address + i] = val 0xff; -val = 8; +uint32_t config_size = pci_config_size(d); + +for (i = 0; i l addr + i config_size; val = 8, ++i) { +uint8_t wmask = d-wmask[addr + i]; +d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); } } Let's not name an internal static helper pci_write_config. This is really update_config_by_mask or something like that. But see below: maybe we don't need it at all? The function already exists, I just made it do what it seems like it should have been doing already. @@ -1207,18 +1208,14 @@ void pci_default_cap_write_config(PCIDevice *pci_dev, void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { -int i, was_irq_disabled = pci_irq_disabled(d); -uint32_t config_size = pci_config_size(d); +int was_irq_disabled = pci_irq_disabled(d); if (pci_access_cap_config(d, addr, l)) { d-cap.config_write(d, addr, val, l); return; } I would like to also examine the need for _cap_ functions. Why can assigned devices just do pci_default_write_config if (range_overlap(...msi)) { } if (range_overlap(...msix)) { } and then we could remove all the _cap_ extensions altogether? I think that somewhere we need to track what capabilities are at what offset, config space isn't a performance path, but that look horribly inefficient and gets worse with more capabilities. Why don't we define capability id 0xff as normal config space (first 64 bytes), then add the capability id to read/write_config (this is what vfio does). Then the driver can split capability handling off from their main functions if they want. Anyway, I think such an improvement could be added incrementally later. Thanks, Alex -for (i = 0; i l addr + i config_size; val = 8, ++i) { -uint8_t wmask = d-wmask[addr + i]; -d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); -} +pci_write_config(d, addr, val, l); #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT if (kvm_enabled() kvm_irqchip_in_kernel() -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] pci: Replace used bitmap with capability byte map
On Fri, 2010-11-12 at 07:40 +0200, Michael S. Tsirkin wrote: On Thu, Nov 11, 2010 at 07:55:43PM -0700, Alex Williamson wrote: Capabilities are allocated in bytes, so we can track both whether a byte is used and by what capability in the same structure. Remove pci_reserve_capability() as there are no users. Signed-off-by: Alex Williamson alex.william...@redhat.com I actually wanted to remove the used array completely and ask all users to add offsets directly. Will this be needed then? Can you give an example, I don't understand what you mean by asking users to add offsets directly. I think some kind of tracking what's where in config space needs to be done somewhere and the common PCI code seems like it'd be the place. Thanks, Alex --- hw/pci.c | 16 +--- hw/pci.h |6 ++ 2 files changed, 7 insertions(+), 15 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 6b2b320..e1e8a77 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -730,7 +730,7 @@ static void pci_config_alloc(PCIDevice *pci_dev) pci_dev-config = qemu_mallocz(config_size); pci_dev-cmask = qemu_mallocz(config_size); pci_dev-wmask = qemu_mallocz(config_size); -pci_dev-used = qemu_mallocz(config_size); +pci_dev-cap_map = qemu_mallocz(config_size); } static void pci_config_free(PCIDevice *pci_dev) @@ -738,7 +738,7 @@ static void pci_config_free(PCIDevice *pci_dev) qemu_free(pci_dev-config); qemu_free(pci_dev-cmask); qemu_free(pci_dev-wmask); -qemu_free(pci_dev-used); +qemu_free(pci_dev-cap_map); } /* -1 for devfn means auto assign */ @@ -1928,7 +1928,7 @@ static int pci_find_space(PCIDevice *pdev, uint8_t size) int offset = PCI_CONFIG_HEADER_SIZE; int i; for (i = PCI_CONFIG_HEADER_SIZE; i config_size; ++i) -if (pdev-used[i]) +if (pdev-cap_map[i]) offset = i + 1; else if (i - offset + 1 == size) return offset; @@ -2033,7 +2033,7 @@ int pci_add_capability_at_offset(PCIDevice *pdev, uint8_t cap_id, config[PCI_CAP_LIST_ID] = cap_id; config[PCI_CAP_LIST_NEXT] = pdev-config[PCI_CAPABILITY_LIST]; pdev-config[PCI_CAPABILITY_LIST] = offset; -memset(pdev-used + offset, 0xFF, size); +memset(pdev-cap_map + offset, cap_id, size); /* Make capability read-only by default */ memset(pdev-wmask + offset, 0, size); /* Check capability by default */ @@ -2068,7 +2068,7 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) memset(pdev-wmask + offset, 0xff, size); /* Clear cmask as device-specific registers can't be checked */ memset(pdev-cmask + offset, 0, size); -memset(pdev-used + offset, 0, size); +memset(pdev-cap_map + offset, 0, size); if (!pdev-config[PCI_CAPABILITY_LIST]) { pdev-config[PCI_STATUS] = ~PCI_STATUS_CAP_LIST; @@ -2076,12 +2076,6 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) } } -/* Reserve space for capability at a known offset (to call after load). */ -void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size) -{ -memset(pdev-used + offset, 0xff, size); -} - uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id) { return pci_find_capability_list(pdev, cap_id, NULL); diff --git a/hw/pci.h b/hw/pci.h index 0712e55..2265c70 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -151,8 +151,8 @@ struct PCIDevice { /* Used to implement R/W bytes */ uint8_t *wmask; -/* Used to allocate config space for capabilities. */ -uint8_t *used; +/* Used to allocate config space and track capabilities. */ +uint8_t *cap_map; /* the following fields are read only */ PCIBus *bus; @@ -239,8 +239,6 @@ int pci_add_capability_at_offset(PCIDevice *pci_dev, uint8_t cap_id, void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size); -void pci_reserve_capability(PCIDevice *pci_dev, uint8_t offset, uint8_t size); - uint8_t pci_find_capability(PCIDevice *pci_dev, uint8_t cap_id); uint32_t pci_default_read_config(PCIDevice *d, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/8] device-assignment: pass through and stub more PCI caps
On Fri, 2010-11-12 at 07:36 +0200, Michael S. Tsirkin wrote: On Thu, Nov 11, 2010 at 07:56:46PM -0700, Alex Williamson wrote: Some drivers depend on finding capabilities like power management, PCI express/X, vital product data, or vendor specific fields. Now that we have better capability support, we can pass more of these tables through to the guest. Note that VPD and VNDR are direct pass through capabilies, the rest are mostly empty shells with a few writable bits where necessary. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 160 +--- 1 files changed, 149 insertions(+), 11 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 179c7dc..1b228ad 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -366,6 +366,27 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos) return (uint8_t)assigned_dev_pci_read(d, pos, 1); } +static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int len) +{ +AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +ssize_t ret; +int fd = pci_dev-real_device.config_fd; + +again: +ret = pwrite(fd, val, len, pos); +if (ret != len) { + if ((ret 0) (errno == EINTR || errno == EAGAIN)) + goto again; do {} while() ? Sure, this is just a copy of another place that does something similar. They should either be merged or both converted in a separate patch. + + fprintf(stderr, %s: pwrite failed, ret = %zd errno = %d\n, + __func__, ret, errno); + + exit(1); +} + +return; +} + static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap) { int id; @@ -1244,37 +1265,75 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) #endif #endif +static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev, +uint8_t cap_id, +uint32_t address, int len) +{ +uint8_t cap; + +switch (cap_id) { + +case PCI_CAP_ID_VPD: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap = PCI_CAP_FLAGS) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; + +case PCI_CAP_ID_VNDR: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap PCI_CAP_FLAGS) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; +} + +return pci_default_cap_read_config(pci_dev, cap_id, address, len); +} + static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint8_t cap_id, uint32_t address, uint32_t val, int len) { +uint8_t cap; + pci_default_cap_write_config(pci_dev, cap_id, address, val, len); switch (cap_id) { #ifdef KVM_CAP_IRQ_ROUTING case PCI_CAP_ID_MSI: #ifdef KVM_CAP_DEVICE_MSI -{ -uint8_t cap = pci_find_capability(pci_dev, cap_id); -if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { -assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); -} +cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); } #endif break; case PCI_CAP_ID_MSIX: #ifdef KVM_CAP_DEVICE_MSIX -{ -uint8_t cap = pci_find_capability(pci_dev, cap_id); -if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { -assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); -} +cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { +assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); } #endif break; #endif + +case PCI_CAP_ID_VPD: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap = PCI_CAP_FLAGS) { +assigned_dev_pci_write(pci_dev, address, val, len); +} +break; + +case PCI_CAP_ID_VNDR: +cap = pci_find_capability(pci_dev, cap_id); +if (address - cap PCI_CAP_FLAGS) { +assigned_dev_pci_write(pci_dev, address, val, len); +} +break; I have a feeling we should use overlap functions instead of address math. What do you think? if (!ranges_overlap(address - cap, len, 0, PCI_CAP_FLAGS)) ?
[PATCH v2 1/5] KVM: MMU: fix missing post sync audit
Add AUDIT_POST_SYNC audit for long mode shadow page Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5275c50..f3fad4f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2539,6 +2539,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu-arch.mmu.root_hpa; sp = page_header(root); mmu_sync_children(vcpu, sp); + trace_kvm_mmu_audit(vcpu, AUDIT_POST_SYNC); return; } for (i = 0; i 4; ++i) { -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/5] KVM: MMU: clear apfs if page state is changed
If CR0.PG is changed, the page fault cann't be avoid when the prefault address is accessed later And it also fix a bug: it can retry a page enabled #PF in page disabled context if mmu is shadow page This idear is from Gleb Natapov Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fc29223..c071d73 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -525,6 +525,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) kvm_x86_ops-set_cr0(vcpu, cr0); + if ((cr0 ^ old_cr0) X86_CR0_PG) + kvm_clear_async_pf_completion_queue(vcpu); + if ((cr0 ^ old_cr0) update_bits) kvm_mmu_reset_context(vcpu); return 0; -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/5] KVM: MMU: support apf for nonpaing guest
Let's support apf for nonpaing guest Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f3fad4f..5ee5b97 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2286,7 +2286,11 @@ static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) return 1; } -static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) +static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, +gva_t gva, pfn_t *pfn, bool write, bool *writable); + +static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, +bool no_apf) { int r; int level; @@ -2307,7 +2311,9 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) mmu_seq = vcpu-kvm-mmu_notifier_seq; smp_rmb(); - pfn = gfn_to_pfn_prot(vcpu-kvm, gfn, write, map_writable); + + if (try_async_pf(vcpu, no_apf, gfn, v, pfn, write, map_writable)) + return 0; /* mmio */ if (is_error_pfn(pfn)) @@ -2594,7 +2600,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, gfn = gva PAGE_SHIFT; return nonpaging_map(vcpu, gva PAGE_MASK, -error_code PFERR_WRITE_MASK, gfn); +error_code PFERR_WRITE_MASK, gfn, no_apf); } static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/5] KVM: MMU: retry #PF for softmmu
Retry #PF for softmmu only when the current vcpu has the same root shadow page as the time when #PF occurs. it means they have same paging environment Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |6 ++ arch/x86/kvm/mmu.c | 35 ++- arch/x86/kvm/x86.c | 15 ++- virt/kvm/async_pf.c |1 + 4 files changed, 55 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b04c0fa..2cefe00 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -192,6 +192,8 @@ struct kvm_mmu_page { struct list_head link; struct hlist_node hash_link; + struct kref apfs_counter; + /* * The following two entries are used to key the shadow page in the * hash table. @@ -600,6 +602,7 @@ struct kvm_x86_ops { struct kvm_arch_async_pf { u32 token; gfn_t gfn; + struct kvm_mmu_page *root_sp; bool direct_map; }; @@ -698,6 +701,8 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu); int fx_init(struct kvm_vcpu *vcpu); +struct kvm_mmu_page *get_vcpu_root_sp(struct kvm_vcpu *vcpu, gva_t gva); +void kvm_mmu_release_apf_sp(struct kvm_mmu_page *sp); void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes, @@ -816,6 +821,7 @@ void kvm_set_shared_msr(unsigned index, u64 val, u64 mask); bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip); +void kvm_arch_clear_async_pf(struct kvm_async_pf *work); void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, struct kvm_async_pf *work); void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index bdb9fa9..4b6d54c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -993,6 +993,19 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr) percpu_counter_add(kvm_total_used_mmu_pages, nr); } +static void free_shadow_page(struct kref *kref) +{ + struct kvm_mmu_page *sp; + + sp = container_of(kref, struct kvm_mmu_page, apfs_counter); + kmem_cache_free(mmu_page_header_cache, sp); +} + +void kvm_mmu_release_apf_sp(struct kvm_mmu_page *sp) +{ + kref_put(sp-apfs_counter, free_shadow_page); +} + static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp-spt)); @@ -1001,7 +1014,7 @@ static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp) __free_page(virt_to_page(sp-spt)); if (!sp-role.direct) __free_page(virt_to_page(sp-gfns)); - kmem_cache_free(mmu_page_header_cache, sp); + kvm_mmu_release_apf_sp(sp); kvm_mod_used_mmu_pages(kvm, -1); } @@ -1026,6 +1039,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, sp-multimapped = 0; sp-parent_pte = parent_pte; kvm_mod_used_mmu_pages(vcpu-kvm, +1); + kref_init(sp-apfs_counter); + return sp; } @@ -2603,13 +2618,31 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, error_code PFERR_WRITE_MASK, gfn, no_apf); } +struct kvm_mmu_page *get_vcpu_root_sp(struct kvm_vcpu *vcpu, gva_t gva) +{ + struct kvm_shadow_walk_iterator iterator; + bool ret; + + shadow_walk_init(iterator, vcpu, gva); + ret = shadow_walk_okay(iterator); + WARN_ON(!ret); + + return page_header(__pa(iterator.sptep)); +} + static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) { struct kvm_arch_async_pf arch; + arch.token = (vcpu-arch.apf.id++ 12) | vcpu-vcpu_id; arch.gfn = gfn; arch.direct_map = vcpu-arch.mmu.direct_map; + if (!arch.direct_map) { + arch.root_sp = get_vcpu_root_sp(vcpu, gva); + kref_get(arch.root_sp-apfs_counter); + } + return kvm_setup_async_pf(vcpu, gva, gfn, arch); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 003a0ca..1ecc1a9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6169,7 +6169,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) { int r; - if (!vcpu-arch.mmu.direct_map || !work-arch.direct_map || + if (vcpu-arch.mmu.direct_map != work-arch.direct_map || is_error_page(work-page)) return; @@ -6177,6 +6177,10 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) if (unlikely(r)) return; + if (!vcpu-arch.mmu.direct_map + get_vcpu_root_sp(vcpu, work-gva) != work-arch.root_sp) + return; + vcpu-arch.mmu.page_fault(vcpu,
[PATCH v2 4/5] KVM: MMU: fix apf prefault if nested guest is enabled
If apf is generated in L2 guest and is completed in L1 guest, it will prefault this apf in L1 guest's mmu context. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c |1 + arch/x86/kvm/x86.c |3 ++- 3 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7f20f2c..b04c0fa 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -600,6 +600,7 @@ struct kvm_x86_ops { struct kvm_arch_async_pf { u32 token; gfn_t gfn; + bool direct_map; }; extern struct kvm_x86_ops *kvm_x86_ops; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5ee5b97..bdb9fa9 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2608,6 +2608,7 @@ static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) struct kvm_arch_async_pf arch; arch.token = (vcpu-arch.apf.id++ 12) | vcpu-vcpu_id; arch.gfn = gfn; + arch.direct_map = vcpu-arch.mmu.direct_map; return kvm_setup_async_pf(vcpu, gva, gfn, arch); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c071d73..003a0ca 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6169,7 +6169,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) { int r; - if (!vcpu-arch.mmu.direct_map || is_error_page(work-page)) + if (!vcpu-arch.mmu.direct_map || !work-arch.direct_map || + is_error_page(work-page)) return; r = kvm_mmu_reload(vcpu); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html