Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 10/24/2011 02:06 PM, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. Good point. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Right. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] qemu-kvm: memory_region_add_subregion_common: Assertion `!subregion-parent' failed.
This bug is triggered for my Windows XP guest, but not for my linux guests. The gdb result shows that a vga.vram memoryregion is added twice. libvirt log --- 2011-10-25 16:18:58.117: starting up LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /home/laijs/bin/qemu.sh -S -M pc-0.13 -enable-kvm -m 256 -smp 1,sockets=1,cores=1,threads=1 -name Windows -uuid 3b0f4f29-1ca1-c199-a080-3ccac8f745a9 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/Windows.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -drive file=/home/laijs/Windows/windows.img,if=none,id=drive-ide0-0-0,format=qcow2,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/home/laijs/OFFICE2003_STD_EN.ISO,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=17,id=hostnet0 -device pcnet,netdev=hostnet0,id=net0,mac=52:54:00:a5:d2:ea,bus=pci.0,multifunction=on,addr=0x3.0x0 -usb -vnc 127.0.0.1:0 -vga std -device virtio-balloon-pci,id=balloon0,bus=pci.0,multifunction=on,addr=0 x5.0x0 Domain id=45 is tainted: high-privileges qemu-system-x86_64: /home/laijs/work/qemu-kvm/memory.c:1083: memory_region_add_subregion_common: Assertion `!subregion-parent' failed. 2011-10-25 16:18:58.380: shutting down -git bisect result b195043003d90ea4027ea01cc7a6c974ac915108 is the first bad commit commit b195043003d90ea4027ea01cc7a6c974ac915108 Author: Avi Kivity a...@redhat.com Date: Mon Aug 8 16:08:57 2011 +0300 vga: convert vga and its derivatives to the memory API Convert all vga memory to the memory API. Note we need to fall back to get_system_memory(), since the various buses don't pass the vga window as a memory region. We no longer need to sync the dirty bitmap of the cirrus mapped memory banks, since the memory API takes care of that for us. [jan: fix vga-pci logging] Reviewed-by: Richard Henderson r...@twiddle.net Signed-off-by: Avi Kivity a...@redhat.com Signed-off-by: Anthony Liguori aligu...@us.ibm.com :04 04 e8faf8e539a4ec814ef212ce97040046363a67f3 917701ddf6a42f49204a5b2adfa9654c03c0d6f6 M hw ---gdb result(b19504300) (gdb) bt #0 0x0042995a in memory_region_add_subregion_common (mr=0x2a799a0, offset=4026531840, subregion=0x2d17250) at /home/laijs/work/qemu-kvm/memory.c:1083 #1 0x00590408 in pci_update_mappings (d=0x2d16f40) at /home/laijs/work/qemu-kvm/hw/pci.c:1123 #2 0x00590846 in pci_default_write_config (d=0x2d16f40, addr=4, val=value optimized out, l=value optimized out) at /home/laijs/work/qemu-kvm/hw/pci.c:1173 #3 0x00427616 in kvm_handle_io (env=0x2bdf500) at /home/laijs/work/qemu-kvm/kvm-all.c:837 #4 kvm_cpu_exec (env=0x2bdf500) at /home/laijs/work/qemu-kvm/kvm-all.c:976 #5 0x0040c308 in cpu_exec_all () at /home/laijs/work/qemu-kvm/cpus.c:1102 #6 0x00587281 in main_loop (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /home/laijs/work/qemu-kvm/vl.c:1392 #7 main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /home/laijs/work/qemu-kvm/vl.c:3356 (gdb) p mr-name $1 = 0x2a79a70 system (gdb) p subregion-name $2 = 0x2c02960 vga.vram (gdb) p *subregion $3 = {ops = 0x0, opaque = 0x0, parent = 0x2a799a0, size = 8388608, addr = 3758096384, offset = 0, backend_registered = true, ram_addr = 268697600, iorange = {ops = 0x0, base = 0, len = 0}, terminates = true, alias = 0x0, alias_offset = 0, priority = 1, may_overlap = true, subregions = {tqh_first = 0x0, tqh_last = 0x2d172c8}, subregions_link = {tqe_next = 0x2b7d610, tqe_prev = 0x2b7d788}, coalesced = {tqh_first = 0x0, tqh_last = 0x2d172e8}, name = 0x2c02960 vga.vram, dirty_log_mask = 1 '\001', ioeventfd_nb = 0, ioeventfds = 0x0} (gdb) p subregion-parent $4 = (MemoryRegion *) 0x2a799a0 (gdb) p *subregion-parent $5 = {ops = 0x0, opaque = 0x0, parent = 0x0, size = 9223372036854775807, addr = 0, offset = 0, backend_registered = false, ram_addr = 0, iorange = {ops = 0x0, base = 0, len = 0}, terminates = false, alias = 0x0, alias_offset = 0, priority = 0, may_overlap = false, subregions = { tqh_first = 0x2acc120, tqh_last = 0x2b4d168}, subregions_link = {tqe_next = 0x0, tqe_prev = 0x0}, coalesced = {tqh_first = 0x0, tqh_last = 0x2a79a38}, name = 0x2a79a70 system, dirty_log_mask = 0 '\000', ioeventfd_nb = 0, ioeventfds = 0x0} -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1 V6 resent ] qemu-kvm: fix improper nmi emulation
Previous discussions: Which approach you prefer to? I need to know the result before wasting too much time to respin the approach. Yes, sorry about the slow and sometimes conflicting feedback. 1) Fix KVM_NMI emulation approach (which is v3 patchset) - It directly fixes the problem and matches the real hard ware more, but it changes KVM_NMI bahavior. - Require both kernel-site and userspace-site fix. 2) Get the LAPIC state from kernel irqchip, and inject NMI if it is allowed (which is v4 patchset) - Simple, don't changes any kernel behavior. - Only need the userspace-site fix 3) Add KVM_SET_LINT1 approach (which is v5 patchset) - don't changes the kernel's KVM_NMI behavior. - much complex - Require both kernel-site and userspace-site fix. - userspace-site should also handle the !KVM_SET_LINT1 condition, it uses all the 2) approach' code. it means this approach equals the 2) approach + KVM_SET_LINT1 ioctl. This is an urgent bug of us, we need to settle it down soo While (1) is simple, it overloads a single ioctl with two meanings, that's not so good. Whether we do (1) or (3), we need (2) as well, for older kernels. So I recommend first focusing on (2) and merging it, then doing (3). (note an additional issue with 3 is whether to make it a vm or vcpu ioctl - we've been assuming vcpu ioctl but it's not necessarily the best choice). It is the 2) approach. It only changes the user space site, the kernel site is not touched. It is changed from previous v4 patch, fixed problems found by Jan. end previous discussions From: Lai Jiangshan la...@cn.fujitsu.com Currently, NMI interrupt is blindly sent to all the vCPUs when NMI button event happens. This doesn't properly emulate real hardware on which NMI button event triggers LINT1. Because of this, NMI is sent to the processor even when LINT1 is maskied in LVT. For example, this causes the problem that kdump initiated by NMI sometimes doesn't work on KVM, because kdump assumes NMI is masked on CPUs other than CPU0. With this patch, inject-nmi request is handled as follows. - When in-kernel irqchip is disabled, deliver LINT1 instead of NMI interrupt. - When in-kernel irqchip is enabled, get the in-kernel LAPIC states and test the APIC_LVT_MASKED, if LINT1 is unmasked, and then delivering the NMI directly. (Suggested by Jan Kiszka) Changed from old version: re-implement it by the Jan's suggestion. fix the race found by Jan. Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com Reported-by: Kenji Kaneshige kaneshige.ke...@jp.fujitsu.com Acked-by: Avi Kivity a...@redhat.com Acked-by: Jan Kiszka jan.kis...@web.de --- hw/apic.c | 33 + hw/apic.h |1 + monitor.c |6 +- 3 files changed, 39 insertions(+), 1 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index 69d6ac5..922796a 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -205,6 +205,39 @@ void apic_deliver_pic_intr(DeviceState *d, int level) } } +static inline uint32_t kapic_reg(struct kvm_lapic_state *kapic, int reg_id); + +static void kvm_irqchip_deliver_nmi(void *p) +{ +APICState *s = p; +struct kvm_lapic_state klapic; +uint32_t lvt; + +kvm_get_lapic(s-cpu_env, klapic); +lvt = kapic_reg(klapic, 0x32 + APIC_LVT_LINT1); + +if (lvt APIC_LVT_MASKED) { +return; +} + +if (((lvt 8) 7) != APIC_DM_NMI) { +return; +} + +kvm_vcpu_ioctl(s-cpu_env, KVM_NMI); +} + +void apic_deliver_nmi(DeviceState *d) +{ +APICState *s = DO_UPCAST(APICState, busdev.qdev, d); + +if (kvm_irqchip_in_kernel()) { +run_on_cpu(s-cpu_env, kvm_irqchip_deliver_nmi, s); +} else { +apic_local_deliver(s, APIC_LVT_LINT1); +} +} + #define foreach_apic(apic, deliver_bitmask, code) \ {\ int __i, __j, __mask;\ diff --git a/hw/apic.h b/hw/apic.h index c857d52..3a4be0a 100644 --- a/hw/apic.h +++ b/hw/apic.h @@ -10,6 +10,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, uint8_t trigger_mode); int apic_accept_pic_intr(DeviceState *s); void apic_deliver_pic_intr(DeviceState *s, int level); +void apic_deliver_nmi(DeviceState *d); int apic_get_interrupt(DeviceState *s); void apic_reset_irq_delivered(void); int apic_get_irq_delivered(void); diff --git a/monitor.c b/monitor.c index cb485bf..0b81f17 100644 --- a/monitor.c +++ b/monitor.c @@ -2616,7 +2616,11 @@ static int do_inject_nmi(Monitor *mon, const QDict *qdict, QObject **ret_data) CPUState *env; for (env = first_cpu; env != NULL; env = env-next_cpu) { -cpu_interrupt(env, CPU_INTERRUPT_NMI); +if (!env-apic_state) { +cpu_interrupt(env, CPU_INTERRUPT_NMI); +} else { +apic_deliver_nmi(env-apic_state); +} } return 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the
Re: [BUG] qemu-kvm: memory_region_add_subregion_common: Assertion `!subregion-parent' failed.
On 2011-10-25 11:42, Lai Jiangshan wrote: This bug is triggered for my Windows XP guest, but not for my linux guests. The gdb result shows that a vga.vram memoryregion is added twice. libvirt log --- 2011-10-25 16:18:58.117: starting up LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /home/laijs/bin/qemu.sh -S -M pc-0.13 -enable-kvm -m 256 -smp 1,sockets=1,cores=1,threads=1 -name Windows -uuid 3b0f4f29-1ca1-c199-a080-3ccac8f745a9 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/Windows.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -drive file=/home/laijs/Windows/windows.img,if=none,id=drive-ide0-0-0,format=qcow2,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/home/laijs/OFFICE2003_STD_EN.ISO,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=17,id=hostnet0 -device pcnet ,netdev=hostnet0,id=net0,mac=52:54:00:a5:d2:ea,bus=pci.0,multifunction=on,addr=0x3.0x0 -usb -vnc 127.0.0.1:0 -vga std -device virtio-balloon-pci,id=balloon0,bus=pci.0,multifunction=on,addr=0 x5.0x0 Domain id=45 is tainted: high-privileges qemu-system-x86_64: /home/laijs/work/qemu-kvm/memory.c:1083: memory_region_add_subregion_common: Assertion `!subregion-parent' failed. 2011-10-25 16:18:58.380: shutting down -git bisect result b195043003d90ea4027ea01cc7a6c974ac915108 is the first bad commit commit b195043003d90ea4027ea01cc7a6c974ac915108 Author: Avi Kivity a...@redhat.com Date: Mon Aug 8 16:08:57 2011 +0300 vga: convert vga and its derivatives to the memory API Convert all vga memory to the memory API. Note we need to fall back to get_system_memory(), since the various buses don't pass the vga window as a memory region. We no longer need to sync the dirty bitmap of the cirrus mapped memory banks, since the memory API takes care of that for us. [jan: fix vga-pci logging] Reviewed-by: Richard Henderson r...@twiddle.net Signed-off-by: Avi Kivity a...@redhat.com Signed-off-by: Anthony Liguori aligu...@us.ibm.com :04 04 e8faf8e539a4ec814ef212ce97040046363a67f3 917701ddf6a42f49204a5b2adfa9654c03c0d6f6 Mhw ---gdb result(b19504300) (gdb) bt #0 0x0042995a in memory_region_add_subregion_common (mr=0x2a799a0, offset=4026531840, subregion=0x2d17250) at /home/laijs/work/qemu-kvm/memory.c:1083 #1 0x00590408 in pci_update_mappings (d=0x2d16f40) at /home/laijs/work/qemu-kvm/hw/pci.c:1123 #2 0x00590846 in pci_default_write_config (d=0x2d16f40, addr=4, val=value optimized out, l=value optimized out) at /home/laijs/work/qemu-kvm/hw/pci.c:1173 #3 0x00427616 in kvm_handle_io (env=0x2bdf500) at /home/laijs/work/qemu-kvm/kvm-all.c:837 #4 kvm_cpu_exec (env=0x2bdf500) at /home/laijs/work/qemu-kvm/kvm-all.c:976 #5 0x0040c308 in cpu_exec_all () at /home/laijs/work/qemu-kvm/cpus.c:1102 #6 0x00587281 in main_loop (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /home/laijs/work/qemu-kvm/vl.c:1392 #7 main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /home/laijs/work/qemu-kvm/vl.c:3356 (gdb) p mr-name $1 = 0x2a79a70 system (gdb) p subregion-name $2 = 0x2c02960 vga.vram (gdb) p *subregion $3 = {ops = 0x0, opaque = 0x0, parent = 0x2a799a0, size = 8388608, addr = 3758096384, offset = 0, backend_registered = true, ram_addr = 268697600, iorange = {ops = 0x0, base = 0, len = 0}, terminates = true, alias = 0x0, alias_offset = 0, priority = 1, may_overlap = true, subregions = {tqh_first = 0x0, tqh_last = 0x2d172c8}, subregions_link = {tqe_next = 0x2b7d610, tqe_prev = 0x2b7d788}, coalesced = {tqh_first = 0x0, tqh_last = 0x2d172e8}, name = 0x2c02960 vga.vram, dirty_log_mask = 1 '\001', ioeventfd_nb = 0, ioeventfds = 0x0} (gdb) p subregion-parent $4 = (MemoryRegion *) 0x2a799a0 (gdb) p *subregion-parent $5 = {ops = 0x0, opaque = 0x0, parent = 0x0, size = 9223372036854775807, addr = 0, offset = 0, backend_registered = false, ram_addr = 0, iorange = {ops = 0x0, base = 0, len = 0}, terminates = false, alias = 0x0, alias_offset = 0, priority = 0, may_overlap = false, subregions = { tqh_first = 0x2acc120, tqh_last = 0x2b4d168}, subregions_link = {tqe_next = 0x0, tqe_prev = 0x0}, coalesced = {tqh_first = 0x0, tqh_last = 0x2a79a38}, name = 0x2a79a70 system, dirty_log_mask = 0 '\000', ioeventfd_nb = 0, ioeventfds = 0x0} That's likely vga_init_vbe vs. pci_register_bar with vga.vram. Either the PCI bar or the VBE mapping should be wrapped by an alias region
[PATCH] kvm tools: Add method to stop ipc thread
Stop the ipc thread when shutting down the hypervisor. This solves a bug where the .sock files weren't removed upon shutdown. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/kvm-ipc.h |1 + tools/kvm/kvm-ipc.c | 33 + tools/kvm/kvm.c |1 + 3 files changed, 31 insertions(+), 4 deletions(-) diff --git a/tools/kvm/include/kvm/kvm-ipc.h b/tools/kvm/include/kvm/kvm-ipc.h index c932052..731767f 100644 --- a/tools/kvm/include/kvm/kvm-ipc.h +++ b/tools/kvm/include/kvm/kvm-ipc.h @@ -22,5 +22,6 @@ enum { int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)); int kvm_ipc__handle(int fd, struct kvm_ipc_msg *msg); int kvm_ipc__start(int sock); +int kvm_ipc__stop(void); #endif diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index f05e926..65209e1 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -7,12 +7,14 @@ #include sys/un.h #include sys/types.h #include sys/socket.h +#include sys/eventfd.h #define KVM_IPC_MAX_MSGS 16 static void (*msgs[KVM_IPC_MAX_MSGS])(int fd, u32 type, u32 len, u8 *msg); static DECLARE_RWSEM(msgs_rwlock); -static int epoll_fd, server_fd; +static int epoll_fd, server_fd, stop_fd; +static pthread_t thread; int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)) { @@ -109,8 +111,11 @@ static void *kvm_ipc__thread(void *param) nfds = epoll_wait(epoll_fd, event, 1, -1); if (nfds 0) { int fd = event.data.fd; - - if (fd == server_fd) { + printf(bleh\n); + if (fd == stop_fd) { + printf(Got stop signal\n); + break; + } else if (fd == server_fd) { int client; client = kvm_ipc__new_conn(fd); @@ -128,7 +133,6 @@ static void *kvm_ipc__thread(void *param) int kvm_ipc__start(int sock) { - pthread_t thread; struct epoll_event ev; server_fd = sock; @@ -140,8 +144,29 @@ int kvm_ipc__start(int sock) if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, sock, ev) 0) die(Failed starting IPC thread); + stop_fd = eventfd(0, 0); + ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET; + ev.data.fd = stop_fd; + if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, stop_fd, ev) 0) + die(Failed adding stop event to epoll); + if (pthread_create(thread, NULL, kvm_ipc__thread, NULL) != 0) die(Failed starting IPC thread); return 0; } + +int kvm_ipc__stop(void) +{ + u64 val = 1; + int ret; + + ret = write(stop_fd, val, sizeof(val)); + if (ret 0) + return ret; + + close(server_fd); + close(epoll_fd); + + return ret; +} diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c index 40ae6a5..8d6b5e1 100644 --- a/tools/kvm/kvm.c +++ b/tools/kvm/kvm.c @@ -237,6 +237,7 @@ void kvm__delete(struct kvm *kvm) kvm__stop_timer(kvm); munmap(kvm-ram_start, kvm-ram_size); + kvm_ipc__stop(); kvm__remove_socket(kvm-name); free(kvm); } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Add method to stop ipc thread
On Tue, Oct 25, 2011 at 1:40 PM, Sasha Levin levinsasha...@gmail.com wrote: Stop the ipc thread when shutting down the hypervisor. This solves a bug where the .sock files weren't removed upon shutdown. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/kvm-ipc.h | 1 + tools/kvm/kvm-ipc.c | 33 + tools/kvm/kvm.c | 1 + 3 files changed, 31 insertions(+), 4 deletions(-) diff --git a/tools/kvm/include/kvm/kvm-ipc.h b/tools/kvm/include/kvm/kvm-ipc.h index c932052..731767f 100644 --- a/tools/kvm/include/kvm/kvm-ipc.h +++ b/tools/kvm/include/kvm/kvm-ipc.h @@ -22,5 +22,6 @@ enum { int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)); int kvm_ipc__handle(int fd, struct kvm_ipc_msg *msg); int kvm_ipc__start(int sock); +int kvm_ipc__stop(void); #endif diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index f05e926..65209e1 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -7,12 +7,14 @@ #include sys/un.h #include sys/types.h #include sys/socket.h +#include sys/eventfd.h #define KVM_IPC_MAX_MSGS 16 static void (*msgs[KVM_IPC_MAX_MSGS])(int fd, u32 type, u32 len, u8 *msg); static DECLARE_RWSEM(msgs_rwlock); -static int epoll_fd, server_fd; +static int epoll_fd, server_fd, stop_fd; +static pthread_t thread; int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)) { @@ -109,8 +111,11 @@ static void *kvm_ipc__thread(void *param) nfds = epoll_wait(epoll_fd, event, 1, -1); if (nfds 0) { int fd = event.data.fd; - - if (fd == server_fd) { + printf(bleh\n); + if (fd == stop_fd) { + printf(Got stop signal\n); + break; LOL. Can I have a version of the patch without those printf calls? :-) + } else if (fd == server_fd) { int client; client = kvm_ipc__new_conn(fd); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] kvm-kmod-3.1
This is the kvm-kmod compat wrapper release based on KVM modules of Linux 3.1 (see [1] for background information). The package is available for download from http://sourceforge.net/projects/kvm/files/kvm-kmod/3.1/kvm-kmod-3.1.tar.bz2/download KVM changes since kvm-kmod-3.0b: - initial version of VMX nesting - steal time support (inform guest about the time it was scheduled out) - faster MMIO - additional CPU features exposed to the guest, most importantly SMEP kvm-kmod changes: - fix for cross-building kvm-kmod - Drop unused PROCESSOR variable from config.mak As kernel.org is still out of reach for me, I'm hosting the kvm-kmod git tree now on my server: git://git.kiszka.org/kvm-kmod.git http://git.kiszka.org/kvm-kmod.git This will likely remain the primary repository because it's also not suffering from the excessively long mirroring latencies kernel.org had to cope with. Those latencies made buildbot runs that require public git access really unhandy. Jan [1] http://www.linux-kvm.org/page/Getting_the_kvm_kernel_modules -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Add method to stop ipc thread
于 2011年10月25日 18:40, Sasha Levin 写道: Stop the ipc thread when shutting down the hypervisor. This solves a bug where the .sock files weren't removed upon shutdown. I tested the patch, the socket is still there. (I'm testing with --sdl, and shutdown the vm window). And the guest starting succeeded even if a socket with the name same as the specified guest name exists, which means I can start two guests with same socket, which is not right. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/kvm-ipc.h |1 + tools/kvm/kvm-ipc.c | 33 + tools/kvm/kvm.c |1 + 3 files changed, 31 insertions(+), 4 deletions(-) diff --git a/tools/kvm/include/kvm/kvm-ipc.h b/tools/kvm/include/kvm/kvm-ipc.h index c932052..731767f 100644 --- a/tools/kvm/include/kvm/kvm-ipc.h +++ b/tools/kvm/include/kvm/kvm-ipc.h @@ -22,5 +22,6 @@ enum { int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)); int kvm_ipc__handle(int fd, struct kvm_ipc_msg *msg); int kvm_ipc__start(int sock); +int kvm_ipc__stop(void); #endif diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index f05e926..65209e1 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -7,12 +7,14 @@ #include sys/un.h #include sys/types.h #include sys/socket.h +#include sys/eventfd.h #define KVM_IPC_MAX_MSGS 16 static void (*msgs[KVM_IPC_MAX_MSGS])(int fd, u32 type, u32 len, u8 *msg); static DECLARE_RWSEM(msgs_rwlock); -static int epoll_fd, server_fd; +static int epoll_fd, server_fd, stop_fd; +static pthread_t thread; int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)) { @@ -109,8 +111,11 @@ static void *kvm_ipc__thread(void *param) nfds = epoll_wait(epoll_fd, event, 1, -1); if (nfds 0) { int fd = event.data.fd; - - if (fd == server_fd) { + printf(bleh\n); + if (fd == stop_fd) { + printf(Got stop signal\n); + break; + } else if (fd == server_fd) { int client; client = kvm_ipc__new_conn(fd); @@ -128,7 +133,6 @@ static void *kvm_ipc__thread(void *param) int kvm_ipc__start(int sock) { - pthread_t thread; struct epoll_event ev; server_fd = sock; @@ -140,8 +144,29 @@ int kvm_ipc__start(int sock) if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, sock, ev) 0) die(Failed starting IPC thread); + stop_fd = eventfd(0, 0); + ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET; + ev.data.fd = stop_fd; + if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, stop_fd, ev) 0) + die(Failed adding stop event to epoll); + if (pthread_create(thread, NULL, kvm_ipc__thread, NULL) != 0) die(Failed starting IPC thread); return 0; } + +int kvm_ipc__stop(void) +{ + u64 val = 1; + int ret; + + ret = write(stop_fd, val, sizeof(val)); + if (ret 0) + return ret; + + close(server_fd); + close(epoll_fd); + + return ret; +} diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c index 40ae6a5..8d6b5e1 100644 --- a/tools/kvm/kvm.c +++ b/tools/kvm/kvm.c @@ -237,6 +237,7 @@ void kvm__delete(struct kvm *kvm) kvm__stop_timer(kvm); munmap(kvm-ram_start, kvm-ram_size); + kvm_ipc__stop(); kvm__remove_socket(kvm-name); free(kvm); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Add method to stop ipc thread
On Tue, 2011-10-25 at 13:42 +0300, Pekka Enberg wrote: On Tue, Oct 25, 2011 at 1:40 PM, Sasha Levin levinsasha...@gmail.com wrote: Stop the ipc thread when shutting down the hypervisor. This solves a bug where the .sock files weren't removed upon shutdown. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/kvm-ipc.h |1 + tools/kvm/kvm-ipc.c | 33 + tools/kvm/kvm.c |1 + 3 files changed, 31 insertions(+), 4 deletions(-) diff --git a/tools/kvm/include/kvm/kvm-ipc.h b/tools/kvm/include/kvm/kvm-ipc.h index c932052..731767f 100644 --- a/tools/kvm/include/kvm/kvm-ipc.h +++ b/tools/kvm/include/kvm/kvm-ipc.h @@ -22,5 +22,6 @@ enum { int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)); int kvm_ipc__handle(int fd, struct kvm_ipc_msg *msg); int kvm_ipc__start(int sock); +int kvm_ipc__stop(void); #endif diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index f05e926..65209e1 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -7,12 +7,14 @@ #include sys/un.h #include sys/types.h #include sys/socket.h +#include sys/eventfd.h #define KVM_IPC_MAX_MSGS 16 static void (*msgs[KVM_IPC_MAX_MSGS])(int fd, u32 type, u32 len, u8 *msg); static DECLARE_RWSEM(msgs_rwlock); -static int epoll_fd, server_fd; +static int epoll_fd, server_fd, stop_fd; +static pthread_t thread; int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)) { @@ -109,8 +111,11 @@ static void *kvm_ipc__thread(void *param) nfds = epoll_wait(epoll_fd, event, 1, -1); if (nfds 0) { int fd = event.data.fd; - - if (fd == server_fd) { + printf(bleh\n); + if (fd == stop_fd) { + printf(Got stop signal\n); + break; LOL. Can I have a version of the patch without those printf calls? :-) Woopsie :) -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm tools: Add method to stop ipc thread
Stop the ipc thread when shutting down the hypervisor. This solves a bug where the .sock files weren't removed upon shutdown. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/kvm-ipc.h |1 + tools/kvm/kvm-ipc.c | 30 +++--- tools/kvm/kvm.c |1 + 3 files changed, 29 insertions(+), 3 deletions(-) diff --git a/tools/kvm/include/kvm/kvm-ipc.h b/tools/kvm/include/kvm/kvm-ipc.h index c932052..731767f 100644 --- a/tools/kvm/include/kvm/kvm-ipc.h +++ b/tools/kvm/include/kvm/kvm-ipc.h @@ -22,5 +22,6 @@ enum { int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)); int kvm_ipc__handle(int fd, struct kvm_ipc_msg *msg); int kvm_ipc__start(int sock); +int kvm_ipc__stop(void); #endif diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index f05e926..3131fe8 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -7,12 +7,14 @@ #include sys/un.h #include sys/types.h #include sys/socket.h +#include sys/eventfd.h #define KVM_IPC_MAX_MSGS 16 static void (*msgs[KVM_IPC_MAX_MSGS])(int fd, u32 type, u32 len, u8 *msg); static DECLARE_RWSEM(msgs_rwlock); -static int epoll_fd, server_fd; +static int epoll_fd, server_fd, stop_fd; +static pthread_t thread; int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 len, u8 *msg)) { @@ -110,7 +112,9 @@ static void *kvm_ipc__thread(void *param) if (nfds 0) { int fd = event.data.fd; - if (fd == server_fd) { + if (fd == stop_fd) { + break; + } else if (fd == server_fd) { int client; client = kvm_ipc__new_conn(fd); @@ -128,7 +132,6 @@ static void *kvm_ipc__thread(void *param) int kvm_ipc__start(int sock) { - pthread_t thread; struct epoll_event ev; server_fd = sock; @@ -140,8 +143,29 @@ int kvm_ipc__start(int sock) if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, sock, ev) 0) die(Failed starting IPC thread); + stop_fd = eventfd(0, 0); + ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET; + ev.data.fd = stop_fd; + if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, stop_fd, ev) 0) + die(Failed adding stop event to epoll); + if (pthread_create(thread, NULL, kvm_ipc__thread, NULL) != 0) die(Failed starting IPC thread); return 0; } + +int kvm_ipc__stop(void) +{ + u64 val = 1; + int ret; + + ret = write(stop_fd, val, sizeof(val)); + if (ret 0) + return ret; + + close(server_fd); + close(epoll_fd); + + return ret; +} diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c index 40ae6a5..8d6b5e1 100644 --- a/tools/kvm/kvm.c +++ b/tools/kvm/kvm.c @@ -237,6 +237,7 @@ void kvm__delete(struct kvm *kvm) kvm__stop_timer(kvm); munmap(kvm-ram_start, kvm-ram_size); + kvm_ipc__stop(); kvm__remove_socket(kvm-name); free(kvm); } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Add method to stop ipc thread
On Tue, 2011-10-25 at 19:08 +0800, Osier Yang wrote: 于 2011年10月25日 18:40, Sasha Levin 写道: Stop the ipc thread when shutting down the hypervisor. This solves a bug where the .sock files weren't removed upon shutdown. I tested the patch, the socket is still there. (I'm testing with --sdl, and shutdown the vm window). And the guest starting succeeded even if a socket with the name same as the specified guest name exists, which means I can start two guests with same socket, which is not right. I've just tested it with --sdl, looks like closing the sdl window uses the old method of IPC which I forgot to convert. I'll fix that in a separate patch. If you test it without --sdl it should work right. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm tools: Fix SDL exit on window close
We've changed IPC to use sockets instead of signals, but the process of closing the SDL window was still trigerring an exit signal causing an ugly message and not cleaning up after itself. This patch switches that to use the proper method of cleaning up. Reported-by: Osier Yang jy...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/ui/sdl.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/tools/kvm/ui/sdl.c b/tools/kvm/ui/sdl.c index b84cd66..5bf11fa 100644 --- a/tools/kvm/ui/sdl.c +++ b/tools/kvm/ui/sdl.c @@ -4,6 +4,7 @@ #include kvm/i8042.h #include kvm/util.h #include kvm/kvm.h +#include kvm/kvm-cpu.h #include SDL/SDL.h #include pthread.h @@ -254,7 +255,7 @@ static void *sdl__thread(void *p) SDL_Delay(1000 / FRAME_RATE); } exit: - kill(0, SIGKVMSTOP); + kvm_cpu__reboot(); return NULL; } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Tue, Oct 25, 2011 at 01:41:39PM +0200, Jan Kiszka wrote: On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? Not immediately I think. If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. True. OTOH this might have cost (extra mmio) for the doubtful benefit of making PBA values exact. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Clearing PBA when we detect an empty ring in host is not required, IMO. It's an optimization. Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? Apparently whether mask accesses happen a lot depends on the workload. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It might not know this unless notified by driver. E.g. virtio drivers currently don't do interrupt status reads. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan Talking about emulated devices? It's not sure that real hardware clears PBA. Considering that no guests I know of use PBA ATM, I would not be surprised if many devices had broken PBA support. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH] kvm tools, bios: Don't include glibc headers
We should not include glibc headers from BIOS code which can't rely on glibc at runtime. This patch fixes the following build breakage on 64-bit: CC bios/e820.o In file included from /usr/include/features.h:387:0, from /usr/include/unistd.h:26, from include/kvm/util.h:12, from bios/e820.c:5: /usr/include/gnu/stubs.h:7:27: fatal error: gnu/stubs-32.h: No such file or directory compilation terminated. make: *** [bios/bios.bin.elf] Error 1 when built without the 'glibc-devel.i686' package on Fedora, for example. Cc: Asias He asias.he...@gmail.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Sasha Levin levinsasha...@gmail.com Signed-off-by: Pekka Enberg penb...@kernel.org --- tools/kvm/README |2 +- tools/kvm/bios/e820.c |1 - tools/kvm/bios/int10.c |3 --- 3 files changed, 1 insertions(+), 5 deletions(-) diff --git a/tools/kvm/README b/tools/kvm/README index 7cbf530..c262995 100644 --- a/tools/kvm/README +++ b/tools/kvm/README @@ -102,7 +102,7 @@ For rpm based systems: 32-bit: yum install glibc-devel 64-bit: -yum install glibc-devel glibc-devel.i686 glibc-static +yum install glibc-devel glibc-static On 64-bit Arch Linux make sure the multilib repository is enabled in your /etc/pacman.conf and run diff --git a/tools/kvm/bios/e820.c b/tools/kvm/bios/e820.c index 1eafb5b..a9bca29 100644 --- a/tools/kvm/bios/e820.c +++ b/tools/kvm/bios/e820.c @@ -2,7 +2,6 @@ #include kvm/segment.h #include kvm/bios.h -#include kvm/util.h #include asm/processor-flags.h #include asm/e820.h diff --git a/tools/kvm/bios/int10.c b/tools/kvm/bios/int10.c index 6387ca8..32f7c92 100644 --- a/tools/kvm/bios/int10.c +++ b/tools/kvm/bios/int10.c @@ -1,12 +1,9 @@ #include kvm/segment.h #include kvm/bios.h -#include kvm/util.h #include kvm/vesa.h #include bios/memcpy.h -#include stdint.h - #define VESA_MAGIC ('V' + ('E' 8) + ('S' 16) + ('A' 24)) /* VESA General Information table */ -- 1.7.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm tools: Kill double slash from kvm__get_dir()
In preparation for showing kvm__get_dir() path to the user, kill the unnecessary double slash to make the output more readable. Cc: Asias He asias.he...@gmail.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Sasha Levin levinsasha...@gmail.com Signed-off-by: Pekka Enberg penb...@kernel.org --- tools/kvm/include/kvm/kvm.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tools/kvm/include/kvm/kvm.h b/tools/kvm/include/kvm/kvm.h index 2b3024a..38253da 100644 --- a/tools/kvm/include/kvm/kvm.h +++ b/tools/kvm/include/kvm/kvm.h @@ -22,7 +22,7 @@ #define SIGKVMSTOP (SIGRTMIN + 4) #define SIGKVMRESUME (SIGRTMIN + 5) -#define KVM_PID_FILE_PATH /.kvm-tools/ +#define KVM_PID_FILE_PATH .kvm-tools/ #define HOME_DIR getenv(HOME) struct kvm { -- 1.7.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm tools: Improve kvm setup info and error messages
This patch improves the output of 'kvm setup' as follows: [penberg@tux kvm]$ ./kvm setup kvm setup creates a new rootfs under /home/penberg//.kvm-tools/. This can be used later by the '-d' parameter of 'kvm run'. usage: kvm setup [name] [penberg@tux kvm]$ ./kvm setup test-box A new rootfs 'test-box' has been created in '/home/penberg//.kvm-tools/test-box'. You can now start it by running the following command: kvm run -d test-box [penberg@tux kvm]$ ./kvm setup test-box Unable to create rootfs in /home/penberg//.kvm-tools/test-box: File exists Cc: Asias He asias.he...@gmail.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Sasha Levin levinsasha...@gmail.com Signed-off-by: Pekka Enberg penb...@kernel.org --- tools/kvm/builtin-setup.c | 15 +-- 1 files changed, 9 insertions(+), 6 deletions(-) diff --git a/tools/kvm/builtin-setup.c b/tools/kvm/builtin-setup.c index dce1748..7deb0ed 100644 --- a/tools/kvm/builtin-setup.c +++ b/tools/kvm/builtin-setup.c @@ -46,8 +46,9 @@ static void parse_setup_options(int argc, const char **argv) void kvm_setup_help(void) { - printf(\nkvm setup creates a new rootfs and stores it under ~/.kvm-tools/ .\n - This can be used later by the '-d' parameter of 'kvm run'.\n); + printf(\nkvm setup creates a new rootfs under %s.\n + This can be used later by the '-d' parameter of 'kvm run'.\n, + kvm__get_dir()); usage_with_options(setup_usage, setup_options); } @@ -215,11 +216,13 @@ int kvm_cmd_setup(int argc, const char **argv, const char *prefix) r = do_setup(instance_name); if (r == 0) - pr_info(Your new rootfs named %s has been created.\n - You can now start it by running 'kvm run -d %s'\n, - instance_name, instance_name); + printf(A new rootfs '%s' has been created in '%s%s'.\n\n + You can now start it by running the following command:\n\n + kvm run -d %s\n, + instance_name, kvm__get_dir(), instance_name, instance_name); else - perror(Error creating rootfs); + printf(Unable to create rootfs in %s%s: %s\n, + kvm__get_dir(), instance_name, strerror(errno)); return r; } -- 1.7.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-25 14:05, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 01:41:39PM +0200, Jan Kiszka wrote: On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? Not immediately I think. If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. True. OTOH this might have cost (extra mmio) for the doubtful benefit of making PBA values exact. I think correctness come before performance unless the latter hurts significantly. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Clearing PBA when we detect an empty ring in host is not required, IMO. It's an optimization. For virtio that might be true - as we are free to define the device behaviour to our benefit. What emulated real devices do is another thing. Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? Apparently whether mask accesses happen a lot depends on the workload. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It might not know this unless notified by driver. E.g. virtio drivers currently don't do interrupt status reads. Talking about real devices, they obviously know as they maintain the hardware state. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan Talking about emulated devices? It's not sure that real hardware clears PBA. Considering that no guests I know of use PBA ATM, I would not be surprised if many devices had broken PBA support. OK, if there are no conforming MSI-X devices out there, then we can forget about all the PBA maintenance beyond set if message hit mask, cleared again on unmask. But I doubt that this is generally true. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm tools: Improve kvm list default output
This patch changes 'kvm list' default output to look as follows: [penberg@tux kvm]$ ./kvm list PID GUEST 8156 guest-8156 default (not running) test-box (not running) Cc: Asias He asias.he...@gmail.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Sasha Levin levinsasha...@gmail.com Signed-off-by: Pekka Enberg penb...@kernel.org --- tools/kvm/builtin-list.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/tools/kvm/builtin-list.c b/tools/kvm/builtin-list.c index 12e2537..2ff3472 100644 --- a/tools/kvm/builtin-list.c +++ b/tools/kvm/builtin-list.c @@ -91,8 +91,6 @@ cleanup: static int kvm_list_running_instances(void) { - printf( PID GUEST\n); - return kvm__enumerate_instances(print_guest); } @@ -107,13 +105,11 @@ static int kvm_list_rootfs(void) if (dir == NULL) return -1; - printf( ROOTFS\n); - while ((dirent = readdir(dir))) { if (dirent-d_type == DT_DIR strcmp(dirent-d_name, .) strcmp(dirent-d_name, ..)) - printf(%s\n, dirent-d_name); + printf( %s (not running)\n, dirent-d_name); } return 0; @@ -136,7 +132,9 @@ int kvm_cmd_list(int argc, const char **argv, const char *prefix) parse_setup_options(argc, argv); if (!run !rootfs) - kvm_list_help(); + run = rootfs = true; + + printf( PID GUEST\n); if (run) { r = kvm_list_running_instances(); -- 1.7.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] PCI: Rework config space locking, add INTx masking services
On 2011-10-14 16:48, Jan Kiszka wrote: On 2011-10-06 17:48, Jesse Barnes wrote: On Mon, 12 Sep 2011 18:54:01 +0200 Jan Kiszka jan.kis...@siemens.com wrote: This series tries to heal the currently broken locking scheme around PCI config space accesses. We have an interface lock out access via sysfs, but that service wrongly assumes it is only called by one instance at a time for some device. So two loops doing echo 1 /sys/bus/pci/devices/some-device/reset in parallel will trigger a kernel BUG at the moment. Besides synchronizing with user space, we also need to manage config space access of generic PCI drivers. They need to mask legacy interrupt lines while the specific driver runs in user space or a guest OS. The approach taken here is provide mutex-like locking for general access - which still requires a special mechanism due to requirements of the IBM Power RAID SCSI driver. Furthermore, INTx masking is now available via the PCI core and synchronized via the internal pci_lock. Not sure who may want to take this, so I'm CC'ing broadly. ISTR a bunch of discussion about this (just back from lots of work travel and vacation, sorry I missed most of it). Is this the agreed upon way of handling it? If so, can I get some Reviewed/Acked-bys from people? I hope this is acceptable. These changes are required for further improvements of the KVM device assignment support (INTx sharing). So I would appreciate any ack or whatever feedback as well. At least two acks arrived by now. But I guess this series is not yet lined up for the 3.2 window, is it? Can we get it merged for 3.3? Just let me know if any rebasing is required and against which tree. Thanks, Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] qemu-kvm: memory_region_add_subregion_common: Assertion `!subregion-parent' failed.
Hi, -M pc-0.13 That's likely vga_init_vbe vs. pci_register_bar with vga.vram. Either the PCI bar or the VBE mapping should be wrapped by an alias region pointing to the original vram. And vmware_vga seems to be affected as well. Will there be a migration flag day before 1.0? If yes we can just zap the bochs vbe mapping for pci vga cards. It is not used any more since 0.14 and it is only there for migration compatibility with 0.13 and older ... Easiest way around this is to use a more recent machine type. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
On 10/25/2011 07:35 AM, Kevin Wolf wrote: Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? I don't see a way around it. If the default mode is cache=writeback, then we're open to data corruption in any guest where barrier=0. With guest togglable WCE, it ends up being a guest configuration issue so we can more or less defer responsibility. Do you think it's a good idea to change the default mode w/o guest WCE toggle support? What's your view about older guests if we change the default mode? What's your main motivation for wanting to change the default mode? I'd be much more open to changing the default mode to cache=none FWIW since the risk of data loss there is much, much lower. Regards, Anthony Liguori Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm tools: Kill double slash from kvm__get_dir()
In preparation for showing kvm__get_dir() path to the user, kill the unnecessary double slash to make the output more readable. Cc: Asias He asias.he...@gmail.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Sasha Levin levinsasha...@gmail.com Signed-off-by: Pekka Enberg penb...@kernel.org --- tools/kvm/kvm.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c index 8d6b5e1..8605fc8 100644 --- a/tools/kvm/kvm.c +++ b/tools/kvm/kvm.c @@ -86,7 +86,11 @@ static char kvm_dir[PATH_MAX]; static void set_dir(const char *fmt, va_list args) { - vsnprintf(kvm_dir, sizeof(kvm_dir), fmt, args); + char tmp[PATH_MAX]; + + vsnprintf(tmp, sizeof(tmp), fmt, args); + + realpath(tmp, kvm_dir); } void kvm__set_dir(const char *fmt, ...) -- 1.7.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
On 10/25/2011 03:05 PM, Anthony Liguori wrote: On 10/25/2011 07:35 AM, Kevin Wolf wrote: Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? I don't see a way around it. If the default mode is cache=writeback, then we're open to data corruption in any guest where barrier=0. With guest togglable WCE, it ends up being a guest configuration issue so we can more or less defer responsibility. Do you think it's a good idea to change the default mode w/o guest WCE toggle support? What's your view about older guests if we change the default mode? What's your main motivation for wanting to change the default mode? I'd be much more open to changing the default mode to cache=none FWIW since the risk of data loss there is much, much lower. A bit related to this, it would be nice to mark a VM un-migratable if cache!=none. Juan reports that currently it such VMs are exposed to data integrity issues so we need to fail migrating them automatically. Regards, Anthony Liguori Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm tools: Avoid overwriting socket files with same name
This patch prevents overwriting socket files by running two instances with the same name. Reported-by: Osier Yang jy...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/kvm.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c index 8d6b5e1..f700d78 100644 --- a/tools/kvm/kvm.c +++ b/tools/kvm/kvm.c @@ -152,6 +152,9 @@ static int kvm__create_socket(struct kvm *kvm) sprintf(full_name, %s, kvm__get_dir()); mkdir(full_name, 0777); sprintf(full_name, %s/%s.sock, kvm__get_dir(), kvm-name); + if (access(full_name, F_OK) == 0) + die(Socket file %s already exist, full_name); + s = socket(AF_UNIX, SOCK_STREAM, 0); if (s 0) return s; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Tue, Oct 25, 2011 at 02:21:01PM +0200, Jan Kiszka wrote: On 2011-10-25 14:05, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 01:41:39PM +0200, Jan Kiszka wrote: On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? Not immediately I think. If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. True. OTOH this might have cost (extra mmio) for the doubtful benefit of making PBA values exact. I think correctness come before performance unless the latter hurts significantly. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Clearing PBA when we detect an empty ring in host is not required, IMO. It's an optimization. For virtio that might be true - as we are free to define the device behaviour to our benefit. What emulated real devices do is another thing. Anything specific in mind? Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? Apparently whether mask accesses happen a lot depends on the workload. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It might not know this unless notified by driver. E.g. virtio drivers currently don't do interrupt status reads. Talking about real devices, they obviously know as they maintain the hardware state. Not necessarily. It's quite common to keep the ring in coherent memory allocated by driver, not within the device, the state is then maintained by driver and device together. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan Talking about emulated devices? It's not sure that real hardware clears PBA. Considering that no guests I know of use PBA ATM, I would not be surprised if many devices had broken PBA support. OK, if there are no conforming MSI-X devices out there, Oh, I'm guessing some devices are conforming :) then we can forget about all the PBA maintenance beyond set if message hit mask, cleared again on unmask. But I doubt that this is generally true. Jan We seem to get by basically with what you describe but I'm not saying it's perfect, just that it's hard to make it perfect. --
Re: [Qemu-devel] KVM call agenda for October 25
On 10/25/2011 08:18 AM, Dor Laor wrote: On 10/25/2011 03:05 PM, Anthony Liguori wrote: On 10/25/2011 07:35 AM, Kevin Wolf wrote: Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? I don't see a way around it. If the default mode is cache=writeback, then we're open to data corruption in any guest where barrier=0. With guest togglable WCE, it ends up being a guest configuration issue so we can more or less defer responsibility. Do you think it's a good idea to change the default mode w/o guest WCE toggle support? What's your view about older guests if we change the default mode? What's your main motivation for wanting to change the default mode? I'd be much more open to changing the default mode to cache=none FWIW since the risk of data loss there is much, much lower. A bit related to this, it would be nice to mark a VM un-migratable if cache!=none. Juan reports that currently it such VMs are exposed to data integrity issues so we need to fail migrating them automatically. That's not correct. cache!=none is perfectly safe *if* you have coherent shared storage. Regards, Anthony Liguori Regards, Anthony Liguori Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
Am 25.10.2011 15:18, schrieb Dor Laor: [...] it would be nice to mark a VM un-migratable [snip] Speaking of which, I'm working on the missing migration support for AHCI but fear I won't quite make it for the Nov 1 deadline. Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
Am 25.10.2011 15:05, schrieb Anthony Liguori: On 10/25/2011 07:35 AM, Kevin Wolf wrote: Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? I don't see a way around it. If the default mode is cache=writeback, then we're open to data corruption in any guest where barrier=0. With guest togglable WCE, it ends up being a guest configuration issue so we can more or less defer responsibility. So do you think that offering a WCE inside the guest would be a real solution or just a way to have an excuse? Christoph said that OSes don't usually change this by themselves, it would need an administrator manually changing the setting. But if we require that, we can just as well require that the administrator set cache=writethrough on the qemu command line. Do you think it's a good idea to change the default mode w/o guest WCE toggle support? What's your view about older guests if we change the default mode? What's your main motivation for wanting to change the default mode? Because people are constantly complaining about the awful (cache=writethrough) performance they get before they are told they should use a different cache option. And they are right. The out-of-the-box experience with qemu's block performance really sucks. I'd be much more open to changing the default mode to cache=none FWIW since the risk of data loss there is much, much lower. I think people said that they'd rather not have cache=none as default because O_DIRECT doesn't work everywhere. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
On 10/25/2011 08:56 AM, Kevin Wolf wrote: Am 25.10.2011 15:05, schrieb Anthony Liguori: On 10/25/2011 07:35 AM, Kevin Wolf wrote: Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? I don't see a way around it. If the default mode is cache=writeback, then we're open to data corruption in any guest where barrier=0. With guest togglable WCE, it ends up being a guest configuration issue so we can more or less defer responsibility. So do you think that offering a WCE inside the guest would be a real solution or just a way to have an excuse? No, it offers a mechanism to fix mistakes at run-time verses at start up time. It also means that you can make template images that understand that they don't support barriers and change the WCE setting appropriately. Christoph said that OSes don't usually change this by themselves, it would need an administrator manually changing the setting. But if we require that, we can just as well require that the administrator set cache=writethrough on the qemu command line. The administrator of the guest != the administrator of the host. Do you think it's a good idea to change the default mode w/o guest WCE toggle support? What's your view about older guests if we change the default mode? What's your main motivation for wanting to change the default mode? Because people are constantly complaining about the awful (cache=writethrough) performance they get before they are told they should use a different cache option. And they are right. The out-of-the-box experience with qemu's block performance really sucks. With qcow2 you mean, right? I'd be much more open to changing the default mode to cache=none FWIW since the risk of data loss there is much, much lower. I think people said that they'd rather not have cache=none as default because O_DIRECT doesn't work everywhere. Where doesn't it work these days? I know it doesn't work on tmpfs. I know it works on ext[234], btrfs, nfs. Regards, Anthony Liguori Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm 0.15 usb problem
Hi Gerd, I tried with the branch you suggested but it didn't work either. The whole WinXP VM crashed once I virtually plugged in the USB ISDN card. I also tried the qemu-kvm-15.1 release but that also didn't work. Do you have any further suggestions? How can I assist in debugging that problem? Cheers Thomas On Monday 19 September 2011 08:30:11 Gerd Hoffmann wrote: On 09/16/11 18:22, Thomas Beinicke wrote: I am experiencing the same problem with version 0.15 while 0.14 works flawless. We're using it for an USB ISDN card which freezes the windows host with qemu-kvm 0.15 when plugged in while the system is already running or it completely prevents the start of the VM. We experience the same problem with two completely different USB ISDN adapters. I can further debug this problem if necessary. Try http://www.kraxel.org/cgit/qemu/log/?h=usb.25.stable cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
Am 25.10.2011 16:06, schrieb Anthony Liguori: On 10/25/2011 08:56 AM, Kevin Wolf wrote: Am 25.10.2011 15:05, schrieb Anthony Liguori: On 10/25/2011 07:35 AM, Kevin Wolf wrote: Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? I don't see a way around it. If the default mode is cache=writeback, then we're open to data corruption in any guest where barrier=0. With guest togglable WCE, it ends up being a guest configuration issue so we can more or less defer responsibility. So do you think that offering a WCE inside the guest would be a real solution or just a way to have an excuse? No, it offers a mechanism to fix mistakes at run-time verses at start up time. This is true (in both directions). But I think it's independent from the right default. It also means that you can make template images that understand that they don't support barriers and change the WCE setting appropriately. Isn't that really a job for management tools? Christoph said that OSes don't usually change this by themselves, it would need an administrator manually changing the setting. But if we require that, we can just as well require that the administrator set cache=writethrough on the qemu command line. The administrator of the guest != the administrator of the host. But the administrator of the guest == the owner of the qemu instance, no? He should be the one to use the management tools and configure his VMs. Do you think it's a good idea to change the default mode w/o guest WCE toggle support? What's your view about older guests if we change the default mode? What's your main motivation for wanting to change the default mode? Because people are constantly complaining about the awful (cache=writethrough) performance they get before they are told they should use a different cache option. And they are right. The out-of-the-box experience with qemu's block performance really sucks. With qcow2 you mean, right? No, with any format, including raw. Which isn't surprising at all, O_SYNC makes writes very expensive. I'd be much more open to changing the default mode to cache=none FWIW since the risk of data loss there is much, much lower. I think people said that they'd rather not have cache=none as default because O_DIRECT doesn't work everywhere. Where doesn't it work these days? I know it doesn't work on tmpfs. I know it works on ext[234], btrfs, nfs. I think tmpfs was named (and failing to start with default settings on tmpfs would be nasty enough), but iirc Alex had another one. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC v2 PATCH 5/4 PATCH] virtio-net: send gratuitous packet when needed
On Tue, Oct 25, 2011 at 10:50:41AM +0800, Jason Wang wrote: On 10/24/2011 01:25 PM, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:54:59PM +1030, Rusty Russell wrote: On Sat, 22 Oct 2011 13:43:11 +0800, Jason Wang jasow...@redhat.com wrote: This make let virtio-net driver can send gratituous packet by a new config bit - VIRTIO_NET_S_ANNOUNCE in each config update interrupt. When this bit is set by backend, the driver would schedule a workqueue to send gratituous packet through NETDEV_NOTIFY_PEERS. This feature is negotiated through bit VIRTIO_NET_F_GUEST_ANNOUNCE. Signed-off-by: Jason Wang jasow...@redhat.com This seems like a huge layering violation. Imagine this in real hardware, for example. commits 06c4648d46d1b757d6b9591a86810be79818b60c and 99606477a5888b0ead0284fecb13417b1da8e3af document the need for this: NETDEV_NOTIFY_PEERS notifier indicates that a device moved to a different physical link. and In real hardware such notifications are only generated when the device comes up or the address changes. So hypervisor could get the same behaviour by sending link up/down events, this is just an optimization so guest won't do unecessary stuff like try to reconfigure an IP address. Maybe LOCATION_CHANGE would be a better name? ANNOUNCE_SELF? It would be nice to formulate what kind of event are we notifying the guest about. The announce part of it is really up to the guest, isn't it? There may be a good reason why virtual devices might want this kind of reconfiguration cheat, which is unnecessary for normal machines, I think yes, the difference with real hardware is guest can change location without link getting dropped. FWIW, Xen seems to use this capability too. So does ms netvsc. but it'd have to be spelled out clearly in the spec to justify it... Cheers, Rusty. Agree, and I'd like to see the spec too. The interface seems to involve the guest clearing the status bit when it detects an event? I would describe this in spec. The interface need guest to clear the status bit, this would let the back-end know it has finished the work as we may need to send the gratuitous packets many times. Also - how does it interact with the link up event? We probably don't want to schedule this when we detect a link status change or during initialization, as this patch seems to do? What if link goes down while the work is running? Is that OK? Looks like there's are duplications if guest enable arp_notify vm is started, How hard would it be to avoid these duplicates? but we need to handle the situation that resuming a stopped virtual machine. For the link down race, I don't see any real issue, either dropping or queued. For example, you do unregister_netdev(vi-dev); cancel_work_sync(vi-announce); which looks scary as announce seems to use the netdev. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address filtering support for passthru mode
On Mon, Oct 24, 2011 at 11:15:05AM -0700, Roopa Prabhu wrote: ... And also I dont know of any hw that provides an interface to set hw filters on a per queue basis. VMDq hardware would support this, no? Am not really sure. This patch uses netdev to pass filters to hw. And I don't see any netdev infrastructure that would support per queue filters. Sure. I was only saying that as far as I understand, VMDq hardware does support this functionality. Right, Greg? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address filtering support for passthru mode
-Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Tuesday, October 25, 2011 8:46 AM To: Roopa Prabhu Cc: net...@vger.kernel.org; s...@us.ibm.com; dragos.tatu...@gmail.com; a...@arndb.de; kvm@vger.kernel.org; da...@davemloft.net; mc...@broadcom.com; dwa...@cisco.com; shemmin...@vyatta.com; eric.duma...@gmail.com; ka...@trash.net; be...@cisco.com; Rose, Gregory V Subject: Re: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address filtering support for passthru mode On Mon, Oct 24, 2011 at 11:15:05AM -0700, Roopa Prabhu wrote: ... And also I dont know of any hw that provides an interface to set hw filters on a per queue basis. VMDq hardware would support this, no? Am not really sure. This patch uses netdev to pass filters to hw. And I don't see any netdev infrastructure that would support per queue filters. Sure. I was only saying that as far as I understand, VMDq hardware does support this functionality. Right, Greg? In the case of Intel HW yes. I'll refrain from speaking for other HW vendors although I'm guessing it would be true in their cases also. YMMV, caveat emptor, etc. etc. - Greg -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 4/5] kvm guest : Added configuration support to enable debug information for KVM Guests
On 10/24/2011 03:45 PM, Avi Kivity wrote: On 10/23/2011 09:07 PM, Raghavendra K T wrote: Added configuration support to enable debug information for KVM Guests in debugfs Signed-off-by: Srivatsa Vaddagiriva...@linux.vnet.ibm.com Signed-off-by: Suzuki Poulosesuz...@in.ibm.com Signed-off-by: Raghavendra K Traghavendra...@linux.vnet.ibm.com --- diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1f03f82..ed34269 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -562,6 +562,15 @@ config KVM_GUEST This option enables various optimizations for running under the KVM hypervisor. +config KVM_DEBUG_FS + bool Enable debug information for KVM Guests in debugfs + depends on KVM_GUEST + default n + ---help--- + This option enables collection of various statistics for KVM guest. + Statistics are displayed in debugfs filesystem. Enabling this option + may incur significant overhead. + source arch/x86/lguest/Kconfig This might be better implemented through tracepoints, which an be enabled dynamically. Interesting option. I 'll explore this one. Thanks for pointing. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 3/5] kvm hypervisor : Add two hypercalls to support pv-ticketlock
On 10/24/2011 07:20 PM, Srivatsa Vaddagiri wrote: * Avi Kivitya...@redhat.com [2011-10-24 15:09:25]: I guess with that change, we can also dropthe need for other hypercall introduced in this patch (kvm_pv_kick_cpu_op()). Essentially a vcpu sleeping because of HLT instruction can be woken up by a IPI issued by vcpu releasing a lock. Not if interrupts are disabled. Hmm yes ..so we need a kick hypercall then. So then do also you foresee the need for directed yield at some point, to address LHP? provided we have good improvements to prove. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 4/5] kvm guest : Added configuration support to enable debug information for KVM Guests
On 10/24/2011 03:15 AM, Avi Kivity wrote: On 10/23/2011 09:07 PM, Raghavendra K T wrote: Added configuration support to enable debug information for KVM Guests in debugfs Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com Signed-off-by: Suzuki Poulose suz...@in.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1f03f82..ed34269 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -562,6 +562,15 @@ config KVM_GUEST This option enables various optimizations for running under the KVM hypervisor. +config KVM_DEBUG_FS +bool Enable debug information for KVM Guests in debugfs +depends on KVM_GUEST +default n +---help--- + This option enables collection of various statistics for KVM guest. + Statistics are displayed in debugfs filesystem. Enabling this option + may incur significant overhead. + source arch/x86/lguest/Kconfig This might be better implemented through tracepoints, which an be enabled dynamically. Tracepoints use spinlocks, so that could get awkward. J -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 5/5] kvm guest : pv-ticketlocks support for linux guests running on KVM hypervisor
On 10/23/2011 12:07 PM, Raghavendra K T wrote: This patch extends Linux guests running on KVM hypervisor to support pv-ticketlocks. Very early during bootup, paravirtualied KVM guest detects if the hypervisor has required feature (KVM_FEATURE_WAIT_FOR_KICK) to support pv-ticketlocks. If so, support for pv-ticketlocks is registered via pv_lock_ops. Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com Signed-off-by: Suzuki Poulose suz...@in.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 2874c19..c7f34b7 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -195,10 +195,18 @@ void kvm_async_pf_task_wait(u32 token); void kvm_async_pf_task_wake(u32 token); u32 kvm_read_and_reset_pf_reason(void); extern void kvm_disable_steal_time(void); -#else + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +void __init kvm_guest_early_init(void); +#else /* CONFIG_PARAVIRT_SPINLOCKS */ +#define kvm_guest_early_init() do { } while (0) +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ + +#else /* CONFIG_KVM_GUEST */ #define kvm_guest_init() do { } while (0) #define kvm_async_pf_task_wait(T) do {} while(0) #define kvm_async_pf_task_wake(T) do {} while(0) +#define kvm_guest_early_init() do { } while (0) static inline u32 kvm_read_and_reset_pf_reason(void) { return 0; diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c index 3bb0850..fb25bca 100644 --- a/arch/x86/kernel/head32.c +++ b/arch/x86/kernel/head32.c @@ -9,6 +9,7 @@ #include linux/start_kernel.h #include linux/mm.h #include linux/memblock.h +#include linux/kvm_para.h #include asm/setup.h #include asm/sections.h @@ -59,6 +60,8 @@ void __init i386_start_kernel(void) break; } + kvm_guest_early_init(); + /* * At this point everything still needed from the boot loader * or BIOS or kernel text should be early reserved or marked not diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index 5655c22..cabf8ec 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -13,6 +13,7 @@ #include linux/start_kernel.h #include linux/io.h #include linux/memblock.h +#include linux/kvm_para.h #include asm/processor.h #include asm/proto.h @@ -115,6 +116,8 @@ void __init x86_64_start_reservations(char *real_mode_data) reserve_ebda_region(); + kvm_guest_early_init(); + /* * At this point everything still needed from the boot loader * or BIOS or kernel text should be early reserved or marked not diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index a9c2116..f4f341f 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -39,6 +39,16 @@ #include asm/desc.h #include asm/tlbflush.h +#ifdef CONFIG_PARAVIRT_SPINLOCKS + +#ifdef CONFIG_KVM_DEBUG_FS + +#include linux/debugfs.h + +#endif /* CONFIG_KVM_DEBUG_FS */ + +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ + #define MMU_QUEUE_SIZE 1024 static int kvmapf = 1; @@ -627,3 +637,240 @@ static __init int activate_jump_labels(void) return 0; } arch_initcall(activate_jump_labels); + +#ifdef CONFIG_PARAVIRT_SPINLOCKS + +#ifdef CONFIG_KVM_DEBUG_FS + +static struct kvm_spinlock_stats +{ + u32 taken_slow; + u32 taken_slow_pickup; + + u32 released_slow; + u32 released_slow_kicked; + +#define HISTO_BUCKETS30 + u32 histo_spin_blocked[HISTO_BUCKETS+1]; + + u64 time_blocked; +} spinlock_stats; + +static u8 zero_stats; + +static inline void check_zero(void) +{ + if (unlikely(zero_stats)) { + memset(spinlock_stats, 0, sizeof(spinlock_stats)); + zero_stats = 0; + } +} + +#define ADD_STATS(elem, val) \ + do { check_zero(); spinlock_stats.elem += (val); } while (0) + +static inline u64 spin_time_start(void) +{ + return sched_clock(); +} + +static void __spin_time_accum(u64 delta, u32 *array) +{ + unsigned index = ilog2(delta); + + check_zero(); + + if (index HISTO_BUCKETS) + array[index]++; + else + array[HISTO_BUCKETS]++; +} + +static inline void spin_time_accum_blocked(u64 start) +{ + u32 delta = sched_clock() - start; + + __spin_time_accum(delta, spinlock_stats.histo_spin_blocked); + spinlock_stats.time_blocked += delta; +} + +static struct dentry *d_spin_debug; +static struct dentry *d_kvm_debug; + +struct dentry *kvm_init_debugfs(void) +{ + d_kvm_debug = debugfs_create_dir(kvm, NULL); + if (!d_kvm_debug) + printk(KERN_WARNING Could not create 'kvm' debugfs directory\n); + + return d_kvm_debug; +} + +static int __init kvm_spinlock_debugfs(void) +{ + struct dentry *d_kvm = kvm_init_debugfs(); + + if (d_kvm == NULL) +
[PATCH 0/6] Avoid soft lockup message when KVM is stopped by host
When a guest kernel is stopped by the host hypervisor it can look like a soft lockup to the guest kernel. This false warning can mask later soft lockup warnings which may be real. This patch series adds a method for a host hypervisor to communicate to a guest kernel that it is being stopped. The final patch in the series has the watchdog check this flag when it goes to issue a soft lockup warning and skip the warning if the guest knows it was stopped. It was attempted to solve this in Qemu, but the side effects of saving and restoring the clock and tsc for each vcpu put the wall clock of the guest behind by the amount of time of the pause. This forces a guest to have ntp running in order to keep the wall clock accurate. Eric B Munson (6): Add flag to indicate that a vm was stopped by the host Add functions to check if the host has stopped the vm Add ioctl for KVM_GUEST_STOPPED Add generic stubs for kvm stop check functions Add check for suspended vm in softlockup detector Add age out of guest paused flag arch/x86/include/asm/pvclock-abi.h |1 + arch/x86/include/asm/pvclock.h |7 arch/x86/kernel/kvmclock.c | 55 arch/x86/kvm/x86.c | 14 + include/asm-generic/pvclock.h | 19 include/linux/kvm.h|4 ++ include/linux/kvm_host.h |2 + kernel/watchdog.c |9 ++ 8 files changed, 111 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/pvclock.h -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] Add flag to indicate that a vm was stopped by the host
This flag will be used to check if the vm was stopped by the host when a soft lockup was detected. Signed-off-by: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/pvclock-abi.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h index 35f2d19..6167fd7 100644 --- a/arch/x86/include/asm/pvclock-abi.h +++ b/arch/x86/include/asm/pvclock-abi.h @@ -40,5 +40,6 @@ struct pvclock_wall_clock { } __attribute__((__packed__)); #define PVCLOCK_TSC_STABLE_BIT (1 0) +#define PVCLOCK_GUEST_STOPPED (1 1) #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PVCLOCK_ABI_H */ -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] Add ioctl for KVM_GUEST_STOPPED
Now that we have a flag that will tell the guest it was suspended, create an interface for that communication using a KVM ioctl. Signed-off-by: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/pvclock.h |3 +++ arch/x86/kernel/kvmclock.c | 12 arch/x86/kvm/x86.c |5 + include/linux/kvm.h|2 ++ 4 files changed, 22 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 7d3ba41..9312814 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -3,6 +3,7 @@ #include linux/clocksource.h #include asm/pvclock-abi.h +#include linux/kvm_host.h /* some helper functions for xen and kvm pv clock sources */ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src); @@ -13,6 +14,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct timespec *ts); void pvclock_resume(void); +void kvm_set_host_stopped(struct kvm_vcpu *vcpu); + bool kvm_check_and_clear_host_stopped(int cpu); /* diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 8ddcfaf..f4fff3d 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -22,6 +22,7 @@ #include asm/msr.h #include asm/apic.h #include linux/percpu.h +#include linux/kvm_host.h #include asm/x86_init.h #include asm/reboot.h @@ -113,6 +114,17 @@ static void kvm_get_preset_lpj(void) preset_lpj = lpj; } +/* + * kvm_set_host_stopped() indicates to the guest kernel that it has been + * stopped by the hypervisor. This function will be called from the host only. + */ +void kvm_set_host_stopped(struct kvm_vcpu *vcpu) +{ + struct pvclock_vcpu_time_info *src = vcpu-arch.hv_clock; + src-flags |= PVCLOCK_GUEST_STOPPED; +} +EXPORT_SYMBOL_GPL(kvm_set_host_stopped); + bool kvm_check_and_clear_host_stopped(int cpu) { bool ret = false; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 84a28ea..16e029e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3264,6 +3264,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp, goto out; } + case KVM_PAUSE_GUEST: { + r = 0; + kvm_set_host_stopped(vcpu); + break; + } default: r = -EINVAL; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index aace6b8..a2697be 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -759,6 +759,8 @@ struct kvm_clock_data { #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) /* Available with KVM_CAP_RMA */ #define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma) +/* VM is being stopped by host */ +#define KVM_PAUSE_GUEST _IO(KVMIO, 0xaa) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 0) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] Add generic stubs for kvm stop check functions
Signed-off-by: Eric B Munson emun...@mgebm.net --- include/asm-generic/pvclock.h | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/pvclock.h diff --git a/include/asm-generic/pvclock.h b/include/asm-generic/pvclock.h new file mode 100644 index 000..b42a8eb --- /dev/null +++ b/include/asm-generic/pvclock.h @@ -0,0 +1,19 @@ +#ifndef _ASM_GENERIC_PVCLOCK_H +#define _ASM_GENERIC_PVCLOCK_H + + +/* + * These functions are used by architectures that support kvm to avoid issuing + * false soft lockup messages. + */ +static inline bool kvm_check_and_clear_host_stopped(int cpu) +{ + return false; +} + +static inline void kvm_clear_guest_paused(struct kvm_vcpu *vcpu) +{ + return; +} + +#endif -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] Add functions to check if the host has stopped the vm
When a host stops or suspends a VM it will set a flag to show this. The watchdog will use these functions to determine if a softlockup is real, or the result of a suspended VM. Signed-off-by: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/pvclock.h |2 ++ arch/x86/kernel/kvmclock.c | 19 +++ 2 files changed, 21 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index c59cc97..7d3ba41 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct timespec *ts); void pvclock_resume(void); +bool kvm_check_and_clear_host_stopped(int cpu); + /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, * yielding a 64-bit result. diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index c1a0188..8ddcfaf 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -113,6 +113,25 @@ static void kvm_get_preset_lpj(void) preset_lpj = lpj; } +bool kvm_check_and_clear_host_stopped(int cpu) +{ + bool ret = false; + struct pvclock_vcpu_time_info *src; + + /* +* per_cpu() is safe here because this function is only called from +* timer functions where preemption is already disabled. +*/ + WARN_ON(!in_atomic()); + src = per_cpu(hv_clock, cpu); + if ((src-flags PVCLOCK_GUEST_STOPPED) != 0) { + src-flags = src-flags (~PVCLOCK_GUEST_STOPPED); + ret = true; + } + + return ret; +} + static struct clocksource kvm_clock = { .name = kvm-clock, .read = kvm_clock_get_cycles, -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Add age out of guest paused flag
The KVM_GUEST_PAUSED flag will prevent a guest from compaining about a soft lockup but it can mask real soft lockups if the flag isn't cleared when it is no longer relevant. This patch adds a kvm ioctl that the hypervisor will use when it resumes a guest to start a timer for aging out the flag. The time out will be specified by the hypervisor in the ioctl call. Signed-off-by: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/pvclock.h |2 ++ arch/x86/kernel/kvmclock.c | 24 arch/x86/kvm/x86.c |9 + include/linux/kvm.h|2 ++ include/linux/kvm_host.h |2 ++ 5 files changed, 39 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 9312814..e8460b9 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -18,6 +18,8 @@ void kvm_set_host_stopped(struct kvm_vcpu *vcpu); bool kvm_check_and_clear_host_stopped(int cpu); +void kvm_clear_guest_paused(struct kvm_vcpu *vcpu, unsigned int length); + /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, * yielding a 64-bit result. diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index f4fff3d..f3f3935 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -23,6 +23,8 @@ #include asm/apic.h #include linux/percpu.h #include linux/kvm_host.h +#include linux/kobject.h +#include linux/sysfs.h #include asm/x86_init.h #include asm/reboot.h @@ -144,6 +146,28 @@ bool kvm_check_and_clear_host_stopped(int cpu) return ret; } +static void kvm_timer_clear_guest_paused(unsigned long vcpu_addr) +{ + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)vcpu_addr; + struct pvclock_vcpu_time_info *src = vcpu-arch.hv_clock; + src-flags = src-flags (!PVCLOCK_GUEST_STOPPED); +} + +/* + * Host has resumed the guest, we need to clear the guest paused flag so we + * don't mask any real soft lockups. + */ +void kvm_clear_guest_paused(struct kvm_vcpu *vcpu, unsigned int length) +{ + if (!timer_pending(vcpu-flag_timer)) + setup_timer(vcpu-flag_timer, + kvm_timer_clear_guest_paused, + (unsigned long)vcpu); + mod_timer(vcpu-flag_timer, + jiffies + (length * HZ)); +} +EXPORT_SYMBOL_GPL(kvm_clear_guest_paused); + static struct clocksource kvm_clock = { .name = kvm-clock, .read = kvm_clock_get_cycles, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 16e029e..b907b5a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3269,6 +3269,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp, kvm_set_host_stopped(vcpu); break; } + case KVM_CLEAR_GUEST_PAUSED: { + unsigned int length; + r = -EFAULT; + if (copy_from_user(length, argp, sizeof length)) + goto out; + r = 0; + kvm_clear_guest_paused(vcpu, length); + break; + } default: r = -EINVAL; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index a2697be..f333274 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -761,6 +761,8 @@ struct kvm_clock_data { #define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma) /* VM is being stopped by host */ #define KVM_PAUSE_GUEST _IO(KVMIO, 0xaa) +/* Start the timer to clear the paused flag */ +#define KVM_CLEAR_GUEST_PAUSED _IO(KVMIO, 0xab) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 0) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index eabb21a..e21be4b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -18,6 +18,7 @@ #include linux/msi.h #include linux/slab.h #include linux/rcupdate.h +#include linux/timer.h #include asm/signal.h #include linux/kvm.h @@ -152,6 +153,7 @@ struct kvm_vcpu { #endif struct kvm_vcpu_arch arch; + struct timer_list flag_timer; }; static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] Add check for suspended vm in softlockup detector
A suspended VM can cause spurious soft lockup warnings. To avoid these, the watchdog now checks if the kernel knows it was stopped by the host and skips the warning if so. Signed-off-by: Eric B Munson emun...@mgebm.net --- kernel/watchdog.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 36491cd..8107ba7 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -25,6 +25,7 @@ #include linux/sysctl.h #include asm/irq_regs.h +#include asm/pvclock.h #include linux/perf_event.h int watchdog_enabled = 1; @@ -292,6 +293,14 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + /* +* If a virtual machine is stopped by the host it can look to +* the watchdog like a soft lockup, check to see if the host +* stopped the vm before we issue the warning +*/ + if (kvm_check_and_clear_host_stopped(smp_processor_id())) + return HRTIMER_RESTART; + /* only warn once */ if (__this_cpu_read(soft_watchdog_warn) == true) return HRTIMER_RESTART; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for October 25
On 25.10.2011, at 17:32, Kevin Wolf kw...@redhat.com wrote: Am 25.10.2011 16:06, schrieb Anthony Liguori: On 10/25/2011 08:56 AM, Kevin Wolf wrote: Am 25.10.2011 15:05, schrieb Anthony Liguori: On 10/25/2011 07:35 AM, Kevin Wolf wrote: Am 24.10.2011 13:35, schrieb Paolo Bonzini: On 10/24/2011 01:04 PM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - What's left to merge for 1.0. I would still like to cache the default cache mode (probably to cache=writeback). We don't allow guests to toggle WCE yet which Anthony would have liked to see before doing the change. Is it a strict requirement? I don't see a way around it. If the default mode is cache=writeback, then we're open to data corruption in any guest where barrier=0. With guest togglable WCE, it ends up being a guest configuration issue so we can more or less defer responsibility. So do you think that offering a WCE inside the guest would be a real solution or just a way to have an excuse? No, it offers a mechanism to fix mistakes at run-time verses at start up time. This is true (in both directions). But I think it's independent from the right default. It also means that you can make template images that understand that they don't support barriers and change the WCE setting appropriately. Isn't that really a job for management tools? Christoph said that OSes don't usually change this by themselves, it would need an administrator manually changing the setting. But if we require that, we can just as well require that the administrator set cache=writethrough on the qemu command line. The administrator of the guest != the administrator of the host. But the administrator of the guest == the owner of the qemu instance, no? He should be the one to use the management tools and configure his VMs. Do you think it's a good idea to change the default mode w/o guest WCE toggle support? What's your view about older guests if we change the default mode? What's your main motivation for wanting to change the default mode? Because people are constantly complaining about the awful (cache=writethrough) performance they get before they are told they should use a different cache option. And they are right. The out-of-the-box experience with qemu's block performance really sucks. With qcow2 you mean, right? No, with any format, including raw. Which isn't surprising at all, O_SYNC makes writes very expensive. I'd be much more open to changing the default mode to cache=none FWIW since the risk of data loss there is much, much lower. I think people said that they'd rather not have cache=none as default because O_DIRECT doesn't work everywhere. Where doesn't it work these days? I know it doesn't work on tmpfs. I know it works on ext[234], btrfs, nfs. I think tmpfs was named (and failing to start with default settings on tmpfs would be nasty enough), but iirc Alex had another one. Yeah, IIRC NFS also failed on me :) Alex Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 5/5] kvm guest : pv-ticketlocks support for linux guests running on KVM hypervisor
On 10/23/2011 12:07 PM, Raghavendra K T wrote: This patch extends Linux guests running on KVM hypervisor to support pv-ticketlocks. Very early during bootup, paravirtualied KVM guest detects if the hypervisor has required feature (KVM_FEATURE_WAIT_FOR_KICK) to support pv-ticketlocks. If so, support for pv-ticketlocks is registered via pv_lock_ops. Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com Signed-off-by: Suzuki Poulose suz...@in.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 2874c19..c7f34b7 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -195,10 +195,18 @@ void kvm_async_pf_task_wait(u32 token); void kvm_async_pf_task_wake(u32 token); u32 kvm_read_and_reset_pf_reason(void); extern void kvm_disable_steal_time(void); -#else + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +void __init kvm_guest_early_init(void); +#else /* CONFIG_PARAVIRT_SPINLOCKS */ +#define kvm_guest_early_init() do { } while (0) +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ + +#else /* CONFIG_KVM_GUEST */ #define kvm_guest_init() do { } while (0) #define kvm_async_pf_task_wait(T) do {} while(0) #define kvm_async_pf_task_wake(T) do {} while(0) +#define kvm_guest_early_init() do { } while (0) static inline u32 kvm_read_and_reset_pf_reason(void) { return 0; diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c index 3bb0850..fb25bca 100644 --- a/arch/x86/kernel/head32.c +++ b/arch/x86/kernel/head32.c @@ -9,6 +9,7 @@ #include linux/start_kernel.h #include linux/mm.h #include linux/memblock.h +#include linux/kvm_para.h #include asm/setup.h #include asm/sections.h @@ -59,6 +60,8 @@ void __init i386_start_kernel(void) break; } + kvm_guest_early_init(); + /* * At this point everything still needed from the boot loader * or BIOS or kernel text should be early reserved or marked not diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index 5655c22..cabf8ec 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -13,6 +13,7 @@ #include linux/start_kernel.h #include linux/io.h #include linux/memblock.h +#include linux/kvm_para.h #include asm/processor.h #include asm/proto.h @@ -115,6 +116,8 @@ void __init x86_64_start_reservations(char *real_mode_data) reserve_ebda_region(); + kvm_guest_early_init(); + /* * At this point everything still needed from the boot loader * or BIOS or kernel text should be early reserved or marked not diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index a9c2116..f4f341f 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -39,6 +39,16 @@ #include asm/desc.h #include asm/tlbflush.h +#ifdef CONFIG_PARAVIRT_SPINLOCKS + +#ifdef CONFIG_KVM_DEBUG_FS + +#include linux/debugfs.h + +#endif /* CONFIG_KVM_DEBUG_FS */ + +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ This is a big mess. Is there any problem with including linux/debugfs.h unconditionally? Or at least using #if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_KVM_DEBUG_FS)? J -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Fix SDL exit on window close
于 2011年10月25日 19:46, Sasha Levin 写道: We've changed IPC to use sockets instead of signals, but the process of closing the SDL window was still trigerring an exit signal causing an ugly message and not cleaning up after itself. This patch switches that to use the proper method of cleaning up. The patch works, thanks. Osier -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
about NPIV with qemu-kvm.
hi, hannes: I want to use NPIV with qemu-kvm, I issued the following command: echo ':' /sys/class/fc_host/host0/vport_create and it will produce a new host6 and one vport succesfully, but it does not create any virtual hba pci device. so I don't know how to assign the virtual host to qemu-kvm. from your this mail, does array will first need to assign a lun to this vport? and through this new created disk, like device /dev/sdf, then I add qemu-kvm with -drive file=/dev/sdf,if=virtio... arguments? Regards. Suya. 2011/6/29, Hannes Reinecke h...@suse.de: On 06/29/2011 12:07 PM, Christoph Hellwig wrote: On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote: I think we're missing a level of addressing. We need the ability to talk to multiple target ports in order for list target ports to make sense. Right now there is one implicit target that handles all commands. That means there is one fixed I_T Nexus. If we introduce list target ports we also need a way to say This CDB is destined for target port #0. Then it is possible to enumerate target ports and address targets independently of the LUN field in the CDB. I'm pretty sure this is also how SAS and other transports work. In their framing they include the target port. Yes, exactly. Hierachial LUNs are a nasty fringe feature that we should avoid as much as possible, that is for everything but IBM vSCSI which is braindead enough to force them. Yep. The question is whether we really need to support multiple targets on a virtio-scsi adapter or not. If you are selectively mapping LUNs that the guest may access, then multiple targets are not necessary. If we want to do pass-through of the entire SCSI bus then we need multiple targets but I'm not sure if there are other challenges like dependencies on the transport (Fibre Channel, SAS, etc) which make it impossible to pass through bus-level access? I don't think bus-level pass through is either easily possible nor desirable. What multiple targets are useful for is allowing more virtual disks than we have virtual PCI slots. We could do this by supporting multiple LUNs, but given that many SCSI ressources are target-based doing multiple targets most likely is the more scabale and more logical variant. E.g. we could much more easily have one virtqueue per target than per LUN. The general idea here is that we can support NPIV. With NPIV we'll have several scsi_hosts, each of which is assigned a different set of LUNs by the array. With virtio we need to able to react on LUN remapping on the array side, ie we need to be able to issue a 'REPORT LUNS' command and add/remove LUNs on the fly. This means we have to expose the scsi_host in some way via virtio. This is impossible with a one-to-one mapping between targets and LUNs. The actual bus-level pass-through will be just on the SCSI layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN remapping internally on the host is a totally different matter. Same goes for the transport details; I doubt we will expose all the dingy details of the various transports, but rather restrict ourselves to an abstract transport. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de+49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC v2 PATCH 5/4 PATCH] virtio-net: send gratuitous packet when needed
On 10/25/2011 11:41 PM, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 10:50:41AM +0800, Jason Wang wrote: On 10/24/2011 01:25 PM, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:54:59PM +1030, Rusty Russell wrote: On Sat, 22 Oct 2011 13:43:11 +0800, Jason Wang jasow...@redhat.com wrote: This make let virtio-net driver can send gratituous packet by a new config bit - VIRTIO_NET_S_ANNOUNCE in each config update interrupt. When this bit is set by backend, the driver would schedule a workqueue to send gratituous packet through NETDEV_NOTIFY_PEERS. This feature is negotiated through bit VIRTIO_NET_F_GUEST_ANNOUNCE. Signed-off-by: Jason Wang jasow...@redhat.com This seems like a huge layering violation. Imagine this in real hardware, for example. commits 06c4648d46d1b757d6b9591a86810be79818b60c and 99606477a5888b0ead0284fecb13417b1da8e3af document the need for this: NETDEV_NOTIFY_PEERS notifier indicates that a device moved to a different physical link. and In real hardware such notifications are only generated when the device comes up or the address changes. So hypervisor could get the same behaviour by sending link up/down events, this is just an optimization so guest won't do unecessary stuff like try to reconfigure an IP address. Maybe LOCATION_CHANGE would be a better name? ANNOUNCE_SELF? It would be nice to formulate what kind of event are we notifying the guest about. The announce part of it is really up to the guest, isn't it? Right. There may be a good reason why virtual devices might want this kind of reconfiguration cheat, which is unnecessary for normal machines, I think yes, the difference with real hardware is guest can change location without link getting dropped. FWIW, Xen seems to use this capability too. So does ms netvsc. but it'd have to be spelled out clearly in the spec to justify it... Cheers, Rusty. Agree, and I'd like to see the spec too. The interface seems to involve the guest clearing the status bit when it detects an event? I would describe this in spec. The interface need guest to clear the status bit, this would let the back-end know it has finished the work as we may need to send the gratuitous packets many times. Also - how does it interact with the link up event? We probably don't want to schedule this when we detect a link status change or during initialization, as this patch seems to do? What if link goes down while the work is running? Is that OK? Looks like there's are duplications if guest enable arp_notify vm is started, How hard would it be to avoid these duplicates? Not hard, it could be done in backend by distinguishing the reason : fresh start or cont after migration or stop. but we need to handle the situation that resuming a stopped virtual machine. For the link down race, I don't see any real issue, either dropping or queued. For example, you do unregister_netdev(vi-dev); cancel_work_sync(vi-announce); which looks scary as announce seems to use the netdev. oops, it's a bug, I would fix it. Thanks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: about NPIV with qemu-kvm.
On 10/26/2011 06:40 AM, ya su wrote: hi, hannes: I want to use NPIV with qemu-kvm, I issued the following command: echo ':' /sys/class/fc_host/host0/vport_create and it will produce a new host6 and one vport succesfully, but it does not create any virtual hba pci device. so I don't know how to assign the virtual host to qemu-kvm. Well, you can't. There is no mechanism for. When using NPIV you need to pass in the individual LUNs via eg virtio-blk. from your this mail, does array will first need to assign a lun to this vport? and through this new created disk, like device /dev/sdf, then I add qemu-kvm with -drive file=/dev/sdf,if=virtio... arguments? Yes. That's what you need to do. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html