Re: Clock jumps
Adding kvm to CC. On Mon, May 24, 2010 at 04:06:32PM +, Orion Poplawski wrote: I have a KVM virtual machine running 2.6.33.4-95.fc13.x86_64 on a CentOS 5.5 host whose clock jumps about 8-12 hours a couple times a day. I have no idea what is causing it. Fedora 12 and Centos 5.5 KVM machines run fine on the same host. Is there any debugging I can enable to see what is jumping the clock? kvm-clock: cpu 0, msr 0:1ba4741, boot clock kvm-clock: cpu 0, msr 0:1e15741, primary cpu clock Switching to clocksource kvm-clock rtc_cmos 00:01: setting system clock to 2010-05-20 16:59:48 UTC (1274374788) Thanks, Orion -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: VMX: Enable XSAVE/XRSTORE for guest
On Monday 24 May 2010 21:36:12 Avi Kivity wrote: On 05/24/2010 01:03 PM, Sheng Yang wrote: From: Dexuan Cuidexuan@intel.com Enable XSAVE/XRSTORE for guest. Change from V3: 1. Enforced the assumption that host OS would use all available xstate bits. 2. Various fixes, addressed Avi's comments. I am still not clear about why we need to reload guest xcr0 when cr4.osxsave set... When cr4.osxsave=0, then the guest executes with the host xcr0 (since xgetbv will trap; this is similar to the guest running with the host fpu if cr0.ts=0). So if cr4.osxsave transtions, we need to transition xcr0 as well. Yes... @@ -3354,6 +3356,29 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu) return 1; } +static int handle_xsetbv(struct kvm_vcpu *vcpu) +{ + u64 new_bv = kvm_read_edx_eax(vcpu); + + if (kvm_register_read(vcpu, VCPU_REGS_RCX) != 0) + goto err; + if (vmx_get_cpl(vcpu) != 0) + goto err; + if (!(new_bv XSTATE_FP)) + goto err; + if ((new_bv XSTATE_YMM) !(new_bv XSTATE_SSE)) + goto err; + if (new_bv ~XCNTXT_MASK) + goto err; Ok. This means we must update kvm immediately when XCNTXT_MASK changes. (Otherwise we would use KVM_XCNTXT_MASK which is always smaller than than XCNTXT_MASK). I guess use host_xcr0 here is better? + vcpu-arch.xcr0 = new_bv; + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0); + skip_emulated_instruction(vcpu); + return 1; +err: + kvm_inject_gp(vcpu, 0); + return 1; +} + @@ -4124,6 +4176,8 @@ int kvm_arch_init(void *opaque) perf_register_guest_info_callbacks(kvm_guest_cbs); + host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK); + return 0; Will fault on old cpu. ... EXPORT_SYMBOL_GPL(fx_init); @@ -5134,6 +5195,12 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) vcpu-guest_fpu_loaded = 1; unlazy_fpu(current); + /* +* Restore all possible states in the guest, +* and assume host would use all available bits. +*/ + if (cpu_has_xsave vcpu-arch.xcr0) + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0); fpu_restore_checking(vcpu-arch.guest_fpu); I think we need to reload xcr0 now to the guest's value. trace_kvm_fpu(1); } @@ -5144,6 +5211,13 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) return; vcpu-guest_fpu_loaded = 0; + /* +* Save all possible states in the guest, +* and assume host would use all available bits. +* Also load host_xcr0 for host usage. +*/ + if (cpu_has_xsave vcpu-arch.xcr0) + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0); fpu_save_init(vcpu-arch.guest_fpu); ++vcpu-stat.fpu_reload; set_bit(KVM_REQ_DEACTIVATE_FPU,vcpu-requests); This might be unnecessary. So far xcr0 life cycle is almost that of save_host_state()/load_host_state(), but not exactly. When loading the guest fpu we switch temporarily to host xcr0, then we have to switch back, but only if gcr4.osxsave. When saving the guest fpu, we're already using the host xcr0: void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { kvm_x86_ops-vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); } One way to simplify this is to have a vcpu-guest_xcr0_loaded flag and check it when needed. So the transition matrix is: save_host_state: if gcr4.osxsave, set guest_xcr0_loaded, load it set gcr4.osxsave: ditto clear gcr4.osxsave: do nothing load_host_state: if guest_xcr0_loaded, clear it, reload host xcr0 fpu switching: if (switched) switch; reload fpu; if (switched) switch may be simplified if we move xcr0 reload back to guest entry (... :) but make it lazy: save_host_state: nothing set cr4.osxsave: nothing clear cr4.osxsave: nothing guest entry: if (gcr4.osxsave !guest_xcr0_loaded) { guest_xcr0_loaded = true, load gxcr0 } load_host_state: if (guest_xcr0_loaded) { guest_xcr0_loaded = false; load host xcr0 } fpu switching: if (guest_xcr0_loaded) { guest_xcr0_loaded = false; load host xcr0 }, do fpu stuff So we delay xcr0 reload as late as possible for both entry and exit. I think I got it. But why we need do it at load_host_state()? I guess just put code before fpu testing in kvm_put_guest_fpu() is fine? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for May 25
On Mon, May 24, 2010 at 05:21:04PM -0700, Chris Wright wrote: Please send in any agenda items you are interested in covering. Sorry for the delayed response. If the community is interested, I would like to discuss the Generic Asynchronous task offloading framework patches posted to the community on 24th May 2010. URL:http://lists.gnu.org/archive/html/qemu-devel/2010-05/msg02227.html Brief Description: The patch series extracts out the task offloading framework code from posix-aio-compat.c which is currently being used only by the paio subsystem to create a generic task offloading framework that could be used by other subsystems within qemu. Currently virtio-9p and asynchronous-encoding from vnc server can make use of the generic framework. Points for discussion: - Is a generic task offloading framework the way to go for subsystems such as virtio-9p, which would like to emulate the AIO behaviour that allows us to free the vcpu thread to handle any other guest requests. - Currently the AIO helper threads indicate the completion of the task to the IO-thread by sending a SIGUSR2, the handler for which does a write() to the file descriptor on which the IO thread is waiting using a select. Should we use this signal-handling mechanism to communicate between the generic asynchronous helper threads and the IO-Thread ? If we have a lack of agenda items I'll cancel the week's call. thanks, -chris -- Thanks and Regards gautham -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: VMX: Enable XSAVE/XRSTORE for guest
On 05/25/2010 09:28 AM, Sheng Yang wrote: @@ -3354,6 +3356,29 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu) return 1; } +static int handle_xsetbv(struct kvm_vcpu *vcpu) +{ + u64 new_bv = kvm_read_edx_eax(vcpu); + + if (kvm_register_read(vcpu, VCPU_REGS_RCX) != 0) + goto err; + if (vmx_get_cpl(vcpu) != 0) + goto err; + if (!(new_bv XSTATE_FP)) + goto err; + if ((new_bv XSTATE_YMM) !(new_bv XSTATE_SSE)) + goto err; + if (new_bv ~XCNTXT_MASK) + goto err; Ok. This means we must update kvm immediately when XCNTXT_MASK changes. (Otherwise we would use KVM_XCNTXT_MASK which is always smaller than than XCNTXT_MASK). I guess use host_xcr0 here is better? Yes - it might be smaller than XCNTXT_MASK may be simplified if we move xcr0 reload back to guest entry (... :) but make it lazy: save_host_state: nothing set cr4.osxsave: nothing clear cr4.osxsave: nothing guest entry: if (gcr4.osxsave !guest_xcr0_loaded) { guest_xcr0_loaded = true, load gxcr0 } load_host_state: if (guest_xcr0_loaded) { guest_xcr0_loaded = false; load host xcr0 } fpu switching: if (guest_xcr0_loaded) { guest_xcr0_loaded = false; load host xcr0 }, do fpu stuff So we delay xcr0 reload as late as possible for both entry and exit. I think I got it. But why we need do it at load_host_state()? I guess just put code before fpu testing in kvm_put_guest_fpu() is fine? Right, load_host_state() is bad because it is vmx specific. kvm_put_guest_fpu() (or perhaps kvm_arch_vcpu_put()) is better. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Fix host panic if ioctl called with wrong index
On Tue, May 25, 2010 at 11:10:36AM +0530, Krishna Kumar wrote: From: Krishna Kumar krkum...@in.ibm.com Missed a boundary value check in vhost_set_vring. The host panics if idx == nvqs is used in ioctl commands in vhost_virtqueue_init. Signed-off-by: Krishna Kumar krkum...@in.ibm.com Thanks, applied. --- drivers/vhost/vhost.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -ruNp org/drivers/vhost/vhost.c new/drivers/vhost/vhost.c --- org/drivers/vhost/vhost.c 2010-05-24 09:25:57.0 +0530 +++ new/drivers/vhost/vhost.c 2010-05-24 09:26:53.0 +0530 @@ -374,7 +374,7 @@ static long vhost_set_vring(struct vhost r = get_user(idx, idxp); if (r 0) return r; - if (idx d-nvqs) + if (idx = d-nvqs) return -ENOBUFS; vq = d-vqs + idx; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/5] trace: Add trace-events file for declaring trace events
On Mon, May 24, 2010 at 11:20 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: +# check if trace backend exists + +sh tracetool --$trace_backend --check-backend /dev/null 2 /dev/null This will fail if objdir != srcdir. You have to qualify tracetool with the path to srcdir. Thanks Anthony, fixed on my branch. I'll resend a v2 together with other fixes. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows guest debugging on KVM/Qemu
On 05/24/2010 11:07 PM, Neo Jia wrote: hi, I am using KVM/Qemu to debug my Windows guest according to KVM wiki page (http://www.linux-kvm.org/page/WindowsGuestDrivers/GuestDebugging). It works for me and also I can only use one Windows guest and bind its serial port to a TCP port and run Virtual Serial Ports Emulator on my Windows dev machine. The problem is that these kind of connection is really slow. Is there any known issue with KVM serial port driver? There is a good discussion about the same issue one year ago. Not sure if there is any improvement or not after that. How slow? Can you measure it (without a debugger, just guest-to-guest file transfer)? slirp used to be ridiculously slow but some recent change made it fairly fast. Probably a missing wakeup, perhaps serial has the same problem. In any case I recommend testing with qemu-kvm.git master. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH] AMD IOMMU emulation
On Mon, May 24, 2010 at 08:10:16PM +, Blue Swirl wrote: On Mon, May 24, 2010 at 3:40 PM, Joerg Roedel j...@8bytes.org wrote: + +#define MMIO_SIZE 0x2028 This size should be a power-of-two value. In this case probably 0x4000. Not really, the devices can reserve regions of any size. There were some implementation deficiencies in earlier versions of QEMU, where the whole page would be reserved anyway, but this limitation has been removed long time ago. The drivers for AMD IOMMU expect that to be 0x4000. At least the Linux driver maps the MMIO region with this size. So the emulation should reserve this amount of MMIO space too. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 19/23] Introduce ft_tranx_ready(), and modify migrate_fd_put_ready() when ft_mode is on.
Introduce ft_tranx_ready() which kicks the FT transaction cycle. When ft_mode is on, migrate_fd_put_ready() would open ft_transaction file and turn on event_tap. To end or cancel ft_transaction, ft_mode and event_tap is turned off. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c | 78 -- 1 files changed, 75 insertions(+), 3 deletions(-) diff --git a/migration.c b/migration.c index 2adf7ad..5b90d37 100644 --- a/migration.c +++ b/migration.c @@ -21,6 +21,7 @@ #include qemu_socket.h #include block-migration.h #include qemu-objects.h +#include event-tap.h //#define DEBUG_MIGRATION @@ -375,6 +376,49 @@ void migrate_fd_connect(FdMigrationState *s) migrate_fd_put_ready(s); } +static int ft_tranx_ready(void) +{ +FdMigrationState *s = migrate_to_fms(current_migration); +int ret = -1; + +if (ft_mode != FT_TRANSACTION ft_mode != FT_INIT) { +return ret; +} + +if (qemu_transaction_begin(s-file) 0) { +fprintf(stderr, tranx_begin failed\n); +goto error_out; +} + +/* make the VM state consistent by flushing outstanding requests. */ +vm_stop(0); +qemu_aio_flush(); +bdrv_flush_all(); + +if (qemu_savevm_state_all(s-mon, s-file) 0) { +fprintf(stderr, savevm_state_all failed\n); +goto error_out; +} + +if (qemu_transaction_commit(s-file) 0) { +fprintf(stderr, tranx_commit failed\n); +goto error_out; +} + +ret = 0; +goto unpause_out; + +error_out: +ft_mode = FT_OFF; +qemu_savevm_state_cancel(s-mon, s-file); +migrate_fd_cleanup(s); +event_tap_unregister(); + +unpause_out: +vm_start(); +return ret; +} + void migrate_fd_put_ready(void *opaque) { FdMigrationState *s = opaque; @@ -402,8 +446,30 @@ void migrate_fd_put_ready(void *opaque) } else { state = MIG_STATE_COMPLETED; } -migrate_fd_cleanup(s); -s-state = state; + +if (ft_mode state == MIG_STATE_COMPLETED) { +/* close buffered_file and open ft_transaction. + * Note: file discriptor won't get closed, + * but reused by ft_transaction. */ +socket_set_block(s-fd); +socket_set_nodelay(s-fd); +qemu_fclose(s-file); +s-file = qemu_fopen_ops_ft_tranx(s, + migrate_fd_put_buffer, + migrate_fd_get_buffer, + migrate_fd_close, + 1); + +/* events are tapped from now. */ +event_tap_register(ft_tranx_ready); + +if (old_vm_running) { +vm_start(); +} +} else { +migrate_fd_cleanup(s); +s-state = state; +} } } @@ -423,8 +489,14 @@ void migrate_fd_cancel(MigrationState *mig_state) DPRINTF(cancelling migration\n); s-state = MIG_STATE_CANCELLED; -qemu_savevm_state_cancel(s-mon, s-file); +if (ft_mode == FT_TRANSACTION) { +qemu_transaction_cancel(s-file); +ft_mode = FT_OFF; +event_tap_unregister(); +} + +qemu_savevm_state_cancel(s-mon, s-file); migrate_fd_cleanup(s); } -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 23/23] Add a parser to accept FT migration incoming mode.
The option looks like, -incoming protocol:address:port,ft_mode Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c | 14 +- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/migration.c b/migration.c index 3334650..a4850f9 100644 --- a/migration.c +++ b/migration.c @@ -42,7 +42,19 @@ static MigrationState *current_migration; void qemu_start_incoming_migration(const char *uri) { -const char *p; +const char *p = uri; + +/* check ft_mode option */ +while (*p != '\0') { +if (*p == ',') { +p++; +if (!strcmp(p, ft_mode)) { +ft_mode = FT_INIT; +break; +} +} +p++; +} if (strstart(uri, tcp:, p)) tcp_start_incoming_migration(p); -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 16/23] Insert event_tap_mmio() to cpu_physical_memory_rw().
Record mmio write event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- exec.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index d5c2a05..e9ed477 100644 --- a/exec.c +++ b/exec.c @@ -44,6 +44,7 @@ #include hw/hw.h #include osdep.h #include kvm.h +#include event-tap.h #if defined(CONFIG_USER_ONLY) #include qemu.h #include signal.h @@ -3373,6 +3374,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf, io_index = (pd IO_MEM_SHIFT) (IO_MEM_NB_ENTRIES - 1); if (p) addr1 = (addr ~TARGET_PAGE_MASK) + p-region_offset; + +event_tap_mmio(addr, buf, len); + /* XXX: could force cpu_single_env to NULL to avoid potential bugs */ if (l = 4 ((addr1 3) == 0)) { -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 17/23] Skip assert() when event_tap_state weren't EVENT_TAP_OFF.
Skip assert(!cpu_single_env) in resume_all_threads() when event_tap_state weren't EVENT_TAP_OFF. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- qemu-kvm.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/qemu-kvm.c b/qemu-kvm.c index 1414f49..e28bf59 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -18,6 +18,7 @@ #include compatfd.h #include gdbstub.h #include monitor.h +#include event-tap.h #include qemu-kvm.h #include libkvm.h @@ -1770,7 +1771,8 @@ static void resume_all_threads(void) { CPUState *penv = first_cpu; -assert(!cpu_single_env); +if (event_tap_get_state() == EVENT_TAP_OFF) +assert(!cpu_single_env); while (penv) { penv-stop = 0; -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 04/23] Use cpu_physical_memory_get_dirty_range() to check multiple dirty pages.
Modifies ram_save_block() and ram_save_remaining() to use cpu_physical_memory_get_dirty_range() to check multiple dirty and non-dirty pages at once. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- vl.c | 52 +--- 1 files changed, 33 insertions(+), 19 deletions(-) diff --git a/vl.c b/vl.c index 729c955..70a8aed 100644 --- a/vl.c +++ b/vl.c @@ -2779,7 +2779,8 @@ static int ram_save_block(QEMUFile *f) static ram_addr_t current_addr = 0; ram_addr_t saved_addr = current_addr; ram_addr_t addr = 0; -int found = 0; +ram_addr_t dirty_rams[HOST_LONG_BITS]; +int i, found = 0; while (addr last_ram_offset) { if (kvm_enabled() current_addr == 0) { @@ -2791,28 +2792,33 @@ static int ram_save_block(QEMUFile *f) return 0; } } -if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) { +if ((found = cpu_physical_memory_get_dirty_range( + current_addr, last_ram_offset, dirty_rams, HOST_LONG_BITS, + MIGRATION_DIRTY_FLAG))) { uint8_t *p; -cpu_physical_memory_reset_dirty(current_addr, -current_addr + TARGET_PAGE_SIZE, -MIGRATION_DIRTY_FLAG); +for (i = 0; i found; i++) { +ram_addr_t page_addr = dirty_rams[i]; +cpu_physical_memory_reset_dirty(page_addr, +page_addr + TARGET_PAGE_SIZE, +MIGRATION_DIRTY_FLAG); -p = qemu_get_ram_ptr(current_addr); +p = qemu_get_ram_ptr(page_addr); -if (is_dup_page(p, *p)) { -qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS); -qemu_put_byte(f, *p); -} else { -qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE); -qemu_put_buffer(f, p, TARGET_PAGE_SIZE); +if (is_dup_page(p, *p)) { +qemu_put_be64(f, page_addr | RAM_SAVE_FLAG_COMPRESS); +qemu_put_byte(f, *p); +} else { +qemu_put_be64(f, page_addr | RAM_SAVE_FLAG_PAGE); +qemu_put_buffer(f, p, TARGET_PAGE_SIZE); +} } -found = 1; break; +} else { +addr += dirty_rams[0]; +current_addr = (saved_addr + addr) % last_ram_offset; } -addr += TARGET_PAGE_SIZE; -current_addr = (saved_addr + addr) % last_ram_offset; } return found; @@ -2822,12 +2828,20 @@ static uint64_t bytes_transferred; static ram_addr_t ram_save_remaining(void) { -ram_addr_t addr; +ram_addr_t addr = 0; ram_addr_t count = 0; +ram_addr_t dirty_rams[HOST_LONG_BITS]; +int found = 0; -for (addr = 0; addr last_ram_offset; addr += TARGET_PAGE_SIZE) { -if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) -count++; +while (addr last_ram_offset) { +if ((found = cpu_physical_memory_get_dirty_range( + addr, last_ram_offset, dirty_rams, HOST_LONG_BITS, + MIGRATION_DIRTY_FLAG))) { +count += found; +addr = dirty_rams[found - 1] + TARGET_PAGE_SIZE; +} else { +addr += dirty_rams[0]; +} } return count; -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 00/23] Kemari for KVM v0.1.1
Hi, This patch series is a revised version of Kemari for KVM, which applied comments for the previous post. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. On the contrary to the previous version, this series doesn't require any modifications to KVM. The I/O events are caputured in net/block layer instead of device emulation layer. The transmission/transaction protocol, and most of the control logic is implemented in QEMU. We prepared a demonstration video again. This time the guest is Windows XP without virtio drivers. The demonstration scenario is, 1. Play with a guest VM (This guest has e1000 and ide) # The guest image should be a NFS/SAN. 2. Start incoming side with, -incoming protocol:address:port,ft_mode 3. Start Kemari to synchronize the VM by running the following command in QEMU. Just add -k option to usual migrate command. migrate -d -k tcp:192.168.0.20: 3. Check the status by calling info migrate. 4. Go back to the VM to play the pinball. 5. Kill the the VM. (VNC client also disappears) 6. Press c to continue the VM on the other host. 7. Bring up the VNC client (Sorry, it pops outside of video capture.) 8. Confirm that the pinball works, then shutdown. http://www.osrg.net/kemari/download/kemari-kvm-winxp.mov The repository contains all patches we're sending with this message. For those who want to try, please pull the following repository. git://kemari.git.sourceforge.net/gitroot/kemari/kemari The changes from v0.1 - v0.1.1 are: - events are tapped in net/block layer instead of device emulation layer. - Introduce a new option for -incoming to accept FT transaction. - Removed writev() support to QEMUFile and FdMigrationState for now. I would post this work in a different series. - Modified virtio-blk save/load handler to send inuse variable to correctly replay. - Removed configure --enable-ft-mode. - Removed unnecessary check for qemu_realloc(). I hope people like this approach, and looking forward to suggestions/comments. Thanks, Yoshi Yoshiaki Tamura (23): Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty. Introduce cpu_physical_memory_get_dirty_range(). Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty. Use cpu_physical_memory_get_dirty_range() to check multiple dirty pages. Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer(). Introduce read() to FdMigrationState. Introduce skip_header parameter to qemu_loadvm_state(). Introduce some socket util functions. Introduce fault tolerant VM transaction QEMUFile and ft_mode. Introduce util functions to control ft_transaction from savevm layer. Introduce qemu_savevm_state_all(). Insent event-tap callbacks to net/block layer. Introduce event-tap. Call init handler of event-tap at main(). Insert event_tap_ioport() to ioport_write(). Insert event_tap_mmio() to cpu_physical_memory_rw(). Skip assert() when event_tap_state weren't EVENT_TAP_OFF. Call event_tap_replay() at vm_start(). Introduce ft_tranx_ready(), and modify migrate_fd_put_ready() when ft_mode is on. Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled. virtio-blk: Modify save/load handler to handle inuse varialble. Introduce -k option to enable FT migration mode (Kemari). Add a parser to accept FT migration incoming mode. Makefile.objs|1 + Makefile.target |1 + block.c | 22 +++ block.h |4 + cpu-all.h| 134 - event-tap.c | 184 event-tap.h | 32 exec.c | 131 + ft_transaction.c | 418 ++ ft_transaction.h | 54 +++ hw/hw.h |7 + hw/virtio.c |8 +- ioport.c |2 + migration-exec.c |2 +- migration-fd.c |2 +- migration-tcp.c | 52 +++- migration-unix.c |2 +- migration.c | 110 ++- migration.h |3 + net/queue.c | 18 +++ net/queue.h |3 + osdep.c | 13 ++ qemu-char.c | 25 +++- qemu-kvm.c | 23 ++-- qemu-monitor.hx |7 +- qemu_socket.h|4 + savevm.c | 146 +-- sysemu.h |3 +- vl.c | 57 +--- 29 files changed, 1371 insertions(+), 97 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h create mode 100644 ft_transaction.c create mode 100644 ft_transaction.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 05/23] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
Currently buf size is fixed at 32KB. It would be useful if it could be flexible. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/hw.h |2 ++ savevm.c | 21 - 2 files changed, 22 insertions(+), 1 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index 05131a0..fc9ed29 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -61,6 +61,8 @@ void qemu_fflush(QEMUFile *f); int qemu_fclose(QEMUFile *f); void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); +void *qemu_realloc_buffer(QEMUFile *f, int size); +void qemu_clear_buffer(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { diff --git a/savevm.c b/savevm.c index 2fd3de6..b9bb9f4 100644 --- a/savevm.c +++ b/savevm.c @@ -174,7 +174,8 @@ struct QEMUFile { when reading */ int buf_index; int buf_size; /* 0 when writing */ -uint8_t buf[IO_BUF_SIZE]; +int buf_max_size; +uint8_t *buf; int has_error; }; @@ -424,6 +425,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, f-get_rate_limit = get_rate_limit; f-is_write = 0; +f-buf_max_size = IO_BUF_SIZE; +f-buf = qemu_mallocz(sizeof(uint8_t) * f-buf_max_size); + return f; } @@ -454,6 +458,20 @@ void qemu_fflush(QEMUFile *f) } } +void *qemu_realloc_buffer(QEMUFile *f, int size) +{ +f-buf_max_size = size; +f-buf = qemu_realloc(f-buf, f-buf_max_size); + +return f-buf; +} + +void qemu_clear_buffer(QEMUFile *f) +{ +f-buf_size = f-buf_index = f-buf_offset = 0; +memset(f-buf, 0, f-buf_max_size); +} + static void qemu_fill_buffer(QEMUFile *f) { int len; @@ -479,6 +497,7 @@ int qemu_fclose(QEMUFile *f) qemu_fflush(f); if (f-close) ret = f-close(f-opaque); +qemu_free(f-buf); qemu_free(f); return ret; } -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 09/23] Introduce fault tolerant VM transaction QEMUFile and ft_mode.
This code implements VM transaction protocol. Like buffered_file, it sits between savevm and migration layer. With this architecture, VM transaction protocol is implemented mostly independent from other existing code. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- Makefile.objs|1 + ft_transaction.c | 418 ++ ft_transaction.h | 54 +++ migration.c |3 + 4 files changed, 476 insertions(+), 0 deletions(-) create mode 100644 ft_transaction.c create mode 100644 ft_transaction.h diff --git a/Makefile.objs b/Makefile.objs index b73e2cb..4388fb3 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -78,6 +78,7 @@ common-obj-y += qemu-char.o savevm.o #aio.o common-obj-y += msmouse.o ps2.o common-obj-y += qdev.o qdev-properties.o common-obj-y += qemu-config.o block-migration.o +common-obj-y += ft_transaction.o common-obj-$(CONFIG_BRLAPI) += baum.o common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o diff --git a/ft_transaction.c b/ft_transaction.c new file mode 100644 index 000..92dc681 --- /dev/null +++ b/ft_transaction.c @@ -0,0 +1,418 @@ +/* + * Fault tolerant VM transaction QEMUFile + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * This source code is based on buffered_file.c. + * Copyright IBM, Corp. 2008 + * Authors: + * Anthony Liguorialigu...@us.ibm.com + */ + +#include qemu-common.h +#include hw/hw.h +#include qemu-timer.h +#include sysemu.h +#include qemu-char.h +#include ft_transaction.h + +// #define DEBUG_FT_TRANSACTION + +typedef struct QEMUFileFtTranx +{ +FtTranxPutBufferFunc *put_buffer; +FtTranxGetBufferFunc *get_buffer; +FtTranxCloseFunc *close; +void *opaque; +QEMUFile *file; +int has_error; +int is_sender; +int buf_max_size; +enum QEMU_VM_TRANSACTION_STATE tranx_state; +uint16_t tranx_id; +uint32_t seq; +} QEMUFileFtTranx; + +#define IO_BUF_SIZE 32768 + +#ifdef DEBUG_FT_TRANSACTION +#define dprintf(fmt, ...) \ +do { printf(ft_transaction: fmt, ## __VA_ARGS__); } while (0) +#else +#define dprintf(fmt, ...) \ +do { } while (0) +#endif + +static ssize_t ft_tranx_flush_buffer(void *opaque, void *buf, int size) +{ +QEMUFileFtTranx *s = opaque; +size_t offset = 0; +ssize_t len; + +while (offset size) { +len = s-put_buffer(s-opaque, (uint8_t *)buf + offset, size - offset); + +if (len = 0) { +fprintf(stderr, ft transaction flush buffer failed \n); +s-has_error = 1; +offset = -EINVAL; +break; +} + +offset += len; +} + +return offset; +} + +static int ft_tranx_send_header(QEMUFileFtTranx *s) +{ +int ret = -1; + +dprintf(send header %d\n, s-tranx_state); + +ret = ft_tranx_flush_buffer(s, s-tranx_state, sizeof(uint16_t)); +if (ret 0) { +goto out; +} +ret = ft_tranx_flush_buffer(s, s-tranx_id, sizeof(uint16_t)); + +out: +return ret; +} + +static int ft_tranx_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size) +{ +QEMUFileFtTranx *s = opaque; +ssize_t ret = -1; + +if (s-has_error) { +fprintf(stderr, flush when error, bailing\n); +return -EINVAL; +} + +ret = ft_tranx_send_header(s); +if (ret 0) { +goto out; +} + +ret = ft_tranx_flush_buffer(s, s-seq, sizeof(s-seq)); +if (ret 0) { +goto out; +} +s-seq++; + +ret = ft_tranx_flush_buffer(s, size, sizeof(uint32_t)); +if (ret 0) { +goto out; +} + +ret = ft_tranx_flush_buffer(s, (uint8_t *)buf, size); + +out: +return ret; +} + +#if 0 +static int ft_tranx_put_vector(void *opaque, struct iovec *vector, int64_t pos, int count) +{ +QEMUFileFtTranx *s = opaque; +ssize_t ret = -1; +int i; +uint32_t size = 0; + +dprintf(putting %d vectors at % PRId64 \n, count, pos); + +if (s-has_error) { +dprintf(put vector when error, bailing\n); +return -EINVAL; +} + +ret = ft_tranx_send_header(s); +if (ret 0) { +return ret; +} + +ret = ft_tranx_flush_buffer(s, s-seq, sizeof(s-seq)); +if (ret 0) { +return ret; +} +s-seq++; + +for (i = 0; i count; i++) +size += vector[i].iov_len; + +ret = ft_tranx_flush_buffer(s, size, sizeof(uint32_t)); +if (ret 0) { +return ret; +} + +while (count 0) { +/* + * It will continue calling put_vector even if count IOV_MAX. + */ +ret = s-put_vector(s-opaque, vector, +((countIOV_MAX)?IOV_MAX:count)); + +if (ret = 0) { +fprintf(stderr, ft transaction putting vector\n); +s-has_error = 1; +
[RFC PATCH 11/23] Introduce qemu_savevm_state_all().
Introduce qemu_savevm_state_all() to send the memory and device info together, while avoiding cancelling memory state tracking. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- savevm.c | 60 sysemu.h |1 + 2 files changed, 61 insertions(+), 0 deletions(-) diff --git a/savevm.c b/savevm.c index 81cb711..25ccbb8 100644 --- a/savevm.c +++ b/savevm.c @@ -1468,6 +1468,66 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f) return 0; } +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, savevm_handlers, entry) { +int len; + +if (se-save_live_state == NULL) +continue; + +/* Section type */ +qemu_put_byte(f, QEMU_VM_SECTION_START); +qemu_put_be32(f, se-section_id); + +/* ID string */ +len = strlen(se-idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se-idstr, len); + +qemu_put_be32(f, se-instance_id); +qemu_put_be32(f, se-version_id); +if (ft_mode == FT_INIT) { +/* This is workaround. */ +se-save_live_state(mon, f, QEMU_VM_SECTION_START, se-opaque); +} else { +se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque); +} +} + +ft_mode = FT_TRANSACTION; +QTAILQ_FOREACH(se, savevm_handlers, entry) { +int len; + + if (se-save_state == NULL se-vmsd == NULL) + continue; + +/* Section type */ +qemu_put_byte(f, QEMU_VM_SECTION_FULL); +qemu_put_be32(f, se-section_id); + +/* ID string */ +len = strlen(se-idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se-idstr, len); + +qemu_put_be32(f, se-instance_id); +qemu_put_be32(f, se-version_id); + +vmstate_save(f, se); +} + +qemu_put_byte(f, QEMU_VM_EOF); + +if (qemu_file_has_error(f)) +return -EIO; + +return 0; +} + + void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f) { SaveStateEntry *se; diff --git a/sysemu.h b/sysemu.h index 6c1441f..df314bb 100644 --- a/sysemu.h +++ b/sysemu.h @@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int shared); int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); int qemu_loadvm_state(QEMUFile *f, int skip_header); -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 00/23] Kemari for KVM v0.1.1
Hi, This patch series is a revised version of Kemari for KVM, which applied comments for the previous post. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. On the contrary to the previous version, this series doesn't require any modifications to KVM. The I/O events are caputured in net/block layer instead of device emulation layer. The transmission/transaction protocol, and most of the control logic is implemented in QEMU. We prepared a demonstration video again. This time the guest is Windows XP without virtio drivers. The demonstration scenario is, 1. Play with a guest VM (This guest has e1000 and ide) # The guest image should be a NFS/SAN. 2. Start incoming side with, -incoming protocol:address:port,ft_mode 3. Start Kemari to synchronize the VM by running the following command in QEMU. Just add -k option to usual migrate command. migrate -d -k tcp:192.168.0.20: 3. Check the status by calling info migrate. 4. Go back to the VM to play the pinball. 5. Kill the the VM. (VNC client also disappears) 6. Press c to continue the VM on the other host. 7. Bring up the VNC client (Sorry, it pops outside of video capture.) 8. Confirm that the pinball works, then shutdown. http://www.osrg.net/kemari/download/kemari-kvm-winxp.mov The repository contains all patches we're sending with this message. For those who want to try, pull the following repository. git://kemari.git.sourceforge.net/gitroot/kemari/kemari The changes from v0.1 - v0.1.1 are: - events are tapped in net/block layer instead of device emulation layer. - Introduce a new option for -incoming to accept FT transaction. - Removed writev() support to QEMUFile and FdMigrationState for now. I would post this work in a different series. - Modified virtio-blk save/load handler to send inuse variable to correctly replay. - Removed configure --enable-ft-mode. - Removed unnecessary check for qemu_realloc(). I hope people like this approach, and looking forward to suggestions/comments. Thanks, Yoshi Yoshiaki Tamura (23): Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty. Introduce cpu_physical_memory_get_dirty_range(). Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty. Use cpu_physical_memory_get_dirty_range() to check multiple dirty pages. Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer(). Introduce read() to FdMigrationState. Introduce skip_header parameter to qemu_loadvm_state(). Introduce some socket util functions. Introduce fault tolerant VM transaction QEMUFile and ft_mode. Introduce util functions to control ft_transaction from savevm layer. Introduce qemu_savevm_state_all(). Insent event-tap callbacks to net/block layer. Introduce event-tap. Call init handler of event-tap at main(). Insert event_tap_ioport() to ioport_write(). Insert event_tap_mmio() to cpu_physical_memory_rw(). Skip assert() when event_tap_state weren't EVENT_TAP_OFF. Call event_tap_replay() at vm_start(). Introduce ft_tranx_ready(), and modify migrate_fd_put_ready() when ft_mode is on. Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled. virtio-blk: Modify save/load handler to handle inuse varialble. Introduce -k option to enable FT migration mode (Kemari). Add a parser to accept FT migration incoming mode. Makefile.objs|1 + Makefile.target |1 + block.c | 22 +++ block.h |4 + cpu-all.h| 134 - event-tap.c | 184 event-tap.h | 32 exec.c | 131 + ft_transaction.c | 418 ++ ft_transaction.h | 54 +++ hw/hw.h |7 + hw/virtio.c |8 +- ioport.c |2 + migration-exec.c |2 +- migration-fd.c |2 +- migration-tcp.c | 52 +++- migration-unix.c |2 +- migration.c | 110 ++- migration.h |3 + net/queue.c | 18 +++ net/queue.h |3 + osdep.c | 13 ++ qemu-char.c | 25 +++- qemu-kvm.c | 23 ++-- qemu-monitor.hx |7 +- qemu_socket.h|4 + savevm.c | 146 +-- sysemu.h |3 +- vl.c | 57 +--- 29 files changed, 1371 insertions(+), 97 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h create mode 100644 ft_transaction.c create mode 100644 ft_transaction.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 13/23] Introduce event-tap.
event-tap controls when to start ft transaction, and inserts callbacks to the net/block. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- Makefile.target |1 + event-tap.c | 184 +++ event-tap.h | 32 ++ 3 files changed, 217 insertions(+), 0 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h diff --git a/Makefile.target b/Makefile.target index 82caf20..a49b21f 100644 --- a/Makefile.target +++ b/Makefile.target @@ -188,6 +188,7 @@ obj-$(CONFIG_KVM) += kvm.o kvm-all.o # MSI-X depends on kvm for interrupt injection, # so moved it from Makefile.objs to Makefile.target for now obj-y += msix.o +obj-y += event-tap.o obj-$(CONFIG_ISA_MMIO) += isa_mmio.o LIBS+=-lz diff --git a/event-tap.c b/event-tap.c new file mode 100644 index 000..5d3a338 --- /dev/null +++ b/event-tap.c @@ -0,0 +1,184 @@ +/* + * Event Tap functions for QEMU + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include block.h +#include ioport.h +#include osdep.h +#include hw/hw.h +#include net/queue.h +#include event-tap.h + +static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF; + +typedef struct EventTapIOport { +uint32_t address; +uint32_t data; +int index; +} EventTapIOport; + +#define MMIO_BUF_SIZE 8 + +typedef struct EventTapMMIO { +uint64_t address; +uint8_t buf[MMIO_BUF_SIZE]; +int len; +} EventTapMMIO; + +#define EVENT_TAP_IOPORT 1 +#define EVENT_TAP_MMIO 2 + +typedef struct EventTapLog { +int mode; +union { +EventTapIOport ioport ; +EventTapMMIO mmio; +}; +} EventTapLog; + +static EventTapLog last_event_tap; + +int event_tap_register(int (*cb)(void)) +{ +if (cb == NULL || event_tap_state != EVENT_TAP_OFF) +return -1; + +bdrv_event_tap_register(cb); +qemu_net_event_tap_register(cb); +event_tap_state = EVENT_TAP_ON; + +return 0; +} + +int event_tap_unregister(void) +{ +if (event_tap_state == EVENT_TAP_OFF) +return -1; + +bdrv_event_tap_unregister(); +qemu_net_event_tap_unregister(); +event_tap_state = EVENT_TAP_OFF; + +return 0; +} + +void event_tap_suspend(void) +{ +if (event_tap_state == EVENT_TAP_ON) +event_tap_state = EVENT_TAP_SUSPEND; +} + +void event_tap_resume(void) +{ +if (event_tap_state == EVENT_TAP_SUSPEND) +event_tap_state = EVENT_TAP_ON; +} + +int event_tap_get_state(void) +{ +return event_tap_state; +} + +void event_tap_ioport(int index, uint32_t address, uint32_t data) +{ +if (event_tap_state != EVENT_TAP_ON) { +return; +} + +last_event_tap.mode = EVENT_TAP_IOPORT; +last_event_tap.ioport.index = index; +last_event_tap.ioport.address = address; +last_event_tap.ioport.data = data; +} + +void event_tap_mmio(uint64_t address, uint8_t *buf, int len) +{ +if (event_tap_state != EVENT_TAP_ON || len MMIO_BUF_SIZE) { +return; +} + +last_event_tap.mode = EVENT_TAP_MMIO; +last_event_tap.mmio.address = address; +last_event_tap.mmio.len = len; +memcpy(last_event_tap.mmio.buf, buf, len); +} + +static void event_tap_reset(void) +{ +memset(last_event_tap, 0, sizeof(last_event_tap)); +} + +void event_tap_replay(void) +{ +if (event_tap_state != EVENT_TAP_REPLAY) { +return; +} + +switch (last_event_tap.mode) { +case EVENT_TAP_IOPORT: +switch (last_event_tap.ioport.index) { +case 0: +cpu_outb(last_event_tap.ioport.address, last_event_tap.ioport.data); +break; +case 1: +cpu_outw(last_event_tap.ioport.address, last_event_tap.ioport.data); +break; +case 2: +cpu_outl(last_event_tap.ioport.address, last_event_tap.ioport.data); +break; +} +event_tap_reset(); +break; +case EVENT_TAP_MMIO: +cpu_physical_memory_rw(last_event_tap.mmio.address, + last_event_tap.mmio.buf, + last_event_tap.mmio.len, 1); +event_tap_reset(); +break; +} +} + +static void event_tap_save(QEMUFile *f, void *opaque) +{ +qemu_put_byte(f, last_event_tap.mode); + +if (last_event_tap.mode == EVENT_TAP_IOPORT) { +qemu_put_be32(f, last_event_tap.ioport.index); +qemu_put_be32(f, last_event_tap.ioport.address); +qemu_put_byte(f, last_event_tap.ioport.data); +} else { +qemu_put_be64(f, last_event_tap.mmio.address); +qemu_put_byte(f, last_event_tap.mmio.len); +qemu_put_buffer(f, last_event_tap.mmio.buf, last_event_tap.mmio.len); +} +} + +static int event_tap_load(QEMUFile *f, void *opaque, int version_id) +{ +last_event_tap.mode = qemu_get_byte(f); + +if (last_event_tap.mode ==
[RFC PATCH 22/23] Introduce -k option to enable FT migration mode (Kemari).
When -k option is set to migrate command, it will turn on ft_mode to start FT migration mode (Kemari). Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c |3 +++ qemu-monitor.hx |7 --- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/migration.c b/migration.c index 5b90d37..3334650 100644 --- a/migration.c +++ b/migration.c @@ -71,6 +71,9 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) return -1; } +if (qdict_get_int(qdict, ft)) +ft_mode = FT_INIT; + if (strstart(uri, tcp:, p)) { s = tcp_start_outgoing_migration(mon, p, max_throttle, detach, (int)qdict_get_int(qdict, blk), diff --git a/qemu-monitor.hx b/qemu-monitor.hx index 16c45b7..22b72d9 100644 --- a/qemu-monitor.hx +++ b/qemu-monitor.hx @@ -765,13 +765,14 @@ ETEXI { .name = migrate, -.args_type = detach:-d,blk:-b,inc:-i,uri:s, -.params = [-d] [-b] [-i] uri, +.args_type = detach:-d,blk:-b,inc:-i,ft:-k,uri:s, +.params = [-d] [-b] [-i] [-k] uri, .help = migrate to URI (using -d to not wait for completion) \n\t\t\t -b for migration without shared storage with full copy of disk\n\t\t\t -i for migration without shared storage with incremental copy of disk - (base image shared between src and destination), + (base image shared between src and destination) + \n\t\t\t -k for FT migration mode (Kemari), .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 03/23] Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty.
Modifies kvm_get_dirty_pages_log_range to use cpu_physical_memory_set_dirty_range() to update the row of the bit-based phys_ram_dirty bitmap at once. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- qemu-kvm.c | 19 +++ 1 files changed, 7 insertions(+), 12 deletions(-) diff --git a/qemu-kvm.c b/qemu-kvm.c index 29365a9..1414f49 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -2323,8 +2323,8 @@ static int kvm_get_dirty_pages_log_range(unsigned long start_addr, unsigned long offset, unsigned long mem_size) { -unsigned int i, j; -unsigned long page_number, addr, addr1, c; +unsigned int i; +unsigned long page_number, addr, addr1; ram_addr_t ram_addr; unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + HOST_LONG_BITS - 1) / HOST_LONG_BITS; @@ -2335,16 +2335,11 @@ static int kvm_get_dirty_pages_log_range(unsigned long start_addr, */ for (i = 0; i len; i++) { if (bitmap[i] != 0) { -c = leul_to_cpu(bitmap[i]); -do { -j = ffsl(c) - 1; -c = ~(1ul j); -page_number = i * HOST_LONG_BITS + j; -addr1 = page_number * TARGET_PAGE_SIZE; -addr = offset + addr1; -ram_addr = cpu_get_physical_page_desc(addr); -cpu_physical_memory_set_dirty(ram_addr); -} while (c != 0); +page_number = i * HOST_LONG_BITS; +addr1 = page_number * TARGET_PAGE_SIZE; +addr = offset + addr1; +ram_addr = cpu_get_physical_page_desc(addr); +cpu_physical_memory_set_dirty_range(ram_addr, leul_to_cpu(bitmap[i])); } } return 0; -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 10/23] Introduce util functions to control ft_transaction from savevm layer.
To utilize ft_transaction function, savevm needs interfaces to be exported. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/hw.h |5 + savevm.c | 41 + 2 files changed, 46 insertions(+), 0 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index fc9ed29..5a48a91 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -54,6 +54,8 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, QEMUFile *qemu_fopen(const char *filename, const char *mode); QEMUFile *qemu_fdopen(int fd, const char *mode); QEMUFile *qemu_fopen_socket(int fd); +QEMUFile *qemu_fopen_transaction(int fd); +QEMUFile *qemu_fopen_tranx_sender(void *opaque); QEMUFile *qemu_popen(FILE *popen_file, const char *mode); QEMUFile *qemu_popen_cmd(const char *command, const char *mode); int qemu_stdio_fd(QEMUFile *f); @@ -63,6 +65,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); void *qemu_realloc_buffer(QEMUFile *f, int size); void qemu_clear_buffer(QEMUFile *f); +int qemu_transaction_begin(QEMUFile *f); +int qemu_transaction_commit(QEMUFile *f); +int qemu_transaction_cancel(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { diff --git a/savevm.c b/savevm.c index 2ab883b..81cb711 100644 --- a/savevm.c +++ b/savevm.c @@ -82,6 +82,7 @@ #include migration.h #include qemu_socket.h #include qemu-queue.h +#include ft_transaction.h /* point to the block driver where the snapshots are managed */ static BlockDriverState *bs_snapshots; @@ -207,6 +208,21 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size) return len; } +static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size) +{ +QEMUFileSocket *s = opaque; +ssize_t len; + +do { +len = send(s-fd, (void *)buf, size, 0); +} while (len == -1 socket_error() == EINTR); + +if (len == -1) +len = -socket_error(); + +return len; +} + static int socket_close(void *opaque) { QEMUFileSocket *s = opaque; @@ -335,6 +351,16 @@ QEMUFile *qemu_fopen_socket(int fd) return s-file; } +QEMUFile *qemu_fopen_transaction(int fd) +{ +QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket)); + +s-fd = fd; +s-file = qemu_fopen_ops_ft_tranx(s, socket_put_buffer, socket_get_buffer, + socket_close, 0); +return s-file; +} + static int file_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size) { @@ -472,6 +498,21 @@ void qemu_clear_buffer(QEMUFile *f) memset(f-buf, 0, f-buf_max_size); } +int qemu_transaction_begin(QEMUFile *f) +{ +return qemu_ft_tranx_begin(f-opaque); +} + +int qemu_transaction_commit(QEMUFile *f) +{ +return qemu_ft_tranx_commit(f-opaque); +} + +int qemu_transaction_cancel(QEMUFile *f) +{ +return qemu_ft_tranx_cancel(f-opaque); +} + static void qemu_fill_buffer(QEMUFile *f) { int len; -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 20/23] Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.
When ft_mode is set in the header, tcp_accept_incoming_migration() receives ft_transaction iteratively. We also need a hack no to close fd before moving to ft_transaction mode, so that we can reuse the fd for it. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration-tcp.c | 36 +++- 1 files changed, 35 insertions(+), 1 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index 767a2f1..a5d9b6d 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -18,6 +18,7 @@ #include sysemu.h #include buffered_file.h #include block.h +#include ft_transaction.h //#define DEBUG_MIGRATION_TCP @@ -55,7 +56,8 @@ static int socket_read(FdMigrationState *s, const void * buf, size_t size) static int tcp_close(FdMigrationState *s) { DPRINTF(tcp_close\n); -if (s-fd != -1) { +/* FIX ME: accessing ft_mode here isn't clean */ +if (s-fd != -1 ft_mode != FT_INIT) { close(s-fd); s-fd = -1; } @@ -181,6 +183,38 @@ static void tcp_accept_incoming_migration(void *opaque) fprintf(stderr, load of migration failed\n); goto out_fopen; } + +/* ft_mode is set by qemu_loadvm_state(). */ +if (ft_mode == FT_INIT) { +/* close normal QEMUFile first before reusing connection. */ +qemu_fclose(f); +socket_set_nodelay(c); +socket_set_timeout(c, 5); +/* don't autostart to avoid split brain. */ +autostart = 0; + +f = qemu_fopen_transaction(c); +if (f == NULL) { +fprintf(stderr, could not qemu_fopen transaction\n); +goto out; +} + +/* need to wait sender to setup. */ +if (qemu_transaction_begin(f) 0) { +goto out_fopen; +} + +/* loop until transaction breaks */ +while ((ft_mode != FT_OFF) (ret == 0)) { +ret = qemu_loadvm_state(f, 1); +} + +/* if migrate_cancel was called at the sender */ +if (ft_mode == FT_OFF) { +goto out_fopen; +} +} + qemu_announce_self(); DPRINTF(successfully loaded vm state\n); -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 14/23] Call init handler of event-tap at main().
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- vl.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 70a8aed..56d12c7 100644 --- a/vl.c +++ b/vl.c @@ -169,6 +169,8 @@ int main(int argc, char **argv) #include qemu-queue.h +#include event-tap.h + //#define DEBUG_NET //#define DEBUG_SLIRP @@ -5949,6 +5951,8 @@ int main(int argc, char **argv, char **envp) blk_mig_init(); +event_tap_init(); + if (default_cdrom) { /* we always create the cdrom drive, even if no disk is there */ drive_add(NULL, CDROM_ALIAS); -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 15/23] Insert event_tap_ioport() to ioport_write().
Record ioport event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- ioport.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/ioport.c b/ioport.c index 53dd87a..ad7a017 100644 --- a/ioport.c +++ b/ioport.c @@ -26,6 +26,7 @@ */ #include ioport.h +#include event-tap.h /***/ /* IO Port */ @@ -75,6 +76,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data) default_ioport_writel }; IOPortWriteFunc *func = ioport_write_table[index][address]; +event_tap_ioport(index, address, data); if (!func) func = default_func[index]; func(ioport_opaque[address], address, data); -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 08/23] Introduce some socket util functions.
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- osdep.c | 13 + qemu-char.c | 25 - qemu_socket.h |4 3 files changed, 41 insertions(+), 1 deletions(-) diff --git a/osdep.c b/osdep.c index 3bab79a..63444e7 100644 --- a/osdep.c +++ b/osdep.c @@ -201,6 +201,12 @@ void socket_set_nonblock(int fd) ioctlsocket(fd, FIONBIO, opt); } +void socket_set_block(int fd) +{ +unsigned long opt = 0; +ioctlsocket(fd, FIONBIO, opt); +} + int inet_aton(const char *cp, struct in_addr *ia) { uint32_t addr = inet_addr(cp); @@ -223,6 +229,13 @@ void socket_set_nonblock(int fd) fcntl(fd, F_SETFL, f | O_NONBLOCK); } +void socket_set_block(int fd) +{ +int f; +f = fcntl(fd, F_GETFL); +fcntl(fd, F_SETFL, f ~O_NONBLOCK); +} + void qemu_set_cloexec(int fd) { int f; diff --git a/qemu-char.c b/qemu-char.c index 4169492..ccdf394 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2092,12 +2092,35 @@ static void tcp_chr_telnet_init(int fd) send(fd, (char *)buf, 3, 0); } -static void socket_set_nodelay(int fd) +void socket_set_delay(int fd) +{ +int val = 0; +setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)val, sizeof(val)); +} + +void socket_set_nodelay(int fd) { int val = 1; setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)val, sizeof(val)); } +void socket_set_timeout(int fd, int s) +{ +struct timeval tv = { +.tv_sec = s, +.tv_usec = 0 +}; +/* Set socket_timeout */ +if (setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, + tv, sizeof(tv)) 0) { +fprintf(stderr, failed to set SO_RCVTIMEO\n); +} +if (setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, + tv, sizeof(tv)) 0) { +fprintf(stderr, fialed to set SO_SNDTIMEO\n); +} +} + static void tcp_chr_accept(void *opaque) { CharDriverState *chr = opaque; diff --git a/qemu_socket.h b/qemu_socket.h index 7ee46ac..8eae465 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -35,6 +35,10 @@ int inet_aton(const char *cp, struct in_addr *ia); int qemu_socket(int domain, int type, int protocol); int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen); void socket_set_nonblock(int fd); +void socket_set_block(int fd); +void socket_set_nodelay(int fd); +void socket_set_delay(int fd); +void socket_set_timeout(int fd, int s); int send_all(int fd, const void *buf, int len1); /* New, ipv6-ready socket helper functions, see qemu-sockets.c */ -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 18/23] Call event_tap_replay() at vm_start().
Call event_tap_replay() at vm_start() to replay the last ioport/mmio event upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- vl.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 56d12c7..762440d 100644 --- a/vl.c +++ b/vl.c @@ -3094,6 +3094,7 @@ void vm_start(void) vm_state_notify(1, 0); qemu_rearm_alarm_timer(alarm_timer); resume_all_vcpus(); +event_tap_replay(); } } -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 07/23] Introduce skip_header parameter to qemu_loadvm_state().
Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration-exec.c |2 +- migration-fd.c |2 +- migration-tcp.c |2 +- migration-unix.c |2 +- savevm.c | 24 +--- sysemu.h |2 +- 6 files changed, 18 insertions(+), 16 deletions(-) diff --git a/migration-exec.c b/migration-exec.c index 3edc026..5839a6d 100644 --- a/migration-exec.c +++ b/migration-exec.c @@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void *opaque) QEMUFile *f = opaque; int ret; -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto err; diff --git a/migration-fd.c b/migration-fd.c index 0cc74ad..0e97ed0 100644 --- a/migration-fd.c +++ b/migration-fd.c @@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void *opaque) QEMUFile *f = opaque; int ret; -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto err; diff --git a/migration-tcp.c b/migration-tcp.c index cffc4df..767a2f1 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -176,7 +176,7 @@ static void tcp_accept_incoming_migration(void *opaque) goto out; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto out_fopen; diff --git a/migration-unix.c b/migration-unix.c index b7aab38..dd99a73 100644 --- a/migration-unix.c +++ b/migration-unix.c @@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void *opaque) goto out; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto out_fopen; diff --git a/savevm.c b/savevm.c index b9bb9f4..2ab883b 100644 --- a/savevm.c +++ b/savevm.c @@ -1489,7 +1489,7 @@ typedef struct LoadStateEntry { int version_id; } LoadStateEntry; -int qemu_loadvm_state(QEMUFile *f) +int qemu_loadvm_state(QEMUFile *f, int skip_header) { QLIST_HEAD(, LoadStateEntry) loadvm_handlers = QLIST_HEAD_INITIALIZER(loadvm_handlers); @@ -1498,17 +1498,19 @@ int qemu_loadvm_state(QEMUFile *f) unsigned int v; int ret; -v = qemu_get_be32(f); -if (v != QEMU_VM_FILE_MAGIC) -return -EINVAL; +if (!skip_header) { +v = qemu_get_be32(f); +if (v != QEMU_VM_FILE_MAGIC) +return -EINVAL; -v = qemu_get_be32(f); -if (v == QEMU_VM_FILE_VERSION_COMPAT) { -fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); -return -ENOTSUP; +v = qemu_get_be32(f); +if (v == QEMU_VM_FILE_VERSION_COMPAT) { +fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); +return -ENOTSUP; +} +if (v != QEMU_VM_FILE_VERSION) +return -ENOTSUP; } -if (v != QEMU_VM_FILE_VERSION) -return -ENOTSUP; while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { uint32_t instance_id, version_id, section_id; @@ -1833,7 +1835,7 @@ int load_vmstate(Monitor *mon, const char *name) monitor_printf(mon, Could not open VM state file\n); return -EINVAL; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); qemu_fclose(f); if (ret 0) { monitor_printf(mon, Error %d while loading VM state\n, ret); diff --git a/sysemu.h b/sysemu.h index 647a468..6c1441f 100644 --- a/sysemu.h +++ b/sysemu.h @@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); -int qemu_loadvm_state(QEMUFile *f); +int qemu_loadvm_state(QEMUFile *f, int skip_header); void qemu_errors_to_file(FILE *fp); void qemu_errors_to_mon(Monitor *mon); -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 12/23] Insent event-tap callbacks to net/block layer.
Introduce event-tap callbacks to functions which actually fire outputs at net/block layer. By synchronizing VMs before outputs are fired, we can failover to the receiver upon failure. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- block.c | 22 ++ block.h |4 net/queue.c | 18 ++ net/queue.h |3 +++ 4 files changed, 47 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index 31d1ba4..cf73c47 100644 --- a/block.c +++ b/block.c @@ -59,6 +59,8 @@ BlockDriverState *bdrv_first; static BlockDriver *first_drv; +static int (*bdrv_event_tap)(void); + /* If non-zero, use only whitelisted block drivers */ static int use_bdrv_whitelist; @@ -787,6 +789,10 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num, set_dirty_bitmap(bs, sector_num, nb_sectors, 1); } +if (bdrv_event_tap != NULL) { +bdrv_event_tap(); +} + return drv-bdrv_write(bs, sector_num, buf, nb_sectors); } @@ -1851,6 +1857,10 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs) MultiwriteCB *mcb; int i; +if (bdrv_event_tap != NULL) { +bdrv_event_tap(); +} + if (num_reqs == 0) { return 0; } @@ -2277,3 +2287,15 @@ int64_t bdrv_get_dirty_count(BlockDriverState *bs) { return bs-dirty_count; } + +void bdrv_event_tap_register(int (*cb)(void)) +{ +if (bdrv_event_tap == NULL) { +bdrv_event_tap = cb; +} +} + +void bdrv_event_tap_unregister(void) +{ +bdrv_event_tap = NULL; +} diff --git a/block.h b/block.h index edf5704..b5139db 100644 --- a/block.h +++ b/block.h @@ -207,4 +207,8 @@ int bdrv_get_dirty(BlockDriverState *bs, int64_t sector); void bdrv_reset_dirty(BlockDriverState *bs, int64_t cur_sector, int nr_sectors); int64_t bdrv_get_dirty_count(BlockDriverState *bs); + +void bdrv_event_tap_register(int (*cb)(void)); +void bdrv_event_tap_unregister(void); + #endif diff --git a/net/queue.c b/net/queue.c index 2ea6cd0..a542efe 100644 --- a/net/queue.c +++ b/net/queue.c @@ -57,6 +57,8 @@ struct NetQueue { unsigned delivering : 1; }; +static int (*net_event_tap)(void); + NetQueue *qemu_new_net_queue(NetPacketDeliver *deliver, NetPacketDeliverIOV *deliver_iov, void *opaque) @@ -151,6 +153,8 @@ static ssize_t qemu_net_queue_deliver(NetQueue *queue, ssize_t ret = -1; queue-delivering = 1; +if (net_event_tap) +net_event_tap(); ret = queue-deliver(sender, flags, data, size, queue-opaque); queue-delivering = 0; @@ -166,6 +170,8 @@ static ssize_t qemu_net_queue_deliver_iov(NetQueue *queue, ssize_t ret = -1; queue-delivering = 1; +if (net_event_tap) +net_event_tap(); ret = queue-deliver_iov(sender, flags, iov, iovcnt, queue-opaque); queue-delivering = 0; @@ -258,3 +264,15 @@ void qemu_net_queue_flush(NetQueue *queue) qemu_free(packet); } } + +void qemu_net_event_tap_register(int (*cb)(void)) +{ +if (net_event_tap == NULL) { +net_event_tap = cb; +} +} + +void qemu_net_event_tap_unregister(void) +{ +net_event_tap = NULL; +} diff --git a/net/queue.h b/net/queue.h index a31958e..5b031c1 100644 --- a/net/queue.h +++ b/net/queue.h @@ -68,4 +68,7 @@ ssize_t qemu_net_queue_send_iov(NetQueue *queue, void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from); void qemu_net_queue_flush(NetQueue *queue); +void qemu_net_event_tap_register(int (*cb)(void)); +void qemu_net_event_tap_unregister(void); + #endif /* QEMU_NET_QUEUE_H */ -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 02/23] Introduce cpu_physical_memory_get_dirty_range().
It checks the first row and puts dirty addr in the array. If the first row is empty, it skips to the first non-dirty row or the end addr, and put the length in the first entry of the array. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- cpu-all.h |4 +++ exec.c| 67 + 2 files changed, 71 insertions(+), 0 deletions(-) diff --git a/cpu-all.h b/cpu-all.h index 3f8762d..27187d4 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -1007,6 +1007,10 @@ static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start, } } +int cpu_physical_memory_get_dirty_range(ram_addr_t start, ram_addr_t end, +ram_addr_t *dirty_rams, int length, +int dirty_flags); + void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end, int dirty_flags); void cpu_tlb_update_dirty(CPUState *env); diff --git a/exec.c b/exec.c index bf8d703..d5c2a05 100644 --- a/exec.c +++ b/exec.c @@ -1962,6 +1962,73 @@ static inline void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry, } } +/* It checks the first row and puts dirty addrs in the array. + If the first row is empty, it skips to the first non-dirty row + or the end addr, and put the length in the first entry of the array. */ +int cpu_physical_memory_get_dirty_range(ram_addr_t start, ram_addr_t end, +ram_addr_t *dirty_rams, int length, +int dirty_flag) +{ +unsigned long p = 0, page_number; +ram_addr_t addr; +ram_addr_t s_idx = (start TARGET_PAGE_BITS) / HOST_LONG_BITS; +ram_addr_t e_idx = (end TARGET_PAGE_BITS) / HOST_LONG_BITS; +int i, j, offset, dirty_idx = dirty_flag_to_idx(dirty_flag); + +/* mask bits before the start addr */ +offset = (start TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); +cpu_physical_memory_sync_master(s_idx); +p |= phys_ram_dirty[dirty_idx][s_idx] ~((1UL offset) - 1); + +if (s_idx == e_idx) { +/* mask bits after the end addr */ +offset = (end TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); +p = (1UL offset) - 1; +} + +if (p == 0) { +/* when the row is empty */ +ram_addr_t skip; +if (s_idx == e_idx) { +skip = end; + } else { +/* skip empty rows */ +while (s_idx e_idx) { +s_idx++; +cpu_physical_memory_sync_master(s_idx); + +if (phys_ram_dirty[dirty_idx][s_idx] != 0) { +break; +} +} +skip = (s_idx * HOST_LONG_BITS * TARGET_PAGE_SIZE); +} +dirty_rams[0] = skip - start; +i = 0; + +} else if (p == ~0UL) { +/* when the row is fully dirtied */ +addr = start; +for (i = 0; i length; i++) { +dirty_rams[i] = addr; +addr += TARGET_PAGE_SIZE; +} +} else { +/* when the row is partially dirtied */ +i = 0; +do { +j = ffsl(p) - 1; +p = ~(1UL j); +page_number = s_idx * HOST_LONG_BITS + j; +addr = page_number * TARGET_PAGE_SIZE; +dirty_rams[i] = addr; +i++; +} while (p != 0 i length); +} + +return i; +} + /* Note: start and end must be within the same ram block. */ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end, int dirty_flags) -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 21/23] virtio-blk: Modify save/load handler to handle inuse varialble.
Modify inuse type to uint16_t, let save/load to handle, and revert last_avail_idx with inuse if there are outstanding emulation. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/virtio.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index 7c020a3..502929c 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -70,7 +70,7 @@ struct VirtQueue VRing vring; target_phys_addr_t pa; uint16_t last_avail_idx; -int inuse; +uint16_t inuse; uint16_t vector; void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq); }; @@ -641,6 +641,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) qemu_put_be32(f, vdev-vq[i].vring.num); qemu_put_be64(f, vdev-vq[i].pa); qemu_put_be16s(f, vdev-vq[i].last_avail_idx); +qemu_put_be16s(f, vdev-vq[i].inuse); if (vdev-binding-save_queue) vdev-binding-save_queue(vdev-binding_opaque, i, f); } @@ -678,6 +679,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f) vdev-vq[i].vring.num = qemu_get_be32(f); vdev-vq[i].pa = qemu_get_be64(f); qemu_get_be16s(f, vdev-vq[i].last_avail_idx); +qemu_get_be16s(f, vdev-vq[i].inuse); + +/* revert last_avail_idx if there are outstanding emulation. */ +vdev-vq[i].last_avail_idx -= vdev-vq[i].inuse; +vdev-vq[i].inuse = 0; if (vdev-vq[i].pa) { virtqueue_init(vdev-vq[i]); -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Gentoo guest with smp: emerge freeze while recompile world
On 05/24/2010 12:15 AM, Riccardo wrote: Please try with kvmclock disabled. I have recompile gentoo-sources-2.6.34 without kvm-clock: # cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc # dmesg | grep clock [0.00] hpet clockevent registered [0.661050] Switching to clocksource tsc And with this kernel all working fine! (emerge -e world) It's a problem in the kvm-clock for kernel=2.6.32 Can you provide the traces with kvmclock enables so we can see what went wrong? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/5] trace: Add trace-events file for declaring trace events
On 05/25/2010 01:07 AM, Anthony Liguori wrote: Interesting approach as it lets us defer the tracing backend decision. Also, it's compatible with the multiplatform nature of qemu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/24/2010 10:38 PM, Anthony Liguori wrote: - Building a plugin API seems a bit simpler to me, although I'm to sure if I'd get the idea correctly: The block layer has already some kind of api (.bdrv_file_open, .bdrv_read). We could simply compile the block-drivers as shared objects and create a method for loading the necessary modules at runtime. That approach would be a recipe for disaster. We would have to introduce a new, reduced functionality block API that was supported for plugins. Otherwise, the only way a plugin could keep up with our API changes would be if it was in tree which defeats the purpose of having plugins. We could guarantee API/ABI stability in a stable branch but not across releases. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/24/2010 10:16 PM, Anthony Liguori wrote: On 05/24/2010 06:56 AM, Avi Kivity wrote: On 05/24/2010 02:42 PM, MORITA Kazutaka wrote: The server would be local and talk over a unix domain socket, perhaps anonymous. nbd has other issues though, such as requiring a copy and no support for metadata operations such as snapshot and file size extension. Sorry, my explanation was unclear. I'm not sure how running servers on localhost can solve the problem. The local server can convert from the local (nbd) protocol to the remote (sheepdog, ceph) protocol. What I wanted to say was that we cannot specify the image of VM. With nbd protocol, command line arguments are as follows: $ qemu nbd:hostname:port As this syntax shows, with nbd protocol the client cannot pass the VM image name to the server. We would extend it to allow it to connect to a unix domain socket: qemu nbd:unix:/path/to/socket nbd is a no-go because it only supports a single, synchronous I/O operation at a time and has no mechanism for extensibility. If we go this route, I think two options are worth considering. The first would be a purely socket based approach where we just accepted the extra copy. The other potential approach would be shared memory based. We export all guest ram as shared memory along with a small bounce buffer pool. We would then use a ring queue (potentially even using virtio-blk) and an eventfd for notification. We can't actually export guest memory unless we allocate it as a shared memory object, which has many disadvantages. The only way to export anonymous memory now is vmsplice(), which is fairly limited. The server at the other end would associate the socket with a filename and forward it to the server using the remote protocol. However, I don't think nbd would be a good protocol. My preference would be for a plugin API, or for a new local protocol that uses splice() to avoid copies. I think a good shared memory implementation would be preferable to plugins. I think it's worth attempting to do a plugin interface for the block layer but I strongly suspect it would not be sufficient. I would not want to see plugins that interacted with BlockDriverState directly, for instance. We change it far too often. Our main loop functions are also not terribly stable so I'm not sure how we would handle that (unless we forced all block plugins to be in a separate thread). If we manage to make a good long-term stable plugin API, it would be a good candidate for the block layer itself. Some OSes manage to have a stable block driver ABI, so it should be possible, if difficult. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/24/2010 10:19 PM, Anthony Liguori wrote: On 05/24/2010 06:03 AM, Avi Kivity wrote: On 05/24/2010 11:27 AM, Stefan Hajnoczi wrote: On Sun, May 23, 2010 at 1:01 PM, Avi Kivitya...@redhat.com wrote: On 05/21/2010 12:29 AM, Anthony Liguori wrote: I'd be more interested in enabling people to build these types of storage systems without touching qemu. Both sheepdog and ceph ultimately transmit I/O over a socket to a central daemon, right? That incurs an extra copy. Besides a shared memory approach, I wonder if the splice() family of syscalls could be used to send/receive data through a storage daemon without the daemon looking at or copying the data? Excellent idea. splice() eventually requires a copy. You cannot splice() to linux-aio so you'd have to splice() to a temporary buffer and then call into linux-aio. With shared memory, you can avoid ever bringing the data into memory via O_DIRECT and linux-aio. If the final destination is a socket, then you end up queuing guest memory as an skbuff. In theory we could do an aio splice to block devices but I don't think that's realistic given our experience with aio changes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [qemu-kvm tests PATCH] qemu-kvm tests cleanup
On 05/15/2010 11:12 AM, Asias He wrote: fix test/x86/msr.c fail to build on i386 Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v2] KVM: MMU: allow more page become unsync at getting sp time
On 05/24/2010 10:41 AM, Xiao Guangrong wrote: Allow more page become asynchronous at getting sp time, if need create new shadow page for gfn but it not allow unsync(level 1), we should unsync all gfn's unsync page Both applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ixgbe: macvlan on PF/VF when SRIOV is enabled
On Mon, 2010-05-24 at 10:54 -0700, Rose, Gregory V wrote: We look forward to it and will be happy to provide feedback. I have submitted the patch to make macvlan on PF works when SRIOV is enabled. One thing you can do is allocate VFs and then load the VF driver in your host domain and then assign each of them a macvlan filter. You'd get a similar effect. That's I am trying to make it work for macvlan on VFs in host domain. I need to add VF secondary addresses in address filter, right? Do you have any aggregation performance comparison between multiple macvlans on PF and single macvlan per VF in host domain? I will run some test to figure it out. If you have some data to share that would be great. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VMX: Fix and improve guest state validity checks
On 05/13/2010 11:15 PM, Mohammed Gamal wrote: On Thu, May 13, 2010 at 9:24 AM, Avi Kivitya...@redhat.com wrote: On 05/11/2010 07:52 PM, Mohammed Gamal wrote: - Add 's' and 'g' field checks on segment registers - Correct SS checks for request and descriptor privilege levels Signed-off-by: Mohammed Gamalm.gamal...@gmail.com --- arch/x86/kvm/vmx.c | 73 +++ 1 files changed, 67 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 777e00d..9805c2a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2121,16 +2121,30 @@ static bool stack_segment_valid(struct kvm_vcpu *vcpu) vmx_get_segment(vcpu,ss, VCPU_SREG_SS); ss_rpl = ss.selectorSELECTOR_RPL_MASK; - if (ss.unusable) + if (ss.dpl != ss_rpl) /* DPL != RPL */ + return false; + + if (ss.unusable) /* Short-circuit */ return true; If ss.unusable, do the dpl and rpl have any meaning? The idea is that dpl and rpl are checked on vmentry regardless of whether ss is usable or not. While the other checks are performed only if ss is usable. Any reference to back this up? I think rpl is valid regardless of ss.unusable (i.e. loading selector 0003 results in an unusable segment with rpl=3), but I don't see how dpl can be valid in an unusable segment. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for May 18
On 05/19/2010 11:20 AM, Christoph Hellwig wrote: It's time we get a proper bugzilla.qemu.org for both qemu and qemu-kvm that can be used sanely. If you ask nicely you might even get a virtual instance of bugzilla.kernel.org which works quite nicely. That would be my preference too but there's a limit to how much we can juggle the bug database around. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: host panic on kernel 2.6.34
Copying netdev, bridge mailing lists. On 05/24/2010 11:23 AM, Hao, Xudong wrote: Hi all I build latest kvm 37dec075a7854f0f550540bf3b9bbeef37c11e2a, based on kernel 2.6.34, after kvm and kvm_intel module loaded, then /etc/init.d/kvm start, a few minutes later, the system will panic. kernel: 2.6.34 kvm: 37dec075a7854f0f550540bf3b9bbeef37c11e2a qemu-kvm: 69dd59a66aaf56d1e8e4c96d0a0923c9cf8f79a0 BUG: unable to handle kernel NULL pointer dereference at 0018 IP: [f914c05b] br_mdb_ip_get+0x2e/0x1aa [bridge] *pdpt = 35fbb001 *pde = Oops: [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map Modules linked in: bridge stp autofs4 hidp rfcomm l2cap crc16 bluetooth rfkill ] Pid: 0, comm: swapper Not tainted 2.6.34 #1 X7DWA/X7DWA EIP: 0060:[f914c05b] EFLAGS: 00010246 CPU: 0 EIP is at br_mdb_ip_get+0x2e/0x1aa [bridge] EAX: c5801d40 EBX: c5801d40 ECX: faef EDX: ESI: f67e03c0 EDI: f5249200 EBP: c5801c94 ESP: c5801c80 DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 Process swapper (pid: 0, ti=c5801000 task=c07f2fe0 task.ti=c07de000) Stack: c5801d40 c5801d40 f67e03c0 f5249200 c5801cb0 f914c6fd fff90006 0 f67e0940 f6326740 f627e064 f67e03c0 c5801d78 f914dd0c f76af140 f6326740 0 f5249200 f67e03c0 0014 f6326758 c5801d54 c08eb440 c5801cf4 c5801d00 Call Trace: [f914c6fd] ? br_multicast_leave_group+0x52/0x128 [bridge] [f914dd0c] ? br_multicast_rcv+0x6dc/0xe90 [bridge] [c0650420] ? fib_lookup+0x2c/0x3a [c064cd15] ? fib_validate_source+0x29d/0x2b4 [c0621175] ? nf_hook_slow+0x3b/0x92 [f9147b39] ? br_handle_frame_finish+0x53/0x17e [bridge] [f914b880] ? br_nf_pre_routing_finish+0x264/0x27c [bridge] [c0621175] ? nf_hook_slow+0x3b/0x92 [f914b61c] ? br_nf_pre_routing_finish+0x0/0x27c [bridge] [f914bf6f] ? br_nf_pre_routing+0x553/0x570 [bridge] [c0621107] ? nf_iterate+0x2f/0x62 [f9147ae6] ? br_handle_frame_finish+0x0/0x17e [bridge] [c0621175] ? nf_hook_slow+0x3b/0x92 [f9147ae6] ? br_handle_frame_finish+0x0/0x17e [bridge] [f9147dda] ? br_handle_frame+0x176/0x198 [bridge] [f9147ae6] ? br_handle_frame_finish+0x0/0x17e [bridge] [c060643b] ? __netif_receive_skb+0x29a/0x37e [c0607023] ? dev_gro_receive+0xfd/0x1d2 [c0606e03] ? netif_receive_skb+0x61/0x67 [c0607199] ? __napi_gro_receive+0xa1/0xba [c0606e7e] ? napi_skb_finish+0x1e/0x33 [c0607201] ? napi_gro_receive+0x20/0x24 [f8867cfc] ? igb_poll+0x706/0xa39 [igb] [c06093b2] ? net_rx_action+0x97/0x13b [c0430641] ? __do_softirq+0x80/0xf4 [c04305c1] ? __do_softirq+0x0/0xf4 IRQ [c04305bf] ? irq_exit+0x29/0x2b [c040373e] ? do_IRQ+0x85/0x9b [c0402ca9] ? common_interrupt+0x29/0x30 [c0407c4f] ? mwait_idle+0x4c/0x52 [c0401a08] ? cpu_idle+0x3a/0x4e [c066cf16] ? rest_init+0x62/0x64 [c08248dd] ? start_kernel+0x2c2/0x2c7 [c08240b3] ? i386_start_kernel+0xb3/0xb8 Code: 57 56 53 83 ec 08 89 45 f0 89 55 ec 8b 42 10 66 83 f8 08 74 0e 31 db 66 3 EIP: [f914c05b] br_mdb_ip_get+0x2e/0x1aa [bridge] SS:ESP 0068:c5801c80 CR2: 0018 ---[ end trace 907f878ab4cd8031 ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: swapper Tainted: G D 2.6.34 #1 Call Trace: [c042c31b] panic+0x3e/0xaa [c0681caa] oops_end+0x8c/0x9b [c041e710] no_context+0x153/0x15d [c041e8a2] __bad_area_nosemaphore+0xe5/0xed [c041e90e] bad_area_nosemaphore+0xd/0x13 [c06838b0] do_page_fault+0x375/0x37d [c0650420] ? fib_lookup+0x2c/0x3a [c0624431] ? ip_route_input_common+0x695/0xf2f [c068353b] ? do_page_fault+0x0/0x37d [c06813d6] error_code+0x66/0x6c [c068353b] ? do_page_fault+0x0/0x37d [f914c05b] ? br_mdb_ip_get+0x2e/0x1aa [bridge] [f914c6fd] br_multicast_leave_group+0x52/0x128 [bridge] [f914dd0c] br_multicast_rcv+0x6dc/0xe90 [bridge] [c0650420] ? fib_lookup+0x2c/0x3a [c064cd15] ? fib_validate_source+0x29d/0x2b4 [c0621175] ? nf_hook_slow+0x3b/0x92 [f9147b39] br_handle_frame_finish+0x53/0x17e [bridge] [f914b880] br_nf_pre_routing_finish+0x264/0x27c [bridge] [c0621175] ? nf_hook_slow+0x3b/0x92 [f914b61c] ? br_nf_pre_routing_finish+0x0/0x27c [bridge] [f914bf6f] br_nf_pre_routing+0x553/0x570 [bridge] [c0621107] nf_iterate+0x2f/0x62 [f9147ae6] ? br_handle_frame_finish+0x0/0x17e [bridge] [c0621175] nf_hook_slow+0x3b/0x92 [f9147ae6] ? br_handle_frame_finish+0x0/0x17e [bridge] [f9147dda] br_handle_frame+0x176/0x198 [bridge] [f9147ae6] ? br_handle_frame_finish+0x0/0x17e [bridge] [c060643b] __netif_receive_skb+0x29a/0x37e [c0607023] ? dev_gro_receive+0xfd/0x1d2 [c0606e03] netif_receive_skb+0x61/0x67 [c0607199] ? __napi_gro_receive+0xa1/0xba [c0606e7e] napi_skb_finish+0x1e/0x33 [c0607201] napi_gro_receive+0x20/0x24 [f8867cfc] igb_poll+0x706/0xa39 [igb] [c06093b2] net_rx_action+0x97/0x13b [c0430641] __do_softirq+0x80/0xf4 [c04305c1] ? __do_softirq+0x0/0xf4 IRQ [c04305bf] ? irq_exit+0x29/0x2b [c040373e] ? do_IRQ+0x85/0x9b [c0402ca9] ? common_interrupt+0x29/0x30 [c0407c4f] ?
[PATCH v2 0/7] Tracing backends
After the RFC discussion, updated patches which I propose for review and merge: The following patches against qemu.git allow static trace events to be declared in QEMU. Trace events use a lightweight syntax and are independent of the backend tracing system (e.g. LTTng UST). Supported backends are: * my trivial tracer (simple) * LTTng Userspace Tracer (ust) * no tracer (nop, the default) The ./configure option to choose a backend is --trace-backend=. Main point of this patchset: adding new trace events is easy and we can switch between backends without modifying the code. These patches are also available at: http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/tracing v2: [PATCH 1/7] trace: Add trace-events file for declaring trace events * Use $source_path/tracetool in ./configure * Include qemu-common.h in trace.h so common types are available [PATCH 2/7] trace: Support disabled events in trace-events * New in v2: makes it easy to build only a subset of trace events [PATCH 3/7] trace: Add simple built-in tracing backend * Make simpletrace.py parse trace-events instead of generating Python [PATCH 4/7] trace: Add LTTng Userspace Tracer backend [PATCH 5/7] trace: Trace qemu_malloc() and qemu_vmalloc() * Record pointer result from allocation functions [PATCH 6/7] trace: Trace virtio-blk, multiwrite, and paio_submit [PATCH 7/7] trace: Trace virtqueue operations * New in v2: observe virtqueue buffer add/remove and notifies -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] trace: Support disabled events in trace-events
Sometimes it is useful to disable a trace event. Removing the event from trace-events is not enough since source code will call the trace_*() function for the event. This patch makes it easy to build without specific trace events by marking them disabled in trace-events: disable multiwrite_cb(void *mcb, int ret) mcb %p ret %d This builds without the multiwrite_cb trace event. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- v2: * This patch is new in v2 trace-events |4 +++- tracetool| 10 -- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/trace-events b/trace-events index a37d3cc..5efaa86 100644 --- a/trace-events +++ b/trace-events @@ -12,10 +12,12 @@ # # Format of a trace event: # -# name(type1 arg1[, type2 arg2] ...) format-string +# [disable] name(type1 arg1[, type2 arg2] ...) format-string # # Example: qemu_malloc(size_t size) size %zu # +# The disable keyword will build without the trace event. +# # The name must be a valid as a C function name. # # Types should be standard C types. Use void * for pointers because the trace diff --git a/tracetool b/tracetool index 766a9ba..53d3612 100755 --- a/tracetool +++ b/tracetool @@ -110,7 +110,7 @@ linetoc_end_nop() # Process stdin by calling begin, line, and end functions for the backend convert() { -local begin process_line end +local begin process_line end str disable begin=lineto$1_begin_$backend process_line=lineto$1_$backend end=lineto$1_end_$backend @@ -123,8 +123,14 @@ convert() str=${str%%#*} test -z $str continue +# Process the line. The nop backend handles disabled lines. +disable=${str%%disable*} echo -$process_line $str +if test -z $disable; then +lineto$1_nop ${str##disable} +else +$process_line $str +fi done echo -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] trace: Add trace-events file for declaring trace events
This patch introduces the trace-events file where trace events can be declared like so: qemu_malloc(size_t size) size %zu qemu_free(void *ptr) ptr %p These trace event declarations are processed by a new tool called tracetool to generate code for the trace events. Trace event declarations are independent of the backend tracing system (LTTng User Space Tracing, ftrace markers, DTrace). The default nop backend generates empty trace event functions. Therefore trace events are disabled by default. The trace-events file serves two purposes: 1. Adding trace events is easy. It is not necessary to understand the details of a backend tracing system. The trace-events file is a single location where trace events can be declared without code duplication. 2. QEMU is not tightly coupled to one particular backend tracing system. In order to support tracing across QEMU host platforms and to anticipate new backend tracing systems that are currently maturing, it is important to be flexible and not tied to one system. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- v2: * Use $source_path/tracetool in ./configure * Include qemu-common.h in trace.h so common types are available .gitignore |2 + Makefile| 17 - Makefile.objs |5 ++ Makefile.target |1 + configure | 19 ++ trace-events| 24 tracetool | 165 +++ 7 files changed, 229 insertions(+), 4 deletions(-) create mode 100644 trace-events create mode 100755 tracetool diff --git a/.gitignore b/.gitignore index fdfe2f0..4644557 100644 --- a/.gitignore +++ b/.gitignore @@ -2,6 +2,8 @@ config-devices.* config-all-devices.* config-host.* config-target.* +trace.h +trace.c *-softmmu *-darwin-user *-linux-user diff --git a/Makefile b/Makefile index 7986bf6..a9f79a9 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ # Makefile for QEMU. -GENERATED_HEADERS = config-host.h +GENERATED_HEADERS = config-host.h trace.h ifneq ($(wildcard config-host.mak),) # Put the all: rule here so that config-host.mak can contain dependencies. @@ -130,16 +130,24 @@ bt-host.o: QEMU_CFLAGS += $(BLUEZ_CFLAGS) iov.o: iov.c iov.h +trace.h: trace-events + $(call quiet-command,sh $(SRC_PATH)/tracetool --$(TRACE_BACKEND) -h $ $@, GEN $@) + +trace.c: trace-events + $(call quiet-command,sh $(SRC_PATH)/tracetool --$(TRACE_BACKEND) -c $ $@, GEN $@) + +trace.o: trace.c + ## qemu-img.o: qemu-img-cmds.h qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o: $(GENERATED_HEADERS) -qemu-img$(EXESUF): qemu-img.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y) +qemu-img$(EXESUF): qemu-img.o qemu-tool.o qemu-error.o $(trace-obj-y) $(block-obj-y) $(qobject-obj-y) -qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y) +qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o qemu-error.o $(trace-obj-y) $(block-obj-y) $(qobject-obj-y) -qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y) +qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o $(trace-obj-y) $(block-obj-y) $(qobject-obj-y) qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx $(call quiet-command,sh $(SRC_PATH)/hxtool -h $ $@, GEN $@) @@ -157,6 +165,7 @@ clean: rm -f *.o *.d *.a $(TOOLS) TAGS cscope.* *.pod *~ */*~ rm -f slirp/*.o slirp/*.d audio/*.o audio/*.d block/*.o block/*.d net/*.o net/*.d rm -f qemu-img-cmds.h + rm -f trace.c trace.h $(MAKE) -C tests clean for d in $(ALL_SUBDIRS) libhw32 libhw64 libuser libdis libdis-user; do \ if test -d $$d; then $(MAKE) -C $$d $@ || exit 1; fi; \ diff --git a/Makefile.objs b/Makefile.objs index 1a942e5..20e709e 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -251,6 +251,11 @@ libdis-$(CONFIG_S390_DIS) += s390-dis.o libdis-$(CONFIG_SH4_DIS) += sh4-dis.o libdis-$(CONFIG_SPARC_DIS) += sparc-dis.o +## +# trace + +trace-obj-y = trace.o + vl.o: QEMU_CFLAGS+=$(GPROF_CFLAGS) vl.o: QEMU_CFLAGS+=$(SDL_CFLAGS) diff --git a/Makefile.target b/Makefile.target index fda5bf3..8f7b564 100644 --- a/Makefile.target +++ b/Makefile.target @@ -293,6 +293,7 @@ $(obj-y) $(obj-$(TARGET_BASE_ARCH)-y): $(GENERATED_HEADERS) obj-y += $(addprefix ../, $(common-obj-y)) obj-y += $(addprefix ../libdis/, $(libdis-y)) +obj-y += $(addprefix ../, $(trace-obj-y)) obj-y += $(libobj-y) obj-y += $(addprefix $(HWDIR)/, $(hw-obj-y)) diff --git a/configure b/configure index 3cd2c5f..e94e113 100755 --- a/configure +++ b/configure @@ -299,6 +299,7 @@ pkgversion= check_utests=no user_pie=no zero_malloc= +trace_backend=nop # OS specific if check_define __linux__ ; then @@ -494,6 +495,8 @@ for opt do ;; --target-list=*) target_list=$optarg ;; +
[PATCH 4/7] trace: Add LTTng Userspace Tracer backend
This patch adds LTTng Userspace Tracer (UST) backend support. The UST system requires no kernel support but libust and liburcu must be installed. $ ./configure --trace-backend ust $ make Start the UST daemon: $ ustd List available tracepoints and enable some: $ ustctl --list-markers $(pgrep qemu) [...] {PID: 5458, channel/marker: ust/paio_submit, state: 0, fmt: acb %p opaque %p sector_num %lu nb_sectors %lu type %lu 0x4b32ba} $ ustctl --enable-marker ust/paio_submit $(pgrep qemu) Run the trace: $ ustctl --create-trace $(pgrep qemu) $ ustctl --start-trace $(pgrep qemu) [...] $ ustctl --stop-trace $(pgrep qemu) $ ustctl --destroy-trace $(pgrep qemu) Trace results can be viewed using lttv-gui. More information about UST: http://lttng.org/ust Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- configure |5 +++- tracetool | 77 +++- 2 files changed, 79 insertions(+), 3 deletions(-) diff --git a/configure b/configure index 7d2c69b..675d0fc 100755 --- a/configure +++ b/configure @@ -829,7 +829,7 @@ echo --enable-docsenable documentation build echo --disable-docs disable documentation build echo --disable-vhost-net disable vhost-net acceleration support echo --enable-vhost-net enable vhost-net acceleration support -echo --trace-backend=BTrace backend nop simple +echo --trace-backend=BTrace backend nop simple ust echo echo NOTE: The object files are built at the place where configure is launched exit 1 @@ -2302,6 +2302,9 @@ bsd) esac echo TRACE_BACKEND=$trace_backend $config_host_mak +if test $trace_backend = ust; then + LIBS=-lust $LIBS +fi tools= if test `expr $target_list : .*softmmu.*` != 0 ; then diff --git a/tracetool b/tracetool index f094ddc..9ea9c08 100755 --- a/tracetool +++ b/tracetool @@ -3,12 +3,13 @@ usage() { cat 2 EOF -usage: $0 [--nop | --simple] [-h | -c] +usage: $0 [--nop | --simple | --ust] [-h | -c] Generate tracing code for a file on stdin. Backends: --nop Tracing disabled --simple Simple built-in backend + --ust LTTng User Space Tracing backend Output formats: -hGenerate .h file @@ -179,6 +180,78 @@ linetoc_end_simple() return } +linetoh_begin_ust() +{ +echo #include ust/tracepoint.h +} + +linetoh_ust() +{ +local name args argnames +name=$(get_name $1) +args=$(get_args $1) +argnames=$(get_argnames $1) + +cat EOF +DECLARE_TRACE(ust_$name, TPPROTO($args), TPARGS($argnames)); +#define trace_$name trace_ust_$name +EOF +} + +linetoh_end_ust() +{ +# Clean up after UST headers which pollute the namespace +cat EOF +#undef mutex_lock +#undef mutex_unlock +EOF +} + +linetoc_begin_ust() +{ +cat EOF +#include ust/marker.h +#include trace.h +EOF +} + +linetoc_ust() +{ +local name args argnames fmt +name=$(get_name $1) +args=$(get_args $1) +argnames=$(get_argnames $1) +fmt=$(get_fmt $1) + +cat EOF +DEFINE_TRACE(ust_$name); + +static void ust_${name}_probe($args) +{ +trace_mark(ust, $name, $fmt, $argnames); +} +EOF + +# Collect names for later +names=$names $name +} + +linetoc_end_ust() +{ +cat EOF +static void __attribute__((constructor)) trace_init(void) +{ +EOF + +for name in $names; do +cat EOF +register_trace_ust_$name(ust_${name}_probe); +EOF +done + +echo } +} + # Process stdin by calling begin, line, and end functions for the backend convert() { @@ -228,7 +301,7 @@ tracetoc() # Choose backend case $1 in ---nop | --simple) backend=${1#--} ;; +--nop | --simple | --ust) backend=${1#--} ;; *) usage ;; esac shift -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] trace: Trace virtqueue operations
This patch adds trace events for virtqueue operations including adding/removing buffers, notifying the guest, and receiving a notify from the guest. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- v2: * This patch is new in v2 hw/virtio.c |8 trace-events |8 2 files changed, 16 insertions(+), 0 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index 4475bb3..a5741ae 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -13,6 +13,7 @@ #include inttypes.h +#include trace.h #include virtio.h #include sysemu.h @@ -205,6 +206,8 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem, unsigned int offset; int i; +trace_virtqueue_fill(vq, elem, len, idx); + offset = 0; for (i = 0; i elem-in_num; i++) { size_t size = MIN(len - offset, elem-in_sg[i].iov_len); @@ -232,6 +235,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count) { /* Make sure buffer is written before we update index. */ wmb(); +trace_virtqueue_flush(vq, count); vring_used_idx_increment(vq, count); vq-inuse -= count; } @@ -422,6 +426,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) vq-inuse++; +trace_virtqueue_pop(vq, elem, elem-in_num, elem-out_num); return elem-in_num + elem-out_num; } @@ -560,6 +565,7 @@ int virtio_queue_get_num(VirtIODevice *vdev, int n) void virtio_queue_notify(VirtIODevice *vdev, int n) { if (n VIRTIO_PCI_QUEUE_MAX vdev-vq[n].vring.desc) { +trace_virtio_queue_notify(vdev, n, vdev-vq[n]); vdev-vq[n].handle_output(vdev, vdev-vq[n]); } } @@ -597,6 +603,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, void virtio_irq(VirtQueue *vq) { +trace_virtio_irq(vq); vq-vdev-isr |= 0x01; virtio_notify_vector(vq-vdev, vq-vector); } @@ -609,6 +616,7 @@ void virtio_notify(VirtIODevice *vdev, VirtQueue *vq) (vq-inuse || vring_avail_idx(vq) != vq-last_avail_idx))) return; +trace_virtio_notify(vdev, vq); vdev-isr |= 0x01; virtio_notify_vector(vdev, vq-vector); } diff --git a/trace-events b/trace-events index 48415f8..a533414 100644 --- a/trace-events +++ b/trace-events @@ -35,6 +35,14 @@ qemu_memalign(size_t alignment, size_t size, void *ptr) alignment %zu size %zu qemu_valloc(size_t size, void *ptr) size %zu ptr %p qemu_vfree(void *ptr) ptr %p +# hw/virtio.c +virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) vq %p elem %p len %u idx %u +virtqueue_flush(void *vq, unsigned int count) vq %p count %u +virtqueue_pop(void *vq, void *elem, unsigned int in_num, unsigned int out_num) vq %p elem %p in_num %u out_num %u +virtio_queue_notify(void *vdev, int n, void *vq) vdev %p n %d vq %p +virtio_irq(void *vq) vq %p +virtio_notify(void *vdev, void *vq) vdev %p vq %p + # block.c multiwrite_cb(void *mcb, int ret) mcb %p ret %d bdrv_aio_multiwrite(void *mcb, int num_callbacks, int num_reqs) mcb %p num_callbacks %d num_reqs %d -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] trace: Trace virtio-blk, multiwrite, and paio_submit
This patch adds trace events that make it possible to observe virtio-blk. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- block.c|7 +++ hw/virtio-blk.c|7 +++ posix-aio-compat.c |2 ++ trace-events | 14 ++ 4 files changed, 30 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index 0b0966c..56db112 100644 --- a/block.c +++ b/block.c @@ -23,6 +23,7 @@ */ #include config-host.h #include qemu-common.h +#include trace.h #include monitor.h #include block_int.h #include module.h @@ -1922,6 +1923,8 @@ static void multiwrite_cb(void *opaque, int ret) { MultiwriteCB *mcb = opaque; +trace_multiwrite_cb(mcb, ret); + if (ret 0 !mcb-error) { mcb-error = ret; multiwrite_user_cb(mcb); @@ -2065,6 +2068,8 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs) // Check for mergable requests num_reqs = multiwrite_merge(bs, reqs, num_reqs, mcb); +trace_bdrv_aio_multiwrite(mcb, mcb-num_callbacks, num_reqs); + // Run the aio requests for (i = 0; i num_reqs; i++) { acb = bdrv_aio_writev(bs, reqs[i].sector, reqs[i].qiov, @@ -2075,9 +2080,11 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs) // submitted yet. Otherwise we'll wait for the submitted AIOs to // complete and report the error in the callback. if (mcb-num_requests == 0) { +trace_bdrv_aio_multiwrite_earlyfail(mcb); reqs[i].error = -EIO; goto fail; } else { +trace_bdrv_aio_multiwrite_latefail(mcb, i); mcb-num_requests++; multiwrite_cb(mcb, -EIO); break; diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index 5d7f1a2..706f109 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -13,6 +13,7 @@ #include qemu-common.h #include sysemu.h +#include trace.h #include virtio-blk.h #include block_int.h #ifdef __linux__ @@ -50,6 +51,8 @@ static void virtio_blk_req_complete(VirtIOBlockReq *req, int status) { VirtIOBlock *s = req-dev; +trace_virtio_blk_req_complete(req, status); + req-in-status = status; virtqueue_push(s-vq, req-elem, req-qiov.size + sizeof(*req-in)); virtio_notify(s-vdev, s-vq); @@ -87,6 +90,8 @@ static void virtio_blk_rw_complete(void *opaque, int ret) { VirtIOBlockReq *req = opaque; +trace_virtio_blk_rw_complete(req, ret); + if (ret) { int is_read = !(req-out-type VIRTIO_BLK_T_OUT); if (virtio_blk_handle_rw_error(req, -ret, is_read)) @@ -263,6 +268,8 @@ static void virtio_blk_handle_flush(BlockRequest *blkreq, int *num_writes, static void virtio_blk_handle_write(BlockRequest *blkreq, int *num_writes, VirtIOBlockReq *req, BlockDriverState **old_bs) { +trace_virtio_blk_handle_write(req, req-out-sector, req-qiov.size / 512); + if (req-out-sector req-dev-sector_mask) { virtio_blk_rw_complete(req, -EIO); return; diff --git a/posix-aio-compat.c b/posix-aio-compat.c index b43c531..c2200fe 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -25,6 +25,7 @@ #include qemu-queue.h #include osdep.h #include qemu-common.h +#include trace.h #include block_int.h #include block/raw-posix-aio.h @@ -583,6 +584,7 @@ BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, acb-next = posix_aio_state-first_aio; posix_aio_state-first_aio = acb; +trace_paio_submit(acb, opaque, sector_num, nb_sectors, type); qemu_paio_submit(acb); return acb-common; } diff --git a/trace-events b/trace-events index 3fde0c6..48415f8 100644 --- a/trace-events +++ b/trace-events @@ -34,3 +34,17 @@ qemu_free(void *ptr) ptr %p qemu_memalign(size_t alignment, size_t size, void *ptr) alignment %zu size %zu ptr %p qemu_valloc(size_t size, void *ptr) size %zu ptr %p qemu_vfree(void *ptr) ptr %p + +# block.c +multiwrite_cb(void *mcb, int ret) mcb %p ret %d +bdrv_aio_multiwrite(void *mcb, int num_callbacks, int num_reqs) mcb %p num_callbacks %d num_reqs %d +bdrv_aio_multiwrite_earlyfail(void *mcb) mcb %p +bdrv_aio_multiwrite_latefail(void *mcb, int i) mcb %p i %d + +# hw/virtio-blk.c +virtio_blk_req_complete(void *req, int status) req %p status %d +virtio_blk_rw_complete(void *req, int ret) req %p ret %d +virtio_blk_handle_write(void *req, unsigned long sector, unsigned long nsectors) req %p sector %lu nsectors %lu + +# posix-aio-compat.c +paio_submit(void *acb, void *opaque, unsigned long sector_num, unsigned long nb_sectors, unsigned long type) acb %p opaque %p sector_num %lu nb_sectors %lu type %lu -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] trace: Trace qemu_malloc() and qemu_vmalloc()
It is often useful to instrument memory management functions in order to find leaks or performance problems. This patch adds trace events for the memory allocation primitives. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- v2: * Record pointer result from allocation functions osdep.c | 24 ++-- qemu-malloc.c | 12 ++-- trace-events | 10 ++ 3 files changed, 38 insertions(+), 8 deletions(-) diff --git a/osdep.c b/osdep.c index abbc8a2..a6b7726 100644 --- a/osdep.c +++ b/osdep.c @@ -50,6 +50,7 @@ #endif #include qemu-common.h +#include trace.h #include sysemu.h #include qemu_socket.h @@ -71,25 +72,34 @@ static void *oom_check(void *ptr) #if defined(_WIN32) void *qemu_memalign(size_t alignment, size_t size) { +void *ptr; + if (!size) { abort(); } -return oom_check(VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE)); +ptr = oom_check(VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE)); +trace_qemu_memalign(alignment, size, ptr); +return ptr; } void *qemu_vmalloc(size_t size) { +void *ptr; + /* FIXME: this is not exactly optimal solution since VirtualAlloc has 64Kb granularity, but at least it guarantees us that the memory is page aligned. */ if (!size) { abort(); } -return oom_check(VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE)); +ptr = oom_check(VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE)); +trace_qemu_vmalloc(size, ptr); +return ptr; } void qemu_vfree(void *ptr) { +trace_qemu_vfree(ptr); VirtualFree(ptr, 0, MEM_RELEASE); } @@ -97,21 +107,22 @@ void qemu_vfree(void *ptr) void *qemu_memalign(size_t alignment, size_t size) { +void *ptr; #if defined(_POSIX_C_SOURCE) !defined(__sun__) int ret; -void *ptr; ret = posix_memalign(ptr, alignment, size); if (ret != 0) { fprintf(stderr, Failed to allocate %zu B: %s\n, size, strerror(ret)); abort(); } -return ptr; #elif defined(CONFIG_BSD) -return oom_check(valloc(size)); +ptr = oom_check(valloc(size)); #else -return oom_check(memalign(alignment, size)); +ptr = oom_check(memalign(alignment, size)); #endif +trace_qemu_memalign(alignment, size, ptr); +return ptr; } /* alloc shared memory pages */ @@ -122,6 +133,7 @@ void *qemu_vmalloc(size_t size) void qemu_vfree(void *ptr) { +trace_qemu_vfree(ptr); free(ptr); } diff --git a/qemu-malloc.c b/qemu-malloc.c index 6cdc5de..72de60a 100644 --- a/qemu-malloc.c +++ b/qemu-malloc.c @@ -22,6 +22,7 @@ * THE SOFTWARE. */ #include qemu-common.h +#include trace.h #include stdlib.h static void *oom_check(void *ptr) @@ -39,6 +40,7 @@ void *get_mmap_addr(unsigned long size) void qemu_free(void *ptr) { +trace_qemu_free(ptr); free(ptr); } @@ -53,18 +55,24 @@ static int allow_zero_malloc(void) void *qemu_malloc(size_t size) { +void *ptr; if (!size !allow_zero_malloc()) { abort(); } -return oom_check(malloc(size ? size : 1)); +ptr = oom_check(malloc(size ? size : 1)); +trace_qemu_malloc(size, ptr); +return ptr; } void *qemu_realloc(void *ptr, size_t size) { +void *newptr; if (!size !allow_zero_malloc()) { abort(); } -return oom_check(realloc(ptr, size ? size : 1)); +newptr = oom_check(realloc(ptr, size ? size : 1)); +trace_qemu_realloc(ptr, size, newptr); +return newptr; } void *qemu_mallocz(size_t size) diff --git a/trace-events b/trace-events index 5efaa86..3fde0c6 100644 --- a/trace-events +++ b/trace-events @@ -24,3 +24,13 @@ # system may not have the necessary headers included. # # The format-string should be a sprintf()-compatible format string. + +# qemu-malloc.c +qemu_malloc(size_t size, void *ptr) size %zu ptr %p +qemu_realloc(void *ptr, size_t size, void *newptr) ptr %p size %zu newptr %p +qemu_free(void *ptr) ptr %p + +# osdep.c +qemu_memalign(size_t alignment, size_t size, void *ptr) alignment %zu size %zu ptr %p +qemu_valloc(size_t size, void *ptr) size %zu ptr %p +qemu_vfree(void *ptr) ptr %p -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] trace: Add simple built-in tracing backend
This patch adds a simple tracer which produces binary trace files and is built into QEMU. The main purpose of this patch is to show how new tracing backends can be added to tracetool. To try out the simple backend: ./configure --trace-backend=simple make After running QEMU you can pretty-print the trace: ./simpletrace.py trace-events /tmp/trace.log Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- I intend for this tracing backend to be replaced by something based on Prerna's work. For now it is useful for basic tracing. v2: * Make simpletrace.py parse trace-events instead of generating Python .gitignore |1 + Makefile.objs |3 ++ configure |2 +- simpletrace.c | 64 ++ simpletrace.py | 53 ++ tracetool | 78 +-- 6 files changed, 197 insertions(+), 4 deletions(-) create mode 100644 simpletrace.c create mode 100755 simpletrace.py diff --git a/.gitignore b/.gitignore index 4644557..5128452 100644 --- a/.gitignore +++ b/.gitignore @@ -39,6 +39,7 @@ qemu-monitor.texi *.log *.pdf *.pg +*.pyc *.toc *.tp *.vr diff --git a/Makefile.objs b/Makefile.objs index 20e709e..7cb40ac 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -255,6 +255,9 @@ libdis-$(CONFIG_SPARC_DIS) += sparc-dis.o # trace trace-obj-y = trace.o +ifeq ($(TRACE_BACKEND),simple) +trace-obj-y += simpletrace.o +endif vl.o: QEMU_CFLAGS+=$(GPROF_CFLAGS) diff --git a/configure b/configure index e94e113..7d2c69b 100755 --- a/configure +++ b/configure @@ -829,7 +829,7 @@ echo --enable-docsenable documentation build echo --disable-docs disable documentation build echo --disable-vhost-net disable vhost-net acceleration support echo --enable-vhost-net enable vhost-net acceleration support -echo --trace-backend=BTrace backend nop +echo --trace-backend=BTrace backend nop simple echo echo NOTE: The object files are built at the place where configure is launched exit 1 diff --git a/simpletrace.c b/simpletrace.c new file mode 100644 index 000..2fec4d3 --- /dev/null +++ b/simpletrace.c @@ -0,0 +1,64 @@ +#include stdlib.h +#include stdio.h +#include trace.h + +typedef struct { +unsigned long event; +unsigned long x1; +unsigned long x2; +unsigned long x3; +unsigned long x4; +unsigned long x5; +} TraceRecord; + +enum { +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord), +}; + +static TraceRecord trace_buf[TRACE_BUF_LEN]; +static unsigned int trace_idx; +static FILE *trace_fp; + +static void trace(TraceEvent event, unsigned long x1, + unsigned long x2, unsigned long x3, + unsigned long x4, unsigned long x5) { +TraceRecord *rec = trace_buf[trace_idx]; +rec-event = event; +rec-x1 = x1; +rec-x2 = x2; +rec-x3 = x3; +rec-x4 = x4; +rec-x5 = x5; + +if (++trace_idx == TRACE_BUF_LEN) { +trace_idx = 0; + +if (!trace_fp) { +trace_fp = fopen(/tmp/trace.log, w); +} +if (trace_fp) { +size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp); +result = result; +} +} +} + +void trace1(TraceEvent event, unsigned long x1) { +trace(event, x1, 0, 0, 0, 0); +} + +void trace2(TraceEvent event, unsigned long x1, unsigned long x2) { +trace(event, x1, x2, 0, 0, 0); +} + +void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3) { +trace(event, x1, x2, x3, 0, 0); +} + +void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4) { +trace(event, x1, x2, x3, x4, 0); +} + +void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5) { +trace(event, x1, x2, x3, x4, x5); +} diff --git a/simpletrace.py b/simpletrace.py new file mode 100755 index 000..d6631ba --- /dev/null +++ b/simpletrace.py @@ -0,0 +1,53 @@ +#!/usr/bin/env python +import sys +import struct +import re + +trace_fmt = 'LL' +trace_len = struct.calcsize(trace_fmt) +event_re = re.compile(r'(disable\s+)?([a-zA-Z0-9_]+)\(([^)]*)\)\s+([^]*)') + +def parse_events(fobj): +def get_argnames(args): +return tuple(arg.split()[-1].lstrip('*') for arg in args.split(',')) + +events = {} +event_num = 0 +for line in fobj: +m = event_re.match(line.strip()) +if m is None: +continue + +disable, name, args, fmt = m.groups() +if disable: +continue + +events[event_num] = (name,) + get_argnames(args) +event_num += 1 +return events + +def read_record(fobj): +s = fobj.read(trace_len) +if len(s) != trace_len: +return None +return struct.unpack(trace_fmt, s) + +def format_record(events, rec): +event = events[rec[0]] +fields =
Re: [PATCH] VMX: Fix and improve guest state validity checks
On Tue, May 25, 2010 at 12:37 PM, Avi Kivity a...@redhat.com wrote: On 05/13/2010 11:15 PM, Mohammed Gamal wrote: On Thu, May 13, 2010 at 9:24 AM, Avi Kivitya...@redhat.com wrote: On 05/11/2010 07:52 PM, Mohammed Gamal wrote: - Add 's' and 'g' field checks on segment registers - Correct SS checks for request and descriptor privilege levels Signed-off-by: Mohammed Gamalm.gamal...@gmail.com --- arch/x86/kvm/vmx.c | 73 +++ 1 files changed, 67 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 777e00d..9805c2a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2121,16 +2121,30 @@ static bool stack_segment_valid(struct kvm_vcpu *vcpu) vmx_get_segment(vcpu,ss, VCPU_SREG_SS); ss_rpl = ss.selector SELECTOR_RPL_MASK; - if (ss.unusable) + if (ss.dpl != ss_rpl) /* DPL != RPL */ + return false; + + if (ss.unusable) /* Short-circuit */ return true; If ss.unusable, do the dpl and rpl have any meaning? The idea is that dpl and rpl are checked on vmentry regardless of whether ss is usable or not. While the other checks are performed only if ss is usable. Any reference to back this up? I think rpl is valid regardless of ss.unusable (i.e. loading selector 0003 results in an unusable segment with rpl=3), but I don't see how dpl can be valid in an unusable segment. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3B, System Programming Guide, Part 2, Chapter 22, Section 22.3.1.2: Checks on Guest Segment Registers. You'll note that DS, ES, FS, GS checks are done when the segment is usable. SS checks are not necessarily checked only when the segment is usable. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
Am 23.05.2010 14:01, schrieb Avi Kivity: On 05/21/2010 12:29 AM, Anthony Liguori wrote: I'd be more interested in enabling people to build these types of storage systems without touching qemu. Both sheepdog and ceph ultimately transmit I/O over a socket to a central daemon, right? That incurs an extra copy. So could we not standardize a protocol for this that both sheepdog and ceph could implement? The protocol already exists, nbd. It doesn't support snapshotting etc. but we could extend it. But IMO what's needed is a plugin API for the block layer. What would it buy us, apart from more downstreams and having to maintain a stable API and ABI? Hiding block drivers somewhere else doesn't make them stop existing, they just might not be properly integrated, but rather hacked in to fit that limited stable API. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: irq problems after live migration with 0.12.4
Michael Tokarev wrote: 23.05.2010 13:55, Peter Lieven wrote: Hi, after live migrating ubuntu 9.10 server (2.6.31-14-server) and suse linux 10.1 (2.6.16.13-4-smp) it happens sometimes that the guest runs into irq problems. i mention these 2 guest oss since i have seen the error there. there are likely others around with the same problem. on the host i run 2.6.33.3 (kernel+mod) and qemu-kvm 0.12.4. i started a vm with: /usr/bin/qemu-kvm-0.12.4 -net tap,vlan=141,script=no,downscript=no,ifname=tap0 -net nic,vlan=141,model=e1000,macaddr=52:54:00:ff:00:72 -drive file=/dev/sdb,if=ide,boot=on,cache=none,aio=native -m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU E5430 @ 2.66GHz' -monitor tcp:0:4001,server,nowait -vnc :1 -name 'migration-test-9-10' -boot order=dc,menu=on -k de -incoming tcp:172.21.55.22:5001 -pidfile /var/run/qemu/vm-155.pid -mem-path /hugepages -mem-prealloc -rtc base=utc,clock=host -usb -usbdevice tablet for testing i have a clean ubuntu 9.10 server 64-bit install and created a small script with fetches a dvd iso from a local server and checking md5sum in an endless loop. the download performance is approx. 50MB/s on that vm. to trigger the error i did several migrations of the vm throughout the last days. finally I ended up in the following oops in the guest: [64442.298521] irq 10: nobody cared (try booting with the irqpoll option) [64442.299175] Pid: 0, comm: swapper Not tainted 2.6.31-14-server #48-Ubuntu [64442.299179] Call Trace: [64442.299185]IRQ [810b4b96] __report_bad_irq+0x26/0xa0 [64442.299227] [810b4d9c] note_interrupt+0x18c/0x1d0 [64442.299232] [810b5415] handle_fasteoi_irq+0xd5/0x100 [64442.299244] [81014bdd] handle_irq+0x1d/0x30 [64442.299246] [810140b7] do_IRQ+0x67/0xe0 [64442.299249] [810129d3] ret_from_intr+0x0/0x11 [64442.299266] [810b3234] ? handle_IRQ_event+0x24/0x160 [64442.299269] [810b529f] ? handle_edge_irq+0xcf/0x170 [64442.299271] [81014bdd] ? handle_irq+0x1d/0x30 [64442.299273] [810140b7] ? do_IRQ+0x67/0xe0 [64442.299275] [810129d3] ? ret_from_intr+0x0/0x11 [64442.299290] [81526b14] ? _spin_unlock_irqrestore+0x14/0x20 [64442.299302] [8133257c] ? scsi_dispatch_cmd+0x16c/0x2d0 [64442.299307] [8133963a] ? scsi_request_fn+0x3aa/0x500 [64442.299322] [8125fafc] ? __blk_run_queue+0x6c/0x150 [64442.299324] [8125fcbb] ? blk_run_queue+0x2b/0x50 [64442.299327] [8133899f] ? scsi_run_queue+0xcf/0x2a0 [64442.299336] [81339a0d] ? scsi_next_command+0x3d/0x60 [64442.299338] [8133a21b] ? scsi_end_request+0xab/0xb0 [64442.299340] [8133a50e] ? scsi_io_completion+0x9e/0x4d0 [64442.299348] [81036419] ? default_spin_lock_flags+0x9/0x10 [64442.299351] [8133224d] ? scsi_finish_command+0xbd/0x130 [64442.299353] [8133aa95] ? scsi_softirq_done+0x145/0x170 [64442.299356] [81264e6d] ? blk_done_softirq+0x7d/0x90 [64442.299368] [810651fd] ? __do_softirq+0xbd/0x200 [64442.299370] [810131ac] ? call_softirq+0x1c/0x30 [64442.299372] [81014b85] ? do_softirq+0x55/0x90 [64442.299374] [81064f65] ? irq_exit+0x85/0x90 [64442.299376] [810140c0] ? do_IRQ+0x70/0xe0 [64442.299379] [810129d3] ? ret_from_intr+0x0/0x11 [64442.299380]EOI [810356f6] ? native_safe_halt+0x6/0x10 [64442.299390] [8101a20c] ? default_idle+0x4c/0xe0 [64442.299395] [815298f5] ? atomic_notifier_call_chain+0x15/0x20 [64442.299398] [81010e02] ? cpu_idle+0xb2/0x100 [64442.299406] [815123c6] ? rest_init+0x66/0x70 [64442.299424] [81838047] ? start_kernel+0x352/0x35b [64442.299427] [8183759a] ? x86_64_start_reservations+0x125/0x129 [64442.299429] [81837698] ? x86_64_start_kernel+0xfa/0x109 [64442.299433] handlers: [64442.299840] [ab80] (e1000_intr+0x0/0x190 [e1000]) [64442.300046] Disabling IRQ #10 See also LP bug #584131 (https://bugs.launchpad.net/bugs/584131) and original Debian bug#580649 (http://bugs.debian.org/580649) Not sure if they're related... /mjt michael, do you have any ideas what i got do to debug whats happening? looking at launchpad and debian bug tracker i found other bugs also with a maybe related problem. so this issue might be greater... thanks peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VMX: Fix and improve guest state validity checks
On 05/25/2010 01:36 PM, Mohammed Gamal wrote: Any reference to back this up? I think rpl is valid regardless of ss.unusable (i.e. loading selector 0003 results in an unusable segment with rpl=3), but I don't see how dpl can be valid in an unusable segment. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3B, System Programming Guide, Part 2, Chapter 22, Section 22.3.1.2: Checks on Guest Segment Registers. You'll note that DS, ES, FS, GS checks are done when the segment is usable. SS checks are not necessarily checked only when the segment is usable. Strange, but consistent with If the unusable bit is 1, the base address, the segment limit, and the remainder of the access rights are undefined after VM entry. The only exceptions are the following: — Bits 3:0 of the base address for SS are cleared to 0. — SS.DPL: always loaded from the SS access-rights field. This will be the current privilege level (CPL) after the VM entry completes. — SS.B: set to 1. — The base addresses for FS and GS: always loaded. On processors that support Intel 64 architecture, the values loaded for base addresses for FS and GS are also manifest in the FS.base and GS.base MSRs. — The base address for LDTR on processors that support Intel 64 archi- tecture: set to an undefined but canonical value. — Bits 63:32 of the base addresses for SS, DS, and ES on processors that support Intel 64 architecture: cleared to 0. So you are right. Seems to me we can simplify vmx_get_cpl() on this basis to look at ss.dpl. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH] AMD IOMMU emulation
On Tue, May 25, 2010 at 10:39:22AM +0200, Joerg Roedel wrote: On Mon, May 24, 2010 at 08:10:16PM +, Blue Swirl wrote: On Mon, May 24, 2010 at 3:40 PM, Joerg Roedel j...@8bytes.org wrote: + +#define MMIO_SIZE ? ? ? ? ? ? ? 0x2028 This size should be a power-of-two value. In this case probably 0x4000. Not really, the devices can reserve regions of any size. There were some implementation deficiencies in earlier versions of QEMU, where the whole page would be reserved anyway, but this limitation has been removed long time ago. The drivers for AMD IOMMU expect that to be 0x4000. At least the Linux driver maps the MMIO region with this size. So the emulation should reserve this amount of MMIO space too. Joerg Yeah, I'll change that, since I already reserve 0x4000 bytes in SeaBIOS for it (I did that to deal with the 16 KiB alignment requirement). Eduard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 02:02 PM, Kevin Wolf wrote: So could we not standardize a protocol for this that both sheepdog and ceph could implement? The protocol already exists, nbd. It doesn't support snapshotting etc. but we could extend it. But IMO what's needed is a plugin API for the block layer. What would it buy us, apart from more downstreams and having to maintain a stable API and ABI? Currently if someone wants to add a new block format, they have to upstream it and wait for a new qemu to be released. With a plugin API, they can add a new block format to an existing, supported qemu. Hiding block drivers somewhere else doesn't make them stop existing, they just might not be properly integrated, but rather hacked in to fit that limited stable API. They would hack it to fit the current API, and hack the API in qemu.git to fit their requirements for the next release. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VMX: Properly return error to userspace on vmentry failure
On 05/24/2010 01:01 AM, Mohammed Gamal wrote: The vmexit handler returns KVM_EXIT_UNKNOWN since there is no handler for vmentry failures. This intercepts vmentry failures and returns KVM_FAIL_ENTRY to userspace instead. Signed-off-by: Mohammed Gamalm.gamal...@gmail.com --- arch/x86/kvm/vmx.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 99ae513..4edcffb 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3665,6 +3665,13 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) if (enable_ept is_paging(vcpu)) vcpu-arch.cr3 = vmcs_readl(GUEST_CR3); + if (exit_reason VMX_EXIT_REASONS_FAILED_VMENTRY) { + vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY; + vcpu-run-fail_entry.hardware_entry_failure_reason + = exit_reason ~VMX_EXIT_REASONS_FAILED_VMENTRY; + return 0; + } + if (unlikely(vmx-fail)) { vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY; vcpu-run-fail_entry.hardware_entry_failure_reason How does the user distinguish between KVM_EXIT_FAIL_ENTRY due to an exit reason with bit 31 set and vmlauch/vmresume failure (vmx-fail set)? We need separate exit codes (with documentation in api.txt). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] VMX: Add constant for invalid guest state exit reason
On 05/24/2010 01:01 AM, Mohammed Gamal wrote: For the sake of completeness, this patch adds a symbolic constant for VMX exit reason 0x21 (invalid guest state). Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VMX: Properly return error to userspace on vmentry failure
On Tue, May 25, 2010 at 2:45 PM, Avi Kivity a...@redhat.com wrote: On 05/24/2010 01:01 AM, Mohammed Gamal wrote: The vmexit handler returns KVM_EXIT_UNKNOWN since there is no handler for vmentry failures. This intercepts vmentry failures and returns KVM_FAIL_ENTRY to userspace instead. Signed-off-by: Mohammed Gamalm.gamal...@gmail.com --- arch/x86/kvm/vmx.c | 7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 99ae513..4edcffb 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3665,6 +3665,13 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) if (enable_ept is_paging(vcpu)) vcpu-arch.cr3 = vmcs_readl(GUEST_CR3); + if (exit_reason VMX_EXIT_REASONS_FAILED_VMENTRY) { + vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY; + vcpu-run-fail_entry.hardware_entry_failure_reason + = exit_reason ~VMX_EXIT_REASONS_FAILED_VMENTRY; + return 0; + } + if (unlikely(vmx-fail)) { vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY; vcpu-run-fail_entry.hardware_entry_failure_reason How does the user distinguish between KVM_EXIT_FAIL_ENTRY due to an exit reason with bit 31 set and vmlauch/vmresume failure (vmx-fail set)? We need separate exit codes (with documentation in api.txt). In both cases the vm fails entry, and I don't think the hardware entry failure reason codes would overlap between the vmx-fail case and exit reasons with bit 31 set, so why should there be such distinction between both cases? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On Tue, May 25, 2010 at 02:25:53PM +0300, Avi Kivity wrote: Currently if someone wants to add a new block format, they have to upstream it and wait for a new qemu to be released. With a plugin API, they can add a new block format to an existing, supported qemu. So? Unless we want a stable driver ABI which I fundamentally oppose as it would make block driver development hell they'd have to wait for a new release of the block layer. It's really just going to be a lot of pain for no major gain. qemu releases are frequent enough, and if users care enough they can also easily patch qemu. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] trace: Trace virtqueue operations
On 05/25/2010 01:24 PM, Stefan Hajnoczi wrote: This patch adds trace events for virtqueue operations including adding/removing buffers, notifying the guest, and receiving a notify from the guest. diff --git a/trace-events b/trace-events index 48415f8..a533414 100644 --- a/trace-events +++ b/trace-events @@ -35,6 +35,14 @@ qemu_memalign(size_t alignment, size_t size, void *ptr) alignment %zu size %zu qemu_valloc(size_t size, void *ptr) size %zu ptr %p qemu_vfree(void *ptr) ptr %p +# hw/virtio.c +virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) vq %p elem %p len %u idx %u +virtqueue_flush(void *vq, unsigned int count) vq %p count %u +virtqueue_pop(void *vq, void *elem, unsigned int in_num, unsigned int out_num) vq %p elem %p in_num %u out_num %u +virtio_queue_notify(void *vdev, int n, void *vq) vdev %p n %d vq %p +virtio_irq(void *vq) vq %p +virtio_notify(void *vdev, void *vq) vdev %p vq %p + Those %ps are more or less useless. We need better ways of identifying them. Linux uses %pTYPE to pretty print arbitrary types. We could do something similar (not the same since we don't want our own printf implementation). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VMX: Properly return error to userspace on vmentry failure
On 05/25/2010 03:01 PM, Mohammed Gamal wrote: How does the user distinguish between KVM_EXIT_FAIL_ENTRY due to an exit reason with bit 31 set and vmlauch/vmresume failure (vmx-fail set)? We need separate exit codes (with documentation in api.txt). In both cases the vm fails entry, and I don't think the hardware entry failure reason codes would overlap between the vmx-fail case and exit reasons with bit 31 set, so why should there be such distinction between both cases? Only 5 more error codes (28-33) and we have overlap. If you return the new codes with bit 31 still set then we can use the existing KVM_EXIT_FAIL_ENTRY. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 03:03 PM, Christoph Hellwig wrote: On Tue, May 25, 2010 at 02:25:53PM +0300, Avi Kivity wrote: Currently if someone wants to add a new block format, they have to upstream it and wait for a new qemu to be released. With a plugin API, they can add a new block format to an existing, supported qemu. So? Unless we want a stable driver ABI which I fundamentally oppose as it would make block driver development hell We'd only freeze it for a major release. they'd have to wait for a new release of the block layer. It's really just going to be a lot of pain for no major gain. qemu releases are frequent enough, and if users care enough they can also easily patch qemu. May not be so easy for them, they lose binary updates from their distro and have to keep repatching. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VMX: Properly return error to userspace on vmentry failure
On Tue, May 25, 2010 at 3:10 PM, Avi Kivity a...@redhat.com wrote: On 05/25/2010 03:01 PM, Mohammed Gamal wrote: How does the user distinguish between KVM_EXIT_FAIL_ENTRY due to an exit reason with bit 31 set and vmlauch/vmresume failure (vmx-fail set)? We need separate exit codes (with documentation in api.txt). In both cases the vm fails entry, and I don't think the hardware entry failure reason codes would overlap between the vmx-fail case and exit reasons with bit 31 set, so why should there be such distinction between both cases? Only 5 more error codes (28-33) and we have overlap. If you return the new codes with bit 31 still set then we can use the existing KVM_EXIT_FAIL_ENTRY. That'd be a better idea. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 04:14 AM, Avi Kivity wrote: On 05/24/2010 10:38 PM, Anthony Liguori wrote: - Building a plugin API seems a bit simpler to me, although I'm to sure if I'd get the idea correctly: The block layer has already some kind of api (.bdrv_file_open, .bdrv_read). We could simply compile the block-drivers as shared objects and create a method for loading the necessary modules at runtime. That approach would be a recipe for disaster. We would have to introduce a new, reduced functionality block API that was supported for plugins. Otherwise, the only way a plugin could keep up with our API changes would be if it was in tree which defeats the purpose of having plugins. We could guarantee API/ABI stability in a stable branch but not across releases. We have releases every six months. There would be tons of block plugins that didn't work for random sets of releases. That creates a lot of user confusion and unhappiness. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 06:25 AM, Avi Kivity wrote: On 05/25/2010 02:02 PM, Kevin Wolf wrote: So could we not standardize a protocol for this that both sheepdog and ceph could implement? The protocol already exists, nbd. It doesn't support snapshotting etc. but we could extend it. But IMO what's needed is a plugin API for the block layer. What would it buy us, apart from more downstreams and having to maintain a stable API and ABI? Currently if someone wants to add a new block format, they have to upstream it and wait for a new qemu to be released. With a plugin API, they can add a new block format to an existing, supported qemu. Whether we have a plugin or protocol based mechanism to implement block formats really ends up being just an implementation detail. In order to implement either, we need to take a subset of block functionality that we feel we can support long term and expose that. Right now, that's basically just querying characteristics (like size and geometry) and asynchronous reads and writes. A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. Plugins that just expose chunks of QEMU internal state directly (like BlockDriver) are a really bad idea IMHO. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 04:17 PM, Anthony Liguori wrote: On 05/25/2010 04:14 AM, Avi Kivity wrote: On 05/24/2010 10:38 PM, Anthony Liguori wrote: - Building a plugin API seems a bit simpler to me, although I'm to sure if I'd get the idea correctly: The block layer has already some kind of api (.bdrv_file_open, .bdrv_read). We could simply compile the block-drivers as shared objects and create a method for loading the necessary modules at runtime. That approach would be a recipe for disaster. We would have to introduce a new, reduced functionality block API that was supported for plugins. Otherwise, the only way a plugin could keep up with our API changes would be if it was in tree which defeats the purpose of having plugins. We could guarantee API/ABI stability in a stable branch but not across releases. We have releases every six months. There would be tons of block plugins that didn't work for random sets of releases. That creates a lot of user confusion and unhappiness. The current situation is that those block format drivers only exist in qemu.git or as patches. Surely that's even more unhappiness. Confusion could be mitigated: $ qemu -module my-fancy-block-format-driver.so my-fancy-block-format-driver.so does not support this version of qemu (0.19.2). Please contact my-fancy-block-format-driver-de...@example.org. The question is how many such block format drivers we expect. We now have two in the pipeline (ceph, sheepdog), it's reasonable to assume we'll want an lvm2 driver and btrfs driver. This is an area with a lot of activity and a relatively simply interface. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
At Mon, 24 May 2010 14:16:32 -0500, Anthony Liguori wrote: On 05/24/2010 06:56 AM, Avi Kivity wrote: On 05/24/2010 02:42 PM, MORITA Kazutaka wrote: The server would be local and talk over a unix domain socket, perhaps anonymous. nbd has other issues though, such as requiring a copy and no support for metadata operations such as snapshot and file size extension. Sorry, my explanation was unclear. I'm not sure how running servers on localhost can solve the problem. The local server can convert from the local (nbd) protocol to the remote (sheepdog, ceph) protocol. What I wanted to say was that we cannot specify the image of VM. With nbd protocol, command line arguments are as follows: $ qemu nbd:hostname:port As this syntax shows, with nbd protocol the client cannot pass the VM image name to the server. We would extend it to allow it to connect to a unix domain socket: qemu nbd:unix:/path/to/socket nbd is a no-go because it only supports a single, synchronous I/O operation at a time and has no mechanism for extensibility. If we go this route, I think two options are worth considering. The first would be a purely socket based approach where we just accepted the extra copy. The other potential approach would be shared memory based. We export all guest ram as shared memory along with a small bounce buffer pool. We would then use a ring queue (potentially even using virtio-blk) and an eventfd for notification. The shared memory approach assumes that there is a local server who can talk with the storage system. But Ceph doesn't require the local server, and Sheepdog would be extended to support VMs running outside the storage system. We could run a local daemon who can only work as proxy, but I don't think it looks a clean approach. So I think a socket based approach is the right way to go. BTW, is it required to design a common interface? The way Sheepdog replicates data is different from Ceph, so I think it is not possible to define a common protocol as Christian says. Regards, Kazutaka The server at the other end would associate the socket with a filename and forward it to the server using the remote protocol. However, I don't think nbd would be a good protocol. My preference would be for a plugin API, or for a new local protocol that uses splice() to avoid copies. I think a good shared memory implementation would be preferable to plugins. I think it's worth attempting to do a plugin interface for the block layer but I strongly suspect it would not be sufficient. I would not want to see plugins that interacted with BlockDriverState directly, for instance. We change it far too often. Our main loop functions are also not terribly stable so I'm not sure how we would handle that (unless we forced all block plugins to be in a separate thread). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] trace: Trace virtqueue operations
On Tue, May 25, 2010 at 1:04 PM, Avi Kivity a...@redhat.com wrote: Those %ps are more or less useless. We need better ways of identifying them. You're right, the vq pointer is useless in isolation. We don't know which virtio device or which virtqueue number. With the full context of a trace it would be possible to correlate the vq pointer if we had trace events for vdev and vq setup. Adding custom formatters is could be tricky since the format string is passed only to tracing backends that use it, like UST. And UST uses its own sprintf implementation which we don't have direct control over. I think we just need to guarantee that any pointer can be correlated with previous trace entries that give context for that pointer. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 08:25 AM, Avi Kivity wrote: On 05/25/2010 04:17 PM, Anthony Liguori wrote: On 05/25/2010 04:14 AM, Avi Kivity wrote: On 05/24/2010 10:38 PM, Anthony Liguori wrote: - Building a plugin API seems a bit simpler to me, although I'm to sure if I'd get the idea correctly: The block layer has already some kind of api (.bdrv_file_open, .bdrv_read). We could simply compile the block-drivers as shared objects and create a method for loading the necessary modules at runtime. That approach would be a recipe for disaster. We would have to introduce a new, reduced functionality block API that was supported for plugins. Otherwise, the only way a plugin could keep up with our API changes would be if it was in tree which defeats the purpose of having plugins. We could guarantee API/ABI stability in a stable branch but not across releases. We have releases every six months. There would be tons of block plugins that didn't work for random sets of releases. That creates a lot of user confusion and unhappiness. The current situation is that those block format drivers only exist in qemu.git or as patches. Surely that's even more unhappiness. Confusion could be mitigated: $ qemu -module my-fancy-block-format-driver.so my-fancy-block-format-driver.so does not support this version of qemu (0.19.2). Please contact my-fancy-block-format-driver-de...@example.org. The question is how many such block format drivers we expect. We now have two in the pipeline (ceph, sheepdog), it's reasonable to assume we'll want an lvm2 driver and btrfs driver. This is an area with a lot of activity and a relatively simply interface. If we expose a simple interface, I'm all for it. But BlockDriver is not simple and things like the snapshoting API need love. Of course, there's certainly a question of why we're solving this in qemu at all. Wouldn't it be more appropriate to either (1) implement a kernel module for ceph/sheepdog if performance matters or (2) implement BUSE to complement FUSE and CUSE to enable proper userspace block devices. If you want to use a block device within qemu, you almost certainly want to be able to manipulate it on the host using standard tools (like mount and parted) so it stands to reason that addressing this in the kernel makes more sense. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 04:25 PM, Anthony Liguori wrote: Currently if someone wants to add a new block format, they have to upstream it and wait for a new qemu to be released. With a plugin API, they can add a new block format to an existing, supported qemu. Whether we have a plugin or protocol based mechanism to implement block formats really ends up being just an implementation detail. True. In order to implement either, we need to take a subset of block functionality that we feel we can support long term and expose that. Right now, that's basically just querying characteristics (like size and geometry) and asynchronous reads and writes. Unfortunately, you're right. A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. May be hard due to difficulty of exposing guest memory. Plugins that just expose chunks of QEMU internal state directly (like BlockDriver) are a really bad idea IMHO. Also, we don't want to expose all of the qemu API. We should default the visibility attribute to hidden and expose only select functions, perhaps under their own interface. And no inlines. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 04:35 PM, Anthony Liguori wrote: On 05/25/2010 08:31 AM, Avi Kivity wrote: A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. May be hard due to difficulty of exposing guest memory. If someone did a series to add plugins, I would expect a very strong argument as to why a shared memory mechanism was not possible or at least plausible. I'm not sure I understand why shared memory is such a bad thing wrt KVM. Can you elaborate? Is it simply a matter of fork()? fork() doesn't work in the with of memory hotplug. What else is there? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 08:31 AM, Avi Kivity wrote: A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. May be hard due to difficulty of exposing guest memory. If someone did a series to add plugins, I would expect a very strong argument as to why a shared memory mechanism was not possible or at least plausible. I'm not sure I understand why shared memory is such a bad thing wrt KVM. Can you elaborate? Is it simply a matter of fork()? Plugins that just expose chunks of QEMU internal state directly (like BlockDriver) are a really bad idea IMHO. Also, we don't want to expose all of the qemu API. We should default the visibility attribute to hidden and expose only select functions, perhaps under their own interface. And no inlines. Yeah, if we did plugins, this would be a key requirement. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add support for protocol driver create_options
Am 24.05.2010 08:34, schrieb MORITA Kazutaka: At Fri, 21 May 2010 18:57:36 +0200, Kevin Wolf wrote: Am 20.05.2010 07:36, schrieb MORITA Kazutaka: + +/* + * Append an option list (list) to an option list (dest). + * + * If dest is NULL, a new copy of list is created. + * + * Returns a pointer to the first element of dest (or the newly allocated copy) + */ +QEMUOptionParameter *append_option_parameters(QEMUOptionParameter *dest, +QEMUOptionParameter *list) +{ +size_t num_options, num_dest_options; + +num_options = count_option_parameters(dest); +num_dest_options = num_options; + +num_options += count_option_parameters(list); + +dest = qemu_realloc(dest, (num_options + 1) * sizeof(QEMUOptionParameter)); + +while (list list-name) { +if (get_option_parameter(dest, list-name) == NULL) { +dest[num_dest_options++] = *list; You need to add a dest[num_dest_options].name = NULL; here. Otherwise the next loop iteration works on uninitialized memory and possibly an unterminated list. I got a segfault for that reason. I forgot to add it, sorry. Fixed version is below. Thanks, Kazutaka == This patch enables protocol drivers to use their create options which are not supported by the format. For example, protcol drivers can use a backing_file option with raw format. Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp $ ./qemu-img create -f qcow2 -o cluster_size=4k /tmp/test.qcow2 4G Unknown option 'cluster_size' qemu-img: Invalid options for file format 'qcow2'. I think you added another num_dest_options++ which shouldn't be there. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] trace: Trace virtqueue operations
On 05/25/2010 04:27 PM, Stefan Hajnoczi wrote: On Tue, May 25, 2010 at 1:04 PM, Avi Kivitya...@redhat.com wrote: Those %ps are more or less useless. We need better ways of identifying them. You're right, the vq pointer is useless in isolation. We don't know which virtio device or which virtqueue number. With the full context of a trace it would be possible to correlate the vq pointer if we had trace events for vdev and vq setup. Adding custom formatters is could be tricky since the format string is passed only to tracing backends that use it, like UST. And UST uses its own sprintf implementation which we don't have direct control over. Hm. Perhaps we can convert %{type} to %p for backends which don't support it, and to whatever format they do support for those that do. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
Am 25.05.2010 15:25, schrieb Anthony Liguori: On 05/25/2010 06:25 AM, Avi Kivity wrote: On 05/25/2010 02:02 PM, Kevin Wolf wrote: So could we not standardize a protocol for this that both sheepdog and ceph could implement? The protocol already exists, nbd. It doesn't support snapshotting etc. but we could extend it. But IMO what's needed is a plugin API for the block layer. What would it buy us, apart from more downstreams and having to maintain a stable API and ABI? Currently if someone wants to add a new block format, they have to upstream it and wait for a new qemu to be released. With a plugin API, they can add a new block format to an existing, supported qemu. Whether we have a plugin or protocol based mechanism to implement block formats really ends up being just an implementation detail. In order to implement either, we need to take a subset of block functionality that we feel we can support long term and expose that. Right now, that's basically just querying characteristics (like size and geometry) and asynchronous reads and writes. A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. Plugins that just expose chunks of QEMU internal state directly (like BlockDriver) are a really bad idea IMHO. I'm still not convinced that we need either. I share Christoph's concern that we would make our life harder for almost no gain. It's probably a very small group of users (if it exists at all) that wants to add new block drivers themselves, but at the same time can't run upstream qemu. But if we were to decide that there's no way around it, I agree with you that directly exposing the internal API isn't going to work. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 08:36 AM, Avi Kivity wrote: We'd need a kernel-level generic snapshot API for this eventually. or (2) implement BUSE to complement FUSE and CUSE to enable proper userspace block devices. Likely slow due do lots of copying. Also needs a snapshot API. The kernel could use splice. (ABUSE was proposed a while ago by Zach). If you want to use a block device within qemu, you almost certainly want to be able to manipulate it on the host using standard tools (like mount and parted) so it stands to reason that addressing this in the kernel makes more sense. qemu-nbd also allows this. This reasoning also applies to qcow2, btw. I know. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 08:38 AM, Avi Kivity wrote: On 05/25/2010 04:35 PM, Anthony Liguori wrote: On 05/25/2010 08:31 AM, Avi Kivity wrote: A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. May be hard due to difficulty of exposing guest memory. If someone did a series to add plugins, I would expect a very strong argument as to why a shared memory mechanism was not possible or at least plausible. I'm not sure I understand why shared memory is such a bad thing wrt KVM. Can you elaborate? Is it simply a matter of fork()? fork() doesn't work in the with of memory hotplug. What else is there? Is it that fork() doesn't work or is it that fork() is very expensive? Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 04:54 PM, Anthony Liguori wrote: On 05/25/2010 08:36 AM, Avi Kivity wrote: We'd need a kernel-level generic snapshot API for this eventually. or (2) implement BUSE to complement FUSE and CUSE to enable proper userspace block devices. Likely slow due do lots of copying. Also needs a snapshot API. The kernel could use splice. Still can't make guest memory appear in (A)BUSE process memory without either mmu tricks (vmsplice in reverse) or a copy. May be workable for an (A)BUSE driver that talks over a network, and thus can splice() its way out. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] trace: Trace virtqueue operations
On Tue, May 25, 2010 at 2:52 PM, Avi Kivity a...@redhat.com wrote: Hm. Perhaps we can convert %{type} to %p for backends which don't support it, and to whatever format they do support for those that do. True. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 04:55 PM, Anthony Liguori wrote: On 05/25/2010 08:38 AM, Avi Kivity wrote: On 05/25/2010 04:35 PM, Anthony Liguori wrote: On 05/25/2010 08:31 AM, Avi Kivity wrote: A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. May be hard due to difficulty of exposing guest memory. If someone did a series to add plugins, I would expect a very strong argument as to why a shared memory mechanism was not possible or at least plausible. I'm not sure I understand why shared memory is such a bad thing wrt KVM. Can you elaborate? Is it simply a matter of fork()? fork() doesn't work in the with of memory hotplug. What else is there? Is it that fork() doesn't work or is it that fork() is very expensive? It doesn't work, fork() is done at block device creation time, which freezes the child memory map, while guest memory is allocated at hotplug time. fork() actually isn't very expensive since we use MADV_DONTFORK (probably fast enough for everything except realtime). It may be possible to do a processfd() which can be mmap()ed by another process to export anonymous memory using mmu notifiers, not sure how easy or mergeable that is. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
Am 25.05.2010 15:25, schrieb Avi Kivity: On 05/25/2010 04:17 PM, Anthony Liguori wrote: On 05/25/2010 04:14 AM, Avi Kivity wrote: On 05/24/2010 10:38 PM, Anthony Liguori wrote: - Building a plugin API seems a bit simpler to me, although I'm to sure if I'd get the idea correctly: The block layer has already some kind of api (.bdrv_file_open, .bdrv_read). We could simply compile the block-drivers as shared objects and create a method for loading the necessary modules at runtime. That approach would be a recipe for disaster. We would have to introduce a new, reduced functionality block API that was supported for plugins. Otherwise, the only way a plugin could keep up with our API changes would be if it was in tree which defeats the purpose of having plugins. We could guarantee API/ABI stability in a stable branch but not across releases. We have releases every six months. There would be tons of block plugins that didn't work for random sets of releases. That creates a lot of user confusion and unhappiness. The current situation is that those block format drivers only exist in qemu.git or as patches. Surely that's even more unhappiness. The difference is that in the current situation these drivers will be part of the next qemu release, so the patch may be obsolete, but you don't even need it any more. If you start keeping block drivers outside qemu and not even try integrating them, they'll stay external. Confusion could be mitigated: $ qemu -module my-fancy-block-format-driver.so my-fancy-block-format-driver.so does not support this version of qemu (0.19.2). Please contact my-fancy-block-format-driver-de...@example.org. The question is how many such block format drivers we expect. We now have two in the pipeline (ceph, sheepdog), it's reasonable to assume we'll want an lvm2 driver and btrfs driver. This is an area with a lot of activity and a relatively simply interface. What's the reason for not having these drivers upstream? Do we gain anything by hiding them from our users and requiring them to install the drivers separately from somewhere else? Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Propagate fpu_alloc errors
Memory allocation may fail. Propagate such errors. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/include/asm/kvm_host.h |2 +- arch/x86/kvm/svm.c |7 ++- arch/x86/kvm/vmx.c |4 +++- arch/x86/kvm/x86.c | 11 +-- 4 files changed, 19 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d08bb4a..0cd0f29 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -624,7 +624,7 @@ int kvm_pic_set_irq(void *opaque, int irq, int level); void kvm_inject_nmi(struct kvm_vcpu *vcpu); -void fx_init(struct kvm_vcpu *vcpu); +int fx_init(struct kvm_vcpu *vcpu); void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 4af2c12..5f25e59 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -903,13 +903,18 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-asid_generation = 0; init_vmcb(svm); - fx_init(svm-vcpu); + err = fx_init(svm-vcpu); + if (err) + goto free_page4; + svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; return svm-vcpu; +free_page4: + __free_page(hsave_page); free_page3: __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER); free_page2: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 99ae513..61bdae3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2661,7 +2661,9 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) msr |= MSR_IA32_APICBASE_BSP; kvm_set_apic_base(vmx-vcpu, msr); - fx_init(vmx-vcpu); + ret = fx_init(vmx-vcpu); + if (ret != 0) + goto out; seg_setup(VCPU_SREG_CS); /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7be1d36..e773d93 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5113,12 +5113,19 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) return 0; } -void fx_init(struct kvm_vcpu *vcpu) +int fx_init(struct kvm_vcpu *vcpu) { - fpu_alloc(vcpu-arch.guest_fpu); + int err; + + err = fpu_alloc(vcpu-arch.guest_fpu); + if (err) + return err; + fpu_finit(vcpu-arch.guest_fpu); vcpu-arch.cr0 |= X86_CR0_ET; + + return 0; } EXPORT_SYMBOL_GPL(fx_init); -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: svm: Drop unused local variable
Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/svm.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5f25e59..3c03c36 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1491,8 +1491,6 @@ static void svm_handle_mce(struct vcpu_svm *svm) * Erratum 383 triggered. Guest state is corrupt so kill the * guest. */ - struct kvm_run *kvm_run = svm-vcpu.run; - pr_err(KVM: Guest triggered AMD Erratum 383\n); set_bit(KVM_REQ_TRIPLE_FAULT, svm-vcpu.requests); -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 08:57 AM, Avi Kivity wrote: On 05/25/2010 04:54 PM, Anthony Liguori wrote: On 05/25/2010 08:36 AM, Avi Kivity wrote: We'd need a kernel-level generic snapshot API for this eventually. or (2) implement BUSE to complement FUSE and CUSE to enable proper userspace block devices. Likely slow due do lots of copying. Also needs a snapshot API. The kernel could use splice. Still can't make guest memory appear in (A)BUSE process memory without either mmu tricks (vmsplice in reverse) or a copy. May be workable for an (A)BUSE driver that talks over a network, and thus can splice() its way out. splice() actually takes offset parameter so it may be possible to treat that offset parameter as a file offset. That would essentially allow you to implement a splice() based thread pool where splice() replaces preadv/pwritev. It's not quite linux-aio, but it should take you pretty far. I think the main point is that the problem of allowing block plugins to qemu is the same as block plugins for the kernel. The kernel doesn't provide a stable interface (and we probably can't for the same reasons) and it's generally discourage from a code quality perspective. That said, making an external program work well as a block backend is identical to making userspace block devices fast. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 08:55 AM, Avi Kivity wrote: On 05/25/2010 04:53 PM, Kevin Wolf wrote: I'm still not convinced that we need either. I share Christoph's concern that we would make our life harder for almost no gain. It's probably a very small group of users (if it exists at all) that wants to add new block drivers themselves, but at the same time can't run upstream qemu. The first part of your argument may be true, but the second isn't. No user can run upstream qemu.git. It's not tested or supported, and has no backwards compatibility guarantees. Yes, it does have backwards compatibility guarantees. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost-net: fix reversed logic in mask notifiers
When guest notifier is assigned, we set mask notifier, which will assign kvm irqfd. When guest notifier is unassigned, mask notifier is unset, which should unassign kvm irqfd. The way to do this is to call mask notifier telling it to mask the vector. This, unless vector is already masked which unassigns irqfd already. The logic in unassign was reversed, which left kvm irqfd assigned. This patch is qemu-kvm only as irqfd is not upstream. Signed-off-by: Michael S. Tsirkin m...@redhat.com Reported-by: Amit Shah amit.s...@redhat.com --- hw/msix.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 8f9a621..1398680 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -617,6 +617,7 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) assert(opaque); assert(!dev-msix_mask_notifier_opaque[vector]); +/* Unmask the new notifier unless vector is masked. */ if (msix_is_masked(dev, vector)) { return 0; } @@ -638,12 +639,13 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector) assert(dev-msix_mask_notifier); assert(dev-msix_mask_notifier_opaque[vector]); +/* Mask the old notifier unless it is already masked. */ if (msix_is_masked(dev, vector)) { return 0; } r = dev-msix_mask_notifier(dev, vector, dev-msix_mask_notifier_opaque[vector], -msix_is_masked(dev, vector)); +!msix_is_masked(dev, vector)); if (r 0) { return r; } -- 1.7.1.12.g42b7f -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 09:01 AM, Avi Kivity wrote: On 05/25/2010 04:55 PM, Anthony Liguori wrote: On 05/25/2010 08:38 AM, Avi Kivity wrote: On 05/25/2010 04:35 PM, Anthony Liguori wrote: On 05/25/2010 08:31 AM, Avi Kivity wrote: A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. May be hard due to difficulty of exposing guest memory. If someone did a series to add plugins, I would expect a very strong argument as to why a shared memory mechanism was not possible or at least plausible. I'm not sure I understand why shared memory is such a bad thing wrt KVM. Can you elaborate? Is it simply a matter of fork()? fork() doesn't work in the with of memory hotplug. What else is there? Is it that fork() doesn't work or is it that fork() is very expensive? It doesn't work, fork() is done at block device creation time, which freezes the child memory map, while guest memory is allocated at hotplug time. Now I'm confused. I thought you were saying shared memory somehow affects fork(). If you're talking about shared memory inheritance via fork(), that's less important. You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared memory segments dynamically. Regards, Anthony Liguori fork() actually isn't very expensive since we use MADV_DONTFORK (probably fast enough for everything except realtime). It may be possible to do a processfd() which can be mmap()ed by another process to export anonymous memory using mmu notifiers, not sure how easy or mergeable that is. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
Am 25.05.2010 15:55, schrieb Avi Kivity: On 05/25/2010 04:53 PM, Kevin Wolf wrote: I'm still not convinced that we need either. I share Christoph's concern that we would make our life harder for almost no gain. It's probably a very small group of users (if it exists at all) that wants to add new block drivers themselves, but at the same time can't run upstream qemu. The first part of your argument may be true, but the second isn't. No user can run upstream qemu.git. It's not tested or supported, and has no backwards compatibility guarantees. The second part was basically meant to say developers don't count here. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: svm: Drop unused local variable
Ah right, thanks :) On Tue, May 25, 2010 at 10:02:15AM -0400, Jan Kiszka wrote: Signed-off-by: Jan Kiszka jan.kis...@siemens.com Acked-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/svm.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5f25e59..3c03c36 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1491,8 +1491,6 @@ static void svm_handle_mce(struct vcpu_svm *svm) * Erratum 383 triggered. Guest state is corrupt so kill the * guest. */ - struct kvm_run *kvm_run = svm-vcpu.run; - pr_err(KVM: Guest triggered AMD Erratum 383\n); set_bit(KVM_REQ_TRIPLE_FAULT, svm-vcpu.requests); -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: fix reversed logic in mask notifiers
On (Tue) May 25 2010 [17:00:43], Michael S. Tsirkin wrote: When guest notifier is assigned, we set mask notifier, which will assign kvm irqfd. When guest notifier is unassigned, mask notifier is unset, which should unassign kvm irqfd. The way to do this is to call mask notifier telling it to mask the vector. This, unless vector is already masked which unassigns irqfd already. The logic in unassign was reversed, which left kvm irqfd assigned. This patch is qemu-kvm only as irqfd is not upstream. Signed-off-by: Michael S. Tsirkin m...@redhat.com Reported-by: Amit Shah amit.s...@redhat.com Acked-by: Amit Shah amit.s...@redhat.com Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: fix reversed logic in mask notifiers
Michael S. Tsirkin m...@redhat.com wrote: When guest notifier is assigned, we set mask notifier, which will assign kvm irqfd. When guest notifier is unassigned, mask notifier is unset, which should unassign kvm irqfd. The way to do this is to call mask notifier telling it to mask the vector. This, unless vector is already masked which unassigns irqfd already. The logic in unassign was reversed, which left kvm irqfd assigned. This patch is qemu-kvm only as irqfd is not upstream. Signed-off-by: Michael S. Tsirkin m...@redhat.com Reported-by: Amit Shah amit.s...@redhat.com --- hw/msix.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 8f9a621..1398680 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -617,6 +617,7 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) assert(opaque); assert(!dev-msix_mask_notifier_opaque[vector]); +/* Unmask the new notifier unless vector is masked. */ if (msix_is_masked(dev, vector)) { return 0; } @@ -638,12 +639,13 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector) assert(dev-msix_mask_notifier); assert(dev-msix_mask_notifier_opaque[vector]); +/* Mask the old notifier unless it is already masked. */ if (msix_is_masked(dev, vector)) { return 0; } r = dev-msix_mask_notifier(dev, vector, dev-msix_mask_notifier_opaque[vector], -msix_is_masked(dev, vector)); +!msix_is_masked(dev, vector)); Why don't put just a 1 here? we have: if (msix_is_masked()) return 0 r = msix_mask_notifier(., !msix_is_masked()); i.e. at that point msix_is_masked() is false, or we really, really needs locking. Puttting a !foo, when we know that it needs to be an 1 looks strange. Later, Juan. PD. Yes, I already asked in a previous version to just have two methods, mask/unmask. we now at call time which one we need. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: fix reversed logic in mask notifiers
On Tue, May 25, 2010 at 04:37:36PM +0200, Juan Quintela wrote: Michael S. Tsirkin m...@redhat.com wrote: When guest notifier is assigned, we set mask notifier, which will assign kvm irqfd. When guest notifier is unassigned, mask notifier is unset, which should unassign kvm irqfd. The way to do this is to call mask notifier telling it to mask the vector. This, unless vector is already masked which unassigns irqfd already. The logic in unassign was reversed, which left kvm irqfd assigned. This patch is qemu-kvm only as irqfd is not upstream. Signed-off-by: Michael S. Tsirkin m...@redhat.com Reported-by: Amit Shah amit.s...@redhat.com --- hw/msix.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 8f9a621..1398680 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -617,6 +617,7 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) assert(opaque); assert(!dev-msix_mask_notifier_opaque[vector]); +/* Unmask the new notifier unless vector is masked. */ if (msix_is_masked(dev, vector)) { return 0; } @@ -638,12 +639,13 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector) assert(dev-msix_mask_notifier); assert(dev-msix_mask_notifier_opaque[vector]); +/* Mask the old notifier unless it is already masked. */ if (msix_is_masked(dev, vector)) { return 0; } r = dev-msix_mask_notifier(dev, vector, dev-msix_mask_notifier_opaque[vector], -msix_is_masked(dev, vector)); +!msix_is_masked(dev, vector)); Why don't put just a 1 here? we have: if (msix_is_masked()) return 0 r = msix_mask_notifier(., !msix_is_masked()); i.e. at that point msix_is_masked() is false, or we really, really needs locking. Puttting a !foo, when we know that it needs to be an 1 looks strange. Later, Juan. PD. Yes, I already asked in a previous version to just have two methods, mask/unmask. we now at call time which one we need. I find msix_is_masked clearer here than true since you don't need to look up definition to understand what this 'true' stands for. The value is clear from code above. What do you think? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: fix reversed logic in mask notifiers
On 05/25/10 16:00, Michael S. Tsirkin wrote: When guest notifier is assigned, we set mask notifier, which will assign kvm irqfd. When guest notifier is unassigned, mask notifier is unset, which should unassign kvm irqfd. The way to do this is to call mask notifier telling it to mask the vector. This, unless vector is already masked which unassigns irqfd already. The logic in unassign was reversed, which left kvm irqfd assigned. This patch is qemu-kvm only as irqfd is not upstream. Signed-off-by: Michael S. Tsirkinm...@redhat.com Reported-by: Amit Shahamit.s...@redhat.com Acked-by: Gerd Hoffmann kra...@redhat.com cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] Redirct and make use of the guest serial console
On Tue, 2010-05-11 at 17:03 +0800, Jason Wang wrote: The guest console is useful for failure troubleshooting especially for the one who has calltrace. And as we plan to push the network related test in the next few weeks, we found the serial session in more reliable during the network testing. So this patchset logs the guest serial throught the redirectied serial of guest and also enable the ability to log into guest through serial console. I only open the serial console for linux, I would do some investigation on windows guests. Change from v1: - Coding style improvement according to the suggestions from Michael Goldish - Improve the username sending handling in remote_login() - Change the matching re of login to [Ll]ogin:\s*$ - Check whether vm have already dead in dumpping thread - Return none rather than raise exception when met unknown shell_client - Keep tty0 for all linux guests - Enable the serial console in unattended installation - Add a helper to check whether the panic information was occured - Keep the porcess() at its original location in preprocess() Jason, after a long conversation I've had with Michael during the previous week, we reached some common points: 1 - We believe it is possible to be able to both log in *and* log serial console output. That will require changes to kvm_subprocess and might take a little bit more time. 2 - We know you guys are depending on this patchset to be accepted in order to proceed with the network related cases. However, we ask for a little more patience, and we'd like to get your opinions on the patches that we are going to roll out. This way we can get to a better solution for all of us. So, please bear with us and I'll try to see with Michael and Dor if we can prioritize this work to not block work items for you guys. Cheers, Lucas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: fix reversed logic in mask notifiers
Michael S. Tsirkin m...@redhat.com wrote: On Tue, May 25, 2010 at 04:37:36PM +0200, Juan Quintela wrote: we have: if (msix_is_masked()) return 0 r = msix_mask_notifier(., !msix_is_masked()); i.e. at that point msix_is_masked() is false, or we really, really needs locking. Puttting a !foo, when we know that it needs to be an 1 looks strange. Later, Juan. PD. Yes, I already asked in a previous version to just have two methods, mask/unmask. we now at call time which one we need. I find msix_is_masked clearer here than true since you don't need to look up definition to understand what this 'true' stands for. The value is clear from code above. What do you think? I preffer the change, but it is up to you. at that point, we are using !msix_masked() to mean true i.e. we know that msix_masked() is false. What you want to do is mask. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 05:05 PM, Anthony Liguori wrote: On 05/25/2010 09:01 AM, Avi Kivity wrote: On 05/25/2010 04:55 PM, Anthony Liguori wrote: On 05/25/2010 08:38 AM, Avi Kivity wrote: On 05/25/2010 04:35 PM, Anthony Liguori wrote: On 05/25/2010 08:31 AM, Avi Kivity wrote: A protocol based mechanism has the advantage of being more robust in the face of poorly written block backends so if it's possible to make it perform as well as a plugin, it's a preferable approach. May be hard due to difficulty of exposing guest memory. If someone did a series to add plugins, I would expect a very strong argument as to why a shared memory mechanism was not possible or at least plausible. I'm not sure I understand why shared memory is such a bad thing wrt KVM. Can you elaborate? Is it simply a matter of fork()? fork() doesn't work in the with of memory hotplug. What else is there? Is it that fork() doesn't work or is it that fork() is very expensive? It doesn't work, fork() is done at block device creation time, which freezes the child memory map, while guest memory is allocated at hotplug time. Now I'm confused. I thought you were saying shared memory somehow affects fork(). If you're talking about shared memory inheritance via fork(), that's less important. The latter. Why is it less important? If you don't inherit the memory, you can't access it. You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared memory segments dynamically. Doesn't work for anonymous memory. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 05:09 PM, Kevin Wolf wrote: The first part of your argument may be true, but the second isn't. No user can run upstream qemu.git. It's not tested or supported, and has no backwards compatibility guarantees. The second part was basically meant to say developers don't count here. Agreed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On 05/25/2010 10:00 AM, Avi Kivity wrote: The latter. Why is it less important? If you don't inherit the memory, you can't access it. You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared memory segments dynamically. Doesn't work for anonymous memory. What's wrong with /dev/shm memory? Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: fix reversed logic in mask notifiers
On Tue, May 25, 2010 at 04:58:15PM +0200, Juan Quintela wrote: Michael S. Tsirkin m...@redhat.com wrote: On Tue, May 25, 2010 at 04:37:36PM +0200, Juan Quintela wrote: we have: if (msix_is_masked()) return 0 r = msix_mask_notifier(., !msix_is_masked()); i.e. at that point msix_is_masked() is false, or we really, really needs locking. Puttting a !foo, when we know that it needs to be an 1 looks strange. Later, Juan. PD. Yes, I already asked in a previous version to just have two methods, mask/unmask. we now at call time which one we need. I find msix_is_masked clearer here than true since you don't need to look up definition to understand what this 'true' stands for. The value is clear from code above. What do you think? I preffer the change, but it is up to you. at that point, we are using !msix_masked() to mean true i.e. we know that msix_masked() is false. What you want to do is mask. Later, Juan. Right. I guess I'll keep it as is, when I look at it with a fresh mind next time, I'll clean it all up. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html