Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
At Fri, 21 May 2010 06:28:42 +0100, Stefan Hajnoczi wrote: On Thu, May 20, 2010 at 11:16 PM, Christian Brunner c...@muc.de wrote: 2010/5/20 Anthony Liguori anth...@codemonkey.ws: Both sheepdog and ceph ultimately transmit I/O over a socket to a central daemon, right? So could we not standardize a protocol for this that both sheepdog and ceph could implement? There is no central daemon. The concept is that they talk to many storage nodes at the same time. Data is distributed and replicated over many nodes in the network. The mechanism to do this is quite complex. I don't know about sheepdog, but in Ceph this is called RADOS (reliable autonomic distributed object store). Sheepdog and Ceph may look similar, but this is where they act different. I don't think that it would be possible to implement a common protocol. I believe Sheepdog has a local daemon on each node. The QEMU storage backend talks to the daemon on the same node, which then does the real network communication with the rest of the distributed storage system. Yes. It is because Sheepdog doesn't have a configuration about cluster membership as I mentioned in another mail, so the drvier doesn't know which node to access other than localhost. So I think we're not talking about a network protocol here, we're talking about a common interface that can be used by QEMU and other programs to take advantage of Ceph, Sheepdog, etc services available on the local node. Haven't looked into your patch enough yet, but does librados talk directly over the network or does it connect to a local daemon/driver? AFAIK, librados access directly over the network, so I think it is difficult to define a common interface. Thanks, Kazutaka -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][v3] KVM: VMX: Enable XSAVE/XRSTORE for guest
On Thursday 20 May 2010 17:46:40 Avi Kivity wrote: On 05/20/2010 12:16 PM, Sheng Yang wrote: From: Dexuan Cuidexuan@intel.com Enable XSAVE/XRSTORE for guest. Change from V2: Addressed comments from Avi. Change from V1: 1. Use FPU API. 2. Fix CPUID issue. 3. Save/restore all possible guest xstate fields when switching. Because we don't know which fields guest has already touched. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d08bb4a..3938bd1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -302,6 +302,7 @@ struct kvm_vcpu_arch { } update_pte; struct fpu guest_fpu; + u64 xcr0; gva_t mmio_fault_cr2; struct kvm_pio_request pio; diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 9e6779f..346ea66 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -266,6 +266,7 @@ enum vmcs_field { #define EXIT_REASON_EPT_VIOLATION 48 #define EXIT_REASON_EPT_MISCONFIG 49 #define EXIT_REASON_WBINVD54 +#define EXIT_REASON_XSETBV 55 /* * Interruption-information format diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 99ae513..a63f206 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -36,6 +36,8 @@ #includeasm/vmx.h #includeasm/virtext.h #includeasm/mce.h +#includeasm/i387.h +#includeasm/xcr.h #include trace.h @@ -247,6 +249,9 @@ static const u32 vmx_msr_index[] = { }; #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index) +#define MERGE_TO_U64(low, high) \ + (((low) -1u) | ((u64)((high) -1u) 32)) + static inline u64 kvm_read_edx_eax(vcpu) in cache_regs.h +static int handle_xsetbv(struct kvm_vcpu *vcpu) +{ + u64 new_bv = MERGE_TO_U64(kvm_register_read(vcpu, VCPU_REGS_RAX), + kvm_register_read(vcpu, VCPU_REGS_RDX)); + + if (kvm_register_read(vcpu, VCPU_REGS_RCX) != 0) + goto err; + if (vmx_get_cpl(vcpu) != 0) + goto err; + if (!(new_bv XSTATE_FP)) + goto err; + if ((new_bv XSTATE_YMM) !(new_bv XSTATE_SSE)) + goto err; What about a check against unknown bits? + vcpu-arch.xcr0 = new_bv; + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0); + skip_emulated_instruction(vcpu); + return 1; +err: + kvm_inject_gp(vcpu, 0); + return 1; +} + static int handle_apic_access(struct kvm_vcpu *vcpu) { return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; +static u64 host_xcr0; __read_mostly. + +static void update_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (!best) + return; + + /* Update OSXSAVE bit */ + if (cpu_has_xsave best-function == 0x1) { + best-ecx= ~(bit(X86_FEATURE_OSXSAVE)); + if (kvm_read_cr4(vcpu) X86_CR4_OSXSAVE) + best-ecx |= bit(X86_FEATURE_OSXSAVE); + } +} Note: need to update after userspace writes cpuid as well. Not quite understand. Userspace set OSXSAVE should be trimmed IMO... + int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { unsigned long old_cr4 = kvm_read_cr4(vcpu); @@ -481,6 +513,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) if (cr4 CR4_RESERVED_BITS) return 1; + if (!guest_cpuid_has_xsave(vcpu) (cr4 X86_CR4_OSXSAVE)) + return 1; + if (is_long_mode(vcpu)) { if (!(cr4 X86_CR4_PAE)) return 1; @@ -497,6 +532,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) if ((cr4 ^ old_cr4) pdptr_bits) kvm_mmu_reset_context(vcpu); + if ((cr4 ^ old_cr4) X86_CR4_OSXSAVE) + update_cpuid(vcpu); + I think we need to reload the guest's xcr0 at this point. Alternatively, call vmx_load_host_state() to ensure the the next entry will reload it. Current xcr0 would be loaded when next vmentry. And if we use prepare_guest_switch(), how about SVM? @@ -1931,7 +1964,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, switch (function) { case 0: - entry-eax = min(entry-eax, (u32)0xb); + entry-eax = min(entry-eax, (u32)0xd); Do we need any special handling for leaf 0xc? Don't think so. CPUID would return all 0 for it. @@ -4567,6 +4616,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) kvm_x86_ops-prepare_guest_switch(vcpu); if (vcpu-fpu_active) kvm_load_guest_fpu(vcpu); + if (kvm_read_cr4(vcpu) X86_CR4_OSXSAVE) +
Re: [PATCH 0/7] Consolidate vcpu ioctl locking
On 15.05.2010 10:26, Alexander Graf wrote: On S390, I'm also still sceptical if the implementation we have really works. A device injects an S390_INTERRUPT with its address and on the next vcpu_run, an according interrupt is issued. But what happens if two devices trigger an S390_INTERRUPT before the vcpu_run? We'd have lost an interrupt by then... We're safe on that: the interrupt info field in both struct kvm (for floating interrupts) and struct vcpu (for cpu local interrupts) have their own locking and can queue up interrupts. cheers, Carsten -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][v3] KVM: VMX: Enable XSAVE/XRSTORE for guest
On 05/21/2010 10:26 AM, Sheng Yang wrote: + +static void update_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (!best) + return; + + /* Update OSXSAVE bit */ + if (cpu_has_xsave best-function == 0x1) { + best-ecx= ~(bit(X86_FEATURE_OSXSAVE)); + if (kvm_read_cr4(vcpu) X86_CR4_OSXSAVE) + best-ecx |= bit(X86_FEATURE_OSXSAVE); + } +} Note: need to update after userspace writes cpuid as well. Not quite understand. Userspace set OSXSAVE should be trimmed IMO... Two cases: userspace does KVM_SET_CPUID2 with osxsave set but cr4.xsave clear, or the other way round. So we should set cpuid.osxsave depending to cr4.xsave whenever cr4 OR cpuid is modified, and completely ignore userspace setting for that bit. @@ -497,6 +532,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) if ((cr4 ^ old_cr4) pdptr_bits) kvm_mmu_reset_context(vcpu); + if ((cr4 ^ old_cr4) X86_CR4_OSXSAVE) + update_cpuid(vcpu); + I think we need to reload the guest's xcr0 at this point. Alternatively, call vmx_load_host_state() to ensure the the next entry will reload it. Current xcr0 would be loaded when next vmentry. True. And if we use prepare_guest_switch(), how about SVM? kvm_arch_vcpu_load() looks like a good place, as long as interrupts don't use the fpu. @@ -5134,6 +5197,10 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) vcpu-guest_fpu_loaded = 1; unlazy_fpu(current); + /* Restore all possible states in the guest */ + if (cpu_has_xsave guest_cpuid_has_xsave(vcpu)) + xsetbv(XCR_XFEATURE_ENABLED_MASK, + cpuid_get_possible_xcr0(vcpu)); Best to calculate it out of the fast path, when guest cpuid is set. Need to check it at this time as well. You mean guest_cpuid_has_xsave()? Not quite understand the point here... Also cpuid_get_possible_cr0(). So we have something like if (vcpu-save_xcr0) xsetbv(vcpu-save_xcr0); Those cpuid functions have loops, we don't want them running every context switch. Also can avoid it if guest xcr0 == host xcr0. I don't know the assumption that host use all possible xcr0 bits can apply. If so, only use host_xcr0 should be fine. I think we can rely on it. Those bits are a service to userspace and the guest is just a different kind of userspace, so it makes sense to expose the same set. Would update other points. Thanks. Thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 01/19] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 124f90c..cf309c9 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -203,6 +203,18 @@ struct skb_shared_info { void * destructor_arg; }; +/* The structure is for a skb which skb-data may point to + * an external buffer, which is not allocated from kernel space. + * Since the buffer is external, then the shinfo or frags are + * also extern too. It also contains a destructor for itself. + */ +struct skb_external_page { + u8 *start; + int size; + struct skb_frag_struct *frags; + struct skb_shared_info *ushinfo; + void(*dtor)(struct skb_external_page *); +}; /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 04/19] Add a ndo_mp_port_prep pointer to net_device_ops.
From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index efb575a..183c786 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -707,6 +707,10 @@ struct net_device_ops { int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mpassthru_port *port); +#endif }; /* -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 07/19] Add interface to get external buffers.
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h | 12 net/core/skbuff.c | 16 2 files changed, 28 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index cf309c9..281a1c0 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1519,6 +1519,18 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page) __free_page(page); } +extern struct skb_external_page *netdev_alloc_external_pages( + struct net_device *dev, + struct sk_buff *skb, int npages); + +static inline struct skb_external_page *netdev_alloc_external_page( + struct net_device *dev, + struct sk_buff *skb, unsigned int size) +{ + return netdev_alloc_external_pages(dev, skb, + DIV_ROUND_UP(size, PAGE_SIZE)); +} + /** * skb_clone_writable - is the header of a clone writable * @skb: buffer to check diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 93c4e06..fbdb1f1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -278,6 +278,22 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) } EXPORT_SYMBOL(__netdev_alloc_page); +struct skb_external_page *netdev_alloc_external_pages(struct net_device *dev, + struct sk_buff *skb, int npages) +{ + struct mpassthru_port *port; + struct skb_external_page *ext_page = NULL; + + port = rcu_dereference(dev-mp_port); + if (!port) + goto out; + WARN_ON(npages port-npages); + ext_page = port-ctor(port, skb, npages); +out: + return ext_page; +} +EXPORT_SYMBOL(netdev_alloc_external_pages); + void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, int size) { -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 06/19] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 31d9c4a..0cb78f4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1602,6 +1602,11 @@ extern void netdev_mp_port_detach(struct net_device *dev); extern int netdev_mp_port_prep(struct net_device *dev, struct mpassthru_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return (dev dev-mp_port); +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 11/19] Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 37587f0..418457c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb) static void skb_release_data(struct sk_buff *skb) { + /* check if the skb has external buffers, we have use destructor_arg +* here to indicate +*/ + struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg; + if (!skb-cloned || !atomic_sub_return(skb-nohdr ? (1 SKB_DATAREF_SHIFT) + 1 : 1, skb_shinfo(skb)-dataref)) { @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + /* if the skb has external buffers, use destructor here, +* since after that skb-head will be kfree, in case skb-head +* from external buffer cannot use kfree to destroy. +*/ + if (dev_is_mpassthru(skb-dev) ext_page ext_page-dtor) + ext_page-dtor(ext_page); kfree(skb-head); } } -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 13/19] To skip GRO if buffer is external currently.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index dc2f225..6c6b2fe 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2787,6 +2787,10 @@ enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb) if (skb_is_gso(skb) || skb_has_frags(skb)) goto normal; + /* currently GRO is not supported by mediate passthru */ + if (dev_is_mpassthru(skb-dev)) + goto normal; + rcu_read_lock(); list_for_each_entry_rcu(ptype, head, list) { if (ptype-type != type || ptype-dev || !ptype-gro_receive) -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 14/19] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..ba8f320 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,25 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 18/19] Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 17/19] Export proto_ops to vhost-net driver.
From: Xin Xiaohui xiaohui@intel.com Currently, vhost-net is only user to the mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 330 - 1 files changed, 325 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index de07f1e..d0df691 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -414,6 +414,11 @@ static void mp_put(struct mp_file *mfile) mp_detach(mfile-mp); } +static void iocb_tag(struct kiocb *iocb) +{ + iocb-ki_flags = 1; +} + /* The callback to destruct the external buffers or skb */ static void page_dtor(struct skb_external_page *ext_page) { @@ -449,7 +454,7 @@ static void page_dtor(struct skb_external_page *ext_page) * Queue the notifier to wake up the backend driver */ - create_iocb(info, info-total); + iocb_tag(info-iocb); sk = ctor-port.sock-sk; sk-sk_write_space(sk); @@ -569,8 +574,323 @@ failed: return NULL; } +static void mp_sock_destruct(struct sock *sk) +{ + struct mp_struct *mp = container_of(sk, struct mp_sock, sk)-mp; + kfree(mp); +} + +static void mp_sock_state_change(struct sock *sk) +{ + if (sk_has_sleeper(sk)) + wake_up_interruptible_sync_poll(sk-sk_sleep, POLLIN); +} + +static void mp_sock_write_space(struct sock *sk) +{ + if (sk_has_sleeper(sk)) + wake_up_interruptible_sync_poll(sk-sk_sleep, POLLOUT); +} + +static void mp_sock_data_ready(struct sock *sk, int coming) +{ + struct mp_struct *mp = container_of(sk, struct mp_sock, sk)-mp; + struct page_ctor *ctor = NULL; + struct sk_buff *skb = NULL; + struct page_info *info = NULL; + struct ethhdr *eth; + struct kiocb *iocb = NULL; + int len, i; + + struct virtio_net_hdr hdr = { + .flags = 0, + .gso_type = VIRTIO_NET_HDR_GSO_NONE + }; + + ctor = rcu_dereference(mp-ctor); + if (!ctor) + return; + + while ((skb = skb_dequeue(sk-sk_receive_queue)) != NULL) { + if (skb_shinfo(skb)-destructor_arg) { + info = container_of(skb_shinfo(skb)-destructor_arg, + struct page_info, ext_page); + info-skb = skb; + if (skb-len info-len) { + mp-dev-stats.rx_dropped++; + DBG(KERN_INFO Discarded truncated rx packet: +len %d %zd\n, skb-len, info-len); + info-total = skb-len; + goto clean; + } else { + int i; + struct skb_shared_info *gshinfo = + (struct skb_shared_info *) + (info-ushinfo); + struct skb_shared_info *hshinfo = + skb_shinfo(skb); + + if (gshinfo-nr_frags hshinfo-nr_frags) + goto clean; + eth = eth_hdr(skb); + skb_push(skb, ETH_HLEN); + + hdr.hdr_len = skb_headlen(skb); + info-total = skb-len; + + for (i = 0; i gshinfo-nr_frags; i++) + gshinfo-frags[i].size = 0; + for (i = 0; i hshinfo-nr_frags; i++) + gshinfo-frags[i].size = + hshinfo-frags[i].size; + } + } else { + /* The skb composed with kernel buffers +* in case external buffers are not sufficent. +* The case should be rare. +*/ + unsigned long flags; + int i; + struct skb_shared_info *gshinfo = NULL; + + info = NULL; + + spin_lock_irqsave(ctor-read_lock, flags); + if (!list_empty(ctor-readq)) { + info = list_first_entry(ctor-readq, + struct page_info, list); + list_del(info-list); + } + spin_unlock_irqrestore(ctor-read_lock, flags); + if (!info) { + DBG(KERN_INFO + No external buffer avaliable %p\n, + skb); +
[RFC][PATCH v6 19/19] Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 255 - drivers/vhost/vhost.c | 120 +-- drivers/vhost/vhost.h | 14 +++ 3 files changed, 333 insertions(+), 56 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9777583..9a0d162 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -45,10 +47,13 @@ enum vhost_net_poll_state { VHOST_NET_POLL_STOPPED = 2, }; +static struct kmem_cache *notify_cache; + struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -93,11 +98,146 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log info is needed, +* since these buffers are in async queue, and may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct list_head *entry, *tmp; + unsigned long flags; + int tx_total_len = 0; + + if (!is_async_vq(vq)) + return; + spin_lock_irqsave(vq-notify_lock, flags); + list_for_each_safe(entry, tmp, vq-notifier) { + iocb = list_entry(entry, +struct kiocb, ki_list); + if (!iocb-ki_flags) + continue; + list_del(iocb-ki_list); + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + +
[RFC][PATCH v6 16/19] Manipulate external buffers in mp device.
From: Xin, Xiaohuixiaohui@intel.com How external buffer comes from, how to destroy. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 253 - 1 files changed, 251 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index 25e2f3e..de07f1e 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -161,6 +161,39 @@ static int mp_dev_change_flags(struct net_device *dev, unsigned flags) return ret; } +/* The main function to allocate external buffers */ +static struct skb_external_page *page_ctor(struct mpassthru_port *port, + struct sk_buff *skb, int npages) +{ + int i; + unsigned long flags; + struct page_ctor *ctor; + struct page_info *info = NULL; + + ctor = container_of(port, struct page_ctor, port); + + spin_lock_irqsave(ctor-read_lock, flags); + if (!list_empty(ctor-readq)) { + info = list_first_entry(ctor-readq, struct page_info, list); + list_del(info-list); + } + spin_unlock_irqrestore(ctor-read_lock, flags); + if (!info) + return NULL; + + for (i = 0; i info-pnum; i++) { + get_page(info-pages[i]); + info-frag[i].page = info-pages[i]; + info-frag[i].page_offset = i ? 0 : info-offset; + info-frag[i].size = port-npages 1 ? PAGE_SIZE : + port-data_len; + } + info-skb = skb; + info-ext_page.frags = info-frag; + info-ext_page.ushinfo = info-ushinfo; + return info-ext_page; +} + static int page_ctor_attach(struct mp_struct *mp) { int rc; @@ -186,7 +219,7 @@ static int page_ctor_attach(struct mp_struct *mp) dev_hold(dev); ctor-dev = dev; - ctor-port.ctor = NULL; + ctor-port.ctor = page_ctor; ctor-port.sock = mp-socket; ctor-lock_pages = 0; rc = netdev_mp_port_attach(dev, ctor-port); @@ -252,11 +285,66 @@ static int set_memlock_rlimit(struct page_ctor *ctor, int resource, return 0; } +static void relinquish_resource(struct page_ctor *ctor) +{ + if (!(ctor-dev-flags IFF_UP) + !(ctor-wq_len + ctor-rq_len)) + printk(KERN_INFO relinquish_resource\n); +} + +static void mp_ki_dtor(struct kiocb *iocb) +{ + struct page_info *info = (struct page_info *)(iocb-private); + int i; + + if (info-flags == INFO_READ) { + for (i = 0; i info-pnum; i++) { + if (info-pages[i]) { + set_page_dirty_lock(info-pages[i]); + put_page(info-pages[i]); + } + } + info-skb-destructor = NULL; + kfree_skb(info-skb); + info-ctor-rq_len--; + } else + info-ctor-wq_len--; + /* Decrement the number of locked pages */ + info-ctor-lock_pages -= info-pnum; + kmem_cache_free(ext_page_info_cache, info); + relinquish_resource(info-ctor); + + return; +} + +static struct kiocb *create_iocb(struct page_info *info, int size) +{ + struct kiocb *iocb = NULL; + + iocb = info-iocb; + if (!iocb) + return iocb; + iocb-ki_flags = 0; + iocb-ki_users = 1; + iocb-ki_key = 0; + iocb-ki_ctx = NULL; + iocb-ki_cancel = NULL; + iocb-ki_retry = NULL; + iocb-ki_iovec = NULL; + iocb-ki_eventfd = NULL; + iocb-ki_pos = info-desc_pos; + iocb-ki_nbytes = size; + iocb-ki_dtor(iocb); + iocb-private = (void *)info; + iocb-ki_dtor = mp_ki_dtor; + + return iocb; +} + static int page_ctor_detach(struct mp_struct *mp) { struct page_ctor *ctor; struct page_info *info; - struct kiocb *iocb = NULL; int i; /* locked by mp_mutex */ @@ -268,11 +356,17 @@ static int page_ctor_detach(struct mp_struct *mp) for (i = 0; i info-pnum; i++) if (info-pages[i]) put_page(info-pages[i]); + create_iocb(info, 0); + ctor-rq_len--; kmem_cache_free(ext_page_info_cache, info); } + + relinquish_resource(ctor); + set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, ctor-o_rlim.rlim_cur, ctor-o_rlim.rlim_max); + netdev_mp_port_detach(ctor-dev); dev_put(ctor-dev); @@ -320,6 +414,161 @@ static void mp_put(struct mp_file *mfile) mp_detach(mfile-mp); } +/* The callback to destruct the external buffers or skb */ +static void page_dtor(struct skb_external_page *ext_page) +{ + struct page_info *info; + struct
[RFC][PATCH v6 15/19] Add basic funcs and ioctl to mp device.
From: Xin Xiaohui xiaohui@intel.com The ioctl is used by mp device to bind an underlying NIC, it will query hardware capability and declare the NIC to use external buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- memory leak fixed, kconfig made, do_unbind() made, mp_chr_ioctl() cleanup by Jeff Dike jd...@linux.intel.com drivers/vhost/mpassthru.c | 681 + 1 files changed, 681 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..25e2f3e --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,681 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +struct page_info { + struct list_headlist; + int header; + /* indicate the actual length of bytes +* send/recv in the external buffers +*/ + int total; + int offset; + struct page *pages[MAX_SKB_FRAGS+1]; + struct skb_frag_struct frag[MAX_SKB_FRAGS+1]; + struct sk_buff *skb; + struct page_ctor*ctor; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_external_pageext_page; + struct skb_shared_info ushinfo; + +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + unsignedpnum; + + /* It's meaningful for receive, means +* the max length allowed +*/ + size_t len; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + + struct kiocb*iocb; + unsigned intdesc_pos; + struct iovechdr[MAX_SKB_FRAGS + 2]; + struct ioveciov[MAX_SKB_FRAGS + 2]; +}; + +static struct kmem_cache *ext_page_info_cache; + +struct page_ctor { + struct list_headreadq; + int wq_len; + int rq_len; + spinlock_t read_lock; + /* record the locked pages */ + int lock_pages; + struct rlimit o_rlim; + struct net_device *dev; + struct mpassthru_port port; +}; + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_ctor*ctor; + struct socket socket; + +#ifdef MPASSTHRU_DEBUG + int debug; +#endif +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +static int
[RFC][PATCH v6 12/19] Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 35 +++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 37b389a..dc2f225 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2548,6 +2548,37 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mpassthru_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev)) + return skb; + mp_port = skb-dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process @@ -2629,6 +2660,10 @@ int netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) goto out; -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 10/19] Don't do skb recycle, if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 38d19d0..37587f0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -553,6 +553,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) if (skb_shared(skb) || skb_cloned(skb)) return 0; + /* if the device wants to do mediate passthru, the skb may +* get external buffer, so don't recycle +*/ + if (dev_is_mpassthru(skb-dev)) + return 0; + skb_release_head_state(skb); shinfo = skb_shinfo(skb); atomic_set(shinfo-dataref, 1); -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 09/19] Ignore room skb_reserve() when device is using external buffer.
From: Xin Xiaohui xiaohui@intel.com Make the skb-data and skb-head from external buffer to be consistent, we ignore the room reserved by driver for kernel skb. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 5ff8c27..193b259 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1200,6 +1200,15 @@ static inline int skb_tailroom(const struct sk_buff *skb) */ static inline void skb_reserve(struct sk_buff *skb, int len) { + /* Since skb_reserve() is only for an empty buffer, +* and when the skb is getting external buffer, we cannot +* retain the external buffer has the same reserved space +* in the header which kernel allocatd skb has, so have to +* ignore this. And we have recorded the external buffer +* info in the destructor_arg field, so use it as indicator. +*/ + if (skb_shinfo(skb)-destructor_arg) + return; skb-data += len; skb-tail += len; } -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 08/19] Make __alloc_skb() to get external buffer.
From: Xin Xiaohui xiaohui@intel.com Add a dev parameter to __alloc_skb(), skb-data points to external buffer, recompute skb-head, maintain shinfo of the external buffer, record external buffer info into destructor_arg field. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- __alloc_skb() cleanup by Jeff Dike jd...@linux.intel.com include/linux/skbuff.h |7 --- net/core/skbuff.c | 43 +-- 2 files changed, 41 insertions(+), 9 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 281a1c0..5ff8c27 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -442,17 +442,18 @@ extern void kfree_skb(struct sk_buff *skb); extern void consume_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, - gfp_t priority, int fclone, int node); + gfp_t priority, int fclone, + int node, struct net_device *dev); static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { - return __alloc_skb(size, priority, 0, -1); + return __alloc_skb(size, priority, 0, -1, NULL); } static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { - return __alloc_skb(size, priority, 1, -1); + return __alloc_skb(size, priority, 1, -1, NULL); } extern int skb_recycle_check(struct sk_buff *skb, int skb_size); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index fbdb1f1..38d19d0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -161,7 +161,8 @@ EXPORT_SYMBOL(skb_under_panic); * @fclone: allocate from fclone cache instead of head cache * and allocate a cloned (child) skb * @node: numa node to allocate memory on - * + * @dev: a device owns the skb if the skb try to get external buffer. + * otherwise is NULL. * Allocate a new sk_buff. The returned buffer has no headroom and a * tail room of size bytes. The object has a reference count of one. * The return is the buffer. On a failure the return is %NULL. @@ -170,12 +171,13 @@ EXPORT_SYMBOL(skb_under_panic); * %GFP_ATOMIC. */ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, - int fclone, int node) + int fclone, int node, struct net_device *dev) { struct kmem_cache *cache; struct skb_shared_info *shinfo; struct sk_buff *skb; - u8 *data; + u8 *data = NULL; + struct skb_external_page *ext_page = NULL; cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; @@ -185,8 +187,23 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, goto out; size = SKB_DATA_ALIGN(size); - data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info), - gfp_mask, node); + + /* If the device wants to do mediate passthru(zero-copy), +* the skb may try to get external buffers from outside. +* If fails, then fall back to alloc buffers from kernel. +*/ + if (dev dev-mp_port) { + ext_page = netdev_alloc_external_page(dev, skb, size); + if (ext_page) { + data = ext_page-start; + size = ext_page-size; + } + } + + if (!data) + data = kmalloc_node_track_caller( + size + sizeof(struct skb_shared_info), + gfp_mask, node); if (!data) goto nodata; @@ -208,6 +225,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, skb-mac_header = ~0U; #endif + /* If the skb get external buffers sucessfully, since the shinfo is +* at the end of the buffer, we may retain the shinfo once we +* need it sometime. +*/ + if (ext_page) { + skb-head = skb-data - NET_IP_ALIGN - NET_SKB_PAD; + memcpy(ext_page-ushinfo, skb_shinfo(skb), + sizeof(struct skb_shared_info)); + } /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); atomic_set(shinfo-dataref, 1); @@ -231,6 +257,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, child-fclone = SKB_FCLONE_UNAVAILABLE; } + /* Record the external buffer info in this field. It's not so good, +* but we cannot find another place easily. +*/ + shinfo-destructor_arg = ext_page; + out: return skb; nodata: @@ -259,7 +290,7 @@ struct sk_buff
[RFC][PATCH v6 05/19] Add a function make external buffer owner to query capability.
From: Xin Xiaohui xiaohui@intel.com The external buffer owner can use the functions to get the capability of the underlying NIC driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |2 + net/core/dev.c| 51 + 2 files changed, 53 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 183c786..31d9c4a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1599,6 +1599,8 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_attach(struct net_device *dev, struct mpassthru_port *port); extern void netdev_mp_port_detach(struct net_device *dev); +extern int netdev_mp_port_prep(struct net_device *dev, + struct mpassthru_port *port); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index ecbb6b1..37b389a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2497,6 +2497,57 @@ void netdev_mp_port_detach(struct net_device *dev) } EXPORT_SYMBOL(netdev_mp_port_detach); +/* To support meidate passthru(zero-copy) with NIC driver, + * we'd better query NIC driver for the capability it can + * provide, especially for packet split mode, now we only + * query for the header size, and the payload a descriptor + * may carry. If a driver does not use the API to export, + * then we may try to use a default value, currently, + * we use the default value from an IGB driver. Now, + * it's only called by mpassthru device. + */ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +int netdev_mp_port_prep(struct net_device *dev, + struct mpassthru_port *port) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + /* needed by packet split */ + + if (ops-ndo_mp_port_prep) { + rc = ops-ndo_mp_port_prep(dev, port); + if (rc) + return rc; + } else { + /* If the NIC driver did not report this, +* then we try to use default value. +*/ + port-hdr_len = 128; + port-data_len = 2048; + port-npages = 1; + } + + if (port-hdr_len = 0) + goto err; + + npages = port-npages; + data_len = port-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} +EXPORT_SYMBOL(netdev_mp_port_prep); +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 03/19] Export 2 func for device to assign/deassign new strucure
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |3 +++ net/core/dev.c| 28 2 files changed, 31 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index bae725c..efb575a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1592,6 +1592,9 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff *napi_frags_skb(struct napi_struct *napi); extern gro_result_tnapi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_attach(struct net_device *dev, +struct mpassthru_port *port); +extern void netdev_mp_port_detach(struct net_device *dev); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index f769098..ecbb6b1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb) rcu_read_unlock(); } +/* Export two functions to assign/de-assign mp_port pointer + * to a net device. + */ + +int netdev_mp_port_attach(struct net_device *dev, + struct mpassthru_port *port) +{ + /* locked by mp_mutex */ + if (rcu_dereference(dev-mp_port)) + return -EBUSY; + + rcu_assign_pointer(dev-mp_port, port); + + return 0; +} +EXPORT_SYMBOL(netdev_mp_port_attach); + +void netdev_mp_port_detach(struct net_device *dev) +{ + /* locked by mp_mutex */ + if (!rcu_dereference(dev-mp_port)) + return; + + rcu_assign_pointer(dev-mp_port, NULL); + synchronize_rcu(); +} +EXPORT_SYMBOL(netdev_mp_port_detach); + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 02/19] Add a new struct for device to manipulate external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 19 ++- 1 files changed, 18 insertions(+), 1 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index fa8b476..bae725c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -530,6 +530,22 @@ struct netdev_queue { unsigned long tx_dropped; } cacheline_aligned_in_smp; +/* Add a structure in structure net_device, the new field is + * named as mp_port. It's for mediate passthru (zero-copy). + * It contains the capability for the net device driver, + * a socket, and an external buffer creator, external means + * skb buffer belongs to the device may not be allocated from + * kernel space. + */ +struct mpassthru_port { + int hdr_len; + int data_len; + int npages; + unsignedflags; + struct socket *sock; + struct skb_external_page *(*ctor)(struct mpassthru_port *, + struct sk_buff *, int); +}; /* * This structure defines the management hooks for network devices. @@ -952,7 +968,8 @@ struct net_device { struct macvlan_port *macvlan_port; /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mpassthru_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v6 00/19] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-13:net core changes. patch 14-18:new device as interface to mantpulate external buffers. patch 19: for vhost-net. The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One:Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Two:Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. We have got comments from Michael. And he said the first method will break the compatiblity of virtio-net driver and may complicate the qemu live migration. Currently, we tried to ignore the skb_reserve() if the device is doing zero-copy. Then guest virtio-net driver wil not changed. So we now continue to go with the first way. But comments about the two ways are still appreicated. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support To support GRO Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using
[RFC 0/2] Tracing
Trace events in QEMU/KVM can be very useful for debugging and performance analysis. I'd like to discuss tracing support and hope others have an interest in this feature, too. Following this email are patches I am using to debug virtio-blk and storage. The patches provide trivial tracing support, but they don't address the details of real tracing tools: enabling/disabling events at runtime, no overhead for disabled events, multithreading support, etc. It would be nice to have userland tracing facilities that work out-of-the-box on production systems. Unfortunately, I'm not aware of any such facilities out there right now on Linux. Perhaps SystemTap userspace tracing is the way to go, has anyone tried it with KVM? For the medium term, without userspace tracing facilities in the OS we could put something into QEMU to address the need for tracing. Here are my thoughts on fleshing out the tracing patch I have posted: 1. Make it possible to enable/disable events at runtime. Users enable only the events they are interested in and aren't flooded with trace data for all other events. 2. Either make trace events cheap or build without trace events by default. Disable by default still allows tracing to be used for development but less for production. 3. Allow events in any execution context (cpu, io, aio emulation threads). The current code does not support concurrency and is meant for when the iothread mutex is held. 4. Make it easy to add new events. Instead of keeping trace.h and trace.py in sync manually, use something like .hx to produce the appropriate C and Python. Summary: Tracing is useful, are there external tools we can use right now? If not, should we put in something that works well enough until external tools catch up? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] trace: Add simple tracing support
Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- Makefile.objs |2 +- trace.c | 64 + trace.h |9 trace.py | 30 ++ 4 files changed, 104 insertions(+), 1 deletions(-) create mode 100644 trace.c create mode 100644 trace.h create mode 100755 trace.py diff --git a/Makefile.objs b/Makefile.objs index acbaf22..307e989 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o # block-obj-y is code used by both qemu system emulation and qemu-img block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o diff --git a/trace.c b/trace.c new file mode 100644 index 000..2fec4d3 --- /dev/null +++ b/trace.c @@ -0,0 +1,64 @@ +#include stdlib.h +#include stdio.h +#include trace.h + +typedef struct { +unsigned long event; +unsigned long x1; +unsigned long x2; +unsigned long x3; +unsigned long x4; +unsigned long x5; +} TraceRecord; + +enum { +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord), +}; + +static TraceRecord trace_buf[TRACE_BUF_LEN]; +static unsigned int trace_idx; +static FILE *trace_fp; + +static void trace(TraceEvent event, unsigned long x1, + unsigned long x2, unsigned long x3, + unsigned long x4, unsigned long x5) { +TraceRecord *rec = trace_buf[trace_idx]; +rec-event = event; +rec-x1 = x1; +rec-x2 = x2; +rec-x3 = x3; +rec-x4 = x4; +rec-x5 = x5; + +if (++trace_idx == TRACE_BUF_LEN) { +trace_idx = 0; + +if (!trace_fp) { +trace_fp = fopen(/tmp/trace.log, w); +} +if (trace_fp) { +size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp); +result = result; +} +} +} + +void trace1(TraceEvent event, unsigned long x1) { +trace(event, x1, 0, 0, 0, 0); +} + +void trace2(TraceEvent event, unsigned long x1, unsigned long x2) { +trace(event, x1, x2, 0, 0, 0); +} + +void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3) { +trace(event, x1, x2, x3, 0, 0); +} + +void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4) { +trace(event, x1, x2, x3, x4, 0); +} + +void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5) { +trace(event, x1, x2, x3, x4, x5); +} diff --git a/trace.h b/trace.h new file mode 100644 index 000..144aa1e --- /dev/null +++ b/trace.h @@ -0,0 +1,9 @@ +typedef enum { +TRACE_MAX +} TraceEvent; + +void trace1(TraceEvent event, unsigned long x1); +void trace2(TraceEvent event, unsigned long x1, unsigned long x2); +void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3); +void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4); +void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5); diff --git a/trace.py b/trace.py new file mode 100755 index 000..f38ab6b --- /dev/null +++ b/trace.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python +import sys +import struct + +trace_fmt = 'LL' +trace_len = struct.calcsize(trace_fmt) + +events = { +} + +def read_record(fobj): +s = fobj.read(trace_len) +if len(s) != trace_len: +return None +return struct.unpack(trace_fmt, s) + +def format_record(rec): +event = events[rec[0]] +fields = [event[0]] +for i in xrange(1, len(event)): +fields.append('%s=0x%x' % (event[i], rec[i])) +return ' '.join(fields) + +f = open(sys.argv[1], 'rb') +while True: +rec = read_record(f) +if rec is None: +break + +print format_record(rec) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] trace: Trace write requests in virtio-blk, multiwrite, and paio_submit
Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- block.c|7 +++ hw/virtio-blk.c|6 ++ posix-aio-compat.c |2 ++ trace.h| 42 +- trace.py |8 5 files changed, 64 insertions(+), 1 deletions(-) diff --git a/block.c b/block.c index bfe46e3..a7fb040 100644 --- a/block.c +++ b/block.c @@ -27,6 +27,7 @@ #include block_int.h #include module.h #include qemu-objects.h +#include trace.h #ifdef CONFIG_BSD #include sys/types.h @@ -1913,6 +1914,8 @@ static void multiwrite_cb(void *opaque, int ret) { MultiwriteCB *mcb = opaque; +trace_multiwrite_cb(mcb, ret); + if (ret 0 !mcb-error) { mcb-error = ret; multiwrite_user_cb(mcb); @@ -2044,6 +2047,8 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs) // Check for mergable requests num_reqs = multiwrite_merge(bs, reqs, num_reqs, mcb); +trace_bdrv_aio_multiwrite(mcb, mcb-num_callbacks, num_reqs); + // Run the aio requests for (i = 0; i num_reqs; i++) { acb = bdrv_aio_writev(bs, reqs[i].sector, reqs[i].qiov, @@ -2054,9 +2059,11 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs) // submitted yet. Otherwise we'll wait for the submitted AIOs to // complete and report the error in the callback. if (mcb-num_requests == 0) { +trace_bdrv_aio_multiwrite_earlyfail(mcb); reqs[i].error = -EIO; goto fail; } else { +trace_bdrv_aio_multiwrite_latefail(mcb, i); mcb-num_requests++; multiwrite_cb(mcb, -EIO); break; diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index b05d15e..73b873e 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -13,6 +13,7 @@ #include qemu-common.h #include sysemu.h +#include trace.h #include virtio-blk.h #include block_int.h #ifdef __linux__ @@ -50,6 +51,8 @@ static void virtio_blk_req_complete(VirtIOBlockReq *req, int status) { VirtIOBlock *s = req-dev; +trace_virtio_blk_req_complete(req, status); + req-in-status = status; virtqueue_push(s-vq, req-elem, req-qiov.size + sizeof(*req-in)); virtio_notify(s-vdev, s-vq); @@ -87,6 +90,8 @@ static void virtio_blk_rw_complete(void *opaque, int ret) { VirtIOBlockReq *req = opaque; +trace_virtio_blk_rw_complete(req, ret); + if (ret) { int is_read = !(req-out-type VIRTIO_BLK_T_OUT); if (virtio_blk_handle_rw_error(req, -ret, is_read)) @@ -270,6 +275,7 @@ static void virtio_blk_handle_write(BlockRequest *blkreq, int *num_writes, blkreq[*num_writes].cb = virtio_blk_rw_complete; blkreq[*num_writes].opaque = req; blkreq[*num_writes].error = 0; +trace_virtio_blk_handle_write(req, req-out-sector, req-qiov.size / 512); (*num_writes)++; } diff --git a/posix-aio-compat.c b/posix-aio-compat.c index b43c531..57d83f0 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -23,6 +23,7 @@ #include stdio.h #include qemu-queue.h +#include trace.h #include osdep.h #include qemu-common.h #include block_int.h @@ -583,6 +584,7 @@ BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd, acb-next = posix_aio_state-first_aio; posix_aio_state-first_aio = acb; +trace_paio_submit(acb, opaque, sector_num, nb_sectors, type); qemu_paio_submit(acb); return acb-common; } diff --git a/trace.h b/trace.h index 144aa1e..3c4564f 100644 --- a/trace.h +++ b/trace.h @@ -1,5 +1,12 @@ typedef enum { -TRACE_MAX +TRACE_MULTIWRITE_CB, +TRACE_BDRV_AIO_MULTIWRITE, +TRACE_BDRV_AIO_MULTIWRITE_EARLYFAIL, +TRACE_BDRV_AIO_MULTIWRITE_LATEFAIL, +TRACE_VIRTIO_BLK_REQ_COMPLETE, +TRACE_VIRTIO_BLK_RW_COMPLETE, +TRACE_VIRTIO_BLK_HANDLE_WRITE, +TRACE_PAIO_SUBMIT, } TraceEvent; void trace1(TraceEvent event, unsigned long x1); @@ -7,3 +14,36 @@ void trace2(TraceEvent event, unsigned long x1, unsigned long x2); void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3); void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4); void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5); + +static inline void trace_multiwrite_cb(void *mcb, int ret) { +trace2(TRACE_MULTIWRITE_CB, (unsigned long)mcb, ret); +} + +static inline void trace_bdrv_aio_multiwrite(void *mcb, int num_callbacks, int num_reqs) { +trace3(TRACE_BDRV_AIO_MULTIWRITE, (unsigned long)mcb, num_callbacks, num_reqs); +} + +static inline void trace_bdrv_aio_multiwrite_earlyfail(void *mcb) { +trace1(TRACE_BDRV_AIO_MULTIWRITE_EARLYFAIL, (unsigned long)mcb); +} + +static inline void trace_bdrv_aio_multiwrite_latefail(void *mcb, int i) { +
Re: repeatable hang with loop mount and heavy IO in guest (now in host - not KVM then..)
On 02/27/2010 12:38 AM, Antoine Martin wrote: 1 0 0 98 0 1| 0 0 | 66B 354B| 0 0 | 3011 1 1 0 98 0 0| 0 0 | 66B 354B| 0 0 | 2911 From that point onwards, nothing will happen. The host has disk IO to spare... So what is it waiting for?? Moved to an AMD64 host. No effect. Disabled swap before running the test. No effect. Moved the guest to a fully up-to-date FC12 server (2.6.31.6-145.fc12.x86_64), no effect. I have narrowed it down to the guest's filesystem used for backing the disk image which is loop mounted: although it was not completely full (and had enough inodes), freeing some space on it prevents the system from misbehaving. FYI: the disk image was clean and was fscked before each test. kvm had been updated to 0.12.3 The weird thing is that the same filesystem works fine (no system hang) if used directly from the host, it is only misbehaving via kvm... So I am not dismissing the possibility that kvm may be at least partly to blame, or that it is exposing a filesystem bug (race?) not normally encountered. (I have backed up the full 32GB virtual disk in case someone suggests further investigation) Well, well. I've just hit the exact same bug on another *host* (not a guest), running stock Fedora 12. So this isn't a kvm bug after all. Definitely a loop+ext(4?) bug. Looks like you need a pretty big loop mounted partition to trigger it. (bigger than available ram?) This is what triggered it on a quad amd system with 8Gb of ram, software raid-1 partition: mount -o loop 2GB.dd source dd if=/dev/zero of=8GB.dd bs=1048576 count=8192 mkfs.ext4 -f 8GB.dd mount -o loop 8GB.dd dest rsync -rplogtD source/* dest/ umount source umount dest ^ this is where it hangs, I then tried to issue a 'sync' from another terminal, which also hung. It took more than 10 minutes to settle itself, during that time one CPU was stuck in wait state. dstat reported almost no IO at the time (1MB/s) I assume dstat reports page write back like any other disk IO? That raid partition does ~60MB/s, so writing back 8GB shouldn't take 10 minutes. (that's even assuming it would have to write back the whole 8GB at umount time - which should not be the case) Cheers Antoine Here's the hung trace: INFO: task umount:526 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. umountD 0002 0 526 32488 0x 880140f9fc88 0086 880008e3c228 810d5fd9 880140f9fc28 880140f9fcd8 880140f9ffd8 880140f9ffd8 88021b5e03d8 f980 00015740 88021b5e03d8 Call Trace: [810d5fd9] ? sync_page+0x0/0x4a [81046fbd] ? __enqueue_entity+0x7b/0x7d [8113a047] ? bdi_sched_wait+0x0/0x12 [8113a055] bdi_sched_wait+0xe/0x12 [814549f0] __wait_on_bit+0x48/0x7b [8102649f] ? native_smp_send_reschedule+0x5c/0x5e [81454a91] out_of_line_wait_on_bit+0x6e/0x79 [8113a047] ? bdi_sched_wait+0x0/0x12 [810748dc] ? wake_bit_function+0x0/0x33 [8113ad0b] wait_on_bit.clone.1+0x1e/0x20 [8113ad71] bdi_sync_writeback+0x64/0x6b [8113ad9a] sync_inodes_sb+0x22/0xec [8113e547] __sync_filesystem+0x4e/0x77 [8113e71d] sync_filesystem+0x4b/0x4f [8111d6d9] generic_shutdown_super+0x27/0xc9 [8111d7a2] kill_block_super+0x27/0x3f [8111ded7] deactivate_super+0x56/0x6b [81134262] mntput_no_expire+0xb4/0xec [8113482a] sys_umount+0x2d5/0x304 [81458133] ? do_page_fault+0x270/0x2a0 [81011d32] system_call_fastpath+0x16/0x1b INFO: task umount:526 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. umountD 0002 0 526 32488 0x 880140f9fc88 0086 880008e3c228 810d5fd9 880140f9fc28 880140f9fcd8 880140f9ffd8 880140f9ffd8 88021b5e03d8 f980 00015740 88021b5e03d8 Call Trace: [810d5fd9] ? sync_page+0x0/0x4a [81046fbd] ? __enqueue_entity+0x7b/0x7d [8113a047] ? bdi_sched_wait+0x0/0x12 [8113a055] bdi_sched_wait+0xe/0x12 [814549f0] __wait_on_bit+0x48/0x7b [8102649f] ? native_smp_send_reschedule+0x5c/0x5e [81454a91] out_of_line_wait_on_bit+0x6e/0x79 [8113a047] ? bdi_sched_wait+0x0/0x12 [810748dc] ? wake_bit_function+0x0/0x33 [8113ad0b] wait_on_bit.clone.1+0x1e/0x20 [8113ad71] bdi_sync_writeback+0x64/0x6b [8113ad9a] sync_inodes_sb+0x22/0xec [8113e547] __sync_filesystem+0x4e/0x77 [8113e71d] sync_filesystem+0x4b/0x4f [8111d6d9] generic_shutdown_super+0x27/0xc9 [8111d7a2] kill_block_super+0x27/0x3f [8111ded7] deactivate_super+0x56/0x6b [81134262] mntput_no_expire+0xb4/0xec [8113482a]
Re: [PATCH 1/2] trace: Add simple tracing support
I should have used the [RFC] tag to make it clear that I'm not proposing these patches for merge, sorry. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] trace: Add simple tracing support
Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. When already writing to a file, why not reusing QEMU's logging infrastructure (log foo / -d foo)? Shouldn't make a huge performance difference if the data is saved in clear-text. Also, having support for ftrace's user space markers would be a very nice option (only an option as it's Linux-specific), see http://lwn.net/Articles/366796. This allows to correlate kernel events (KVM as well as others) with what goes on in QEMU. It simply enables integration with the whole kernel tracing infrastructure, e.g. KernelShark (http://people.redhat.com/srostedt/kernelshark/HTML). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC 0/2] Tracing
Hi Stefan, Nice to see the patchset. I am working on something similar, on the lines of static trace events for QEMU, that collect traces in a qemu-internal buffer. This would employ monitor commands to read traces, as well as enable/disable trace events at runtime. I plan to post a prototype early next week. On 05/21/2010 03:12 PM, Stefan Hajnoczi wrote: Trace events in QEMU/KVM can be very useful for debugging and performance analysis. I'd like to discuss tracing support and hope others have an interest in this feature, too. Following this email are patches I am using to debug virtio-blk and storage. The patches provide trivial tracing support, but they don't address the details of real tracing tools: enabling/disabling events at runtime, no overhead for disabled events, multithreading support, etc. It would be nice to have userland tracing facilities that work out-of-the-box on production systems. Unfortunately, I'm not aware of any such facilities out there right now on Linux. Perhaps SystemTap userspace tracing is the way to go, has anyone tried it with KVM? For the medium term, without userspace tracing facilities in the OS we could put something into QEMU to address the need for tracing. Here are my thoughts on fleshing out the tracing patch I have posted: 1. Make it possible to enable/disable events at runtime. Users enable only the events they are interested in and aren't flooded with trace data for all other events. Agree, my upcoming patchset should address this. 2. Either make trace events cheap or build without trace events by default. Disable by default still allows tracing to be used for development but less for production. I'm trying to do this too, though quite a lot remains to be improved in my current implementation :-) 3. Allow events in any execution context (cpu, io, aio emulation threads). Agree. 4. Make it easy to add new events. Agree ! I'm trying to provide a unified macro interface like trace events which makes it easy enough to add new events. Regards, -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add support for protocol driver create_options
Am 20.05.2010 07:36, schrieb MORITA Kazutaka: This patch enables protocol drivers to use their create options which are not supported by the format. For example, protcol drivers can use a backing_file option with raw format. Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Hm, this is not stackable, right? Though I do see that making it stackable would require some bigger changes, so maybe we can get away with claiming that this approach covers everything that happens in practice. If we accept that this is the desired behaviour, the code looks good to me. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network problem with Solaris 10u8 guest
On 05/12/10 12:41, Harald Dunkel wrote: Hi folks, I am trying to run Solaris 10u8 as a guest in kvm (kernel 2.6.33.2). Problem: The virtual network devices don't work with this Solaris version. Short update: Virtualbox 3.1.6 seems to be more reliable in this case. Regards Harri -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- Makefile.objs |2 +- trace.c | 64 + trace.h |9 trace.py | 30 ++ 4 files changed, 104 insertions(+), 1 deletions(-) create mode 100644 trace.c create mode 100644 trace.h create mode 100755 trace.py diff --git a/Makefile.objs b/Makefile.objs index acbaf22..307e989 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o # block-obj-y is code used by both qemu system emulation and qemu-img block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o diff --git a/trace.c b/trace.c new file mode 100644 index 000..2fec4d3 --- /dev/null +++ b/trace.c @@ -0,0 +1,64 @@ +#includestdlib.h +#includestdio.h +#include trace.h + +typedef struct { +unsigned long event; +unsigned long x1; +unsigned long x2; +unsigned long x3; +unsigned long x4; +unsigned long x5; +} TraceRecord; + +enum { +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord), +}; + +static TraceRecord trace_buf[TRACE_BUF_LEN]; +static unsigned int trace_idx; +static FILE *trace_fp; + +static void trace(TraceEvent event, unsigned long x1, + unsigned long x2, unsigned long x3, + unsigned long x4, unsigned long x5) { +TraceRecord *rec =trace_buf[trace_idx]; +rec-event = event; +rec-x1 = x1; +rec-x2 = x2; +rec-x3 = x3; +rec-x4 = x4; +rec-x5 = x5; + +if (++trace_idx == TRACE_BUF_LEN) { +trace_idx = 0; + +if (!trace_fp) { +trace_fp = fopen(/tmp/trace.log, w); +} +if (trace_fp) { +size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp); +result = result; +} +} +} It is probably worth while to read trace points via the monitor or through some other mechanism. My concern would be that writing even 64k out to disk would introduce enough performance overhead mainly because it runs lock-step with the guest's VCPU. Maybe it's worth adding a thread that syncs the ring to disk if we want to write to disk? +void trace1(TraceEvent event, unsigned long x1) { +trace(event, x1, 0, 0, 0, 0); +} + +void trace2(TraceEvent event, unsigned long x1, unsigned long x2) { +trace(event, x1, x2, 0, 0, 0); +} + +void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3) { +trace(event, x1, x2, x3, 0, 0); +} + +void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4) { +trace(event, x1, x2, x3, x4, 0); +} + +void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5) { +trace(event, x1, x2, x3, x4, x5); +} diff --git a/trace.h b/trace.h new file mode 100644 index 000..144aa1e --- /dev/null +++ b/trace.h @@ -0,0 +1,9 @@ +typedef enum { +TRACE_MAX +} TraceEvent; + +void trace1(TraceEvent event, unsigned long x1); +void trace2(TraceEvent event, unsigned long x1, unsigned long x2); +void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3); +void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4); +void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned long x3, unsigned long x4, unsigned long x5); Looks good. I think we definitely need something like this. Regards, Anthony Liguori diff --git a/trace.py b/trace.py new file mode 100755 index 000..f38ab6b --- /dev/null +++ b/trace.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python +import sys +import struct + +trace_fmt = 'LL' +trace_len = struct.calcsize(trace_fmt) + +events = { +} + +def read_record(fobj): +s = fobj.read(trace_len) +if len(s) != trace_len: +return None +return struct.unpack(trace_fmt, s) + +def format_record(rec): +event = events[rec[0]] +fields = [event[0]] +for i in xrange(1, len(event)): +fields.append('%s=0x%x' % (event[i], rec[i])) +return ' '.join(fields) + +f = open(sys.argv[1], 'rb') +while True: +rec = read_record(f) +if rec is None: +break + +print format_record(rec) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bug tracking?
So, what's the current state of the bug tracking system? As far as I can see, qemu is moving to launchpad. Where qemu-kvm-related issues should be submitted nowadays? Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] trace: Add simple tracing support
On Fri, May 21, 2010 at 12:13 PM, Jan Kiszka jan.kis...@siemens.com wrote: Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. When already writing to a file, why not reusing QEMU's logging infrastructure (log foo / -d foo)? Shouldn't make a huge performance difference if the data is saved in clear-text. Also, having support for ftrace's user space markers would be a very nice option (only an option as it's Linux-specific), see http://lwn.net/Articles/366796. Thanks for the links. I think using the platform's tracing facility has many advantages. The main one being that we can focus on QEMU/KVM development rather than re-implementing tracing infrastructure :). It may be possible to have SystemTap, DTrace, or nop static trace event code. A platform with no tracing support can only use the nop backend, which results in a build without static trace events. Platforms with tracing support can build with the appropriate backend or nop. The backend tracing facility is abstracted and most of QEMU doesn't need to know which one is being used. I hadn't seen trace markers. However, I suspect they aren't ideal for static trace events because logging an event requires a write system call. They look useful for annotating kernel tracing information, but less for high frequency/low overhead userspace tracing. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] trace: Add simple tracing support
Stefan Hajnoczi wrote: On Fri, May 21, 2010 at 12:13 PM, Jan Kiszka jan.kis...@siemens.com wrote: Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. When already writing to a file, why not reusing QEMU's logging infrastructure (log foo / -d foo)? Shouldn't make a huge performance difference if the data is saved in clear-text. Also, having support for ftrace's user space markers would be a very nice option (only an option as it's Linux-specific), see http://lwn.net/Articles/366796. Thanks for the links. I think using the platform's tracing facility has many advantages. The main one being that we can focus on QEMU/KVM development rather than re-implementing tracing infrastructure :). Indeed. :) It may be possible to have SystemTap, DTrace, or nop static trace event code. A platform with no tracing support can only use the nop backend, which results in a build without static trace events. Platforms with tracing support can build with the appropriate backend or nop. The backend tracing facility is abstracted and most of QEMU doesn't need to know which one is being used. That would be ideal. I hadn't seen trace markers. However, I suspect they aren't ideal for static trace events because logging an event requires a write system call. They look useful for annotating kernel tracing information, but less for high frequency/low overhead userspace tracing. You never know for sure until you tried :). There are surely lots of scenarios where this overhead does not matter. Moreover, I'm sure that something of LTTng's high-frequency/low-overhead tracing capabilities will make it (in whatever form) into mainline sooner or later. So we need that smart infrastructure to make use of it once it's available (actually, LTTng is already available, just still requires some kernel patching). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network problem with Solaris 10u8 guest
21.05.2010 16:36, Harald Dunkel wrote: On 05/12/10 12:41, Harald Dunkel wrote: Hi folks, I am trying to run Solaris 10u8 as a guest in kvm (kernel 2.6.33.2). Problem: The virtual network devices don't work with this Solaris version. Short update: Virtualbox 3.1.6 seems to be more reliable in this case. I forgot to send my testing results. I installed solaris from sol-10-u8-ga-x86-dvd.iso. It were with the default rtl8138 NIC, and the installer configured rtls0 interface (so it actually at least recognizable). I left `ping -f $solaris-ip' process running for whole night - it were still running in the morning without any visible issues, at 100% CPU usage (2 cores - for ping, host kernel and kvm processes). Now, it looks like I forgot solaris enough to being unable to set up new network driver, so I can't easily switch to e1000. Maybe reinstall will be faster for me. So, basically, I can't reproduce the issue. /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
Anthony Liguori wrote: On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- Makefile.objs |2 +- trace.c | 64 + trace.h |9 trace.py | 30 ++ 4 files changed, 104 insertions(+), 1 deletions(-) create mode 100644 trace.c create mode 100644 trace.h create mode 100755 trace.py diff --git a/Makefile.objs b/Makefile.objs index acbaf22..307e989 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o # block-obj-y is code used by both qemu system emulation and qemu-img block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o diff --git a/trace.c b/trace.c new file mode 100644 index 000..2fec4d3 --- /dev/null +++ b/trace.c @@ -0,0 +1,64 @@ +#includestdlib.h +#includestdio.h +#include trace.h + +typedef struct { +unsigned long event; +unsigned long x1; +unsigned long x2; +unsigned long x3; +unsigned long x4; +unsigned long x5; +} TraceRecord; + +enum { +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord), +}; + +static TraceRecord trace_buf[TRACE_BUF_LEN]; +static unsigned int trace_idx; +static FILE *trace_fp; + +static void trace(TraceEvent event, unsigned long x1, + unsigned long x2, unsigned long x3, + unsigned long x4, unsigned long x5) { +TraceRecord *rec =trace_buf[trace_idx]; +rec-event = event; +rec-x1 = x1; +rec-x2 = x2; +rec-x3 = x3; +rec-x4 = x4; +rec-x5 = x5; + +if (++trace_idx == TRACE_BUF_LEN) { +trace_idx = 0; + +if (!trace_fp) { +trace_fp = fopen(/tmp/trace.log, w); +} +if (trace_fp) { +size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp); +result = result; +} +} +} It is probably worth while to read trace points via the monitor or through some other mechanism. My concern would be that writing even 64k out to disk would introduce enough performance overhead mainly because it runs lock-step with the guest's VCPU. Maybe it's worth adding a thread that syncs the ring to disk if we want to write to disk? That's not what QEMU should worry about. If somehow possible, let's push this into the hands of a (user space) tracing framework, ideally one that is already designed for such requirements. E.g. there exists quite useful work in the context of LTTng (user space RCU for application tracing). We may need simple stubs for the case that no such framework is (yet) available. But effort should focus on a QEMU infrastructure to add useful tracepoints to the code. Specifically when tracing over KVM, you usually need information about kernel states as well, so you depend on an integrated approach, not Yet Another Log File. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
On 05/21/2010 08:46 AM, Jan Kiszka wrote: Anthony Liguori wrote: On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- Makefile.objs |2 +- trace.c | 64 + trace.h |9 trace.py | 30 ++ 4 files changed, 104 insertions(+), 1 deletions(-) create mode 100644 trace.c create mode 100644 trace.h create mode 100755 trace.py diff --git a/Makefile.objs b/Makefile.objs index acbaf22..307e989 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o # block-obj-y is code used by both qemu system emulation and qemu-img block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o diff --git a/trace.c b/trace.c new file mode 100644 index 000..2fec4d3 --- /dev/null +++ b/trace.c @@ -0,0 +1,64 @@ +#includestdlib.h +#includestdio.h +#include trace.h + +typedef struct { +unsigned long event; +unsigned long x1; +unsigned long x2; +unsigned long x3; +unsigned long x4; +unsigned long x5; +} TraceRecord; + +enum { +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord), +}; + +static TraceRecord trace_buf[TRACE_BUF_LEN]; +static unsigned int trace_idx; +static FILE *trace_fp; + +static void trace(TraceEvent event, unsigned long x1, + unsigned long x2, unsigned long x3, + unsigned long x4, unsigned long x5) { +TraceRecord *rec =trace_buf[trace_idx]; +rec-event = event; +rec-x1 = x1; +rec-x2 = x2; +rec-x3 = x3; +rec-x4 = x4; +rec-x5 = x5; + +if (++trace_idx == TRACE_BUF_LEN) { +trace_idx = 0; + +if (!trace_fp) { +trace_fp = fopen(/tmp/trace.log, w); +} +if (trace_fp) { +size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp); +result = result; +} +} +} It is probably worth while to read trace points via the monitor or through some other mechanism. My concern would be that writing even 64k out to disk would introduce enough performance overhead mainly because it runs lock-step with the guest's VCPU. Maybe it's worth adding a thread that syncs the ring to disk if we want to write to disk? That's not what QEMU should worry about. If somehow possible, let's push this into the hands of a (user space) tracing framework, ideally one that is already designed for such requirements. E.g. there exists quite useful work in the context of LTTng (user space RCU for application tracing). From what I understand, none of the current kernel approaches to userspace tracing have much momentum at the moment. We may need simple stubs for the case that no such framework is (yet) available. But effort should focus on a QEMU infrastructure to add useful tracepoints to the code. Specifically when tracing over KVM, you usually need information about kernel states as well, so you depend on an integrated approach, not Yet Another Log File. I think the simple code that Stefan pasted gives us 95% of what we need. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Gentoo guest with smp: emerge freeze while recompile world
There are almost impossible to debug. Try copying vmlinux out of your guest and attach with gdb when it hangs. Then issue the command (gdb) thread apply all backtrace to see what the guest is doing. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --- End of Original Message --- Hi, I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world complete without errors! I always use the same .config After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the problem remain, the compile freeze and I see this in ps -elf: 5 S root 1013 1 0 76 -4 - 3125 poll_s 13:00 ?00:00:00 /sbin/udevd --daemon 1 S root 2669 1 0 80 0 - 7523 wait 13:00 ?00:00:00 supervising syslog-ng 5 S root 2670 2669 0 80 0 - 7556 poll_s 13:00 ?00:00:00 /usr/sbin/syslog-ng 1 S root 3258 1 0 80 0 - 9505 poll_s 13:00 ?00:00:00 /usr/sbin/sshd 1 S root 3378 1 0 80 0 - 4115 hrtime 13:00 ?00:00:00 /usr/sbin/cron 0 S root 3446 1 0 80 0 - 1493 n_tty_ 13:00 tty2 00:00:00 /sbin/agetty 38400 tty2 linux 0 S root 3447 1 0 80 0 - 1493 n_tty_ 13:00 tty3 00:00:00 /sbin/agetty 38400 tty3 linux 0 S root 3448 1 0 80 0 - 1493 n_tty_ 13:00 tty4 00:00:00 /sbin/agetty 38400 tty4 linux 0 S root 3449 1 0 80 0 - 1493 n_tty_ 13:00 tty5 00:00:00 /sbin/agetty 38400 tty5 linux 0 S root 3450 1 0 80 0 - 1493 n_tty_ 13:00 tty6 00:00:00 /sbin/agetty 38400 tty6 linux 5 S root 3457 1 0 80 0 - 5959 poll_s 13:00 ?00:00:00 SCREEN -S sb1 4 S root 3458 3457 0 80 0 - 4454 wait 13:00 pts/000:00:00 -/bin/bash 4 S root 3462 3458 0 75 -5 - 45171 poll_s 13:00 pts/000:00:34 /usr/bin/python2.6 /usr/bin/emerge -e world 4 S root 3613 1 0 80 0 - 14014 wait 13:01 tty1 00:00:00 /bin/login -- 4 S root 3953 3613 0 80 0 - 4429 n_tty_ 13:01 tty1 00:00:00 -bash 0 S root 6614 3462 0 75 -5 - 972 wait 14:26 pts/000:00:00 [dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh compile 4 S root 6615 6614 0 75 -5 - 6362 wait 14:26 pts/000:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 5 S root 6646 6615 0 75 -5 - 6745 wait 14:26 pts/000:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 4 S root 13235 6646 0 75 -5 - 3651 wait 14:27 pts/000:00:00 make -j8 4 S root 13238 13235 0 75 -5 - 3652 wait 14:27 pts/000:00:00 make all-recursive 4 S root 13239 13238 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 13243 13239 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 13244 13243 0 75 -5 - 3686 wait 14:27 pts/000:00:00 make all 4 S root 13358 13244 0 75 -5 - 3684 wait 14:27 pts/000:00:00 make all-recursive 4 S root 13359 13358 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 16546 13359 0 75 -5 - 5956 wait 14:28 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 16547 16546 0 75 -5 - 3652 wait 14:28 pts/000:00:00 make all 4 S root 16548 16547 0 75 -5 - 3652 n_tty_ 14:28 pts/000:00:00 make all-am 4 S root 16599 3258 0 80 0 - 17937 poll_s 15:07 ?00:00:00 sshd: r...@pts/2 4 S root 16602 16599 0 80 0 - 4429 wait 15:07 pts/200:00:00 -bash 4 R root 16611 16602 0 80 0 - 3698 - 15:08 pts/200:00:00 ps -elf 1 S root 31506 2 0 80 0 - 0 bdi_wr 14:25 ?00:00:00 [flush-253:0] All in wait? After this test I reboot in 2.6.31-r10 and I complete emerge -e world succefully. The problem show always with all kernels =2.6.32 I have setup something wrong in kernel? I post the .config in the previous email. Best regards, Riccardo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup
On 5/20/2010 3:22 PM, Paul Menage wrote: On Tue, May 18, 2010 at 5:04 PM, Sridhar Samudrala samudrala.srid...@gmail.com wrote: Add a new kernel API to attach a task to current task's cgroup in all the active hierarchies. Signed-off-by: Sridhar Samudralas...@us.ibm.com Reviewed-by: Paul Menagemen...@google.com It would be more efficient to just attach directly to current-cgroups rather than potentially creating/destroying one css_set for each hierarchy until we've completely converged on current-cgroups - but that would require a bunch of refactoring of the guts of cgroup_attach_task() to ensure that the right can_attach()/attach() callbacks are made. That doesn't really seem worthwhile right now for the initial use, that I imagine isn't going to be performance-sensitive. Yes. In our use-case, this will be called only once per guest interface when the guest comes up. Hope you or someone more familiar with cgroups subsystem can optimize this function later. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Gentoo guest with smp: emerge freeze while recompile world
On 05/21/2010 04:16 PM, Riccardo wrote: ... There are almost impossible to debug. Try copying vmlinux out of your guest and attach with gdb when it hangs. Then issue the command (gdb) thread apply all backtrace to see what the guest is doing. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --- End of Original Message --- Hi, I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world complete without errors! Interesing. Can you so a git bisect to see where it stops working? I always use the same .config After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the problem remain, the compile freeze and I see this in ps -elf: 5 S root 1013 1 0 76 -4 - 3125 poll_s 13:00 ?00:00:00 /sbin/udevd --daemon 1 S root 2669 1 0 80 0 - 7523 wait 13:00 ?00:00:00 supervising syslog-ng 5 S root 2670 2669 0 80 0 - 7556 poll_s 13:00 ?00:00:00 /usr/sbin/syslog-ng 1 S root 3258 1 0 80 0 - 9505 poll_s 13:00 ?00:00:00 /usr/sbin/sshd 1 S root 3378 1 0 80 0 - 4115 hrtime 13:00 ?00:00:00 /usr/sbin/cron 0 S root 3446 1 0 80 0 - 1493 n_tty_ 13:00 tty2 00:00:00 /sbin/agetty 38400 tty2 linux 0 S root 3447 1 0 80 0 - 1493 n_tty_ 13:00 tty3 00:00:00 /sbin/agetty 38400 tty3 linux 0 S root 3448 1 0 80 0 - 1493 n_tty_ 13:00 tty4 00:00:00 /sbin/agetty 38400 tty4 linux 0 S root 3449 1 0 80 0 - 1493 n_tty_ 13:00 tty5 00:00:00 /sbin/agetty 38400 tty5 linux 0 S root 3450 1 0 80 0 - 1493 n_tty_ 13:00 tty6 00:00:00 /sbin/agetty 38400 tty6 linux 5 S root 3457 1 0 80 0 - 5959 poll_s 13:00 ?00:00:00 SCREEN -S sb1 4 S root 3458 3457 0 80 0 - 4454 wait 13:00 pts/000:00:00 -/bin/bash 4 S root 3462 3458 0 75 -5 - 45171 poll_s 13:00 pts/000:00:34 /usr/bin/python2.6 /usr/bin/emerge -e world 4 S root 3613 1 0 80 0 - 14014 wait 13:01 tty1 00:00:00 /bin/login -- 4 S root 3953 3613 0 80 0 - 4429 n_tty_ 13:01 tty1 00:00:00 -bash 0 S root 6614 3462 0 75 -5 - 972 wait 14:26 pts/000:00:00 [dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh compile 4 S root 6615 6614 0 75 -5 - 6362 wait 14:26 pts/000:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 5 S root 6646 6615 0 75 -5 - 6745 wait 14:26 pts/000:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 4 S root 13235 6646 0 75 -5 - 3651 wait 14:27 pts/000:00:00 make -j8 4 S root 13238 13235 0 75 -5 - 3652 wait 14:27 pts/000:00:00 make all-recursive 4 S root 13239 13238 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 13243 13239 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 13244 13243 0 75 -5 - 3686 wait 14:27 pts/000:00:00 make all 4 S root 13358 13244 0 75 -5 - 3684 wait 14:27 pts/000:00:00 make all-recursive 4 S root 13359 13358 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 16546 13359 0 75 -5 - 5956 wait 14:28 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 16547 16546 0 75 -5 - 3652 wait 14:28 pts/000:00:00 make all 4 S root 16548 16547 0 75 -5 - 3652 n_tty_ 14:28 pts/000:00:00 make all-am 4 S root 16599 3258 0 80 0 - 17937 poll_s 15:07 ?00:00:00 sshd: r...@pts/2 4 S root 16602 16599 0 80 0 - 4429 wait 15:07 pts/200:00:00 -bash 4 R root 16611 16602 0 80 0 - 3698 - 15:08 pts/200:00:00 ps -elf 1 S root 31506 2 0 80 0 - 0 bdi_wr 14:25 ?00:00:00 [flush-253:0] All in wait? Maybe a block driver problem? Are you using virtio? After this test I reboot in 2.6.31-r10 and I complete emerge -e world succefully. The problem show always with all kernels=2.6.32 I have setup something wrong in kernel? I post the .config in the previous email. It should work for all .configs. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Gentoo guest with smp: emerge freeze while recompile world
-- Original Message --- From: Avi Kivity a...@redhat.com To: Riccardo andrighetto.ricca...@gmail.com Cc: kvm@vger.kernel.org Sent: Fri, 21 May 2010 18:21:20 +0300 Subject: Re: Gentoo guest with smp: emerge freeze while recompile world On 05/21/2010 04:16 PM, Riccardo wrote: ... There are almost impossible to debug. Try copying vmlinux out of your guest and attach with gdb when it hangs. Then issue the command (gdb) thread apply all backtrace to see what the guest is doing. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --- End of Original Message --- Hi, I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world complete without errors! Interesing. Can you so a git bisect to see where it stops working? Ehm sorry I don't understand the request have you a link? I always use the same .config After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the problem remain, the compile freeze and I see this in ps -elf: 5 S root 1013 1 0 76 -4 - 3125 poll_s 13:00 ?00:00:00 /sbin/udevd --daemon 1 S root 2669 1 0 80 0 - 7523 wait 13:00 ?00:00:00 supervising syslog-ng 5 S root 2670 2669 0 80 0 - 7556 poll_s 13:00 ? 00:00:00 /usr/sbin/syslog-ng 1 S root 3258 1 0 80 0 - 9505 poll_s 13:00 ?00:00:00 /usr/sbin/sshd 1 S root 3378 1 0 80 0 - 4115 hrtime 13:00 ?00:00:00 /usr/sbin/cron 0 S root 3446 1 0 80 0 - 1493 n_tty_ 13:00 tty2 00:00:00 /sbin/agetty 38400 tty2 linux 0 S root 3447 1 0 80 0 - 1493 n_tty_ 13:00 tty3 00:00:00 /sbin/agetty 38400 tty3 linux 0 S root 3448 1 0 80 0 - 1493 n_tty_ 13:00 tty4 00:00:00 /sbin/agetty 38400 tty4 linux 0 S root 3449 1 0 80 0 - 1493 n_tty_ 13:00 tty5 00:00:00 /sbin/agetty 38400 tty5 linux 0 S root 3450 1 0 80 0 - 1493 n_tty_ 13:00 tty6 00:00:00 /sbin/agetty 38400 tty6 linux 5 S root 3457 1 0 80 0 - 5959 poll_s 13:00 ?00:00:00 SCREEN -S sb1 4 S root 3458 3457 0 80 0 - 4454 wait 13:00 pts/0 00:00:00 -/bin/bash 4 S root 3462 3458 0 75 -5 - 45171 poll_s 13:00 pts/000:00:34 /usr/bin/python2.6 /usr/bin/emerge -e world 4 S root 3613 1 0 80 0 - 14014 wait 13:01 tty1 00:00:00 /bin/login -- 4 S root 3953 3613 0 80 0 - 4429 n_tty_ 13:01 tty1 00:00:00 -bash 0 S root 6614 3462 0 75 -5 - 972 wait 14:26 pts/000:00:00 [dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh compile 4 S root 6615 6614 0 75 -5 - 6362 wait 14:26 pts/0 00:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 5 S root 6646 6615 0 75 -5 - 6745 wait 14:26 pts/0 00:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 4 S root 13235 6646 0 75 -5 - 3651 wait 14:27 pts/0 00:00:00 make -j8 4 S root 13238 13235 0 75 -5 - 3652 wait 14:27 pts/000:00:00 make all-recursive 4 S root 13239 13238 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 13243 13239 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 13244 13243 0 75 -5 - 3686 wait 14:27 pts/000:00:00 make all 4 S root 13358 13244 0 75 -5 - 3684 wait 14:27 pts/000:00:00 make all-recursive 4 S root 13359 13358 0 75 -5 - 5956 wait 14:27 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 16546 13359 0 75 -5 - 5956 wait 14:28 pts/000:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 16547 16546 0 75 -5 - 3652 wait 14:28 pts/000:00:00 make all 4 S root 16548 16547 0 75 -5 - 3652 n_tty_ 14:28 pts/000:00:00 make all-am 4 S root 16599 3258 0 80 0 - 17937 poll_s 15:07 ?00:00:00 sshd: r...@pts/2 4 S root 16602 16599 0 80 0 - 4429 wait 15:07 pts/200:00:00 -bash 4 R root 16611 16602 0 80 0 - 3698 - 15:08 pts/200:00:00 ps -elf 1 S root 31506 2 0 80 0 - 0 bdi_wr 14:25 ?00:00:00 [flush-253:0] All in wait? Maybe a
Re: Bug tracking?
On 05/21/2010 07:45 AM, Michael Tokarev wrote: So, what's the current state of the bug tracking system? As far as I can see, qemu is moving to launchpad. Where qemu-kvm-related issues should be submitted nowadays? Kernel issues should be filed in bugzilla.kernel.org. qemu issues should be filed in LaunchPad. Regards, Anthony Liguori Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug tracking?
On 05/21/2010 06:50 PM, Anthony Liguori wrote: On 05/21/2010 07:45 AM, Michael Tokarev wrote: So, what's the current state of the bug tracking system? As far as I can see, qemu is moving to launchpad. Where qemu-kvm-related issues should be submitted nowadays? Kernel issues should be filed in bugzilla.kernel.org. qemu issues should be filed in LaunchPad. qemu-kvm issues, even if not present in upstream qemu, should be filed in launchpad (but clearly marked to be qemu-kvm specific). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug tracking?
21.05.2010 19:56, Avi Kivity wrote: On 05/21/2010 06:50 PM, Anthony Liguori wrote: On 05/21/2010 07:45 AM, Michael Tokarev wrote: So, what's the current state of the bug tracking system? As far as I can see, qemu is moving to launchpad. Where qemu-kvm-related issues should be submitted nowadays? Kernel issues should be filed in bugzilla.kernel.org. qemu issues should be filed in LaunchPad. qemu-kvm issues, even if not present in upstream qemu, should be filed in launchpad (but clearly marked to be qemu-kvm specific). Aha. That makes perfect sense now. Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Gentoo guest with smp: emerge freeze while recompile world
On Friday, May 21, 2010 10:46:10 am Riccardo wrote: -- Original Message --- From: Avi Kivity a...@redhat.com To: Riccardo andrighetto.ricca...@gmail.com Cc: kvm@vger.kernel.org Sent: Fri, 21 May 2010 18:21:20 +0300 Subject: Re: Gentoo guest with smp: emerge freeze while recompile world On 05/21/2010 04:16 PM, Riccardo wrote: ... There are almost impossible to debug. Try copying vmlinux out of your guest and attach with gdb when it hangs. Then issue the command (gdb) thread apply all backtrace to see what the guest is doing. panic. --- End of Original Message --- Hi, I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world complete without errors! Interesing. Can you so a git bisect to see where it stops working? Ehm sorry I don't understand the request have you a link? I always use the same .config After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the problem remain, the compile freeze and I see this in ps -elf: 5 S root 1013 1 0 76 -4 - 3125 poll_s 13:00 ? 00:00:00 /sbin/udevd --daemon 1 S root 2669 1 0 80 0 - 7523 wait 13:00 ? 00:00:00 supervising syslog-ng 5 S root 2670 2669 0 80 0 - 7556 poll_s 13:00 ? 00:00:00 /usr/sbin/syslog-ng 1 S root 3258 1 0 80 0 - 9505 poll_s 13:00 ? 00:00:00 /usr/sbin/sshd 1 S root 3378 1 0 80 0 - 4115 hrtime 13:00 ? 00:00:00 /usr/sbin/cron 0 S root 3446 1 0 80 0 - 1493 n_tty_ 13:00 tty2 00:00:00 /sbin/agetty 38400 tty2 linux 0 S root 3447 1 0 80 0 - 1493 n_tty_ 13:00 tty3 00:00:00 /sbin/agetty 38400 tty3 linux 0 S root 3448 1 0 80 0 - 1493 n_tty_ 13:00 tty4 00:00:00 /sbin/agetty 38400 tty4 linux 0 S root 3449 1 0 80 0 - 1493 n_tty_ 13:00 tty5 00:00:00 /sbin/agetty 38400 tty5 linux 0 S root 3450 1 0 80 0 - 1493 n_tty_ 13:00 tty6 00:00:00 /sbin/agetty 38400 tty6 linux 5 S root 3457 1 0 80 0 - 5959 poll_s 13:00 ? 00:00:00 SCREEN -S sb1 4 S root 3458 3457 0 80 0 - 4454 wait 13:00 pts/0 00:00:00 -/bin/bash 4 S root 3462 3458 0 75 -5 - 45171 poll_s 13:00 pts/0 00:00:34 /usr/bin/python2.6 /usr/bin/emerge -e world 4 S root 3613 1 0 80 0 - 14014 wait 13:01 tty1 00:00:00 /bin/login -- 4 S root 3953 3613 0 80 0 - 4429 n_tty_ 13:01 tty1 00:00:00 -bash 0 S root 6614 3462 0 75 -5 - 972 wait 14:26 pts/0 00:00:00 [dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh compile 4 S root 6615 6614 0 75 -5 - 6362 wait 14:26 pts/000:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 5 S root 6646 6615 0 75 -5 - 6745 wait 14:26 pts/0 00:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 4 S root 13235 6646 0 75 -5 - 3651 wait 14:27 pts/0 00:00:00 make -j8 4 S root 13238 13235 0 75 -5 - 3652 wait 14:27 pts/0 00:00:00 make all-recursive 4 S root 13239 13238 0 75 -5 - 5956 wait 14:27 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 13243 13239 0 75 -5 - 5956 wait 14:27 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 13244 13243 0 75 -5 - 3686 wait 14:27 pts/0 00:00:00 make all 4 S root 13358 13244 0 75 -5 - 3684 wait 14:27 pts/0 00:00:00 make all-recursive 4 S root 13359 13358 0 75 -5 - 5956 wait 14:27 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 16546 13359 0 75 -5 - 5956 wait 14:28 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 16547 16546 0 75 -5 - 3652 wait 14:28 pts/0 00:00:00 make all 4 S root 16548 16547 0 75 -5 - 3652 n_tty_ 14:28 pts/0 00:00:00 make all-am 4 S root 16599 3258 0 80 0 - 17937 poll_s 15:07 ? 00:00:00 sshd: r...@pts/2 4 S root 16602 16599 0 80 0 - 4429 wait 15:07 pts/2 00:00:00 -bash 4 R root 16611 16602 0 80 0 - 3698 - 15:08 pts/2 00:00:00 ps -elf 1 S root 31506 2 0 80 0 - 0 bdi_wr 14:25 ? 00:00:00 [flush-253:0] All in wait? Maybe a block driver problem? Are you using virtio? Yes, I always
Re: Gentoo guest with smp: emerge freeze while recompile world
-- Original Message --- From: Brian Jackson i...@theiggy.com To: Riccardo andrighetto.ricca...@gmail.com Cc: kvm@vger.kernel.org Sent: Fri, 21 May 2010 11:35:36 -0500 Subject: Re: Gentoo guest with smp: emerge freeze while recompile world On Friday, May 21, 2010 10:46:10 am Riccardo wrote: -- Original Message --- From: Avi Kivity a...@redhat.com To: Riccardo andrighetto.ricca...@gmail.com Cc: kvm@vger.kernel.org Sent: Fri, 21 May 2010 18:21:20 +0300 Subject: Re: Gentoo guest with smp: emerge freeze while recompile world On 05/21/2010 04:16 PM, Riccardo wrote: ... There are almost impossible to debug. Try copying vmlinux out of your guest and attach with gdb when it hangs. Then issue the command (gdb) thread apply all backtrace to see what the guest is doing. panic. --- End of Original Message --- Hi, I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world complete without errors! Interesing. Can you so a git bisect to see where it stops working? Ehm sorry I don't understand the request have you a link? I always use the same .config After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the problem remain, the compile freeze and I see this in ps -elf: 5 S root 1013 1 0 76 -4 - 3125 poll_s 13:00 ? 00:00:00 /sbin/udevd --daemon 1 S root 2669 1 0 80 0 - 7523 wait 13:00 ? 00:00:00 supervising syslog-ng 5 S root 2670 2669 0 80 0 - 7556 poll_s 13:00 ? 00:00:00 /usr/sbin/syslog-ng 1 S root 3258 1 0 80 0 - 9505 poll_s 13:00 ? 00:00:00 /usr/sbin/sshd 1 S root 3378 1 0 80 0 - 4115 hrtime 13:00 ? 00:00:00 /usr/sbin/cron 0 S root 3446 1 0 80 0 - 1493 n_tty_ 13:00 tty2 00:00:00 /sbin/agetty 38400 tty2 linux 0 S root 3447 1 0 80 0 - 1493 n_tty_ 13:00 tty3 00:00:00 /sbin/agetty 38400 tty3 linux 0 S root 3448 1 0 80 0 - 1493 n_tty_ 13:00 tty4 00:00:00 /sbin/agetty 38400 tty4 linux 0 S root 3449 1 0 80 0 - 1493 n_tty_ 13:00 tty5 00:00:00 /sbin/agetty 38400 tty5 linux 0 S root 3450 1 0 80 0 - 1493 n_tty_ 13:00 tty6 00:00:00 /sbin/agetty 38400 tty6 linux 5 S root 3457 1 0 80 0 - 5959 poll_s 13:00 ? 00:00:00 SCREEN -S sb1 4 S root 3458 3457 0 80 0 - 4454 wait 13:00 pts/0 00:00:00 -/bin/bash 4 S root 3462 3458 0 75 -5 - 45171 poll_s 13:00 pts/0 00:00:34 /usr/bin/python2.6 /usr/bin/emerge -e world 4 S root 3613 1 0 80 0 - 14014 wait 13:01 tty1 00:00:00 /bin/login -- 4 S root 3953 3613 0 80 0 - 4429 n_tty_ 13:01 tty1 00:00:00 -bash 0 S root 6614 3462 0 75 -5 - 972 wait 14:26 pts/0 00:00:00 [dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh compile 4 S root 6615 6614 0 75 -5 - 6362 wait 14:26 pts/000:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 5 S root 6646 6615 0 75 -5 - 6745 wait 14:26 pts/0 00:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 4 S root 13235 6646 0 75 -5 - 3651 wait 14:27 pts/0 00:00:00 make -j8 4 S root 13238 13235 0 75 -5 - 3652 wait 14:27 pts/0 00:00:00 make all-recursive 4 S root 13239 13238 0 75 -5 - 5956 wait 14:27 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 13243 13239 0 75 -5 - 5956 wait 14:27 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 13244 13243 0 75 -5 - 3686 wait 14:27 pts/0 00:00:00 make all 4 S root 13358 13244 0 75 -5 - 3684 wait 14:27 pts/0 00:00:00 make all-recursive 4 S root 13359 13358 0 75 -5 - 5956 wait 14:27 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 5 S root 16546 13359 0 75 -5 - 5956 wait 14:28 pts/0 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list= 4 S root 16547 16546 0 75 -5 - 3652 wait 14:28 pts/0 00:00:00 make all 4 S root 16548 16547 0 75 -5 - 3652 n_tty_ 14:28 pts/0 00:00:00 make all-am 4 S root 16599 3258 0
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
Anthony Liguori wrote: On 05/21/2010 08:46 AM, Jan Kiszka wrote: Anthony Liguori wrote: On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- Makefile.objs |2 +- trace.c | 64 + trace.h |9 trace.py | 30 ++ 4 files changed, 104 insertions(+), 1 deletions(-) create mode 100644 trace.c create mode 100644 trace.h create mode 100755 trace.py diff --git a/Makefile.objs b/Makefile.objs index acbaf22..307e989 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o # block-obj-y is code used by both qemu system emulation and qemu-img block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o diff --git a/trace.c b/trace.c new file mode 100644 index 000..2fec4d3 --- /dev/null +++ b/trace.c @@ -0,0 +1,64 @@ +#includestdlib.h +#includestdio.h +#include trace.h + +typedef struct { +unsigned long event; +unsigned long x1; +unsigned long x2; +unsigned long x3; +unsigned long x4; +unsigned long x5; +} TraceRecord; + +enum { +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord), +}; + +static TraceRecord trace_buf[TRACE_BUF_LEN]; +static unsigned int trace_idx; +static FILE *trace_fp; + +static void trace(TraceEvent event, unsigned long x1, + unsigned long x2, unsigned long x3, + unsigned long x4, unsigned long x5) { +TraceRecord *rec =trace_buf[trace_idx]; +rec-event = event; +rec-x1 = x1; +rec-x2 = x2; +rec-x3 = x3; +rec-x4 = x4; +rec-x5 = x5; + +if (++trace_idx == TRACE_BUF_LEN) { +trace_idx = 0; + +if (!trace_fp) { +trace_fp = fopen(/tmp/trace.log, w); +} +if (trace_fp) { +size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp); +result = result; +} +} +} It is probably worth while to read trace points via the monitor or through some other mechanism. My concern would be that writing even 64k out to disk would introduce enough performance overhead mainly because it runs lock-step with the guest's VCPU. Maybe it's worth adding a thread that syncs the ring to disk if we want to write to disk? That's not what QEMU should worry about. If somehow possible, let's push this into the hands of a (user space) tracing framework, ideally one that is already designed for such requirements. E.g. there exists quite useful work in the context of LTTng (user space RCU for application tracing). From what I understand, none of the current kernel approaches to userspace tracing have much momentum at the moment. We may need simple stubs for the case that no such framework is (yet) available. But effort should focus on a QEMU infrastructure to add useful tracepoints to the code. Specifically when tracing over KVM, you usually need information about kernel states as well, so you depend on an integrated approach, not Yet Another Log File. I think the simple code that Stefan pasted gives us 95% of what we need. IMHO not 95%, but it is a start. I would just like to avoid that too much efforts are spent on re-inventing smart trace buffers, trace daemons, or trace visualization tools. Then better pick up some semi-perfect approach (e.g. [1], it unfortunately still seems to lack kernel integration) and drive it according to our needs. Jan [1] http://lttng.org/ust -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add support for protocol driver create_options
Am 20.05.2010 07:36, schrieb MORITA Kazutaka: This patch enables protocol drivers to use their create options which are not supported by the format. For example, protcol drivers can use a backing_file option with raw format. Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp --- block.c |7 +++ block.h |1 + qemu-img.c| 49 ++--- qemu-option.c | 52 +--- qemu-option.h |2 ++ 5 files changed, 85 insertions(+), 26 deletions(-) diff --git a/block.c b/block.c index 48d8468..0ab9424 100644 --- a/block.c +++ b/block.c @@ -56,7 +56,6 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors); static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num, const uint8_t *buf, int nb_sectors); -static BlockDriver *find_protocol(const char *filename); static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -210,7 +209,7 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options) { BlockDriver *drv; -drv = find_protocol(filename); +drv = bdrv_find_protocol(filename); if (drv == NULL) { drv = bdrv_find_format(file); } @@ -283,7 +282,7 @@ static BlockDriver *find_hdev_driver(const char *filename) return drv; } -static BlockDriver *find_protocol(const char *filename) +BlockDriver *bdrv_find_protocol(const char *filename) { BlockDriver *drv1; char protocol[128]; @@ -469,7 +468,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, int flags) BlockDriver *drv; int ret; -drv = find_protocol(filename); +drv = bdrv_find_protocol(filename); if (!drv) { return -ENOENT; } diff --git a/block.h b/block.h index 24efeb6..9034ebb 100644 --- a/block.h +++ b/block.h @@ -54,6 +54,7 @@ void bdrv_info_stats(Monitor *mon, QObject **ret_data); void bdrv_init(void); void bdrv_init_with_whitelist(void); +BlockDriver *bdrv_find_protocol(const char *filename); BlockDriver *bdrv_find_format(const char *format_name); BlockDriver *bdrv_find_whitelisted_format(const char *format_name); int bdrv_create(BlockDriver *drv, const char* filename, diff --git a/qemu-img.c b/qemu-img.c index d3c30a7..8ae7184 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -252,8 +252,8 @@ static int img_create(int argc, char **argv) const char *base_fmt = NULL; const char *filename; const char *base_filename = NULL; -BlockDriver *drv; -QEMUOptionParameter *param = NULL; +BlockDriver *drv, *proto_drv; +QEMUOptionParameter *param = NULL, *create_options = NULL; char *options = NULL; flags = 0; @@ -286,33 +286,42 @@ static int img_create(int argc, char **argv) } } +/* Get the filename */ +if (optind = argc) +help(); +filename = argv[optind++]; + /* Find driver and parse its options */ drv = bdrv_find_format(fmt); if (!drv) error(Unknown file format '%s', fmt); +proto_drv = bdrv_find_protocol(filename); +if (!proto_drv) +error(Unknown protocol '%s', filename); + +create_options = append_option_parameters(create_options, + drv-create_options); +create_options = append_option_parameters(create_options, + proto_drv-create_options); + if (options !strcmp(options, ?)) { -print_option_help(drv-create_options); +print_option_help(create_options); return 0; } /* Create parameter list with default values */ -param = parse_option_parameters(, drv-create_options, param); +param = parse_option_parameters(, create_options, param); set_option_parameter_int(param, BLOCK_OPT_SIZE, -1); /* Parse -o options */ if (options) { -param = parse_option_parameters(options, drv-create_options, param); +param = parse_option_parameters(options, create_options, param); if (param == NULL) { error(Invalid options for file format '%s'., fmt); } } -/* Get the filename */ -if (optind = argc) -help(); -filename = argv[optind++]; - /* Add size to parameters */ if (optind argc) { set_option_parameter(param, BLOCK_OPT_SIZE, argv[optind++]); @@ -362,6 +371,7 @@ static int img_create(int argc, char **argv) puts(); ret = bdrv_create(drv, filename, param); +free_option_parameters(create_options); free_option_parameters(param); if (ret 0) { @@ -543,14 +553,14 @@ static int img_convert(int argc, char **argv) { int c, ret, n, n1, bs_n, bs_i, flags, cluster_size,
ixgbe: macvlan on PF/VF when SRIOV is enabled
Hello Jeff, macvlan doesn't work on PF when SRIOV is enabled. Creating macvlan has been successful, but ping (icmp request) goes to VF interface not PF/macvlan even arp entry is correct. I patched ixgbe driver, and macvlan/PF has worked with the patch. But I am not sure whether it is right since I don't have the HW spec. What I did for ixgbe driver was: 1. PF's rar index is 0, VMDQ index is adatper-num_vfs; 2. VF's rar is based on rar_used_count and mc_addr_in_rar_count, VMDQ index is ; 3. PF's secondary addresses is PF's rar index + i, VMDQ index is adapter-num_vfs. Before I submit the patch, I want to understand the right index assignment for both rar index and VMDQ index, when SRIOV enabled: 1. VMDQ index for PF is adapter-num_vfs, or 0? rar index is 0? 2. PF's secondary address rar index is based on rar_used_count/mc_addr_in_rar_count? 2. VF's VPDQ index is based on vf number? 3. VF's rar index is vf + 1, or should be based on rar_used_count? I am also working on macvlan on VF. The question here is whether macvlan on VF should work or not? Looks like ixgbevf secondary addresses are not in receiver address filter, so macvlan on VF doesn't work. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
On Fri, May 21, 2010 at 5:52 PM, Jan Kiszka jan.kis...@siemens.com wrote: I would just like to avoid that too much efforts are spent on re-inventing smart trace buffers, trace daemons, or trace visualization tools. Then better pick up some semi-perfect approach (e.g. [1], it unfortunately still seems to lack kernel integration) and drive it according to our needs. I agree we have to consider existing solutions. The killer is the usability: what dependencies are required to build with tracing? Is a patched kernel or module required? How easy is it to add static trace events during debugging? If there are too many dependencies, especially to unpackaged software, many people will stop right there and not bother. A patched kernel or module isn't acceptable since the hassle of reconfiguring a system for tracing becomes too great (or in some cases changing the kernel is not possible/allowed). Adding new static trace events should be easy, too. Ideally it doesn't require adding information about the trace event in multiple places (header files, C files, etc). It also shouldn't require learning about the tracing system, adding a trace event should be self-explanatory so anyone can easily add one for debugging. A lot of opinions there, but what I'm saying is that friction must be low. If the tracing system is a pain to use, then no-one will use it. http://lttng.org/files/ust/manual/ust.html LTTng Userspace Tracer looks interesting - no kernel support required AFAICT. Toggling trace events in a running process supported. Similar to kernel tracepoint.h and existing report/visualization tool. x86 (32- and 64-bit) only. Like you say, no correlation with kernel trace data. I'll try to give LTTng UST a spin by converting my trace events to use UST. This seems closest to an existing tracing system we can drop in. http://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps Requires kernel support - not sure if enough of utrace is in mainline for this to work out-of-the-box across distros. Unclear how exactly SystemTap userspace probing would work out. Does anyone have experience or want to try this? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
On 05/21/2010 11:52 AM, Jan Kiszka wrote: Anthony Liguori wrote: On 05/21/2010 08:46 AM, Jan Kiszka wrote: Anthony Liguori wrote: On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote: Trace events should be defined in trace.h. Events are written to /tmp/trace.log and can be formatted using trace.py. Remember to add events to trace.py for pretty-printing. Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- Makefile.objs |2 +- trace.c | 64 + trace.h |9 trace.py | 30 ++ 4 files changed, 104 insertions(+), 1 deletions(-) create mode 100644 trace.c create mode 100644 trace.h create mode 100755 trace.py diff --git a/Makefile.objs b/Makefile.objs index acbaf22..307e989 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o # block-obj-y is code used by both qemu system emulation and qemu-img block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o diff --git a/trace.c b/trace.c new file mode 100644 index 000..2fec4d3 --- /dev/null +++ b/trace.c @@ -0,0 +1,64 @@ +#includestdlib.h +#includestdio.h +#include trace.h + +typedef struct { +unsigned long event; +unsigned long x1; +unsigned long x2; +unsigned long x3; +unsigned long x4; +unsigned long x5; +} TraceRecord; + +enum { +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord), +}; + +static TraceRecord trace_buf[TRACE_BUF_LEN]; +static unsigned int trace_idx; +static FILE *trace_fp; + +static void trace(TraceEvent event, unsigned long x1, + unsigned long x2, unsigned long x3, + unsigned long x4, unsigned long x5) { +TraceRecord *rec =trace_buf[trace_idx]; +rec-event = event; +rec-x1 = x1; +rec-x2 = x2; +rec-x3 = x3; +rec-x4 = x4; +rec-x5 = x5; + +if (++trace_idx == TRACE_BUF_LEN) { +trace_idx = 0; + +if (!trace_fp) { +trace_fp = fopen(/tmp/trace.log, w); +} +if (trace_fp) { +size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp); +result = result; +} +} +} It is probably worth while to read trace points via the monitor or through some other mechanism. My concern would be that writing even 64k out to disk would introduce enough performance overhead mainly because it runs lock-step with the guest's VCPU. Maybe it's worth adding a thread that syncs the ring to disk if we want to write to disk? That's not what QEMU should worry about. If somehow possible, let's push this into the hands of a (user space) tracing framework, ideally one that is already designed for such requirements. E.g. there exists quite useful work in the context of LTTng (user space RCU for application tracing). From what I understand, none of the current kernel approaches to userspace tracing have much momentum at the moment. We may need simple stubs for the case that no such framework is (yet) available. But effort should focus on a QEMU infrastructure to add useful tracepoints to the code. Specifically when tracing over KVM, you usually need information about kernel states as well, so you depend on an integrated approach, not Yet Another Log File. I think the simple code that Stefan pasted gives us 95% of what we need. IMHO not 95%, but it is a start. I'm not opposed to using a framework, but I'd rather have an equivalent to kvm_stat tomorrow than wait 3 years for LTTng to not get merged. So let's have a dirt-simple tracing mechanism and focus on adding useful trace points. Then when we have a framework we can use, we can just convert the tracepoints to the new framework. Regards, Anthony Liguori I would just like to avoid that too much efforts are spent on re-inventing smart trace buffers, trace daemons, or trace visualization tools. Then better pick up some semi-perfect approach (e.g. [1], it unfortunately still seems to lack kernel integration) and drive it according to our needs. Jan [1] http://lttng.org/ust -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
On Fri, May 21, 2010 at 09:49:56PM +0100, Stefan Hajnoczi wrote: http://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps Requires kernel support - not sure if enough of utrace is in mainline for this to work out-of-the-box across distros. Nothing of utrace is in mainline, nevermind the whole systemtap code which is intentionally keep out of the kernel tree. Using this means that for every probe in userspace code you need to keep the configured source tree of the currently running kernel around, which is completely unusable for typical developer setups. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
Stefan Hajnoczi wrote: On Fri, May 21, 2010 at 5:52 PM, Jan Kiszka jan.kis...@siemens.com wrote: I would just like to avoid that too much efforts are spent on re-inventing smart trace buffers, trace daemons, or trace visualization tools. Then better pick up some semi-perfect approach (e.g. [1], it unfortunately still seems to lack kernel integration) and drive it according to our needs. I agree we have to consider existing solutions. The killer is the usability: what dependencies are required to build with tracing? Is a patched kernel or module required? How easy is it to add static trace events during debugging? If there are too many dependencies, especially to unpackaged software, many people will stop right there and not bother. A patched kernel or module isn't acceptable since the hassle of reconfiguring a system for tracing becomes too great (or in some cases changing the kernel is not possible/allowed). Adding new static trace events should be easy, too. Ideally it doesn't require adding information about the trace event in multiple places (header files, C files, etc). It also shouldn't require learning about the tracing system, adding a trace event should be self-explanatory so anyone can easily add one for debugging. A lot of opinions there, but what I'm saying is that friction must be low. If the tracing system is a pain to use, then no-one will use it. No question. I mentioned LTTng as it is most promising /wrt performance (both when enabled and disabled). But LTTng was so far not best in class when it came to usability. http://lttng.org/files/ust/manual/ust.html LTTng Userspace Tracer looks interesting - no kernel support required AFAICT. Toggling trace events in a running process supported. Similar to kernel tracepoint.h and existing report/visualization tool. x86 (32- and 64-bit) only. Sure? I thought there might be an arch dependency due to urcu but it has generic support as well now. Like you say, no correlation with kernel trace data. It would be good if we could still hook into trancepoints and stream them out differently. That would allow for add-hoc tracing when performance does not matter that much (trace to file, trace to kernel). But we would still benefit from enabling tracepoints during runtime and keeping them built in. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
Anthony Liguori wrote: I'm not opposed to using a framework, but I'd rather have an equivalent to kvm_stat tomorrow than wait 3 years for LTTng to not get merged. So let's have a dirt-simple tracing mechanism and focus on adding useful trace points. Then when we have a framework we can use, we can just convert the tracepoints to the new framework. That could mean serializing the tracepoints to strings and dumping them to our log file - no concerns. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support
On 05/21/2010 04:41 PM, Jan Kiszka wrote: Anthony Liguori wrote: I'm not opposed to using a framework, but I'd rather have an equivalent to kvm_stat tomorrow than wait 3 years for LTTng to not get merged. So let's have a dirt-simple tracing mechanism and focus on adding useful trace points. Then when we have a framework we can use, we can just convert the tracepoints to the new framework. That could mean serializing the tracepoints to strings and dumping them to our log file - no concerns. Which I really don't mind. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] Consolidate vcpu ioctl locking
On 15.05.2010 10:26, Alexander Graf wrote: On S390, I'm also still sceptical if the implementation we have really works. A device injects an S390_INTERRUPT with its address and on the next vcpu_run, an according interrupt is issued. But what happens if two devices trigger an S390_INTERRUPT before the vcpu_run? We'd have lost an interrupt by then... We're safe on that: the interrupt info field in both struct kvm (for floating interrupts) and struct vcpu (for cpu local interrupts) have their own locking and can queue up interrupts. cheers, Carsten -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html