Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-21 Thread MORITA Kazutaka
At Fri, 21 May 2010 06:28:42 +0100,
Stefan Hajnoczi wrote:
 
 On Thu, May 20, 2010 at 11:16 PM, Christian Brunner c...@muc.de wrote:
  2010/5/20 Anthony Liguori anth...@codemonkey.ws:
  Both sheepdog and ceph ultimately transmit I/O over a socket to a central
  daemon, right?  So could we not standardize a protocol for this that both
  sheepdog and ceph could implement?
 
  There is no central daemon. The concept is that they talk to many
  storage nodes at the same time. Data is distributed and replicated
  over many nodes in the network. The mechanism to do this is quite
  complex. I don't know about sheepdog, but in Ceph this is called RADOS
  (reliable autonomic distributed object store). Sheepdog and Ceph may
  look similar, but this is where they act different. I don't think that
  it would be possible to implement a common protocol.
 
 I believe Sheepdog has a local daemon on each node.  The QEMU storage
 backend talks to the daemon on the same node, which then does the real
 network communication with the rest of the distributed storage system.

Yes.  It is because Sheepdog doesn't have a configuration about
cluster membership as I mentioned in another mail, so the drvier
doesn't know which node to access other than localhost.

  So I think we're not talking about a network protocol here, we're
 talking about a common interface that can be used by QEMU and other
 programs to take advantage of Ceph, Sheepdog, etc services available
 on the local node.
 
 Haven't looked into your patch enough yet, but does librados talk
 directly over the network or does it connect to a local daemon/driver?
 

AFAIK, librados access directly over the network, so I think it is
difficult to define a common interface.


Thanks,

Kazutaka

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][v3] KVM: VMX: Enable XSAVE/XRSTORE for guest

2010-05-21 Thread Sheng Yang
On Thursday 20 May 2010 17:46:40 Avi Kivity wrote:
 On 05/20/2010 12:16 PM, Sheng Yang wrote:
  From: Dexuan Cuidexuan@intel.com
  
  Enable XSAVE/XRSTORE for guest.
  
  Change from V2:
  Addressed comments from Avi.
  
  Change from V1:
  
  1. Use FPU API.
  2. Fix CPUID issue.
  3. Save/restore all possible guest xstate fields when switching. Because
  we don't know which fields guest has already touched.
  
  
  diff --git a/arch/x86/include/asm/kvm_host.h
  b/arch/x86/include/asm/kvm_host.h index d08bb4a..3938bd1 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -302,6 +302,7 @@ struct kvm_vcpu_arch {
  
  } update_pte;
  
  struct fpu guest_fpu;
  
  +   u64 xcr0;
  
  gva_t mmio_fault_cr2;
  struct kvm_pio_request pio;
  
  diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
  index 9e6779f..346ea66 100644
  --- a/arch/x86/include/asm/vmx.h
  +++ b/arch/x86/include/asm/vmx.h
  @@ -266,6 +266,7 @@ enum vmcs_field {
  
#define EXIT_REASON_EPT_VIOLATION   48
#define EXIT_REASON_EPT_MISCONFIG   49
#define EXIT_REASON_WBINVD54
  
  +#define EXIT_REASON_XSETBV 55
  
/*

 * Interruption-information format
  
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 99ae513..a63f206 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -36,6 +36,8 @@
  
#includeasm/vmx.h
#includeasm/virtext.h
#includeasm/mce.h
  
  +#includeasm/i387.h
  +#includeasm/xcr.h
  
#include trace.h
  
  @@ -247,6 +249,9 @@ static const u32 vmx_msr_index[] = {
  
};
#define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index)
  
  +#define MERGE_TO_U64(low, high) \
  +   (((low)  -1u) | ((u64)((high)  -1u)  32))
  +
 
 static inline u64 kvm_read_edx_eax(vcpu) in cache_regs.h
 
  +static int handle_xsetbv(struct kvm_vcpu *vcpu)
  +{
  +   u64 new_bv = MERGE_TO_U64(kvm_register_read(vcpu, VCPU_REGS_RAX),
  +   kvm_register_read(vcpu, VCPU_REGS_RDX));
  +
  +   if (kvm_register_read(vcpu, VCPU_REGS_RCX) != 0)
  +   goto err;
  +   if (vmx_get_cpl(vcpu) != 0)
  +   goto err;
  +   if (!(new_bv  XSTATE_FP))
  +   goto err;
  +   if ((new_bv  XSTATE_YMM)  !(new_bv  XSTATE_SSE))
  +   goto err;
 
 What about a check against unknown bits?
 
  +   vcpu-arch.xcr0 = new_bv;
  +   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);
  +   skip_emulated_instruction(vcpu);
  +   return 1;
  +err:
  +   kvm_inject_gp(vcpu, 0);
  +   return 1;
  +}
  +
  
static int handle_apic_access(struct kvm_vcpu *vcpu)
{

  return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
  
  +static u64 host_xcr0;
 
 __read_mostly.
 
  +
  +static void update_cpuid(struct kvm_vcpu *vcpu)
  +{
  +   struct kvm_cpuid_entry2 *best;
  +
  +   best = kvm_find_cpuid_entry(vcpu, 1, 0);
  +   if (!best)
  +   return;
  +
  +   /* Update OSXSAVE bit */
  +   if (cpu_has_xsave  best-function == 0x1) {
  +   best-ecx= ~(bit(X86_FEATURE_OSXSAVE));
  +   if (kvm_read_cr4(vcpu)  X86_CR4_OSXSAVE)
  +   best-ecx |= bit(X86_FEATURE_OSXSAVE);
  +   }
  +}
 
 Note: need to update after userspace writes cpuid as well.

Not quite understand. Userspace set OSXSAVE should be trimmed IMO...
 
  +
  
int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{

  unsigned long old_cr4 = kvm_read_cr4(vcpu);
  
  @@ -481,6 +513,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned
  long cr4)
  
  if (cr4  CR4_RESERVED_BITS)
  
  return 1;
  
  +   if (!guest_cpuid_has_xsave(vcpu)  (cr4  X86_CR4_OSXSAVE))
  +   return 1;
  +
  
  if (is_long_mode(vcpu)) {
  
  if (!(cr4  X86_CR4_PAE))
  
  return 1;
  
  @@ -497,6 +532,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned
  long cr4)
  
  if ((cr4 ^ old_cr4)  pdptr_bits)
  
  kvm_mmu_reset_context(vcpu);
  
  +   if ((cr4 ^ old_cr4)  X86_CR4_OSXSAVE)
  +   update_cpuid(vcpu);
  +
 
 I think we need to reload the guest's xcr0 at this point.
 Alternatively, call vmx_load_host_state() to ensure the the next entry
 will reload it.

Current xcr0 would be loaded when next vmentry.

And if we use prepare_guest_switch(), how about SVM?

 
  @@ -1931,7 +1964,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2
  *entry, u32 function,
  
  switch (function) {
  
  case 0:
  -   entry-eax = min(entry-eax, (u32)0xb);
  +   entry-eax = min(entry-eax, (u32)0xd);
 
 Do we need any special handling for leaf 0xc?

Don't think so. CPUID would return all 0 for it.
 
  @@ -4567,6 +4616,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
  
  kvm_x86_ops-prepare_guest_switch(vcpu);
  if (vcpu-fpu_active)
  
  kvm_load_guest_fpu(vcpu);
  
  +   if (kvm_read_cr4(vcpu)  X86_CR4_OSXSAVE)
  +   

Re: [PATCH 0/7] Consolidate vcpu ioctl locking

2010-05-21 Thread Carsten Otte

On 15.05.2010 10:26, Alexander Graf wrote:

On S390, I'm also still sceptical if the implementation we have really works. A 
device injects an S390_INTERRUPT with its address and on the next vcpu_run, an 
according interrupt is issued. But what happens if two devices trigger an 
S390_INTERRUPT before the vcpu_run? We'd have lost an interrupt by then...
We're safe on that: the interrupt info field in both struct kvm (for 
floating interrupts) and struct vcpu (for cpu local interrupts) have 
their own locking and can queue up interrupts.


cheers,
Carsten
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][v3] KVM: VMX: Enable XSAVE/XRSTORE for guest

2010-05-21 Thread Avi Kivity

On 05/21/2010 10:26 AM, Sheng Yang wrote:



+
+static void update_cpuid(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   if (!best)
+   return;
+
+   /* Update OSXSAVE bit */
+   if (cpu_has_xsave   best-function == 0x1) {
+   best-ecx= ~(bit(X86_FEATURE_OSXSAVE));
+   if (kvm_read_cr4(vcpu)   X86_CR4_OSXSAVE)
+   best-ecx |= bit(X86_FEATURE_OSXSAVE);
+   }
+}
   

Note: need to update after userspace writes cpuid as well.
 

Not quite understand. Userspace set OSXSAVE should be trimmed IMO...
   


Two cases: userspace does KVM_SET_CPUID2 with osxsave set but cr4.xsave 
clear, or the other way round.


So we should set cpuid.osxsave depending to cr4.xsave whenever cr4 OR 
cpuid is modified, and completely ignore userspace setting for that bit.



@@ -497,6 +532,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned
long cr4)

if ((cr4 ^ old_cr4)   pdptr_bits)

kvm_mmu_reset_context(vcpu);

+   if ((cr4 ^ old_cr4)   X86_CR4_OSXSAVE)
+   update_cpuid(vcpu);
+
   

I think we need to reload the guest's xcr0 at this point.
Alternatively, call vmx_load_host_state() to ensure the the next entry
will reload it.
 

Current xcr0 would be loaded when next vmentry.
   


True.


And if we use prepare_guest_switch(), how about SVM?
   


kvm_arch_vcpu_load() looks like a good place, as long as interrupts 
don't use the fpu.





@@ -5134,6 +5197,10 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)

vcpu-guest_fpu_loaded = 1;
unlazy_fpu(current);

+   /* Restore all possible states in the guest */
+   if (cpu_has_xsave   guest_cpuid_has_xsave(vcpu))
+   xsetbv(XCR_XFEATURE_ENABLED_MASK,
+   cpuid_get_possible_xcr0(vcpu));
   

Best to calculate it out of the fast path, when guest cpuid is set.
Need to check it at this time as well.
 

You mean guest_cpuid_has_xsave()? Not quite understand the point here...
   


Also cpuid_get_possible_cr0().  So we have something like

   if (vcpu-save_xcr0)
   xsetbv(vcpu-save_xcr0);

Those cpuid functions have loops, we don't want them running every 
context switch.



Also can avoid it if guest xcr0 == host xcr0.
 

I don't know the assumption that host use all possible xcr0 bits can apply. If
so, only use host_xcr0 should be fine.
   


I think we can rely on it.  Those bits are a service to userspace and 
the guest is just a different kind of userspace, so it makes sense to 
expose the same set.



Would update other points. Thanks.
   


Thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 01/19] Add a new structure for skb buffer from external.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/skbuff.h |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..cf309c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,18 @@ struct skb_shared_info {
void *  destructor_arg;
 };
 
+/* The structure is for a skb which skb-data may point to
+ * an external buffer, which is not allocated from kernel space.
+ * Since the buffer is external, then the shinfo or frags are
+ * also extern too. It also contains a destructor for itself.
+ */
+struct skb_external_page {
+   u8  *start;
+   int size;
+   struct skb_frag_struct *frags;
+   struct skb_shared_info *ushinfo;
+   void(*dtor)(struct skb_external_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb-data.  The lower 16 bits hold references to
  * the entire skb-data.  A clone of a headerless skb holds the length of
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 04/19] Add a ndo_mp_port_prep pointer to net_device_ops.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index efb575a..183c786 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -707,6 +707,10 @@ struct net_device_ops {
int (*ndo_fcoe_get_wwn)(struct net_device *dev,
u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+   int (*ndo_mp_port_prep)(struct net_device *dev,
+   struct mpassthru_port *port);
+#endif
 };
 
 /*
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 07/19] Add interface to get external buffers.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/skbuff.h |   12 
 net/core/skbuff.c  |   16 
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cf309c9..281a1c0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1519,6 +1519,18 @@ static inline void netdev_free_page(struct net_device 
*dev, struct page *page)
__free_page(page);
 }
 
+extern struct skb_external_page *netdev_alloc_external_pages(
+   struct net_device *dev,
+   struct sk_buff *skb, int npages);
+
+static inline struct skb_external_page *netdev_alloc_external_page(
+   struct net_device *dev,
+   struct sk_buff *skb, unsigned int size)
+{
+   return netdev_alloc_external_pages(dev, skb,
+  DIV_ROUND_UP(size, PAGE_SIZE));
+}
+
 /**
  * skb_clone_writable - is the header of a clone writable
  * @skb: buffer to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..fbdb1f1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -278,6 +278,22 @@ struct page *__netdev_alloc_page(struct net_device *dev, 
gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+struct skb_external_page *netdev_alloc_external_pages(struct net_device *dev,
+   struct sk_buff *skb, int npages)
+{
+   struct mpassthru_port *port;
+   struct skb_external_page *ext_page = NULL;
+
+   port = rcu_dereference(dev-mp_port);
+   if (!port)
+   goto out;
+   WARN_ON(npages  port-npages);
+   ext_page = port-ctor(port, skb, npages);
+out:
+   return ext_page;
+}
+EXPORT_SYMBOL(netdev_alloc_external_pages);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
int size)
 {
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 06/19] Add a function to indicate if device use external buffer.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 31d9c4a..0cb78f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,11 @@ extern void netdev_mp_port_detach(struct net_device 
*dev);
 extern int netdev_mp_port_prep(struct net_device *dev,
struct mpassthru_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+   return (dev  dev-mp_port);
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
kfree_skb(napi-skb);
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 11/19] Use callback to deal with skb_release_data() specially.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

If buffer is external, then use the callback to destruct
buffers.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 net/core/skbuff.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37587f0..418457c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 
 static void skb_release_data(struct sk_buff *skb)
 {
+   /* check if the skb has external buffers, we have use destructor_arg
+* here to indicate
+*/
+   struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg;
+
if (!skb-cloned ||
!atomic_sub_return(skb-nohdr ? (1  SKB_DATAREF_SHIFT) + 1 : 1,
   skb_shinfo(skb)-dataref)) {
@@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
if (skb_has_frags(skb))
skb_drop_fraglist(skb);
 
+   /* if the skb has external buffers, use destructor here,
+* since after that skb-head will be kfree, in case skb-head
+* from external buffer cannot use kfree to destroy.
+*/
+   if (dev_is_mpassthru(skb-dev)  ext_page  ext_page-dtor)
+   ext_page-dtor(ext_page);
kfree(skb-head);
}
 }
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 13/19] To skip GRO if buffer is external currently.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 net/core/dev.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index dc2f225..6c6b2fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2787,6 +2787,10 @@ enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff *skb)
if (skb_is_gso(skb) || skb_has_frags(skb))
goto normal;
 
+   /* currently GRO is not supported by mediate passthru */
+   if (dev_is_mpassthru(skb-dev))
+   goto normal;
+
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
if (ptype-type != type || ptype-dev || !ptype-gro_receive)
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 14/19] Add header file for mp device.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/mpassthru.h |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 000..ba8f320
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,25 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include linux/types.h
+#include linux/if_ether.h
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV  _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV_IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include linux/err.h
+#include linux/errno.h
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 18/19] Add a kconfig entry and make entry for mp device.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 drivers/vhost/Kconfig  |   10 ++
 drivers/vhost/Makefile |2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+   tristate mediate passthru network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 17/19] Export proto_ops to vhost-net driver.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Currently, vhost-net is only user to the mp device.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 drivers/vhost/mpassthru.c |  330 -
 1 files changed, 325 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index de07f1e..d0df691 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -414,6 +414,11 @@ static void mp_put(struct mp_file *mfile)
mp_detach(mfile-mp);
 }
 
+static void iocb_tag(struct kiocb *iocb)
+{
+   iocb-ki_flags = 1;
+}
+
 /* The callback to destruct the external buffers or skb */
 static void page_dtor(struct skb_external_page *ext_page)
 {
@@ -449,7 +454,7 @@ static void page_dtor(struct skb_external_page *ext_page)
 * Queue the notifier to wake up the backend driver
 */
 
-   create_iocb(info, info-total);
+   iocb_tag(info-iocb);
 
sk = ctor-port.sock-sk;
sk-sk_write_space(sk);
@@ -569,8 +574,323 @@ failed:
return NULL;
 }
 
+static void mp_sock_destruct(struct sock *sk)
+{
+   struct mp_struct *mp = container_of(sk, struct mp_sock, sk)-mp;
+   kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+   if (sk_has_sleeper(sk))
+   wake_up_interruptible_sync_poll(sk-sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+   if (sk_has_sleeper(sk))
+   wake_up_interruptible_sync_poll(sk-sk_sleep, POLLOUT);
+}
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+   struct mp_struct *mp = container_of(sk, struct mp_sock, sk)-mp;
+   struct page_ctor *ctor = NULL;
+   struct sk_buff *skb = NULL;
+   struct page_info *info = NULL;
+   struct ethhdr *eth;
+   struct kiocb *iocb = NULL;
+   int len, i;
+
+   struct virtio_net_hdr hdr = {
+   .flags = 0,
+   .gso_type = VIRTIO_NET_HDR_GSO_NONE
+   };
+
+   ctor = rcu_dereference(mp-ctor);
+   if (!ctor)
+   return;
+
+   while ((skb = skb_dequeue(sk-sk_receive_queue)) != NULL) {
+   if (skb_shinfo(skb)-destructor_arg) {
+   info = container_of(skb_shinfo(skb)-destructor_arg,
+   struct page_info, ext_page);
+   info-skb = skb;
+   if (skb-len  info-len) {
+   mp-dev-stats.rx_dropped++;
+   DBG(KERN_INFO Discarded truncated rx packet: 
+len %d  %zd\n, skb-len, info-len);
+   info-total = skb-len;
+   goto clean;
+   } else {
+   int i;
+   struct skb_shared_info *gshinfo =
+   (struct skb_shared_info *)
+   (info-ushinfo);
+   struct skb_shared_info *hshinfo =
+   skb_shinfo(skb);
+
+   if (gshinfo-nr_frags  hshinfo-nr_frags)
+   goto clean;
+   eth = eth_hdr(skb);
+   skb_push(skb, ETH_HLEN);
+
+   hdr.hdr_len = skb_headlen(skb);
+   info-total = skb-len;
+
+   for (i = 0; i  gshinfo-nr_frags; i++)
+   gshinfo-frags[i].size = 0;
+   for (i = 0; i  hshinfo-nr_frags; i++)
+   gshinfo-frags[i].size =
+   hshinfo-frags[i].size;
+   }
+   } else {
+   /* The skb composed with kernel buffers
+* in case external buffers are not sufficent.
+* The case should be rare.
+*/
+   unsigned long flags;
+   int i;
+   struct skb_shared_info *gshinfo = NULL;
+
+   info = NULL;
+
+   spin_lock_irqsave(ctor-read_lock, flags);
+   if (!list_empty(ctor-readq)) {
+   info = list_first_entry(ctor-readq,
+   struct page_info, list);
+   list_del(info-list);
+   }
+   spin_unlock_irqrestore(ctor-read_lock, flags);
+   if (!info) {
+   DBG(KERN_INFO
+   No external buffer avaliable %p\n,
+   skb);
+   

[RFC][PATCH v6 19/19] Provides multiple submits and asynchronous notifications.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---
 drivers/vhost/net.c   |  255 -
 drivers/vhost/vhost.c |  120 +--
 drivers/vhost/vhost.h |   14 +++
 3 files changed, 333 insertions(+), 56 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9777583..9a0d162 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
 #include linux/if_arp.h
 #include linux/if_tun.h
 #include linux/if_macvlan.h
+#include linux/mpassthru.h
+#include linux/aio.h
 
 #include net/sock.h
 
@@ -45,10 +47,13 @@ enum vhost_net_poll_state {
VHOST_NET_POLL_STOPPED = 2,
 };
 
+static struct kmem_cache *notify_cache;
+
 struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -93,11 +98,146 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   iocb = list_first_entry(vq-notifier,
+   struct kiocb, ki_list);
+   list_del(iocb-ki_list);
+   }
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+   return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+   struct vhost_virtqueue *vq = iocb-private;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   list_add_tail(iocb-ki_list, vq-notifier);
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+   return (vq-link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ struct socket *sock)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   unsigned int head, log, in, out;
+   int size;
+
+   if (!is_async_vq(vq))
+   return;
+
+   if (sock-sk-sk_data_ready)
+   sock-sk-sk_data_ready(sock-sk, 0);
+
+   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
+   vq-log : NULL;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, iocb-ki_nbytes);
+   size = iocb-ki_nbytes;
+   head = iocb-ki_pos;
+   rx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+   kmem_cache_free(net-cache, iocb);
+
+   /* when log is enabled, recomputing the log info is needed,
+* since these buffers are in async queue, and may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_vq_desc(net-dev, vq, vq-iov,
+   ARRAY_SIZE(vq-iov),
+   out, in, vq_log,
+   log, head);
+   vhost_log_write(vq, vq_log, log, size);
+   }
+   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   struct list_head *entry, *tmp;
+   unsigned long flags;
+   int tx_total_len = 0;
+
+   if (!is_async_vq(vq))
+   return;
+   spin_lock_irqsave(vq-notify_lock, flags);
+   list_for_each_safe(entry, tmp, vq-notifier) {
+   iocb = list_entry(entry,
+struct kiocb, ki_list);
+   if (!iocb-ki_flags)
+   continue;
+   list_del(iocb-ki_list);   
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, 0);
+   tx_total_len += iocb-ki_nbytes;
+
+   

[RFC][PATCH v6 16/19] Manipulate external buffers in mp device.

2010-05-21 Thread xiaohui . xin
From: Xin, Xiaohuixiaohui@intel.com

How external buffer comes from, how to destroy.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 drivers/vhost/mpassthru.c |  253 -
 1 files changed, 251 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 25e2f3e..de07f1e 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -161,6 +161,39 @@ static int mp_dev_change_flags(struct net_device *dev, 
unsigned flags)
return ret;
 }
 
+/* The main function to allocate external buffers */
+static struct skb_external_page *page_ctor(struct mpassthru_port *port,
+   struct sk_buff *skb, int npages)
+{
+   int i;
+   unsigned long flags;
+   struct page_ctor *ctor;
+   struct page_info *info = NULL;
+
+   ctor = container_of(port, struct page_ctor, port);
+
+   spin_lock_irqsave(ctor-read_lock, flags);
+   if (!list_empty(ctor-readq)) {
+   info = list_first_entry(ctor-readq, struct page_info, list);
+   list_del(info-list);
+   }
+   spin_unlock_irqrestore(ctor-read_lock, flags);
+   if (!info)
+   return NULL;
+
+   for (i = 0; i  info-pnum; i++) {
+   get_page(info-pages[i]);
+   info-frag[i].page = info-pages[i];
+   info-frag[i].page_offset = i ? 0 : info-offset;
+   info-frag[i].size = port-npages  1 ? PAGE_SIZE :
+   port-data_len;
+   }
+   info-skb = skb;
+   info-ext_page.frags = info-frag;
+   info-ext_page.ushinfo = info-ushinfo;
+   return info-ext_page;
+}
+
 static int page_ctor_attach(struct mp_struct *mp)
 {
int rc;
@@ -186,7 +219,7 @@ static int page_ctor_attach(struct mp_struct *mp)
 
dev_hold(dev);
ctor-dev = dev;
-   ctor-port.ctor = NULL;
+   ctor-port.ctor = page_ctor;
ctor-port.sock = mp-socket;
ctor-lock_pages = 0;
rc = netdev_mp_port_attach(dev, ctor-port);
@@ -252,11 +285,66 @@ static int set_memlock_rlimit(struct page_ctor *ctor, int 
resource,
return 0;
 }
 
+static void relinquish_resource(struct page_ctor *ctor)
+{
+   if (!(ctor-dev-flags  IFF_UP) 
+   !(ctor-wq_len + ctor-rq_len))
+   printk(KERN_INFO relinquish_resource\n);
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+   struct page_info *info = (struct page_info *)(iocb-private);
+   int i;
+
+   if (info-flags == INFO_READ) {
+   for (i = 0; i  info-pnum; i++) {
+   if (info-pages[i]) {
+   set_page_dirty_lock(info-pages[i]);
+   put_page(info-pages[i]);
+   }
+   }
+   info-skb-destructor = NULL;
+   kfree_skb(info-skb);
+   info-ctor-rq_len--;
+   } else
+   info-ctor-wq_len--;
+   /* Decrement the number of locked pages */
+   info-ctor-lock_pages -= info-pnum;
+   kmem_cache_free(ext_page_info_cache, info);
+   relinquish_resource(info-ctor);
+
+   return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+   struct kiocb *iocb = NULL;
+
+   iocb = info-iocb;
+   if (!iocb)
+   return iocb;
+   iocb-ki_flags = 0;
+   iocb-ki_users = 1;
+   iocb-ki_key = 0;
+   iocb-ki_ctx = NULL;
+   iocb-ki_cancel = NULL;
+   iocb-ki_retry = NULL;
+   iocb-ki_iovec = NULL;
+   iocb-ki_eventfd = NULL;
+   iocb-ki_pos = info-desc_pos;
+   iocb-ki_nbytes = size;
+   iocb-ki_dtor(iocb);
+   iocb-private = (void *)info;
+   iocb-ki_dtor = mp_ki_dtor;
+
+   return iocb;
+}
+
 static int page_ctor_detach(struct mp_struct *mp)
 {
struct page_ctor *ctor;
struct page_info *info;
-   struct kiocb *iocb = NULL;
int i;
 
/* locked by mp_mutex */
@@ -268,11 +356,17 @@ static int page_ctor_detach(struct mp_struct *mp)
for (i = 0; i  info-pnum; i++)
if (info-pages[i])
put_page(info-pages[i]);
+   create_iocb(info, 0);
+   ctor-rq_len--;
kmem_cache_free(ext_page_info_cache, info);
}
+
+   relinquish_resource(ctor);
+
set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
   ctor-o_rlim.rlim_cur,
   ctor-o_rlim.rlim_max);
+
netdev_mp_port_detach(ctor-dev);
dev_put(ctor-dev);
 
@@ -320,6 +414,161 @@ static void mp_put(struct mp_file *mfile)
mp_detach(mfile-mp);
 }
 
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_external_page *ext_page)
+{
+   struct page_info *info;
+   struct 

[RFC][PATCH v6 15/19] Add basic funcs and ioctl to mp device.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The ioctl is used by mp device to bind an underlying
NIC, it will query hardware capability and declare the
NIC to use external buffers.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---

memory leak fixed,
kconfig made,
do_unbind() made,
mp_chr_ioctl() cleanup

by Jeff Dike jd...@linux.intel.com

 drivers/vhost/mpassthru.c |  681 +
 1 files changed, 681 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..25e2f3e
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,681 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/compat.h
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the external buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a external allocated skb or kernel
+*/
+   struct skb_external_pageext_page;
+   struct skb_shared_info  ushinfo;
+
+#define INFO_READ  0
+#define INFO_WRITE 1
+   unsignedflags;
+   unsignedpnum;
+
+   /* It's meaningful for receive, means
+* the max length allowed
+*/
+   size_t  len;
+
+   /* The fields after that is for backend
+* driver, now for vhost-net.
+*/
+
+   struct kiocb*iocb;
+   unsigned intdesc_pos;
+   struct iovechdr[MAX_SKB_FRAGS + 2];
+   struct ioveciov[MAX_SKB_FRAGS + 2];
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+struct page_ctor {
+   struct list_headreadq;
+   int wq_len;
+   int rq_len;
+   spinlock_t  read_lock;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct mp_struct {
+   struct mp_file  *mfile;
+   struct net_device   *dev;
+   struct page_ctor*ctor;
+   struct socket   socket;
+
+#ifdef MPASSTHRU_DEBUG
+   int debug;
+#endif
+};
+
+struct mp_file {
+   atomic_t count;
+   struct mp_struct *mp;
+   struct net *net;
+};
+
+struct mp_sock {
+   struct sock sk;
+   struct mp_struct*mp;
+};
+
+static int 

[RFC][PATCH v6 12/19] Add a hook to intercept external buffers from NIC driver.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 net/core/dev.c |   35 +++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 37b389a..dc2f225 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2548,6 +2548,37 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+  struct packet_type **pt_prev,
+  int *ret,
+  struct net_device *orig_dev)
+{
+   struct mpassthru_port *mp_port = NULL;
+   struct sock *sk = NULL;
+
+   if (!dev_is_mpassthru(skb-dev))
+   return skb;
+   mp_port = skb-dev-mp_port;
+
+   if (*pt_prev) {
+   *ret = deliver_skb(skb, *pt_prev, orig_dev);
+   *pt_prev = NULL;
+   }
+
+   sk = mp_port-sock-sk;
+   skb_queue_tail(sk-sk_receive_queue, skb);
+   sk-sk_state_change(sk);
+
+   return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
+#endif
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
@@ -2629,6 +2660,10 @@ int netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+   /* To intercept mediate passthru(zero-copy) packets here */
+   skb = handle_mpassthru(skb, pt_prev, ret, orig_dev);
+   if (!skb)
+   goto out;
skb = handle_bridge(skb, pt_prev, ret, orig_dev);
if (!skb)
goto out;
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 10/19] Don't do skb recycle, if device use external buffer.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 net/core/skbuff.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 38d19d0..37587f0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -553,6 +553,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
if (skb_shared(skb) || skb_cloned(skb))
return 0;
 
+   /* if the device wants to do mediate passthru, the skb may
+* get external buffer, so don't recycle
+*/
+   if (dev_is_mpassthru(skb-dev))
+   return 0;
+
skb_release_head_state(skb);
shinfo = skb_shinfo(skb);
atomic_set(shinfo-dataref, 1);
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 09/19] Ignore room skb_reserve() when device is using external buffer.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Make the skb-data and skb-head from external buffer
to be consistent, we ignore the room reserved by driver
for kernel skb.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/skbuff.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5ff8c27..193b259 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1200,6 +1200,15 @@ static inline int skb_tailroom(const struct sk_buff *skb)
  */
 static inline void skb_reserve(struct sk_buff *skb, int len)
 {
+   /* Since skb_reserve() is only for an empty buffer,
+* and when the skb is getting external buffer, we cannot
+* retain the external buffer has the same reserved space
+* in the header which kernel allocatd skb has, so have to
+* ignore this. And we have recorded the external buffer
+* info in the destructor_arg field, so use it as indicator.
+*/
+   if (skb_shinfo(skb)-destructor_arg)
+   return;
skb-data += len;
skb-tail += len;
 }
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 08/19] Make __alloc_skb() to get external buffer.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Add a dev parameter to __alloc_skb(), skb-data
points to external buffer, recompute skb-head,
maintain shinfo of the external buffer, record
external buffer info into destructor_arg field.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---

__alloc_skb() cleanup by

Jeff Dike jd...@linux.intel.com

 include/linux/skbuff.h |7 ---
 net/core/skbuff.c  |   43 +--
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 281a1c0..5ff8c27 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -442,17 +442,18 @@ extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone, int node);
+  gfp_t priority, int fclone,
+  int node, struct net_device *dev);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
-   return __alloc_skb(size, priority, 0, -1);
+   return __alloc_skb(size, priority, 0, -1, NULL);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
   gfp_t priority)
 {
-   return __alloc_skb(size, priority, 1, -1);
+   return __alloc_skb(size, priority, 1, -1, NULL);
 }
 
 extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fbdb1f1..38d19d0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -161,7 +161,8 @@ EXPORT_SYMBOL(skb_under_panic);
  * @fclone: allocate from fclone cache instead of head cache
  * and allocate a cloned (child) skb
  * @node: numa node to allocate memory on
- *
+ * @dev: a device owns the skb if the skb try to get external buffer.
+ * otherwise is NULL.
  * Allocate a new sk_buff. The returned buffer has no headroom and a
  * tail room of size bytes. The object has a reference count of one.
  * The return is the buffer. On a failure the return is %NULL.
@@ -170,12 +171,13 @@ EXPORT_SYMBOL(skb_under_panic);
  * %GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int fclone, int node)
+   int fclone, int node, struct net_device *dev)
 {
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
-   u8 *data;
+   u8 *data = NULL;
+   struct skb_external_page *ext_page = NULL;
 
cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
 
@@ -185,8 +187,23 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
goto out;
 
size = SKB_DATA_ALIGN(size);
-   data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-   gfp_mask, node);
+
+   /* If the device wants to do mediate passthru(zero-copy),
+* the skb may try to get external buffers from outside.
+* If fails, then fall back to alloc buffers from kernel.
+*/
+   if (dev  dev-mp_port) {
+   ext_page = netdev_alloc_external_page(dev, skb, size);
+   if (ext_page) {
+   data = ext_page-start;
+   size = ext_page-size;
+   }
+   }
+
+   if (!data)
+   data = kmalloc_node_track_caller(
+   size + sizeof(struct skb_shared_info),
+   gfp_mask, node);
if (!data)
goto nodata;
 
@@ -208,6 +225,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
skb-mac_header = ~0U;
 #endif
 
+   /* If the skb get external buffers sucessfully, since the shinfo is
+* at the end of the buffer, we may retain the shinfo once we
+* need it sometime.
+*/
+   if (ext_page) {
+   skb-head = skb-data - NET_IP_ALIGN - NET_SKB_PAD;
+   memcpy(ext_page-ushinfo, skb_shinfo(skb),
+  sizeof(struct skb_shared_info));
+   }
/* make sure we initialize shinfo sequentially */
shinfo = skb_shinfo(skb);
atomic_set(shinfo-dataref, 1);
@@ -231,6 +257,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
child-fclone = SKB_FCLONE_UNAVAILABLE;
}
+   /* Record the external buffer info in this field. It's not so good,
+* but we cannot find another place easily.
+*/
+   shinfo-destructor_arg = ext_page;
+
 out:
return skb;
 nodata:
@@ -259,7 +290,7 @@ struct sk_buff 

[RFC][PATCH v6 05/19] Add a function make external buffer owner to query capability.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |2 +
 net/core/dev.c|   51 +
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 183c786..31d9c4a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t   napi_gro_frags(struct 
napi_struct *napi);
 extern int netdev_mp_port_attach(struct net_device *dev,
 struct mpassthru_port *port);
 extern void netdev_mp_port_detach(struct net_device *dev);
+extern int netdev_mp_port_prep(struct net_device *dev,
+   struct mpassthru_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index ecbb6b1..37b389a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2497,6 +2497,57 @@ void netdev_mp_port_detach(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_mp_port_detach);
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+   struct mpassthru_port *port)
+{
+   int rc;
+   int npages, data_len;
+   const struct net_device_ops *ops = dev-netdev_ops;
+
+   /* needed by packet split */
+
+   if (ops-ndo_mp_port_prep) {
+   rc = ops-ndo_mp_port_prep(dev, port);
+   if (rc)
+   return rc;
+   } else {
+   /* If the NIC driver did not report this,
+* then we try to use default value.
+*/
+   port-hdr_len = 128;
+   port-data_len = 2048;
+   port-npages = 1;
+   }
+
+   if (port-hdr_len = 0)
+   goto err;
+
+   npages = port-npages;
+   data_len = port-data_len;
+   if (npages = 0 || npages  MAX_SKB_FRAGS ||
+   (data_len  PAGE_SIZE * (npages - 1) ||
+data_len  PAGE_SIZE * npages))
+   goto err;
+
+   return 0;
+err:
+   dev_warn(dev-dev, invalid page constructor parameters\n);
+
+   return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 03/19] Export 2 func for device to assign/deassign new strucure

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |3 +++
 net/core/dev.c|   28 
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bae725c..efb575a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1592,6 +1592,9 @@ extern gro_result_t   napi_frags_finish(struct 
napi_struct *napi,
  gro_result_t ret);
 extern struct sk_buff *napi_frags_skb(struct napi_struct *napi);
 extern gro_result_tnapi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_attach(struct net_device *dev,
+struct mpassthru_port *port);
+extern void netdev_mp_port_detach(struct net_device *dev);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index f769098..ecbb6b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
 }
 
+/* Export two functions to assign/de-assign mp_port pointer
+ * to a net device.
+ */
+
+int netdev_mp_port_attach(struct net_device *dev,
+   struct mpassthru_port *port)
+{
+   /* locked by mp_mutex */
+   if (rcu_dereference(dev-mp_port))
+   return -EBUSY;
+
+   rcu_assign_pointer(dev-mp_port, port);
+
+   return 0;
+}
+EXPORT_SYMBOL(netdev_mp_port_attach);
+
+void netdev_mp_port_detach(struct net_device *dev)
+{
+   /* locked by mp_mutex */
+   if (!rcu_dereference(dev-mp_port))
+   return;
+
+   rcu_assign_pointer(dev-mp_port, NULL);
+   synchronize_rcu();
+}
+EXPORT_SYMBOL(netdev_mp_port_detach);
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 02/19] Add a new struct for device to manipulate external buffer.

2010-05-21 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |   19 ++-
 1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..bae725c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,22 @@ struct netdev_queue {
unsigned long   tx_dropped;
 } cacheline_aligned_in_smp;
 
+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port  {
+   int hdr_len;
+   int data_len;
+   int npages;
+   unsignedflags;
+   struct socket   *sock;
+   struct skb_external_page *(*ctor)(struct mpassthru_port *,
+   struct sk_buff *, int);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +968,8 @@ struct net_device {
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port*garp_port;
-
+   /* mpassthru */
+   struct mpassthru_port   *mp_port;
/* class/net/name entry */
struct device   dev;
/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v6 00/19] Provide a zero-copy method on KVM virtio-net.

2010-05-21 Thread xiaohui . xin
We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-13:net core changes.
patch 14-18:new device as interface to mantpulate external buffers.
patch 19:   for vhost-net.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One:Modify __alloc_skb() function a bit, it can only allocate a 
structure of sk_buff, and the data pointer is pointing to a 
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb-data is filled
directly by h/w. What we have done is in this way.

Pros:   We can avoid any copy here.
Cons:   Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device 
we have bind to do zero-copy. Then we ask guest to do so.


Two:Modify driver to get user buffer allocated from a page constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)-frags). For 
the head buffer side, let host allocates skb, and h/w fills it. 
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.

Pros:   We could less care the way how guest or host allocates their
buffers.
Cons:   We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.

We have got comments from Michael. And he said the first method will break
the compatiblity of virtio-net driver and may complicate the qemu live 
migration. Currently, we tried to ignore the skb_reserve() if the device
is doing zero-copy. Then guest virtio-net driver wil not changed. So we now
continue to go with the first way. 
But comments about the two ways are still appreicated.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
packet split support
To support GRO
Performance tuning

what we have done in v1:
polish the RCU usage
deal with write logging in asynchroush mode in vhost
add notifier block for mp device
rename page_ctor to mp_port in netdevice.h to make it looks generic
add mp_dev_change_flags() for mp device to change NIC state
add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
a small fix for missing dev_put when fail
using 

[RFC 0/2] Tracing

2010-05-21 Thread Stefan Hajnoczi
Trace events in QEMU/KVM can be very useful for debugging and performance
analysis.  I'd like to discuss tracing support and hope others have an interest
in this feature, too.

Following this email are patches I am using to debug virtio-blk and storage.
The patches provide trivial tracing support, but they don't address the details
of real tracing tools: enabling/disabling events at runtime, no overhead for
disabled events, multithreading support, etc.

It would be nice to have userland tracing facilities that work out-of-the-box
on production systems.  Unfortunately, I'm not aware of any such facilities out
there right now on Linux.  Perhaps SystemTap userspace tracing is the way to
go, has anyone tried it with KVM?

For the medium term, without userspace tracing facilities in the OS we could
put something into QEMU to address the need for tracing.  Here are my thoughts
on fleshing out the tracing patch I have posted:

1. Make it possible to enable/disable events at runtime.  Users enable only the
   events they are interested in and aren't flooded with trace data for all
   other events.

2. Either make trace events cheap or build without trace events by default.
   Disable by default still allows tracing to be used for development but
   less for production.

3. Allow events in any execution context (cpu, io, aio emulation threads).  The
   current code does not support concurrency and is meant for when the iothread
   mutex is held.

4. Make it easy to add new events.  Instead of keeping trace.h and trace.py in
   sync manually, use something like .hx to produce the appropriate C and
   Python.

Summary: Tracing is useful, are there external tools we can use right now?  If
not, should we put in something that works well enough until external tools
catch up?

Stefan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Stefan Hajnoczi
Trace events should be defined in trace.h.  Events are written to
/tmp/trace.log and can be formatted using trace.py.  Remember to add
events to trace.py for pretty-printing.

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 Makefile.objs |2 +-
 trace.c   |   64 +
 trace.h   |9 
 trace.py  |   30 ++
 4 files changed, 104 insertions(+), 1 deletions(-)
 create mode 100644 trace.c
 create mode 100644 trace.h
 create mode 100755 trace.py

diff --git a/Makefile.objs b/Makefile.objs
index acbaf22..307e989 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -8,7 +8,7 @@ qobject-obj-y += qerror.o
 # block-obj-y is code used by both qemu system emulation and qemu-img
 
 block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o
-block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
+block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o
 block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
diff --git a/trace.c b/trace.c
new file mode 100644
index 000..2fec4d3
--- /dev/null
+++ b/trace.c
@@ -0,0 +1,64 @@
+#include stdlib.h
+#include stdio.h
+#include trace.h
+
+typedef struct {
+unsigned long event;
+unsigned long x1;
+unsigned long x2;
+unsigned long x3;
+unsigned long x4;
+unsigned long x5;
+} TraceRecord;
+
+enum {
+TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord),
+};
+
+static TraceRecord trace_buf[TRACE_BUF_LEN];
+static unsigned int trace_idx;
+static FILE *trace_fp;
+
+static void trace(TraceEvent event, unsigned long x1,
+  unsigned long x2, unsigned long x3,
+  unsigned long x4, unsigned long x5) {
+TraceRecord *rec = trace_buf[trace_idx];
+rec-event = event;
+rec-x1 = x1;
+rec-x2 = x2;
+rec-x3 = x3;
+rec-x4 = x4;
+rec-x5 = x5;
+
+if (++trace_idx == TRACE_BUF_LEN) {
+trace_idx = 0;
+
+if (!trace_fp) {
+trace_fp = fopen(/tmp/trace.log, w);
+}
+if (trace_fp) {
+size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp);
+result = result;
+}
+}
+}
+
+void trace1(TraceEvent event, unsigned long x1) {
+trace(event, x1, 0, 0, 0, 0);
+}
+
+void trace2(TraceEvent event, unsigned long x1, unsigned long x2) {
+trace(event, x1, x2, 0, 0, 0);
+}
+
+void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3) {
+trace(event, x1, x2, x3, 0, 0);
+}
+
+void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4) {
+trace(event, x1, x2, x3, x4, 0);
+}
+
+void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4, unsigned long x5) {
+trace(event, x1, x2, x3, x4, x5);
+}
diff --git a/trace.h b/trace.h
new file mode 100644
index 000..144aa1e
--- /dev/null
+++ b/trace.h
@@ -0,0 +1,9 @@
+typedef enum {
+TRACE_MAX
+} TraceEvent;
+
+void trace1(TraceEvent event, unsigned long x1);
+void trace2(TraceEvent event, unsigned long x1, unsigned long x2);
+void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3);
+void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4);
+void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4, unsigned long x5);
diff --git a/trace.py b/trace.py
new file mode 100755
index 000..f38ab6b
--- /dev/null
+++ b/trace.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python
+import sys
+import struct
+
+trace_fmt = 'LL'
+trace_len = struct.calcsize(trace_fmt)
+
+events = {
+}
+
+def read_record(fobj):
+s = fobj.read(trace_len)
+if len(s) != trace_len:
+return None
+return struct.unpack(trace_fmt, s)
+
+def format_record(rec):
+event = events[rec[0]]
+fields = [event[0]]
+for i in xrange(1, len(event)):
+fields.append('%s=0x%x' % (event[i], rec[i]))
+return ' '.join(fields)
+
+f = open(sys.argv[1], 'rb')
+while True:
+rec = read_record(f)
+if rec is None:
+break
+
+print format_record(rec)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] trace: Trace write requests in virtio-blk, multiwrite, and paio_submit

2010-05-21 Thread Stefan Hajnoczi
Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
---
 block.c|7 +++
 hw/virtio-blk.c|6 ++
 posix-aio-compat.c |2 ++
 trace.h|   42 +-
 trace.py   |8 
 5 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/block.c b/block.c
index bfe46e3..a7fb040 100644
--- a/block.c
+++ b/block.c
@@ -27,6 +27,7 @@
 #include block_int.h
 #include module.h
 #include qemu-objects.h
+#include trace.h
 
 #ifdef CONFIG_BSD
 #include sys/types.h
@@ -1913,6 +1914,8 @@ static void multiwrite_cb(void *opaque, int ret)
 {
 MultiwriteCB *mcb = opaque;
 
+trace_multiwrite_cb(mcb, ret);
+
 if (ret  0  !mcb-error) {
 mcb-error = ret;
 multiwrite_user_cb(mcb);
@@ -2044,6 +2047,8 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, 
BlockRequest *reqs, int num_reqs)
 // Check for mergable requests
 num_reqs = multiwrite_merge(bs, reqs, num_reqs, mcb);
 
+trace_bdrv_aio_multiwrite(mcb, mcb-num_callbacks, num_reqs);
+
 // Run the aio requests
 for (i = 0; i  num_reqs; i++) {
 acb = bdrv_aio_writev(bs, reqs[i].sector, reqs[i].qiov,
@@ -2054,9 +2059,11 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, 
BlockRequest *reqs, int num_reqs)
 // submitted yet. Otherwise we'll wait for the submitted AIOs to
 // complete and report the error in the callback.
 if (mcb-num_requests == 0) {
+trace_bdrv_aio_multiwrite_earlyfail(mcb);
 reqs[i].error = -EIO;
 goto fail;
 } else {
+trace_bdrv_aio_multiwrite_latefail(mcb, i);
 mcb-num_requests++;
 multiwrite_cb(mcb, -EIO);
 break;
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index b05d15e..73b873e 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -13,6 +13,7 @@
 
 #include qemu-common.h
 #include sysemu.h
+#include trace.h
 #include virtio-blk.h
 #include block_int.h
 #ifdef __linux__
@@ -50,6 +51,8 @@ static void virtio_blk_req_complete(VirtIOBlockReq *req, int 
status)
 {
 VirtIOBlock *s = req-dev;
 
+trace_virtio_blk_req_complete(req, status);
+
 req-in-status = status;
 virtqueue_push(s-vq, req-elem, req-qiov.size + sizeof(*req-in));
 virtio_notify(s-vdev, s-vq);
@@ -87,6 +90,8 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
 {
 VirtIOBlockReq *req = opaque;
 
+trace_virtio_blk_rw_complete(req, ret);
+
 if (ret) {
 int is_read = !(req-out-type  VIRTIO_BLK_T_OUT);
 if (virtio_blk_handle_rw_error(req, -ret, is_read))
@@ -270,6 +275,7 @@ static void virtio_blk_handle_write(BlockRequest *blkreq, 
int *num_writes,
 blkreq[*num_writes].cb = virtio_blk_rw_complete;
 blkreq[*num_writes].opaque = req;
 blkreq[*num_writes].error = 0;
+trace_virtio_blk_handle_write(req, req-out-sector, req-qiov.size / 512);
 
 (*num_writes)++;
 }
diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index b43c531..57d83f0 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -23,6 +23,7 @@
 #include stdio.h
 
 #include qemu-queue.h
+#include trace.h
 #include osdep.h
 #include qemu-common.h
 #include block_int.h
@@ -583,6 +584,7 @@ BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
 acb-next = posix_aio_state-first_aio;
 posix_aio_state-first_aio = acb;
 
+trace_paio_submit(acb, opaque, sector_num, nb_sectors, type);
 qemu_paio_submit(acb);
 return acb-common;
 }
diff --git a/trace.h b/trace.h
index 144aa1e..3c4564f 100644
--- a/trace.h
+++ b/trace.h
@@ -1,5 +1,12 @@
 typedef enum {
-TRACE_MAX
+TRACE_MULTIWRITE_CB,
+TRACE_BDRV_AIO_MULTIWRITE,
+TRACE_BDRV_AIO_MULTIWRITE_EARLYFAIL,
+TRACE_BDRV_AIO_MULTIWRITE_LATEFAIL,
+TRACE_VIRTIO_BLK_REQ_COMPLETE,
+TRACE_VIRTIO_BLK_RW_COMPLETE,
+TRACE_VIRTIO_BLK_HANDLE_WRITE,
+TRACE_PAIO_SUBMIT,
 } TraceEvent;
 
 void trace1(TraceEvent event, unsigned long x1);
@@ -7,3 +14,36 @@ void trace2(TraceEvent event, unsigned long x1, unsigned long 
x2);
 void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3);
 void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4);
 void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4, unsigned long x5);
+
+static inline void trace_multiwrite_cb(void *mcb, int ret) {
+trace2(TRACE_MULTIWRITE_CB, (unsigned long)mcb, ret);
+}
+
+static inline void trace_bdrv_aio_multiwrite(void *mcb, int num_callbacks, int 
num_reqs) {
+trace3(TRACE_BDRV_AIO_MULTIWRITE, (unsigned long)mcb, num_callbacks, 
num_reqs);
+}
+
+static inline void trace_bdrv_aio_multiwrite_earlyfail(void *mcb) {
+trace1(TRACE_BDRV_AIO_MULTIWRITE_EARLYFAIL, (unsigned long)mcb);
+}
+
+static inline void trace_bdrv_aio_multiwrite_latefail(void *mcb, int i) {
+

Re: repeatable hang with loop mount and heavy IO in guest (now in host - not KVM then..)

2010-05-21 Thread Antoine Martin

On 02/27/2010 12:38 AM, Antoine Martin wrote:

  1   0   0  98   0   1|   0 0 |  66B  354B|   0 0 |  3011
  1   1   0  98   0   0|   0 0 |  66B  354B|   0 0 |  2911
From that point onwards, nothing will happen.
The host has disk IO to spare... So what is it waiting for??

Moved to an AMD64 host. No effect.
Disabled swap before running the test. No effect.
Moved the guest to a fully up-to-date FC12 server 
(2.6.31.6-145.fc12.x86_64), no effect.
I have narrowed it down to the guest's filesystem used for backing the 
disk image which is loop mounted: although it was not completely full 
(and had enough inodes), freeing some space on it prevents the system 
from misbehaving.


FYI: the disk image was clean and was fscked before each test. kvm had 
been updated to 0.12.3
The weird thing is that the same filesystem works fine (no system 
hang) if used directly from the host, it is only misbehaving via kvm...


So I am not dismissing the possibility that kvm may be at least partly 
to blame, or that it is exposing a filesystem bug (race?) not normally 
encountered.
(I have backed up the full 32GB virtual disk in case someone suggests 
further investigation)
Well, well. I've just hit the exact same bug on another *host* (not a 
guest), running stock Fedora 12.

So this isn't a kvm bug after all. Definitely a loop+ext(4?) bug.
Looks like you need a pretty big loop mounted partition to trigger it. 
(bigger than available ram?)


This is what triggered it on a quad amd system with 8Gb of ram, software 
raid-1 partition:

mount -o loop 2GB.dd source
dd if=/dev/zero of=8GB.dd bs=1048576 count=8192
mkfs.ext4 -f 8GB.dd
mount -o loop 8GB.dd dest
rsync -rplogtD source/* dest/
umount source
umount dest
^ this is where it hangs, I then tried to issue a 'sync' from another 
terminal, which also hung.
It took more than 10 minutes to settle itself, during that time one CPU 
was stuck in wait state.

dstat reported almost no IO at the time (1MB/s)
I assume dstat reports page write back like any other disk IO?
That raid partition does ~60MB/s, so writing back 8GB shouldn't take 10 
minutes. (that's even assuming it would have to write back the whole 8GB 
at umount time - which should not be the case)


Cheers
Antoine

Here's the hung trace:
INFO: task umount:526 blocked for more than 120 seconds.
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
umountD 0002 0   526  32488 0x
 880140f9fc88 0086 880008e3c228 810d5fd9
 880140f9fc28 880140f9fcd8 880140f9ffd8 880140f9ffd8
 88021b5e03d8 f980 00015740 88021b5e03d8
Call Trace:
 [810d5fd9] ? sync_page+0x0/0x4a
 [81046fbd] ? __enqueue_entity+0x7b/0x7d
 [8113a047] ? bdi_sched_wait+0x0/0x12
 [8113a055] bdi_sched_wait+0xe/0x12
 [814549f0] __wait_on_bit+0x48/0x7b
 [8102649f] ? native_smp_send_reschedule+0x5c/0x5e
 [81454a91] out_of_line_wait_on_bit+0x6e/0x79
 [8113a047] ? bdi_sched_wait+0x0/0x12
 [810748dc] ? wake_bit_function+0x0/0x33
 [8113ad0b] wait_on_bit.clone.1+0x1e/0x20
 [8113ad71] bdi_sync_writeback+0x64/0x6b
 [8113ad9a] sync_inodes_sb+0x22/0xec
 [8113e547] __sync_filesystem+0x4e/0x77
 [8113e71d] sync_filesystem+0x4b/0x4f
 [8111d6d9] generic_shutdown_super+0x27/0xc9
 [8111d7a2] kill_block_super+0x27/0x3f
 [8111ded7] deactivate_super+0x56/0x6b
 [81134262] mntput_no_expire+0xb4/0xec
 [8113482a] sys_umount+0x2d5/0x304
 [81458133] ? do_page_fault+0x270/0x2a0
 [81011d32] system_call_fastpath+0x16/0x1b
INFO: task umount:526 blocked for more than 120 seconds.
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
umountD 0002 0   526  32488 0x
 880140f9fc88 0086 880008e3c228 810d5fd9
 880140f9fc28 880140f9fcd8 880140f9ffd8 880140f9ffd8
 88021b5e03d8 f980 00015740 88021b5e03d8
Call Trace:
 [810d5fd9] ? sync_page+0x0/0x4a
 [81046fbd] ? __enqueue_entity+0x7b/0x7d
 [8113a047] ? bdi_sched_wait+0x0/0x12
 [8113a055] bdi_sched_wait+0xe/0x12
 [814549f0] __wait_on_bit+0x48/0x7b
 [8102649f] ? native_smp_send_reschedule+0x5c/0x5e
 [81454a91] out_of_line_wait_on_bit+0x6e/0x79
 [8113a047] ? bdi_sched_wait+0x0/0x12
 [810748dc] ? wake_bit_function+0x0/0x33
 [8113ad0b] wait_on_bit.clone.1+0x1e/0x20
 [8113ad71] bdi_sync_writeback+0x64/0x6b
 [8113ad9a] sync_inodes_sb+0x22/0xec
 [8113e547] __sync_filesystem+0x4e/0x77
 [8113e71d] sync_filesystem+0x4b/0x4f
 [8111d6d9] generic_shutdown_super+0x27/0xc9
 [8111d7a2] kill_block_super+0x27/0x3f
 [8111ded7] deactivate_super+0x56/0x6b
 [81134262] mntput_no_expire+0xb4/0xec
 [8113482a] 

Re: [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Stefan Hajnoczi
I should have used the [RFC] tag to make it clear that I'm not
proposing these patches for merge, sorry.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Jan Kiszka
Stefan Hajnoczi wrote:
 Trace events should be defined in trace.h.  Events are written to
 /tmp/trace.log and can be formatted using trace.py.  Remember to add
 events to trace.py for pretty-printing.

When already writing to a file, why not reusing QEMU's logging
infrastructure (log foo / -d foo)? Shouldn't make a huge
performance difference if the data is saved in clear-text.

Also, having support for ftrace's user space markers would be a very
nice option (only an option as it's Linux-specific), see
http://lwn.net/Articles/366796. This allows to correlate kernel events
(KVM as well as others) with what goes on in QEMU. It simply enables
integration with the whole kernel tracing infrastructure, e.g.
KernelShark (http://people.redhat.com/srostedt/kernelshark/HTML).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC 0/2] Tracing

2010-05-21 Thread Prerna Saxena

Hi Stefan,

Nice to see the patchset.
I am working on something similar, on the lines of static trace events 
for QEMU, that collect traces in a qemu-internal buffer. This would 
employ monitor commands to read traces, as well as enable/disable trace 
events at runtime.

I plan to post a prototype early next week.

On 05/21/2010 03:12 PM, Stefan Hajnoczi wrote:

Trace events in QEMU/KVM can be very useful for debugging and performance
analysis.  I'd like to discuss tracing support and hope others have an interest
in this feature, too.

Following this email are patches I am using to debug virtio-blk and storage.
The patches provide trivial tracing support, but they don't address the details
of real tracing tools: enabling/disabling events at runtime, no overhead for
disabled events, multithreading support, etc.

It would be nice to have userland tracing facilities that work out-of-the-box
on production systems.  Unfortunately, I'm not aware of any such facilities out
there right now on Linux.  Perhaps SystemTap userspace tracing is the way to
go, has anyone tried it with KVM?

For the medium term, without userspace tracing facilities in the OS we could
put something into QEMU to address the need for tracing.  Here are my thoughts
on fleshing out the tracing patch I have posted:

1. Make it possible to enable/disable events at runtime.  Users enable only the
events they are interested in and aren't flooded with trace data for all
other events.



Agree, my upcoming patchset should address this.


2. Either make trace events cheap or build without trace events by default.
Disable by default still allows tracing to be used for development but
less for production.



I'm trying to do this too, though quite a lot remains to be improved in 
my current implementation :-)



3. Allow events in any execution context (cpu, io, aio emulation threads).



Agree.


4. Make it easy to add new events.


Agree ! I'm trying to provide a unified macro interface like trace 
events which makes it easy enough to add new events.



Regards,
--
Prerna Saxena

Linux Technology Centre,
IBM Systems and Technology Lab,
Bangalore, India
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add support for protocol driver create_options

2010-05-21 Thread Kevin Wolf
Am 20.05.2010 07:36, schrieb MORITA Kazutaka:
 This patch enables protocol drivers to use their create options which
 are not supported by the format.  For example, protcol drivers can use
 a backing_file option with raw format.
 
 Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp

Hm, this is not stackable, right? Though I do see that making it
stackable would require some bigger changes, so maybe we can get away
with claiming that this approach covers everything that happens in practice.

If we accept that this is the desired behaviour, the code looks good to me.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network problem with Solaris 10u8 guest

2010-05-21 Thread Harald Dunkel
On 05/12/10 12:41, Harald Dunkel wrote:
 Hi folks,
 
 I am trying to run Solaris 10u8 as a guest in kvm (kernel
 2.6.33.2). Problem: The virtual network devices don't work
 with this Solaris version.
 

Short update: Virtualbox 3.1.6 seems to be more reliable in
this case.


Regards

Harri
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Anthony Liguori

On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote:

Trace events should be defined in trace.h.  Events are written to
/tmp/trace.log and can be formatted using trace.py.  Remember to add
events to trace.py for pretty-printing.

Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
---
  Makefile.objs |2 +-
  trace.c   |   64 +
  trace.h   |9 
  trace.py  |   30 ++
  4 files changed, 104 insertions(+), 1 deletions(-)
  create mode 100644 trace.c
  create mode 100644 trace.h
  create mode 100755 trace.py

diff --git a/Makefile.objs b/Makefile.objs
index acbaf22..307e989 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -8,7 +8,7 @@ qobject-obj-y += qerror.o
  # block-obj-y is code used by both qemu system emulation and qemu-img

  block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o
-block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
+block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o
  block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o

diff --git a/trace.c b/trace.c
new file mode 100644
index 000..2fec4d3
--- /dev/null
+++ b/trace.c
@@ -0,0 +1,64 @@
+#includestdlib.h
+#includestdio.h
+#include trace.h
+
+typedef struct {
+unsigned long event;
+unsigned long x1;
+unsigned long x2;
+unsigned long x3;
+unsigned long x4;
+unsigned long x5;
+} TraceRecord;
+
+enum {
+TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord),
+};
+
+static TraceRecord trace_buf[TRACE_BUF_LEN];
+static unsigned int trace_idx;
+static FILE *trace_fp;
+
+static void trace(TraceEvent event, unsigned long x1,
+  unsigned long x2, unsigned long x3,
+  unsigned long x4, unsigned long x5) {
+TraceRecord *rec =trace_buf[trace_idx];
+rec-event = event;
+rec-x1 = x1;
+rec-x2 = x2;
+rec-x3 = x3;
+rec-x4 = x4;
+rec-x5 = x5;
+
+if (++trace_idx == TRACE_BUF_LEN) {
+trace_idx = 0;
+
+if (!trace_fp) {
+trace_fp = fopen(/tmp/trace.log, w);
+}
+if (trace_fp) {
+size_t result = fwrite(trace_buf, sizeof trace_buf, 1, trace_fp);
+result = result;
+}
+}
+}
   


It is probably worth while to read trace points via the monitor or 
through some other mechanism.  My concern would be that writing even 64k 
out to disk would introduce enough performance overhead mainly because 
it runs lock-step with the guest's VCPU.


Maybe it's worth adding a thread that syncs the ring to disk if we want 
to write to disk?



+void trace1(TraceEvent event, unsigned long x1) {
+trace(event, x1, 0, 0, 0, 0);
+}
+
+void trace2(TraceEvent event, unsigned long x1, unsigned long x2) {
+trace(event, x1, x2, 0, 0, 0);
+}
+
+void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3) {
+trace(event, x1, x2, x3, 0, 0);
+}
+
+void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4) {
+trace(event, x1, x2, x3, x4, 0);
+}
+
+void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4, unsigned long x5) {
+trace(event, x1, x2, x3, x4, x5);
+}
diff --git a/trace.h b/trace.h
new file mode 100644
index 000..144aa1e
--- /dev/null
+++ b/trace.h
@@ -0,0 +1,9 @@
+typedef enum {
+TRACE_MAX
+} TraceEvent;
+
+void trace1(TraceEvent event, unsigned long x1);
+void trace2(TraceEvent event, unsigned long x1, unsigned long x2);
+void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3);
+void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4);
+void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4, unsigned long x5);
   


Looks good.  I think we definitely need something like this.

Regards,

Anthony Liguori


diff --git a/trace.py b/trace.py
new file mode 100755
index 000..f38ab6b
--- /dev/null
+++ b/trace.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python
+import sys
+import struct
+
+trace_fmt = 'LL'
+trace_len = struct.calcsize(trace_fmt)
+
+events = {
+}
+
+def read_record(fobj):
+s = fobj.read(trace_len)
+if len(s) != trace_len:
+return None
+return struct.unpack(trace_fmt, s)
+
+def format_record(rec):
+event = events[rec[0]]
+fields = [event[0]]
+for i in xrange(1, len(event)):
+fields.append('%s=0x%x' % (event[i], rec[i]))
+return ' '.join(fields)
+
+f = open(sys.argv[1], 'rb')
+while True:
+rec = read_record(f)
+if rec is None:
+break
+
+print format_record(rec)
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bug tracking?

2010-05-21 Thread Michael Tokarev

So, what's the current state of the bug tracking system?
As far as I can see, qemu is moving to launchpad.
Where qemu-kvm-related issues should be submitted nowadays?

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Stefan Hajnoczi
On Fri, May 21, 2010 at 12:13 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 Stefan Hajnoczi wrote:
 Trace events should be defined in trace.h.  Events are written to
 /tmp/trace.log and can be formatted using trace.py.  Remember to add
 events to trace.py for pretty-printing.

 When already writing to a file, why not reusing QEMU's logging
 infrastructure (log foo / -d foo)? Shouldn't make a huge
 performance difference if the data is saved in clear-text.

 Also, having support for ftrace's user space markers would be a very
 nice option (only an option as it's Linux-specific), see
 http://lwn.net/Articles/366796.

Thanks for the links.

I think using the platform's tracing facility has many advantages.
The main one being that we can focus on QEMU/KVM development rather
than re-implementing tracing infrastructure :).

It may be possible to have SystemTap, DTrace, or nop static trace
event code.  A platform with no tracing support can only use the nop
backend, which results in a build without static trace events.
Platforms with tracing support can build with the appropriate backend
or nop.  The backend tracing facility is abstracted and most of QEMU
doesn't need to know which one is being used.

I hadn't seen trace markers.  However, I suspect they aren't ideal for
static trace events because logging an event requires a write system
call.  They look useful for annotating kernel tracing information, but
less for high frequency/low overhead userspace tracing.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Jan Kiszka
Stefan Hajnoczi wrote:
 On Fri, May 21, 2010 at 12:13 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 Stefan Hajnoczi wrote:
 Trace events should be defined in trace.h.  Events are written to
 /tmp/trace.log and can be formatted using trace.py.  Remember to add
 events to trace.py for pretty-printing.
 When already writing to a file, why not reusing QEMU's logging
 infrastructure (log foo / -d foo)? Shouldn't make a huge
 performance difference if the data is saved in clear-text.
 
 Also, having support for ftrace's user space markers would be a very
 nice option (only an option as it's Linux-specific), see
 http://lwn.net/Articles/366796.
 
 Thanks for the links.
 
 I think using the platform's tracing facility has many advantages.
 The main one being that we can focus on QEMU/KVM development rather
 than re-implementing tracing infrastructure :).

Indeed. :)

 
 It may be possible to have SystemTap, DTrace, or nop static trace
 event code.  A platform with no tracing support can only use the nop
 backend, which results in a build without static trace events.
 Platforms with tracing support can build with the appropriate backend
 or nop.  The backend tracing facility is abstracted and most of QEMU
 doesn't need to know which one is being used.

That would be ideal.

 
 I hadn't seen trace markers.  However, I suspect they aren't ideal for
 static trace events because logging an event requires a write system
 call.  They look useful for annotating kernel tracing information, but
 less for high frequency/low overhead userspace tracing.

You never know for sure until you tried :). There are surely lots of
scenarios where this overhead does not matter.

Moreover, I'm sure that something of LTTng's high-frequency/low-overhead
tracing capabilities will make it (in whatever form) into mainline
sooner or later. So we need that smart infrastructure to make use of it
once it's available (actually, LTTng is already available, just still
requires some kernel patching).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network problem with Solaris 10u8 guest

2010-05-21 Thread Michael Tokarev

21.05.2010 16:36, Harald Dunkel wrote:

On 05/12/10 12:41, Harald Dunkel wrote:

Hi folks,

I am trying to run Solaris 10u8 as a guest in kvm (kernel
2.6.33.2). Problem: The virtual network devices don't work
with this Solaris version.


Short update: Virtualbox 3.1.6 seems to be more reliable in
this case.


I forgot to send my testing results.

I installed solaris from sol-10-u8-ga-x86-dvd.iso.
It were with the default rtl8138 NIC, and the installer
configured rtls0 interface (so it actually at least
recognizable).

I left `ping -f $solaris-ip' process running for whole
night - it were still running in the morning without any
visible issues, at 100% CPU usage (2 cores - for ping,
host kernel and kvm processes).

Now, it looks like I forgot solaris enough to being unable
to set up new network driver, so I can't easily switch to
e1000.  Maybe reinstall will be faster for me.

So, basically, I can't reproduce the issue.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Jan Kiszka
Anthony Liguori wrote:
 On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote:
 Trace events should be defined in trace.h.  Events are written to
 /tmp/trace.log and can be formatted using trace.py.  Remember to add
 events to trace.py for pretty-printing.

 Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---
   Makefile.objs |2 +-
   trace.c   |   64
 +
   trace.h   |9 
   trace.py  |   30 ++
   4 files changed, 104 insertions(+), 1 deletions(-)
   create mode 100644 trace.c
   create mode 100644 trace.h
   create mode 100755 trace.py

 diff --git a/Makefile.objs b/Makefile.objs
 index acbaf22..307e989 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o
   # block-obj-y is code used by both qemu system emulation and qemu-img

   block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o
 module.o
 -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
 +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o
   block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
   block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o

 diff --git a/trace.c b/trace.c
 new file mode 100644
 index 000..2fec4d3
 --- /dev/null
 +++ b/trace.c
 @@ -0,0 +1,64 @@
 +#includestdlib.h
 +#includestdio.h
 +#include trace.h
 +
 +typedef struct {
 +unsigned long event;
 +unsigned long x1;
 +unsigned long x2;
 +unsigned long x3;
 +unsigned long x4;
 +unsigned long x5;
 +} TraceRecord;
 +
 +enum {
 +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord),
 +};
 +
 +static TraceRecord trace_buf[TRACE_BUF_LEN];
 +static unsigned int trace_idx;
 +static FILE *trace_fp;
 +
 +static void trace(TraceEvent event, unsigned long x1,
 +  unsigned long x2, unsigned long x3,
 +  unsigned long x4, unsigned long x5) {
 +TraceRecord *rec =trace_buf[trace_idx];
 +rec-event = event;
 +rec-x1 = x1;
 +rec-x2 = x2;
 +rec-x3 = x3;
 +rec-x4 = x4;
 +rec-x5 = x5;
 +
 +if (++trace_idx == TRACE_BUF_LEN) {
 +trace_idx = 0;
 +
 +if (!trace_fp) {
 +trace_fp = fopen(/tmp/trace.log, w);
 +}
 +if (trace_fp) {
 +size_t result = fwrite(trace_buf, sizeof trace_buf, 1,
 trace_fp);
 +result = result;
 +}
 +}
 +}

 
 It is probably worth while to read trace points via the monitor or
 through some other mechanism.  My concern would be that writing even 64k
 out to disk would introduce enough performance overhead mainly because
 it runs lock-step with the guest's VCPU.
 
 Maybe it's worth adding a thread that syncs the ring to disk if we want
 to write to disk?

That's not what QEMU should worry about. If somehow possible, let's push
this into the hands of a (user space) tracing framework, ideally one
that is already designed for such requirements. E.g. there exists quite
useful work in the context of LTTng (user space RCU for application
tracing).

We may need simple stubs for the case that no such framework is (yet)
available. But effort should focus on a QEMU infrastructure to add
useful tracepoints to the code. Specifically when tracing over KVM, you
usually need information about kernel states as well, so you depend on
an integrated approach, not Yet Another Log File.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Anthony Liguori

On 05/21/2010 08:46 AM, Jan Kiszka wrote:

Anthony Liguori wrote:
   

On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote:
 

Trace events should be defined in trace.h.  Events are written to
/tmp/trace.log and can be formatted using trace.py.  Remember to add
events to trace.py for pretty-printing.

Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
---
   Makefile.objs |2 +-
   trace.c   |   64
+
   trace.h   |9 
   trace.py  |   30 ++
   4 files changed, 104 insertions(+), 1 deletions(-)
   create mode 100644 trace.c
   create mode 100644 trace.h
   create mode 100755 trace.py

diff --git a/Makefile.objs b/Makefile.objs
index acbaf22..307e989 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -8,7 +8,7 @@ qobject-obj-y += qerror.o
   # block-obj-y is code used by both qemu system emulation and qemu-img

   block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o
module.o
-block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
+block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o
   block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
   block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o

diff --git a/trace.c b/trace.c
new file mode 100644
index 000..2fec4d3
--- /dev/null
+++ b/trace.c
@@ -0,0 +1,64 @@
+#includestdlib.h
+#includestdio.h
+#include trace.h
+
+typedef struct {
+unsigned long event;
+unsigned long x1;
+unsigned long x2;
+unsigned long x3;
+unsigned long x4;
+unsigned long x5;
+} TraceRecord;
+
+enum {
+TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord),
+};
+
+static TraceRecord trace_buf[TRACE_BUF_LEN];
+static unsigned int trace_idx;
+static FILE *trace_fp;
+
+static void trace(TraceEvent event, unsigned long x1,
+  unsigned long x2, unsigned long x3,
+  unsigned long x4, unsigned long x5) {
+TraceRecord *rec =trace_buf[trace_idx];
+rec-event = event;
+rec-x1 = x1;
+rec-x2 = x2;
+rec-x3 = x3;
+rec-x4 = x4;
+rec-x5 = x5;
+
+if (++trace_idx == TRACE_BUF_LEN) {
+trace_idx = 0;
+
+if (!trace_fp) {
+trace_fp = fopen(/tmp/trace.log, w);
+}
+if (trace_fp) {
+size_t result = fwrite(trace_buf, sizeof trace_buf, 1,
trace_fp);
+result = result;
+}
+}
+}

   

It is probably worth while to read trace points via the monitor or
through some other mechanism.  My concern would be that writing even 64k
out to disk would introduce enough performance overhead mainly because
it runs lock-step with the guest's VCPU.

Maybe it's worth adding a thread that syncs the ring to disk if we want
to write to disk?
 

That's not what QEMU should worry about. If somehow possible, let's push
this into the hands of a (user space) tracing framework, ideally one
that is already designed for such requirements. E.g. there exists quite
useful work in the context of LTTng (user space RCU for application
tracing).
   


From what I understand, none of the current kernel approaches to 
userspace tracing have much momentum at the moment.



We may need simple stubs for the case that no such framework is (yet)
available. But effort should focus on a QEMU infrastructure to add
useful tracepoints to the code. Specifically when tracing over KVM, you
usually need information about kernel states as well, so you depend on
an integrated approach, not Yet Another Log File.
   


I think the simple code that Stefan pasted gives us 95% of what we need.

Regards,

Anthony Liguori


Jan

   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Gentoo guest with smp: emerge freeze while recompile world

2010-05-21 Thread Riccardo
 There are almost impossible to debug.

 Try copying vmlinux out of your guest and attach with gdb when it
 hangs.  Then issue the command

   (gdb) thread apply all backtrace

 to see what the guest is doing.

 --
 Do not meddle in the internals of kernels, for they are subtle and quick to
panic.
--- End of Original Message ---

Hi,
I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world
complete without errors!
I always use the same .config

After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the problem
remain, the compile freeze and I see this in ps -elf:

5 S root  1013 1  0  76  -4 -   3125 poll_s 13:00 ?00:00:00
/sbin/udevd --daemon
1 S root  2669 1  0  80   0 -   7523 wait   13:00 ?00:00:00
supervising syslog-ng
5 S root  2670  2669  0  80   0 -   7556 poll_s 13:00 ?00:00:00
/usr/sbin/syslog-ng
1 S root  3258 1  0  80   0 -   9505 poll_s 13:00 ?00:00:00
/usr/sbin/sshd
1 S root  3378 1  0  80   0 -   4115 hrtime 13:00 ?00:00:00
/usr/sbin/cron
0 S root  3446 1  0  80   0 -   1493 n_tty_ 13:00 tty2 00:00:00
/sbin/agetty 38400 tty2 linux
0 S root  3447 1  0  80   0 -   1493 n_tty_ 13:00 tty3 00:00:00
/sbin/agetty 38400 tty3 linux
0 S root  3448 1  0  80   0 -   1493 n_tty_ 13:00 tty4 00:00:00
/sbin/agetty 38400 tty4 linux
0 S root  3449 1  0  80   0 -   1493 n_tty_ 13:00 tty5 00:00:00
/sbin/agetty 38400 tty5 linux
0 S root  3450 1  0  80   0 -   1493 n_tty_ 13:00 tty6 00:00:00
/sbin/agetty 38400 tty6 linux
5 S root  3457 1  0  80   0 -   5959 poll_s 13:00 ?00:00:00
SCREEN -S sb1
4 S root  3458  3457  0  80   0 -   4454 wait   13:00 pts/000:00:00
-/bin/bash
4 S root  3462  3458  0  75  -5 - 45171 poll_s 13:00 pts/000:00:34
/usr/bin/python2.6 /usr/bin/emerge -e world
4 S root  3613 1  0  80   0 - 14014 wait   13:01 tty1 00:00:00
/bin/login --
4 S root  3953  3613  0  80   0 -   4429 n_tty_ 13:01 tty1 00:00:00
-bash
0 S root  6614  3462  0  75  -5 -   972 wait   14:26 pts/000:00:00
[dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh compile
4 S root  6615  6614  0  75  -5 -   6362 wait   14:26 pts/000:00:00
/bin/bash /usr/lib64/portage/bin/ebuild.sh compile
5 S root  6646  6615  0  75  -5 -   6745 wait   14:26 pts/000:00:00
/bin/bash /usr/lib64/portage/bin/ebuild.sh compile
4 S root 13235  6646  0  75  -5 -  3651 wait   14:27 pts/000:00:00
make -j8
4 S root 13238 13235  0  75  -5 -  3652 wait   14:27 pts/000:00:00
make all-recursive
4 S root 13239 13238  0  75  -5 -  5956 wait   14:27 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
5 S root 13243 13239  0  75  -5 -  5956 wait   14:27 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
4 S root 13244 13243  0  75  -5 -  3686 wait   14:27 pts/000:00:00
make all
4 S root 13358 13244  0  75  -5 -  3684 wait   14:27 pts/000:00:00
make all-recursive
4 S root 13359 13358  0  75  -5 -  5956 wait   14:27 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
5 S root 16546 13359  0  75  -5 -  5956 wait   14:28 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
4 S root 16547 16546  0  75  -5 -  3652 wait   14:28 pts/000:00:00
make all
4 S root 16548 16547  0  75  -5 -  3652 n_tty_ 14:28 pts/000:00:00
make all-am
4 S root 16599  3258  0  80   0 - 17937 poll_s 15:07 ?00:00:00
sshd: r...@pts/2
4 S root 16602 16599  0  80   0 -  4429 wait   15:07 pts/200:00:00 -bash
4 R root 16611 16602  0  80   0 -  3698 -  15:08 pts/200:00:00 ps
-elf
1 S root 31506 2  0  80   0 - 0 bdi_wr 14:25 ?00:00:00
[flush-253:0]

All in wait?
After this test I reboot in 2.6.31-r10 and I complete emerge -e world
succefully.
The problem show always with all kernels =2.6.32
I have setup something wrong in kernel? I post the .config in the previous 
email.

Best regards,

Riccardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup

2010-05-21 Thread Sridhar Samudrala

On 5/20/2010 3:22 PM, Paul Menage wrote:

On Tue, May 18, 2010 at 5:04 PM, Sridhar Samudrala
samudrala.srid...@gmail.com  wrote:
   

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudralas...@us.ibm.com
 

Reviewed-by: Paul Menagemen...@google.com

It would be more efficient to just attach directly to current-cgroups
rather than potentially creating/destroying one css_set for each
hierarchy until we've completely converged on current-cgroups - but
that would require a bunch of refactoring of the guts of
cgroup_attach_task() to ensure that the right can_attach()/attach()
callbacks are made. That doesn't really seem worthwhile right now for
the initial use, that I imagine isn't going to be
performance-sensitive.
   
Yes. In our use-case, this will be called only once per guest interface 
when the guest comes up.
Hope you or someone more familiar with cgroups subsystem can optimize 
this function later.


Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Gentoo guest with smp: emerge freeze while recompile world

2010-05-21 Thread Avi Kivity

On 05/21/2010 04:16 PM, Riccardo wrote:

...
   

There are almost impossible to debug.

Try copying vmlinux out of your guest and attach with gdb when it
hangs.  Then issue the command

   (gdb) thread apply all backtrace

to see what the guest is doing.

--
Do not meddle in the internals of kernels, for they are subtle and quick to
 

panic.
--- End of Original Message ---

Hi,
I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world
complete without errors!
   


Interesing.  Can you so a git bisect to see where it stops working?


I always use the same .config

After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the problem
remain, the compile freeze and I see this in ps -elf:

5 S root  1013 1  0  76  -4 -  3125 poll_s 13:00 ?00:00:00
/sbin/udevd --daemon
1 S root  2669 1  0  80   0 -  7523 wait   13:00 ?00:00:00
supervising syslog-ng
5 S root  2670  2669  0  80   0 -  7556 poll_s 13:00 ?00:00:00
/usr/sbin/syslog-ng
1 S root  3258 1  0  80   0 -  9505 poll_s 13:00 ?00:00:00
/usr/sbin/sshd
1 S root  3378 1  0  80   0 -  4115 hrtime 13:00 ?00:00:00
/usr/sbin/cron
0 S root  3446 1  0  80   0 -  1493 n_tty_ 13:00 tty2 00:00:00
/sbin/agetty 38400 tty2 linux
0 S root  3447 1  0  80   0 -  1493 n_tty_ 13:00 tty3 00:00:00
/sbin/agetty 38400 tty3 linux
0 S root  3448 1  0  80   0 -  1493 n_tty_ 13:00 tty4 00:00:00
/sbin/agetty 38400 tty4 linux
0 S root  3449 1  0  80   0 -  1493 n_tty_ 13:00 tty5 00:00:00
/sbin/agetty 38400 tty5 linux
0 S root  3450 1  0  80   0 -  1493 n_tty_ 13:00 tty6 00:00:00
/sbin/agetty 38400 tty6 linux
5 S root  3457 1  0  80   0 -  5959 poll_s 13:00 ?00:00:00
SCREEN -S sb1
4 S root  3458  3457  0  80   0 -  4454 wait   13:00 pts/000:00:00
-/bin/bash
4 S root  3462  3458  0  75  -5 - 45171 poll_s 13:00 pts/000:00:34
/usr/bin/python2.6 /usr/bin/emerge -e world
4 S root  3613 1  0  80   0 - 14014 wait   13:01 tty1 00:00:00
/bin/login --
4 S root  3953  3613  0  80   0 -  4429 n_tty_ 13:01 tty1 00:00:00 -bash
0 S root  6614  3462  0  75  -5 -   972 wait   14:26 pts/000:00:00
[dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh compile
4 S root  6615  6614  0  75  -5 -  6362 wait   14:26 pts/000:00:00
/bin/bash /usr/lib64/portage/bin/ebuild.sh compile
5 S root  6646  6615  0  75  -5 -  6745 wait   14:26 pts/000:00:00
/bin/bash /usr/lib64/portage/bin/ebuild.sh compile
4 S root 13235  6646  0  75  -5 -  3651 wait   14:27 pts/000:00:00
make -j8
4 S root 13238 13235  0  75  -5 -  3652 wait   14:27 pts/000:00:00
make all-recursive
4 S root 13239 13238  0  75  -5 -  5956 wait   14:27 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
5 S root 13243 13239  0  75  -5 -  5956 wait   14:27 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
4 S root 13244 13243  0  75  -5 -  3686 wait   14:27 pts/000:00:00
make all
4 S root 13358 13244  0  75  -5 -  3684 wait   14:27 pts/000:00:00
make all-recursive
4 S root 13359 13358  0  75  -5 -  5956 wait   14:27 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
5 S root 16546 13359  0  75  -5 -  5956 wait   14:28 pts/000:00:00
/bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo
all-recursive | sed s/-recursive//`; \?list=
4 S root 16547 16546  0  75  -5 -  3652 wait   14:28 pts/000:00:00
make all
4 S root 16548 16547  0  75  -5 -  3652 n_tty_ 14:28 pts/000:00:00
make all-am
4 S root 16599  3258  0  80   0 - 17937 poll_s 15:07 ?00:00:00
sshd: r...@pts/2
4 S root 16602 16599  0  80   0 -  4429 wait   15:07 pts/200:00:00 -bash
4 R root 16611 16602  0  80   0 -  3698 -  15:08 pts/200:00:00 ps 
-elf
1 S root 31506 2  0  80   0 - 0 bdi_wr 14:25 ?00:00:00
[flush-253:0]

All in wait?
   


Maybe a block driver problem?  Are you using virtio?


After this test I reboot in 2.6.31-r10 and I complete emerge -e world 
succefully.
The problem show always with all kernels=2.6.32
I have setup something wrong in kernel? I post the .config in the previous 
email.

   


It should work for all .configs.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Gentoo guest with smp: emerge freeze while recompile world

2010-05-21 Thread Riccardo
-- Original Message --- 
 From: Avi Kivity a...@redhat.com 
 To: Riccardo andrighetto.ricca...@gmail.com 
 Cc: kvm@vger.kernel.org 
 Sent: Fri, 21 May 2010 18:21:20 +0300 
 Subject: Re: Gentoo guest with smp: emerge freeze while recompile world

  On 05/21/2010 04:16 PM, Riccardo wrote: 
   ... 
   
   There are almost impossible to debug. 
   
   Try copying vmlinux out of your guest and attach with gdb when it 
   hangs.  Then issue the command 
   
  (gdb) thread apply all backtrace 
   
   to see what the guest is doing. 
   
   -- 
   Do not meddle in the internals of kernels, for they are subtle and quick 
   to 
 
   panic. 
   --- End of Original Message --- 
   
   Hi, 
   I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e world 
   complete without errors! 
   
  
  Interesing.  Can you so a git bisect to see where it stops working? 
 Ehm sorry I don't understand the request have you a link? 
   I always use the same .config 
   
   After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the 
   problem 
   remain, the compile freeze and I see this in ps -elf: 
   
   5 S root  1013 1  0  76  -4 -  3125 poll_s 13:00 ?00:00:00 
   /sbin/udevd --daemon 
   1 S root  2669 1  0  80   0 -  7523 wait   13:00 ?00:00:00 
   supervising syslog-ng 
   5 S root  2670  2669  0  80   0 -   7556 poll_s 13:00 ?
   00:00:00 
   /usr/sbin/syslog-ng 
   1 S root  3258 1  0  80   0 -  9505 poll_s 13:00 ?00:00:00 
   /usr/sbin/sshd 
   1 S root  3378 1  0  80   0 -  4115 hrtime 13:00 ?00:00:00 
   /usr/sbin/cron 
   0 S root  3446 1  0  80   0 -  1493 n_tty_ 13:00 tty2 00:00:00 
   /sbin/agetty 38400 tty2 linux 
   0 S root  3447 1  0  80   0 -  1493 n_tty_ 13:00 tty3 00:00:00 
   /sbin/agetty 38400 tty3 linux 
   0 S root  3448 1  0  80   0 -  1493 n_tty_ 13:00 tty4 00:00:00 
   /sbin/agetty 38400 tty4 linux 
   0 S root  3449 1  0  80   0 -  1493 n_tty_ 13:00 tty5 00:00:00 
   /sbin/agetty 38400 tty5 linux 
   0 S root  3450 1  0  80   0 -  1493 n_tty_ 13:00 tty6 00:00:00 
   /sbin/agetty 38400 tty6 linux 
   5 S root  3457 1  0  80   0 -  5959 poll_s 13:00 ?00:00:00 
   SCREEN -S sb1 
   4 S root  3458  3457  0  80   0 -   4454 wait   13:00 pts/0
   00:00:00 
   -/bin/bash 
   4 S root  3462  3458  0  75  -5 - 45171 poll_s 13:00 pts/000:00:34 
   /usr/bin/python2.6 /usr/bin/emerge -e world 
   4 S root  3613 1  0  80   0 - 14014 wait   13:01 tty1 00:00:00 
   /bin/login -- 
   4 S root  3953  3613  0  80   0 -   4429 n_tty_ 13:01 tty1 
 00:00:00 -bash 
   0 S root  6614  3462  0  75  -5 -   972 wait   14:26 pts/000:00:00 
   [dev-util/pkgconfig-0.23] sandbox /usr/lib64/portage/bin/ebuild.sh 
   compile 
   4 S root  6615  6614  0  75  -5 -   6362 wait   14:26 pts/0
   00:00:00 
   /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 
   5 S root  6646  6615  0  75  -5 -   6745 wait   14:26 pts/0
   00:00:00 
   /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 
   4 S root 13235  6646  0  75  -5 -   3651 wait   14:27 pts/0
   00:00:00 
   make -j8 
   4 S root 13238 13235  0  75  -5 -  3652 wait   14:27 pts/000:00:00 
   make all-recursive 
   4 S root 13239 13238  0  75  -5 -  5956 wait   14:27 pts/000:00:00 
   /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo 
   all-recursive | sed s/-recursive//`; \?list= 
   5 S root 13243 13239  0  75  -5 -  5956 wait   14:27 pts/000:00:00 
   /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo 
   all-recursive | sed s/-recursive//`; \?list= 
   4 S root 13244 13243  0  75  -5 -  3686 wait   14:27 pts/000:00:00 
   make all 
   4 S root 13358 13244  0  75  -5 -  3684 wait   14:27 pts/000:00:00 
   make all-recursive 
   4 S root 13359 13358  0  75  -5 -  5956 wait   14:27 pts/000:00:00 
   /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo 
   all-recursive | sed s/-recursive//`; \?list= 
   5 S root 16546 13359  0  75  -5 -  5956 wait   14:28 pts/000:00:00 
   /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; \?target=`echo 
   all-recursive | sed s/-recursive//`; \?list= 
   4 S root 16547 16546  0  75  -5 -  3652 wait   14:28 pts/000:00:00 
   make all 
   4 S root 16548 16547  0  75  -5 -  3652 n_tty_ 14:28 pts/000:00:00 
   make all-am 
   4 S root 16599  3258  0  80   0 - 17937 poll_s 15:07 ?00:00:00 
   sshd: r...@pts/2 
   4 S root 16602 16599  0  80   0 -  4429 wait   15:07 pts/200:00:00 
 -bash 
   4 R root 16611 16602  0  80   0 -  3698 -  15:08 pts/200:00:00 
 ps -elf 
   1 S root 31506 2  0  80   0 - 0 bdi_wr 14:25 ?00:00:00 
   [flush-253:0] 
   
   All in wait? 
   
  
  Maybe a 

Re: Bug tracking?

2010-05-21 Thread Anthony Liguori

On 05/21/2010 07:45 AM, Michael Tokarev wrote:

So, what's the current state of the bug tracking system?
As far as I can see, qemu is moving to launchpad.
Where qemu-kvm-related issues should be submitted nowadays?


Kernel issues should be filed in bugzilla.kernel.org.

qemu issues should be filed in LaunchPad.

Regards,

Anthony Liguori


Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug tracking?

2010-05-21 Thread Avi Kivity

On 05/21/2010 06:50 PM, Anthony Liguori wrote:

On 05/21/2010 07:45 AM, Michael Tokarev wrote:

So, what's the current state of the bug tracking system?
As far as I can see, qemu is moving to launchpad.
Where qemu-kvm-related issues should be submitted nowadays?


Kernel issues should be filed in bugzilla.kernel.org.

qemu issues should be filed in LaunchPad.



qemu-kvm issues, even if not present in upstream qemu, should be filed 
in launchpad (but clearly marked to be qemu-kvm specific).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug tracking?

2010-05-21 Thread Michael Tokarev

21.05.2010 19:56, Avi Kivity wrote:

On 05/21/2010 06:50 PM, Anthony Liguori wrote:

On 05/21/2010 07:45 AM, Michael Tokarev wrote:

So, what's the current state of the bug tracking system?
As far as I can see, qemu is moving to launchpad.
Where qemu-kvm-related issues should be submitted nowadays?


Kernel issues should be filed in bugzilla.kernel.org.

qemu issues should be filed in LaunchPad.


qemu-kvm issues, even if not present in upstream qemu, should be filed
in launchpad (but clearly marked to be qemu-kvm specific).


Aha. That makes perfect sense now.  Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Gentoo guest with smp: emerge freeze while recompile world

2010-05-21 Thread Brian Jackson
On Friday, May 21, 2010 10:46:10 am Riccardo wrote:
 -- Original Message ---
  From: Avi Kivity a...@redhat.com
  To: Riccardo andrighetto.ricca...@gmail.com
  Cc: kvm@vger.kernel.org
  Sent: Fri, 21 May 2010 18:21:20 +0300
  Subject: Re: Gentoo guest with smp: emerge freeze while recompile world
 
   On 05/21/2010 04:16 PM, Riccardo wrote:
...

There are almost impossible to debug.

Try copying vmlinux out of your guest and attach with gdb when it
hangs.  Then issue the command

   (gdb) thread apply all backtrace

to see what the guest is doing.

panic.
--- End of Original Message ---

Hi,
I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e
world complete without errors!
   
   Interesing.  Can you so a git bisect to see where it stops working?
 
  Ehm sorry I don't understand the request have you a link?
 
I always use the same .config

After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the
problem remain, the compile freeze and I see this in ps -elf:

5 S root  1013 1  0  76  -4 -  3125 poll_s 13:00 ?   
00:00:00 /sbin/udevd --daemon
1 S root  2669 1  0  80   0 -  7523 wait   13:00 ?   
00:00:00 supervising syslog-ng
5 S root  2670  2669  0  80   0 -   7556 poll_s 13:00 ?   
00:00:00 /usr/sbin/syslog-ng
1 S root  3258 1  0  80   0 -  9505 poll_s 13:00 ?   
00:00:00 /usr/sbin/sshd
1 S root  3378 1  0  80   0 -  4115 hrtime 13:00 ?   
00:00:00 /usr/sbin/cron
0 S root  3446 1  0  80   0 -  1493 n_tty_ 13:00 tty2
00:00:00 /sbin/agetty 38400 tty2 linux
0 S root  3447 1  0  80   0 -  1493 n_tty_ 13:00 tty3
00:00:00 /sbin/agetty 38400 tty3 linux
0 S root  3448 1  0  80   0 -  1493 n_tty_ 13:00 tty4
00:00:00 /sbin/agetty 38400 tty4 linux
0 S root  3449 1  0  80   0 -  1493 n_tty_ 13:00 tty5
00:00:00 /sbin/agetty 38400 tty5 linux
0 S root  3450 1  0  80   0 -  1493 n_tty_ 13:00 tty6
00:00:00 /sbin/agetty 38400 tty6 linux
5 S root  3457 1  0  80   0 -  5959 poll_s 13:00 ?   
00:00:00 SCREEN -S sb1
4 S root  3458  3457  0  80   0 -   4454 wait   13:00 pts/0   
00:00:00 -/bin/bash
4 S root  3462  3458  0  75  -5 - 45171 poll_s 13:00 pts/0   
00:00:34 /usr/bin/python2.6 /usr/bin/emerge -e world
4 S root  3613 1  0  80   0 - 14014 wait   13:01 tty1
00:00:00 /bin/login --
4 S root  3953  3613  0  80   0 -   4429 n_tty_ 13:01 tty1
 
  00:00:00 -bash
 
0 S root  6614  3462  0  75  -5 -   972 wait   14:26 pts/0   
00:00:00 [dev-util/pkgconfig-0.23] sandbox
/usr/lib64/portage/bin/ebuild.sh compile 4 S root  6615  6614 
0  75  -5 -   6362 wait   14:26 pts/000:00:00 /bin/bash
/usr/lib64/portage/bin/ebuild.sh compile
5 S root  6646  6615  0  75  -5 -   6745 wait   14:26 pts/0   
00:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile
4 S root 13235  6646  0  75  -5 -   3651 wait   14:27 pts/0   
00:00:00 make -j8
4 S root 13238 13235  0  75  -5 -  3652 wait   14:27 pts/0   
00:00:00 make all-recursive
4 S root 13239 13238  0  75  -5 -  5956 wait   14:27 pts/0   
00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no;
\?target=`echo all-recursive | sed s/-recursive//`; \?list=
5 S root 13243 13239  0  75  -5 -  5956 wait   14:27 pts/0   
00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no;
\?target=`echo all-recursive | sed s/-recursive//`; \?list=
4 S root 13244 13243  0  75  -5 -  3686 wait   14:27 pts/0   
00:00:00 make all
4 S root 13358 13244  0  75  -5 -  3684 wait   14:27 pts/0   
00:00:00 make all-recursive
4 S root 13359 13358  0  75  -5 -  5956 wait   14:27 pts/0   
00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no;
\?target=`echo all-recursive | sed s/-recursive//`; \?list=
5 S root 16546 13359  0  75  -5 -  5956 wait   14:28 pts/0   
00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no;
\?target=`echo all-recursive | sed s/-recursive//`; \?list=
4 S root 16547 16546  0  75  -5 -  3652 wait   14:28 pts/0   
00:00:00 make all
4 S root 16548 16547  0  75  -5 -  3652 n_tty_ 14:28 pts/0   
00:00:00 make all-am
4 S root 16599  3258  0  80   0 - 17937 poll_s 15:07 ?   
00:00:00 sshd: r...@pts/2
4 S root 16602 16599  0  80   0 -  4429 wait   15:07 pts/2   
00:00:00
 
  -bash
 
4 R root 16611 16602  0  80   0 -  3698 -  15:08 pts/2   
00:00:00
 
  ps -elf
 
1 S root 31506 2  0  80   0 - 0 bdi_wr 14:25 ?   
00:00:00 [flush-253:0]

All in wait?
   
   Maybe a block driver problem?  Are you using virtio?
 
  Yes, I always 

Re: Gentoo guest with smp: emerge freeze while recompile world

2010-05-21 Thread Riccardo
-- Original Message --- 
 From: Brian Jackson i...@theiggy.com 
 To: Riccardo andrighetto.ricca...@gmail.com 
 Cc: kvm@vger.kernel.org 
 Sent: Fri, 21 May 2010 11:35:36 -0500 
 Subject: Re: Gentoo guest with smp: emerge freeze while recompile world

 On Friday, May 21, 2010 10:46:10 am Riccardo wrote: 
  -- Original Message --- 
   From: Avi Kivity a...@redhat.com 
   To: Riccardo andrighetto.ricca...@gmail.com 
   Cc: kvm@vger.kernel.org 
   Sent: Fri, 21 May 2010 18:21:20 +0300 
   Subject: Re: Gentoo guest with smp: emerge freeze while recompile world 
  
On 05/21/2010 04:16 PM, Riccardo wrote: 
 ... 
 
 There are almost impossible to debug. 
 
 Try copying vmlinux out of your guest and attach with gdb when it 
 hangs.  Then issue the command 
 
(gdb) thread apply all backtrace 
 
 to see what the guest is doing. 
 
 panic. 
 --- End of Original Message --- 
 
 Hi, 
 I compile gentoo-sources-2.6.31-r10 and with this kernel emerge -e 
 world complete without errors! 

Interesing.  Can you so a git bisect to see where it stops working? 
  
   Ehm sorry I don't understand the request have you a link? 
  
 I always use the same .config 
 
 After I try gentoo-sources-2.6.34 and vanilla-sources-2.6.34 but the 
 problem remain, the compile freeze and I see this in ps -elf: 
 
 5 S root  1013 1  0   76  -4 -  3125 poll_s 13:00 ?   
 00:00:00 /sbin/udevd --daemon 
 1 S root  2669 1  0   80   0 -  7523 wait   13:00 ?   
 00:00:00 supervising syslog-ng 
 5 S root  2670  2669  0   80   0 -   7556 poll_s 13:00 ?   
 00:00:00 /usr/sbin/syslog-ng 
 1 S root  3258 1  0   80   0 -  9505 poll_s 13:00 ?   
 00:00:00 /usr/sbin/sshd 
 1 S root  3378 1  0   80   0 -  4115 hrtime 13:00 ?   
 00:00:00 /usr/sbin/cron 
 0 S root  3446 1  0   80   0 -  1493 n_tty_ 13:00 tty2 
 00:00:00 /sbin/agetty 38400 tty2 linux 
 0 S root  3447 1  0   80   0 -  1493 n_tty_ 13:00 tty3 
 00:00:00 /sbin/agetty 38400 tty3 linux 
 0 S root  3448 1  0   80   0 -  1493 n_tty_ 13:00 tty4 
 00:00:00 /sbin/agetty 38400 tty4 linux 
 0 S root  3449 1  0   80   0 -  1493 n_tty_ 13:00 tty5 
 00:00:00 /sbin/agetty 38400 tty5 linux 
 0 S root  3450 1  0   80   0 -  1493 n_tty_ 13:00 tty6 
 00:00:00 /sbin/agetty 38400 tty6 linux 
 5 S root  3457 1  0   80   0 -  5959 poll_s 13:00 ?   
 00:00:00 SCREEN -S sb1 
 4 S root  3458  3457  0   80   0 -   4454 wait   13:00 pts/0   
 00:00:00 -/bin/bash 
 4 S root  3462  3458  0   75  -5 - 45171 poll_s 13:00 pts/0   
 00:00:34 /usr/bin/python2.6 /usr/bin/emerge -e world 
 4 S root  3613 1  0   80   0 - 14014 wait   13:01 tty1 
 00:00:00 /bin/login -- 
 4 S root  3953  3613  0   80   0 -   4429 n_tty_ 13:01 tty1 
  
   00:00:00 -bash 
  
 0 S root  6614  3462  0   75  -5 -   972 wait   14:26 pts/0   
 00:00:00 [dev-util/pkgconfig-0.23] sandbox 
 /usr/lib64/portage/bin/ebuild.sh compile 4 S root  6615  6614 
 0  75  -5 -   6362 wait   14:26 pts/000:00:00 /bin/bash 
 /usr/lib64/portage/bin/ebuild.sh compile 
 5 S root  6646  6615  0   75  -5 -   6745 wait   14:26 pts/0   
 00:00:00 /bin/bash /usr/lib64/portage/bin/ebuild.sh compile 
 4 S root 13235  6646  0  75   -5 -   3651 wait   14:27 pts/0   
 00:00:00 make -j8 
 4 S root 13238 13235  0  75   -5 -  3652 wait   14:27 pts/0   
 00:00:00 make all-recursive 
 4 S root 13239 13238  0  75   -5 -  5956 wait   14:27 pts/0   
 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; 
 \?target=`echo all-recursive | sed s/-recursive//`; \?list= 
 5 S root 13243 13239  0  75   -5 -  5956 wait   14:27 pts/0   
 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; 
 \?target=`echo all-recursive | sed s/-recursive//`; \?list= 
 4 S root 13244 13243  0  75   -5 -  3686 wait   14:27 pts/0   
 00:00:00 make all 
 4 S root 13358 13244  0  75   -5 -  3684 wait   14:27 pts/0   
 00:00:00 make all-recursive 
 4 S root 13359 13358  0  75   -5 -  5956 wait   14:27 pts/0   
 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; 
 \?target=`echo all-recursive | sed s/-recursive//`; \?list= 
 5 S root 16546 13359  0  75   -5 -  5956 wait   14:28 pts/0   
 00:00:00 /bin/sh -c set fnord $MAKEFLAGS; amf=$2; \?dot_seen=no; 
 \?target=`echo all-recursive | sed s/-recursive//`; \?list= 
 4 S root 16547 16546  0  75   -5 -  3652 wait   14:28 pts/0   
 00:00:00 make all 
 4 S root 16548 16547  0  75   -5 -  3652 n_tty_ 14:28 pts/0   
 00:00:00 make all-am 
 4 S root 16599  3258  0  

Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Jan Kiszka
Anthony Liguori wrote:
 On 05/21/2010 08:46 AM, Jan Kiszka wrote:
 Anthony Liguori wrote:

 On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote:
  
 Trace events should be defined in trace.h.  Events are written to
 /tmp/trace.log and can be formatted using trace.py.  Remember to add
 events to trace.py for pretty-printing.

 Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---
Makefile.objs |2 +-
trace.c   |   64
 +
trace.h   |9 
trace.py  |   30 ++
4 files changed, 104 insertions(+), 1 deletions(-)
create mode 100644 trace.c
create mode 100644 trace.h
create mode 100755 trace.py

 diff --git a/Makefile.objs b/Makefile.objs
 index acbaf22..307e989 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -8,7 +8,7 @@ qobject-obj-y += qerror.o
# block-obj-y is code used by both qemu system emulation and qemu-img

block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o
 module.o
 -block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
 +block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o
block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o

 diff --git a/trace.c b/trace.c
 new file mode 100644
 index 000..2fec4d3
 --- /dev/null
 +++ b/trace.c
 @@ -0,0 +1,64 @@
 +#includestdlib.h
 +#includestdio.h
 +#include trace.h
 +
 +typedef struct {
 +unsigned long event;
 +unsigned long x1;
 +unsigned long x2;
 +unsigned long x3;
 +unsigned long x4;
 +unsigned long x5;
 +} TraceRecord;
 +
 +enum {
 +TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord),
 +};
 +
 +static TraceRecord trace_buf[TRACE_BUF_LEN];
 +static unsigned int trace_idx;
 +static FILE *trace_fp;
 +
 +static void trace(TraceEvent event, unsigned long x1,
 +  unsigned long x2, unsigned long x3,
 +  unsigned long x4, unsigned long x5) {
 +TraceRecord *rec =trace_buf[trace_idx];
 +rec-event = event;
 +rec-x1 = x1;
 +rec-x2 = x2;
 +rec-x3 = x3;
 +rec-x4 = x4;
 +rec-x5 = x5;
 +
 +if (++trace_idx == TRACE_BUF_LEN) {
 +trace_idx = 0;
 +
 +if (!trace_fp) {
 +trace_fp = fopen(/tmp/trace.log, w);
 +}
 +if (trace_fp) {
 +size_t result = fwrite(trace_buf, sizeof trace_buf, 1,
 trace_fp);
 +result = result;
 +}
 +}
 +}


 It is probably worth while to read trace points via the monitor or
 through some other mechanism.  My concern would be that writing even 64k
 out to disk would introduce enough performance overhead mainly because
 it runs lock-step with the guest's VCPU.

 Maybe it's worth adding a thread that syncs the ring to disk if we want
 to write to disk?
  
 That's not what QEMU should worry about. If somehow possible, let's push
 this into the hands of a (user space) tracing framework, ideally one
 that is already designed for such requirements. E.g. there exists quite
 useful work in the context of LTTng (user space RCU for application
 tracing).

 
  From what I understand, none of the current kernel approaches to 
 userspace tracing have much momentum at the moment.
 
 We may need simple stubs for the case that no such framework is (yet)
 available. But effort should focus on a QEMU infrastructure to add
 useful tracepoints to the code. Specifically when tracing over KVM, you
 usually need information about kernel states as well, so you depend on
 an integrated approach, not Yet Another Log File.

 
 I think the simple code that Stefan pasted gives us 95% of what we need.

IMHO not 95%, but it is a start.

I would just like to avoid that too much efforts are spent on
re-inventing smart trace buffers, trace daemons, or trace visualization
tools. Then better pick up some semi-perfect approach (e.g. [1], it
unfortunately still seems to lack kernel integration) and drive it
according to our needs.

Jan

[1] http://lttng.org/ust

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add support for protocol driver create_options

2010-05-21 Thread Kevin Wolf
Am 20.05.2010 07:36, schrieb MORITA Kazutaka:
 This patch enables protocol drivers to use their create options which
 are not supported by the format.  For example, protcol drivers can use
 a backing_file option with raw format.
 
 Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
 ---
  block.c   |7 +++
  block.h   |1 +
  qemu-img.c|   49 ++---
  qemu-option.c |   52 +---
  qemu-option.h |2 ++
  5 files changed, 85 insertions(+), 26 deletions(-)
 
 diff --git a/block.c b/block.c
 index 48d8468..0ab9424 100644
 --- a/block.c
 +++ b/block.c
 @@ -56,7 +56,6 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t 
 sector_num,
  uint8_t *buf, int nb_sectors);
  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
   const uint8_t *buf, int nb_sectors);
 -static BlockDriver *find_protocol(const char *filename);
  
  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
  QTAILQ_HEAD_INITIALIZER(bdrv_states);
 @@ -210,7 +209,7 @@ int bdrv_create_file(const char* filename, 
 QEMUOptionParameter *options)
  {
  BlockDriver *drv;
  
 -drv = find_protocol(filename);
 +drv = bdrv_find_protocol(filename);
  if (drv == NULL) {
  drv = bdrv_find_format(file);
  }
 @@ -283,7 +282,7 @@ static BlockDriver *find_hdev_driver(const char *filename)
  return drv;
  }
  
 -static BlockDriver *find_protocol(const char *filename)
 +BlockDriver *bdrv_find_protocol(const char *filename)
  {
  BlockDriver *drv1;
  char protocol[128];
 @@ -469,7 +468,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
 *filename, int flags)
  BlockDriver *drv;
  int ret;
  
 -drv = find_protocol(filename);
 +drv = bdrv_find_protocol(filename);
  if (!drv) {
  return -ENOENT;
  }
 diff --git a/block.h b/block.h
 index 24efeb6..9034ebb 100644
 --- a/block.h
 +++ b/block.h
 @@ -54,6 +54,7 @@ void bdrv_info_stats(Monitor *mon, QObject **ret_data);
  
  void bdrv_init(void);
  void bdrv_init_with_whitelist(void);
 +BlockDriver *bdrv_find_protocol(const char *filename);
  BlockDriver *bdrv_find_format(const char *format_name);
  BlockDriver *bdrv_find_whitelisted_format(const char *format_name);
  int bdrv_create(BlockDriver *drv, const char* filename,
 diff --git a/qemu-img.c b/qemu-img.c
 index d3c30a7..8ae7184 100644
 --- a/qemu-img.c
 +++ b/qemu-img.c
 @@ -252,8 +252,8 @@ static int img_create(int argc, char **argv)
  const char *base_fmt = NULL;
  const char *filename;
  const char *base_filename = NULL;
 -BlockDriver *drv;
 -QEMUOptionParameter *param = NULL;
 +BlockDriver *drv, *proto_drv;
 +QEMUOptionParameter *param = NULL, *create_options = NULL;
  char *options = NULL;
  
  flags = 0;
 @@ -286,33 +286,42 @@ static int img_create(int argc, char **argv)
  }
  }
  
 +/* Get the filename */
 +if (optind = argc)
 +help();
 +filename = argv[optind++];
 +
  /* Find driver and parse its options */
  drv = bdrv_find_format(fmt);
  if (!drv)
  error(Unknown file format '%s', fmt);
  
 +proto_drv = bdrv_find_protocol(filename);
 +if (!proto_drv)
 +error(Unknown protocol '%s', filename);
 +
 +create_options = append_option_parameters(create_options,
 +  drv-create_options);
 +create_options = append_option_parameters(create_options,
 +  proto_drv-create_options);
 +
  if (options  !strcmp(options, ?)) {
 -print_option_help(drv-create_options);
 +print_option_help(create_options);
  return 0;
  }
  
  /* Create parameter list with default values */
 -param = parse_option_parameters(, drv-create_options, param);
 +param = parse_option_parameters(, create_options, param);
  set_option_parameter_int(param, BLOCK_OPT_SIZE, -1);
  
  /* Parse -o options */
  if (options) {
 -param = parse_option_parameters(options, drv-create_options, param);
 +param = parse_option_parameters(options, create_options, param);
  if (param == NULL) {
  error(Invalid options for file format '%s'., fmt);
  }
  }
  
 -/* Get the filename */
 -if (optind = argc)
 -help();
 -filename = argv[optind++];
 -
  /* Add size to parameters */
  if (optind  argc) {
  set_option_parameter(param, BLOCK_OPT_SIZE, argv[optind++]);
 @@ -362,6 +371,7 @@ static int img_create(int argc, char **argv)
  puts();
  
  ret = bdrv_create(drv, filename, param);
 +free_option_parameters(create_options);
  free_option_parameters(param);
  
  if (ret  0) {
 @@ -543,14 +553,14 @@ static int img_convert(int argc, char **argv)
  {
  int c, ret, n, n1, bs_n, bs_i, flags, cluster_size, 

ixgbe: macvlan on PF/VF when SRIOV is enabled

2010-05-21 Thread Shirley Ma
Hello Jeff,

macvlan doesn't work on PF when SRIOV is enabled. Creating macvlan has
been successful, but ping (icmp request) goes to VF interface not
PF/macvlan even arp entry is correct. I patched ixgbe driver, and
macvlan/PF has worked with the patch. But I am not sure whether it is
right since I don't have the HW spec. What I did for ixgbe driver was: 

1. PF's rar index is 0, VMDQ index is adatper-num_vfs;
2. VF's rar is based on rar_used_count and mc_addr_in_rar_count, VMDQ
index is ;
3. PF's secondary addresses is PF's rar index + i, VMDQ index is
adapter-num_vfs.


Before I submit the patch, I want to understand the right index
assignment for both rar index and VMDQ index, when SRIOV enabled:
1. VMDQ index for PF is adapter-num_vfs, or 0? rar index is 0?
2. PF's secondary address rar index is based on
rar_used_count/mc_addr_in_rar_count?
2. VF's VPDQ index is based on vf number?
3. VF's rar index is vf + 1, or should be based on rar_used_count?

I am also working on macvlan on VF. The question here is whether macvlan
on VF should work or not? Looks like ixgbevf secondary addresses are not
in receiver address filter, so macvlan on VF doesn't work.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Stefan Hajnoczi
On Fri, May 21, 2010 at 5:52 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 I would just like to avoid that too much efforts are spent on
 re-inventing smart trace buffers, trace daemons, or trace visualization
 tools. Then better pick up some semi-perfect approach (e.g. [1], it
 unfortunately still seems to lack kernel integration) and drive it
 according to our needs.

I agree we have to consider existing solutions.  The killer is the
usability: what dependencies are required to build with tracing?  Is a
patched kernel or module required?  How easy is it to add static trace
events during debugging?

If there are too many dependencies, especially to unpackaged software,
many people will stop right there and not bother.  A patched kernel or
module isn't acceptable since the hassle of reconfiguring a system for
tracing becomes too great (or in some cases changing the kernel is not
possible/allowed).

Adding new static trace events should be easy, too.  Ideally it
doesn't require adding information about the trace event in multiple
places (header files, C files, etc).  It also shouldn't require
learning about the tracing system, adding a trace event should be
self-explanatory so anyone can easily add one for debugging.

A lot of opinions there, but what I'm saying is that friction must be
low.  If the tracing system is a pain to use, then no-one will use it.

http://lttng.org/files/ust/manual/ust.html

LTTng Userspace Tracer looks interesting - no kernel support required
AFAICT.  Toggling trace events in a running process supported.
Similar to kernel tracepoint.h and existing report/visualization tool.

x86 (32- and 64-bit) only.  Like you say, no correlation with kernel trace data.

I'll try to give LTTng UST a spin by converting my trace events to use
UST.  This seems closest to an existing tracing system we can drop in.

http://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps

Requires kernel support - not sure if enough of utrace is in mainline
for this to work out-of-the-box across distros.

Unclear how exactly SystemTap userspace probing would work out.  Does
anyone have experience or want to try this?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Anthony Liguori

On 05/21/2010 11:52 AM, Jan Kiszka wrote:

Anthony Liguori wrote:
   

On 05/21/2010 08:46 AM, Jan Kiszka wrote:
 

Anthony Liguori wrote:

   

On 05/21/2010 04:42 AM, Stefan Hajnoczi wrote:

 

Trace events should be defined in trace.h.  Events are written to
/tmp/trace.log and can be formatted using trace.py.  Remember to add
events to trace.py for pretty-printing.

Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
---
Makefile.objs |2 +-
trace.c   |   64
+
trace.h   |9 
trace.py  |   30 ++
4 files changed, 104 insertions(+), 1 deletions(-)
create mode 100644 trace.c
create mode 100644 trace.h
create mode 100755 trace.py

diff --git a/Makefile.objs b/Makefile.objs
index acbaf22..307e989 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -8,7 +8,7 @@ qobject-obj-y += qerror.o
# block-obj-y is code used by both qemu system emulation and qemu-img

block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o
module.o
-block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
+block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o trace.o
block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o

diff --git a/trace.c b/trace.c
new file mode 100644
index 000..2fec4d3
--- /dev/null
+++ b/trace.c
@@ -0,0 +1,64 @@
+#includestdlib.h
+#includestdio.h
+#include trace.h
+
+typedef struct {
+unsigned long event;
+unsigned long x1;
+unsigned long x2;
+unsigned long x3;
+unsigned long x4;
+unsigned long x5;
+} TraceRecord;
+
+enum {
+TRACE_BUF_LEN = 64 * 1024 / sizeof(TraceRecord),
+};
+
+static TraceRecord trace_buf[TRACE_BUF_LEN];
+static unsigned int trace_idx;
+static FILE *trace_fp;
+
+static void trace(TraceEvent event, unsigned long x1,
+  unsigned long x2, unsigned long x3,
+  unsigned long x4, unsigned long x5) {
+TraceRecord *rec =trace_buf[trace_idx];
+rec-event = event;
+rec-x1 = x1;
+rec-x2 = x2;
+rec-x3 = x3;
+rec-x4 = x4;
+rec-x5 = x5;
+
+if (++trace_idx == TRACE_BUF_LEN) {
+trace_idx = 0;
+
+if (!trace_fp) {
+trace_fp = fopen(/tmp/trace.log, w);
+}
+if (trace_fp) {
+size_t result = fwrite(trace_buf, sizeof trace_buf, 1,
trace_fp);
+result = result;
+}
+}
+}


   

It is probably worth while to read trace points via the monitor or
through some other mechanism.  My concern would be that writing even 64k
out to disk would introduce enough performance overhead mainly because
it runs lock-step with the guest's VCPU.

Maybe it's worth adding a thread that syncs the ring to disk if we want
to write to disk?

 

That's not what QEMU should worry about. If somehow possible, let's push
this into the hands of a (user space) tracing framework, ideally one
that is already designed for such requirements. E.g. there exists quite
useful work in the context of LTTng (user space RCU for application
tracing).

   

   From what I understand, none of the current kernel approaches to
userspace tracing have much momentum at the moment.

 

We may need simple stubs for the case that no such framework is (yet)
available. But effort should focus on a QEMU infrastructure to add
useful tracepoints to the code. Specifically when tracing over KVM, you
usually need information about kernel states as well, so you depend on
an integrated approach, not Yet Another Log File.

   

I think the simple code that Stefan pasted gives us 95% of what we need.
 

IMHO not 95%, but it is a start.
   


I'm not opposed to using a framework, but I'd rather have an equivalent 
to kvm_stat tomorrow than wait 3 years for LTTng to not get merged.


So let's have a dirt-simple tracing mechanism and focus on adding useful 
trace points.  Then when we have a framework we can use, we can just 
convert the tracepoints to the new framework.


Regards,

Anthony Liguori


I would just like to avoid that too much efforts are spent on
re-inventing smart trace buffers, trace daemons, or trace visualization
tools. Then better pick up some semi-perfect approach (e.g. [1], it
unfortunately still seems to lack kernel integration) and drive it
according to our needs.

Jan

[1] http://lttng.org/ust

   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Christoph Hellwig
On Fri, May 21, 2010 at 09:49:56PM +0100, Stefan Hajnoczi wrote:
 http://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps
 
 Requires kernel support - not sure if enough of utrace is in mainline
 for this to work out-of-the-box across distros.

Nothing of utrace is in mainline, nevermind the whole systemtap code
which is intentionally keep out of the kernel tree.  Using this means
that for every probe in userspace code you need to keep the configured
source tree of the currently running kernel around, which is completely
unusable for typical developer setups.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Jan Kiszka
Stefan Hajnoczi wrote:
 On Fri, May 21, 2010 at 5:52 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 I would just like to avoid that too much efforts are spent on
 re-inventing smart trace buffers, trace daemons, or trace visualization
 tools. Then better pick up some semi-perfect approach (e.g. [1], it
 unfortunately still seems to lack kernel integration) and drive it
 according to our needs.
 
 I agree we have to consider existing solutions.  The killer is the
 usability: what dependencies are required to build with tracing?  Is a
 patched kernel or module required?  How easy is it to add static trace
 events during debugging?
 
 If there are too many dependencies, especially to unpackaged software,
 many people will stop right there and not bother.  A patched kernel or
 module isn't acceptable since the hassle of reconfiguring a system for
 tracing becomes too great (or in some cases changing the kernel is not
 possible/allowed).
 
 Adding new static trace events should be easy, too.  Ideally it
 doesn't require adding information about the trace event in multiple
 places (header files, C files, etc).  It also shouldn't require
 learning about the tracing system, adding a trace event should be
 self-explanatory so anyone can easily add one for debugging.
 
 A lot of opinions there, but what I'm saying is that friction must be
 low.  If the tracing system is a pain to use, then no-one will use it.

No question.

I mentioned LTTng as it is most promising /wrt performance (both when
enabled and disabled). But LTTng was so far not best in class when it
came to usability.

 
 http://lttng.org/files/ust/manual/ust.html
 
 LTTng Userspace Tracer looks interesting - no kernel support required
 AFAICT.  Toggling trace events in a running process supported.
 Similar to kernel tracepoint.h and existing report/visualization tool.
 
 x86 (32- and 64-bit) only.

Sure? I thought there might be an arch dependency due to urcu but it has
generic support as well now.

  Like you say, no correlation with kernel trace data.

It would be good if we could still hook into trancepoints and stream
them out differently. That would allow for add-hoc tracing when
performance does not matter that much (trace to file, trace to kernel).
But we would still benefit from enabling tracepoints during runtime and
keeping them built in.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Jan Kiszka
Anthony Liguori wrote:
 I'm not opposed to using a framework, but I'd rather have an equivalent
 to kvm_stat tomorrow than wait 3 years for LTTng to not get merged.
 
 So let's have a dirt-simple tracing mechanism and focus on adding useful
 trace points.  Then when we have a framework we can use, we can just
 convert the tracepoints to the new framework.

That could mean serializing the tracepoints to strings and dumping them
to our log file - no concerns.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 1/2] trace: Add simple tracing support

2010-05-21 Thread Anthony Liguori

On 05/21/2010 04:41 PM, Jan Kiszka wrote:

Anthony Liguori wrote:
   

I'm not opposed to using a framework, but I'd rather have an equivalent
to kvm_stat tomorrow than wait 3 years for LTTng to not get merged.

So let's have a dirt-simple tracing mechanism and focus on adding useful
trace points.  Then when we have a framework we can use, we can just
convert the tracepoints to the new framework.
 

That could mean serializing the tracepoints to strings and dumping them
to our log file - no concerns.
   


Which I really don't mind.

Regards,

Anthony Liguori


Jan

   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] Consolidate vcpu ioctl locking

2010-05-21 Thread Carsten Otte

On 15.05.2010 10:26, Alexander Graf wrote:

On S390, I'm also still sceptical if the implementation we have really works. A 
device injects an S390_INTERRUPT with its address and on the next vcpu_run, an 
according interrupt is issued. But what happens if two devices trigger an 
S390_INTERRUPT before the vcpu_run? We'd have lost an interrupt by then...
We're safe on that: the interrupt info field in both struct kvm (for 
floating interrupts) and struct vcpu (for cpu local interrupts) have 
their own locking and can queue up interrupts.


cheers,
Carsten
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html