Re: Raw vs. tap (was: Re: [Qemu-devel] Re: Release plan for 0.12.0)

2009-10-15 Thread Mark McLoughlin
On Wed, 2009-10-14 at 17:53 -0500, Anthony Liguori wrote:

 So at this point, I think it's a mistake to include raw socket support.  
 If the goal is to improve networking usability such that it just works 
 as a root user, let's incorporate a default network script that creates 
 a bridge or something like that.  There are better ways to achieve that 
 goal.

FWIW, I haven't really played with the raw backend yet, but my initial
thought was also what exactly does this gain us apart from yet more
confusion for users?.

So, I tend to agree, but I'm not so hung up on the user confusion
aspect - the users that I worry about confusing (e.g. virt-manager
users) would never even know the backend exists, even if qemu did
support it.

The one hope I had for raw is that it might allow us to get closer to
the NIC, get more details on the NIC tx queue and have more intelligent
tx mitigation. This is probably better explored in the context of
vhost-net, though.

Wrt. to configuring bridges, libvirt comes with a good default setup - a
bridge without any physical NICs connected, but NAT set up for access to
the outside world.

For bridging to a physical NIC, our plan continues to be that
NetworkManager will eventually make this trivial for users, but that's
still in progress. In the meantime, the config isn't all that complex:

  http://wiki.libvirt.org/page/Networking#Fedora.2FRHEL_Bridging

Cheers,
Mark.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM test: Add PCI pass through test

2009-10-15 Thread Yolkfull Chow
On Wed, Oct 14, 2009 at 09:08:00AM -0300, Lucas Meneghel Rodrigues wrote:
 Add a new PCI pass trough test. It supports both SR-IOV virtual
 functions and physical NIC card pass through.
 
 Single Root I/O Virtualization (SR-IOV) allows a single PCI device to
 be shared amongst multiple virtual machines while retaining the
 performance benefit of assigning a PCI device to a virtual machine.
 A common example is where a single SR-IOV capable NIC - with perhaps
 only a single physical network port - might be shared with multiple
 virtual machines by assigning a virtual function to each VM.
 
 SR-IOV support is implemented in the kernel. The core implementation is
 contained in the PCI subsystem, but there must also be driver support
 for both the Physical Function (PF) and Virtual Function (VF) devices.
 With an SR-IOV capable device one can allocate VFs from a PF. The VFs
 surface as PCI devices which are backed on the physical PCI device by
 resources (queues, and register sets).
 
 Device support:
 
 In 2.6.30, the Intel® 82576 Gigabit Ethernet Controller is the only
 SR-IOV capable device supported. The igb driver has PF support and the
 igbvf has VF support.
 
 In 2.6.31 the Neterion® X3100™ is supported as well. This device uses
 the same vxge driver for the PF as well as the VFs.

Wow, new NIC card supports SR-IOV... 
At this rate, do we need to move the driver name and its parameter into
config file so that in future if a new NIC card using different driver
is supported, we could handle it without changing code ?

 
 In order to configure the test:
 
   * For SR-IOV virtual functions passthrough, we could specify the
 module parameter 'max_vfs' in config file.
   * For physical NIC card pass through, we should specify the device
 name(s).
 
 Signed-off-by: Yolkfull Chow yz...@redhat.com
 ---
  client/tests/kvm/kvm_tests.cfg.sample |   11 ++-
  client/tests/kvm/kvm_utils.py |  278 
 +
  client/tests/kvm/kvm_vm.py|   72 +
  3 files changed, 360 insertions(+), 1 deletions(-)
 
 diff --git a/client/tests/kvm/kvm_tests.cfg.sample 
 b/client/tests/kvm/kvm_tests.cfg.sample
 index cc3228a..1dad188 100644
 --- a/client/tests/kvm/kvm_tests.cfg.sample
 +++ b/client/tests/kvm/kvm_tests.cfg.sample
 @@ -786,13 +786,22 @@ variants:
  only default
  image_format = raw
  
 -
  variants:
  - @smallpages:
  - hugepages:
  pre_command = /usr/bin/python scripts/hugepage.py /mnt/kvm_hugepage
  extra_params +=  -mem-path /mnt/kvm_hugepage
  
 +variants:
 +- @no_passthrough:
 +pass_through = no
 +- nic_passthrough:
 +pass_through = pf
 +passthrough_devs = eth1
 +- vfs_passthrough:
 +pass_through = vf
 +max_vfs = 7
 +vfs_count = 7
  
  variants:
  - @basic:
 diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py
 index 53b664a..0e3398c 100644
 --- a/client/tests/kvm/kvm_utils.py
 +++ b/client/tests/kvm/kvm_utils.py
 @@ -788,3 +788,281 @@ def md5sum_file(filename, size=None):
  size -= len(data)
  f.close()
  return o.hexdigest()
 +
 +
 +def get_full_id(pci_id):
 +
 +Get full PCI ID of pci_id.
 +
 +cmd = lspci -D | awk '/%s/ {print $1}' % pci_id
 +status, full_id = commands.getstatusoutput(cmd)
 +if status != 0:
 +return None
 +return full_id
 +
 +
 +def get_vendor_id(pci_id):
 +
 +Check out the device vendor ID according to PCI ID.
 +
 +cmd = lspci -n | awk '/%s/ {print $3}' % pci_id
 +return re.sub(:,  , commands.getoutput(cmd))
 +
 +
 +def release_dev(pci_id, pci_dict):
 +
 +Release a single PCI device.
 +
 +@param pci_id: PCI ID of a given PCI device
 +@param pci_dict: Dictionary with information about PCI devices
 +
 +base_dir = /sys/bus/pci
 +full_id = get_full_id(pci_id)
 +vendor_id = get_vendor_id(pci_id)
 +drv_path = os.path.join(base_dir, devices/%s/driver % full_id)
 +if 'pci-stub' in os.readlink(drv_path):
 +cmd = echo '%s'  %s/new_id % (vendor_id, drv_path)
 +if os.system(cmd):
 +return False
 +
 +stub_path = os.path.join(base_dir, drivers/pci-stub)
 +cmd = echo '%s'  %s/unbind % (full_id, stub_path)
 +if os.system(cmd):
 +return False
 +
 +prev_driver = pci_dict[pci_id]
 +cmd = echo '%s'  %s/bind % (full_id, prev_driver)
 +if os.system(cmd):
 +return False
 +return True
 +
 +
 +def release_pci_devs(pci_dict):
 +
 +Release all PCI devices assigned to host.
 +
 +@param pci_dict: Dictionary with information about PCI devices
 +
 +for pci_id in pci_dict:
 +if not release_dev(pci_id, pci_dict):
 +logging.error(Failed to release device [%s] to host % pci_id)
 +else:
 +logging.info(Release device [%s] successfully % pci_id)
 +
 +
 +class PassThrough(object):
 

Re: [Autotest] [PATCH] Add pass through feature test (support SR-IOV)

2009-10-15 Thread Yolkfull Chow
On Wed, Oct 14, 2009 at 09:13:59AM -0300, Lucas Meneghel Rodrigues wrote:
 Yolkfull, I've studied about single root IO virtualization before
 reviewing your patch, the general approach here looks good. There were
 some stylistic points as far as code is concerned, so I have rebased
 your patch against the latest trunk, and added some explanation about
 the features being tested and referenced (extracted from a Fedora 12
 blueprint).
 
 Please let me know if you are OK with it, I guess I will review this
 patch a couple more times, as the code and the features being tested
 are fairly complex.
 
 Thanks!

Lucas, thank you very much for adding a detailed explanation and
improving for this test. I had reviewed the new patch and some new
consideration came to my mind. I had added them on the email, please
reviewed. :)

 
 On Mon, Sep 14, 2009 at 11:20 PM, Yolkfull Chow yz...@redhat.com wrote:
  It supports both SR-IOV virtual functions' and physical NIC card pass 
  through.
   * For SR-IOV virtual functions passthrough, we could specify the module
     parameter 'max_vfs' in config file.
   * For physical NIC card pass through, we should specify the device name(s).
 
  Signed-off-by: Yolkfull Chow yz...@redhat.com
  ---
   client/tests/kvm/kvm_tests.cfg.sample |   12 ++
   client/tests/kvm/kvm_utils.py         |  248 
  -
   client/tests/kvm/kvm_vm.py            |   68 +-
   3 files changed, 326 insertions(+), 2 deletions(-)
 
  diff --git a/client/tests/kvm/kvm_tests.cfg.sample 
  b/client/tests/kvm/kvm_tests.cfg.sample
  index a83ef9b..c6037da 100644
  --- a/client/tests/kvm/kvm_tests.cfg.sample
  +++ b/client/tests/kvm/kvm_tests.cfg.sample
  @@ -627,6 +627,18 @@ variants:
 
 
   variants:
  +    - @no_passthrough:
  +        pass_through = no
  +    - nic_passthrough:
  +        pass_through = pf
  +        passthrough_devs = eth1
  +    - vfs_passthrough:
  +        pass_through = vf
  +        max_vfs = 7
  +        vfs_count = 7
  +
  +
  +variants:
      - @basic:
          only Fedora Windows
      - @full:
  diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py
  index dfca938..1fe3b31 100644
  --- a/client/tests/kvm/kvm_utils.py
  +++ b/client/tests/kvm/kvm_utils.py
  @@ -1,5 +1,5 @@
   import md5, thread, subprocess, time, string, random, socket, os, signal, 
  pty
  -import select, re, logging, commands
  +import select, re, logging, commands, cPickle
   from autotest_lib.client.bin import utils
   from autotest_lib.client.common_lib import error
   import kvm_subprocess
  @@ -795,3 +795,249 @@ def md5sum_file(filename, size=None):
          size -= len(data)
      f.close()
      return o.hexdigest()
  +
  +
  +def get_full_id(pci_id):
  +    
  +    Get full PCI ID of pci_id.
  +    
  +    cmd = lspci -D | awk '/%s/ {print $1}' % pci_id
  +    status, full_id = commands.getstatusoutput(cmd)
  +    if status != 0:
  +        return None
  +    return full_id
  +
  +
  +def get_vendor_id(pci_id):
  +    
  +    Check out the device vendor ID according to PCI ID.
  +    
  +    cmd = lspci -n | awk '/%s/ {print $3}' % pci_id
  +    return re.sub(:,  , commands.getoutput(cmd))
  +
  +
  +def release_pci_devs(dict):
  +    
  +    Release assigned PCI devices to host.
  +    
  +    def release_dev(pci_id):
  +        base_dir = /sys/bus/pci
  +        full_id = get_full_id(pci_id)
  +        vendor_id = get_vendor_id(pci_id)
  +        drv_path = os.path.join(base_dir, devices/%s/driver % full_id)
  +        if 'pci-stub' in os.readlink(drv_path):
  +            cmd = echo '%s'  %s/new_id % (vendor_id, drv_path)
  +            if os.system(cmd):
  +                return False
  +
  +            stub_path = os.path.join(base_dir, drivers/pci-stub)
  +            cmd = echo '%s'  %s/unbind % (full_id, stub_path)
  +            if os.system(cmd):
  +                return False
  +
  +            prev_driver = self.dev_prev_drivers[pci_id]
  +            cmd = echo '%s'  %s/bind % (full_id, prev_driver)
  +            if os.system(cmd):
  +                return False
  +        return True
  +
  +    for pci_id in dict.keys():
  +        if not release_dev(pci_id):
  +            logging.error(Failed to release device [%s] to host % pci_id)
  +        else:
  +            logging.info(Release device [%s] successfully % pci_id)
  +
  +
  +class PassThrough:
  +    
  +    Request passthroughable devices on host. It will check whether to 
  request
  +    PF(physical NIC cards) or VF(Virtual Functions).
  +    
  +    def __init__(self, type=nic_vf, max_vfs=None, names=None):
  +        
  +        Initialize parameter 'type' which could be:
  +        nic_vf: Virtual Functions
  +        nic_pf: Physical NIC card
  +        mixed:  Both includes VFs and PFs
  +
  +        If pass through Physical NIC cards, we need to specify which 
  devices
  +        to be assigned, e.g. 'eth1 eth2'.
  +
  +        If pass through Virtual Functions, we 

Re: Add qemu_send_raw() to vlan.

2009-10-15 Thread Mark McLoughlin
Hi Gleb,

On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote:
 It gets packet without virtio header and adds it if needed.  Allows to
 inject packets to vlan from outside. To send gracious arp for instance.
...
 diff --git a/net.h b/net.h
 index 931133b..3d0b6f2 100644
 --- a/net.h
 +++ b/net.h
 ...
 @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc);
  ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov,
int iovcnt);
  int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
 +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int size);
  void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]);
  void qemu_check_nic_model(NICInfo *nd, const char *model);
  void qemu_check_nic_model_list(NICInfo *nd, const char * const *models,

I've only just now noticed that we never actually made announce_self()
use this ... care to do that?

Cheers,
Mark.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Xen PV-on-HVM guest support (v2)

2009-10-15 Thread Avi Kivity

On 10/15/2009 02:41 PM, Ed Swierk wrote:

Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.

A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future.  Thus this patch
adds special support for the Xen HVM MSR.

I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files.  When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.

I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

v2: fix ioctl struct padding; renumber CAP and ioctl constants; check
kvm_write_guest() return value; change printks to KERN_DEBUG (I think
they're worth keeping for debugging userspace)



+#ifdef KVM_CAP_XEN_HVM
+struct kvm_xen_hvm_config {
+   __u32 msr;
+   __u8 pad[2];
+   __u8 blob_size[2];
+   __u64 blob_addr[2];
+};
+#endif
   


Please change the arrays to separate variables (e.g. blob_size_32, 
blob_size_64), so readers don't have to guess the meaning.


Also, reserve a bunch of space at the end in case we need more hackery.

Is the msr number really variable?  Isn't it an ABI?


   * ioctls for vcpu fds
Index: kvm-kmod/include/linux/kvm_host.h
===
--- kvm-kmod.orig/include/linux/kvm_host.h
+++ kvm-kmod/include/linux/kvm_host.h
@@ -236,6 +236,10 @@ struct kvm {
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
  #endif
+
+#ifdef KVM_CAP_XEN_HVM
+   struct kvm_xen_hvm_config xen_hvm_config;
+#endif
  };
   


struct kvm_arch is a better place for this.


  /* The guest did something we don't support. */
Index: kvm-kmod/x86/x86.c
===
--- kvm-kmod.orig/x86/x86.c
+++ kvm-kmod/x86/x86.c
@@ -875,6 +875,35 @@ static int set_msr_mce(struct kvm_vcpu *
return 0;
  }

+#ifdef KVM_CAP_XEN_HVM
   


No need for the ifdef - it will always be defined for x86.


+static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
+{
+   int blob = !!(vcpu-arch.shadow_efer  EFER_LME);
   


Can use is_long_mode() for this.


+   u32 pnum = data  ~PAGE_MASK;
+   u64 paddr = data  PAGE_MASK;
+   u8 *page;
+   int r = 1;
+
+   if (pnum= vcpu-kvm-xen_hvm_config.blob_size[blob])
+   goto out;
+   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!page)
+   goto out;
+   if (copy_from_user(page, (u8 *)vcpu-kvm-xen_hvm_config.blob_addr[blob]
+  + pnum * PAGE_SIZE, PAGE_SIZE))
+   goto out_free;
   


We want to return -EFAULT here (but make sure the entire code path 
allows this).



+   if (kvm_write_guest(vcpu-kvm, paddr, page, PAGE_SIZE))
+   goto out_free;
+   printk(KERN_DEBUG kvm: copied xen hvm blob %d page %d to 0x%llx\n,
+  blob, pnum, paddr);
+   r = 0;
+out_free:
+   kfree(page);
+out:
+   return r;
+}
+#endif
+
  int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
  {
switch (msr) {
@@ -990,6 +1019,10 @@ int kvm_set_msr_common(struct kvm_vcpu *
0x%x data 0x%llx\n, msr, data);
break;
default:
+#ifdef KVM_CAP_XEN_HVM
+   if (msr  (msr == vcpu-kvm-xen_hvm_config.msr))
+   return xen_hvm_config(vcpu, data);
+#endif
   


Again, can skip the ifdef.


if (!ignore_msrs) {
pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n,
msr, data);
@@ -2453,6 +2486,17 @@ long kvm_arch_vm_ioctl(struct file *filp
r = 0;
break;
}
+#ifdef KVM_CAP_XEN_HVM
+   case KVM_XEN_HVM_CONFIG: {
+   r = -EFAULT;
+   if (copy_from_user(kvm-xen_hvm_config, argp,
+  sizeof(struct kvm_xen_hvm_config)))
+   goto out;
+   printk(KERN_DEBUG kvm: configured xen hvm\n);
+   r = 0;
+   break;
+   }
+#endif
default:
;
}
   


Do we need support for reading the msr?

IMO you can drop the debugging printk()s.  I don't see how they add much 
value.


Please submit the patch against a current kernel tree, not kvm-kmod.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH] kvm: Prevent kvm_init from corrupting debugfs structures

2009-10-15 Thread Avi Kivity

On 10/15/2009 08:21 AM, Darrick J. Wong wrote:

I'm seeing an oops condition when kvm-intel and kvm-amd are modprobe'd
during boot (say on an Intel system) and then rmmod'd:

# modprobe kvm-intel
  kvm_init()
  kvm_init_debug()
  kvm_arch_init()-- stores debugfs dentries internally
  (success, etc)

# modprobe kvm-amd
  kvm_init()
  kvm_init_debug()-- second initialization clobbers kvm's
   internal pointers to dentries
  kvm_arch_init()
  kvm_exit_debug()-- and frees them

# rmmod kvm-intel
  kvm_exit()
  kvm_exit_debug()-- double free of debugfs files!

  *BOOM*

If execution gets to the end of kvm_init(), then the calling module has been
established as the kvm provider.  Move the debugfs initialization to the end of
the function, and remove the now-unnecessary call to kvm_exit_debug() from the
error path.  That way we avoid trampling on the debugfs entries and freeing
them twice.

   


Looks good.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Xen PV-on-HVM guest support (v2)

2009-10-15 Thread Jan Kiszka
Ed Swierk wrote:
 Support for Xen PV-on-HVM guests can be implemented almost entirely in
 userspace, except for handling one annoying MSR that maps a Xen
 hypercall blob into guest address space.
 
 A generic mechanism to delegate MSR writes to userspace seems overkill
 and risks encouraging similar MSR abuse in the future.  Thus this patch
 adds special support for the Xen HVM MSR.
 
 I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
 KVM which MSR the guest will write to, as well as the starting address
 and size of the hypercall blobs (one each for 32-bit and 64-bit) that
 userspace has loaded from files.  When the guest writes to the MSR, KVM
 copies one page of the blob from userspace to the guest.
 
 I've tested this patch with a hacked-up version of Gerd's userspace
 code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
 FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.
 
 v2: fix ioctl struct padding; renumber CAP and ioctl constants; check
 kvm_write_guest() return value; change printks to KERN_DEBUG (I think
 they're worth keeping for debugging userspace)

I disagree /wrt the print in the IOCTL path (missing configuration can
also be reported on access), and the guest triggered path at least
requires a pr_debug conversion. Looks fine to me otherwise.

Jan

 
 Signed-off-by: Ed Swierk eswi...@aristanetworks.com
 
 ---
 Index: kvm-kmod/include/asm-x86/kvm.h
 ===
 --- kvm-kmod.orig/include/asm-x86/kvm.h
 +++ kvm-kmod/include/asm-x86/kvm.h
 @@ -59,6 +59,7 @@
  #define __KVM_HAVE_MSIX
  #define __KVM_HAVE_MCE
  #define __KVM_HAVE_PIT_STATE2
 +#define __KVM_HAVE_XEN_HVM
  
  /* Architectural interrupt line count. */
  #define KVM_NR_INTERRUPTS 256
 Index: kvm-kmod/include/linux/kvm.h
 ===
 --- kvm-kmod.orig/include/linux/kvm.h
 +++ kvm-kmod/include/linux/kvm.h
 @@ -476,6 +476,9 @@ struct kvm_ioeventfd {
  #endif
  #define KVM_CAP_IOEVENTFD 36
  #define KVM_CAP_SET_IDENTITY_MAP_ADDR 37
 +#ifdef __KVM_HAVE_XEN_HVM
 +#define KVM_CAP_XEN_HVM 38
 +#endif
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 @@ -528,6 +531,15 @@ struct kvm_x86_mce {
  };
  #endif
  
 +#ifdef KVM_CAP_XEN_HVM
 +struct kvm_xen_hvm_config {
 + __u32 msr;
 + __u8 pad[2];
 + __u8 blob_size[2];
 + __u64 blob_addr[2];
 +};
 +#endif
 +
  #define KVM_IRQFD_FLAG_DEASSIGN (1  0)
  
  struct kvm_irqfd {
 @@ -586,6 +598,7 @@ struct kvm_irqfd {
  #define KVM_CREATE_PIT2 _IOW(KVMIO, 0x77, struct 
 kvm_pit_config)
  #define KVM_SET_BOOT_CPU_ID_IO(KVMIO, 0x78)
  #define KVM_IOEVENTFD _IOW(KVMIO, 0x79, struct kvm_ioeventfd)
 +#define KVM_XEN_HVM_CONFIG_IOW(KVMIO, 0x7a, struct 
 kvm_xen_hvm_config)
  
  /*
   * ioctls for vcpu fds
 Index: kvm-kmod/include/linux/kvm_host.h
 ===
 --- kvm-kmod.orig/include/linux/kvm_host.h
 +++ kvm-kmod/include/linux/kvm_host.h
 @@ -236,6 +236,10 @@ struct kvm {
   unsigned long mmu_notifier_seq;
   long mmu_notifier_count;
  #endif
 +
 +#ifdef KVM_CAP_XEN_HVM
 + struct kvm_xen_hvm_config xen_hvm_config;
 +#endif
  };
  
  /* The guest did something we don't support. */
 Index: kvm-kmod/x86/x86.c
 ===
 --- kvm-kmod.orig/x86/x86.c
 +++ kvm-kmod/x86/x86.c
 @@ -875,6 +875,35 @@ static int set_msr_mce(struct kvm_vcpu *
   return 0;
  }
  
 +#ifdef KVM_CAP_XEN_HVM
 +static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
 +{
 + int blob = !!(vcpu-arch.shadow_efer  EFER_LME);
 + u32 pnum = data  ~PAGE_MASK;
 + u64 paddr = data  PAGE_MASK;
 + u8 *page;
 + int r = 1;
 +
 + if (pnum = vcpu-kvm-xen_hvm_config.blob_size[blob])
 + goto out;
 + page = kzalloc(PAGE_SIZE, GFP_KERNEL);
 + if (!page)
 + goto out;
 + if (copy_from_user(page, (u8 *)vcpu-kvm-xen_hvm_config.blob_addr[blob]
 ++ pnum * PAGE_SIZE, PAGE_SIZE))
 + goto out_free;
 + if (kvm_write_guest(vcpu-kvm, paddr, page, PAGE_SIZE))
 + goto out_free;
 + printk(KERN_DEBUG kvm: copied xen hvm blob %d page %d to 0x%llx\n,
 +blob, pnum, paddr);
 + r = 0;
 +out_free:
 + kfree(page);
 +out:
 + return r;
 +}
 +#endif
 +
  int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
  {
   switch (msr) {
 @@ -990,6 +1019,10 @@ int kvm_set_msr_common(struct kvm_vcpu *
   0x%x data 0x%llx\n, msr, data);
   break;
   default:
 +#ifdef KVM_CAP_XEN_HVM
 + if (msr  (msr == vcpu-kvm-xen_hvm_config.msr))
 + return xen_hvm_config(vcpu, data);
 +#endif
   if (!ignore_msrs) {
   pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n,
   msr, data);
 @@ -2453,6 

Re: [PATCH][RFC] Xen PV-on-HVM guest support

2009-10-15 Thread Gerd Hoffmann

On 10/15/09 09:17, Jan Kiszka wrote:

Ed Swierk wrote:

Overall it seems pretty solid for Linux PV-on-HVM guests. I think more
work is needed to support full PV guests, but I don't know how much.
Have folks been asking about PV-on-HVM or full PV?


Not all requests weren't that concrete /wrt technology, but some had
older setups and were definitely using full PV.


I had full pv working at some point as well, but I think those patches 
are almost a year old by now and need quite some work to make them work 
on todays master branch ...


cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Xen PV-on-HVM guest support

2009-10-15 Thread Jan Kiszka
Ed Swierk wrote:
 Thanks for the feedback; I'll post a new version shortly.
 
 On Tue, Oct 13, 2009 at 11:45 PM, Jan Kiszka jan.kis...@web.de wrote:
 Interesting stuff. How usable is your work at this point? I've no
 immediate demand, but the question if one could integrate Xen guests
 with KVM already popped up more than once @work.
 
 So far I've managed to boot CentOS 5.3 (both i386 and x86_64) and use
 the Xen PV block and net devices, with pretty good performance. I've
 also booted FreeBSD 8.0-RC1 (amd64 only) with a XENHVM kernel and used
 the Xen PV block and net devices, but the performance of the net
 device is significantly worse than with CentOS. Also some FreeBSD
 applications use a flag that's not yet implemented in the net device
 emulation, but I'm working on fixing that.
 
 Overall it seems pretty solid for Linux PV-on-HVM guests. I think more
 work is needed to support full PV guests, but I don't know how much.
 Have folks been asking about PV-on-HVM or full PV?

Not all requests weren't that concrete /wrt technology, but some had
older setups and were definitely using full PV.

Jan



signature.asc
Description: OpenPGP digital signature


Re: Add qemu_send_raw() to vlan.

2009-10-15 Thread Gleb Natapov
On Thu, Oct 15, 2009 at 08:04:45AM +0100, Mark McLoughlin wrote:
 Hi Gleb,
 
 On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote:
  It gets packet without virtio header and adds it if needed.  Allows to
  inject packets to vlan from outside. To send gracious arp for instance.
 ...
  diff --git a/net.h b/net.h
  index 931133b..3d0b6f2 100644
  --- a/net.h
  +++ b/net.h
  ...
  @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc);
   ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov,
 int iovcnt);
   int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
  +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int 
  size);
   void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]);
   void qemu_check_nic_model(NICInfo *nd, const char *model);
   void qemu_check_nic_model_list(NICInfo *nd, const char * const *models,
 
 I've only just now noticed that we never actually made announce_self()
 use this ... care to do that?
 
Something like this:

---
Use qemu_send_packet_raw to send gracious arp. This will ensure that
vnet header is handled properly.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/savevm.c b/savevm.c
index 7a363b6..8ea2daf 100644
--- a/savevm.c
+++ b/savevm.c
@@ -132,7 +132,7 @@ static void qemu_announce_self_once(void *opaque)
 len = announce_self_create(buf, nd_table[i].macaddr);
 vlan = nd_table[i].vlan;
for(vc = vlan-first_client; vc != NULL; vc = vc-next) {
-vc-receive(vc, buf, len);
+qemu_send_packet_raw(vc, buf, len);
 }
 }
 if (count--) {
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add qemu_send_raw() to vlan.

2009-10-15 Thread Mark McLoughlin
On Thu, 2009-10-15 at 09:33 +0200, Gleb Natapov wrote:
 On Thu, Oct 15, 2009 at 08:04:45AM +0100, Mark McLoughlin wrote:
  Hi Gleb,
  
  On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote:
   It gets packet without virtio header and adds it if needed.  Allows to
   inject packets to vlan from outside. To send gracious arp for instance.
  ...
   diff --git a/net.h b/net.h
   index 931133b..3d0b6f2 100644
   --- a/net.h
   +++ b/net.h
   ...
   @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc);
ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov,
  int iovcnt);
int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
   +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int 
   size);
void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]);
void qemu_check_nic_model(NICInfo *nd, const char *model);
void qemu_check_nic_model_list(NICInfo *nd, const char * const *models,
  
  I've only just now noticed that we never actually made announce_self()
  use this ... care to do that?
  
 Something like this:
 
 ---
 Use qemu_send_packet_raw to send gracious arp. This will ensure that
 vnet header is handled properly.
 
 Signed-off-by: Gleb Natapov g...@redhat.com

Acked-by: Mark McLoughlin mar...@redhat.com

 diff --git a/savevm.c b/savevm.c
 index 7a363b6..8ea2daf 100644
 --- a/savevm.c
 +++ b/savevm.c
 @@ -132,7 +132,7 @@ static void qemu_announce_self_once(void *opaque)
  len = announce_self_create(buf, nd_table[i].macaddr);
  vlan = nd_table[i].vlan;
   for(vc = vlan-first_client; vc != NULL; vc = vc-next) {
 -vc-receive(vc, buf, len);
 +qemu_send_packet_raw(vc, buf, len);

This makes things even more gratuitous because we're making every net
client send the packet rather than receive it, but it works fine in
practice.

Cheers,
Mark.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [STABLE PATCH] hotplug: fix scsi hotplug.

2009-10-15 Thread Gerd Hoffmann

On 10/14/09 19:30, Dustin Kirkland wrote:

Also note that I did not replace the bios.bin, as it appears to me
that the qemu-kvm-0.11 bios.bin is working properly.


Yes, kvm has its own bios, only for vanilla upstream the bios must be 
replaced.


cheers,
  Gerd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Release plan for 0.12.0

2009-10-15 Thread Michael S. Tsirkin
On Wed, Oct 14, 2009 at 02:10:00PM -0700, Sridhar Samudrala wrote:
 On Wed, 2009-10-14 at 17:50 +0200, Michael S. Tsirkin wrote:
  On Wed, Oct 14, 2009 at 04:19:17PM +0100, Jamie Lokier wrote:
   Michael S. Tsirkin wrote:
On Wed, Oct 14, 2009 at 09:17:15AM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 Looks like Or has abandoned it.  I have an updated version which 
 works
 with new APIs, etc.  Let me post it and we'll go from there.

   
 I'm generally inclined to oppose the functionality as I don't think 
 it  offers any advantages over the existing backends.
 

 I patch it in and use it all the time.  It's much easier to setup
 on a random machine than a bridged config.
   

 Having two things that do the same thing is just going to lead to 
 user  
 confusion.

They do not do the same thing. With raw socket you can use windows
update without a bridge in the host, with tap you can't.
   
   On the other hand, with raw socket, guest Windows can't access files
   on the host's Samba share can it?  So it's not that useful even for
   Windows guests.
  
  I guess this depends on whether you use the same host for samba :)
  
 If the problem is tap is too hard to setup, we should try to  
 simplify tap configuration.

The problem is bridge is too hard to setup.
Simplifying that is a good idea, but outside the scope
of the qemu project.
   
   I venture it's important enough for qemu that it's worth working on
   that.  Something that looks like the raw socket but behaves like an
   automatically instantiated bridge attached to the bound interface
   would be a useful interface.
  
  I agree, that would be good to have.
 
 Can't we bind the raw socket to the tap interface instead of the
 physical interface and allow the bridge config to work.


We can, kind of (e.g. with veth) but what's the point then?

 Thanks
 Sridhar
 
 
  
   I don't have much time, but I'll help anybody who wants to do that.
   
   -- Jamie
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raw vs. tap (was: Re: [Qemu-devel] Re: Release plan for 0.12.0)

2009-10-15 Thread Michael S. Tsirkin
On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote:
 I would be much more inclined to consider  
 taking raw and improving the performance long term if guest-host  
 networking worked.  This appears to be a fundamental limitation though  
 and I think it's something that will forever plague users if we include  
 this feature.

In fact, I think it's fixable with a raw socket bound to a macvlan.
Would that be enough?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Avi Kivity

On 10/14/2009 01:06 AM, Jan Kiszka wrote:

Add a new IOCTL pair to retrieve or set the VCPU state in one chunk.
More precisely, the IOCTL is able to process a list of substates to be
read or written. This list is easily extensible without breaking the
existing ABI, thus we will no longer have to add new IOCTLs when we
discover a missing VCPU state field or want to support new hardware
features.

This patch establishes the generic infrastructure for KVM_GET/
SET_VCPU_STATE and adds support for the generic substates REGS, SREGS,
FPU, and MP. To avoid code duplication, the entry point for the
corresponding original IOCTLs are converted to make use of the new
infrastructure internally, too.



+/* for KVM_GET_VCPU_STATE and KVM_SET_VCPU_STATE */
+#define KVM_VCPU_REGS  0
+#define KVM_VCPU_SREGS 1
+#define KVM_VCPU_FPU   2
+#define KVM_VCPU_MP3
   


KVM_VCPU_STATE_*, to avoid collisions.

Better to split sse from fpu since we already know it is about to be 
replaced.



+
+struct kvm_vcpu_substate {
+   __u32 type;
+   __u32 pad;
+   __s64 offset;
+};
+
+#define KVM_MAX_VCPU_SUBSTATES 64
+
+struct kvm_vcpu_state {
+   __u32 nsubstates; /* number of elements in substates */
+   __u32 nprocessed; /* return value: successfully processed substates */
+   struct kvm_vcpu_substate substates[0];
+};
+
   


Wouldn't having an ordinary struct with lots of reserved space be 
simpler?  If we add a bitmask, we can even selectively get/set the 
fields we want (important if new state extends old state: avx vs sse).



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Avi Kivity

On 10/14/2009 01:06 AM, Jan Kiszka wrote:

@@ -1586,6 +1719,7 @@ static long kvm_dev_ioctl_check_extension_generic(long 
arg)
case KVM_CAP_USER_MEMORY:
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
+   case KVM_CAP_VCPU_STATE:
  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
case KVM_CAP_SET_BOOT_CPU_ID:
  #endif
   


This should be done only for the archs that implement the ioctl.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states

2009-10-15 Thread Avi Kivity

On 10/14/2009 01:06 AM, Jan Kiszka wrote:

This plugs an NMI-related hole in the VCPU synchronization between
kernel and user space. So far, neither pending NMIs nor the inhibit NMI
mask was properly read/set which was able to cause problems on
vmsave/restore, live migration and system reset. Fix it by making use
of the new VCPU substate interface.


+struct kvm_nmi_state {
+   __u8 pending;
+   __u8 masked;
+   __u8 pad1[2];
+};
   


Best to be conservative and use 64-bit alignment.  Who knows what we 
might put after this someday.

@@ -513,6 +513,8 @@ struct kvm_x86_ops {
unsigned char *hypercall_addr);
void (*set_irq)(struct kvm_vcpu *vcpu);
void (*set_nmi)(struct kvm_vcpu *vcpu);
+   int (*get_nmi_mask)(struct kvm_vcpu *vcpu);
+   void (*set_nmi_mask)(struct kvm_vcpu *vcpu, int masked);
   


Prefer bool for booleans, please.

Needs a KVM_CAP as well.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Avi Kivity

On 10/14/2009 01:06 AM, Jan Kiszka wrote:

Add a new IOCTL pair to retrieve or set the VCPU state in one chunk.
More precisely, the IOCTL is able to process a list of substates to be
read or written. This list is easily extensible without breaking the
existing ABI, thus we will no longer have to add new IOCTLs when we
discover a missing VCPU state field or want to support new hardware
features.

This patch establishes the generic infrastructure for KVM_GET/
SET_VCPU_STATE and adds support for the generic substates REGS, SREGS,
FPU, and MP. To avoid code duplication, the entry point for the
corresponding original IOCTLs are converted to make use of the new
infrastructure internally, too.

   


One last thing - Documentation/kvm/api.txt needs updating.  Glauber, 
this holds for your patches as well.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Xen PV-on-HVM guest support (v2)

2009-10-15 Thread Gerd Hoffmann

  Hi,


Is the msr number really variable? Isn't it an ABI?


Yes, it is variable.  The guests gets the msr number via cpuid ...


Do we need support for reading the msr?


I don't think so.

cheers,
  Gerd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Xen PV-on-HVM guest support (v2)

2009-10-15 Thread Avi Kivity

On 10/15/2009 05:11 PM, Gerd Hoffmann wrote:

  Hi,


Is the msr number really variable? Isn't it an ABI?


Yes, it is variable.  The guests gets the msr number via cpuid ...


Do we need support for reading the msr?


I don't think so.



Thanks.  So Ed, I think you're good to go, but please update 
Documentation/kvm/api.txt for your next round.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Jan Kiszka
Avi Kivity wrote:
 On 10/14/2009 01:06 AM, Jan Kiszka wrote:
 Add a new IOCTL pair to retrieve or set the VCPU state in one chunk.
 More precisely, the IOCTL is able to process a list of substates to be
 read or written. This list is easily extensible without breaking the
 existing ABI, thus we will no longer have to add new IOCTLs when we
 discover a missing VCPU state field or want to support new hardware
 features.

 This patch establishes the generic infrastructure for KVM_GET/
 SET_VCPU_STATE and adds support for the generic substates REGS, SREGS,
 FPU, and MP. To avoid code duplication, the entry point for the
 corresponding original IOCTLs are converted to make use of the new
 infrastructure internally, too.



 +/* for KVM_GET_VCPU_STATE and KVM_SET_VCPU_STATE */
 +#define KVM_VCPU_REGS   0
 +#define KVM_VCPU_SREGS  1
 +#define KVM_VCPU_FPU2
 +#define KVM_VCPU_MP 3

 
 KVM_VCPU_STATE_*, to avoid collisions.

OK.

 
 Better to split sse from fpu since we already know it is about to be 
 replaced.

The idea is to reuse the existing state structures, including struct
kvm_fpu. This allows to provide the avoid substates for all archs and
simplifies the migration (see my qemu conversion patch). I think, once
we need support for new/wider registers in x86, we can introduce new
KVM_X86_VCPU_STATE_FPU_WHATEVER substates that are able to replace the
old one.

 
 +
 +struct kvm_vcpu_substate {
 +__u32 type;
 +__u32 pad;
 +__s64 offset;
 +};
 +
 +#define KVM_MAX_VCPU_SUBSTATES  64
 +
 +struct kvm_vcpu_state {
 +__u32 nsubstates; /* number of elements in substates */
 +__u32 nprocessed; /* return value: successfully processed substates */
 +struct kvm_vcpu_substate substates[0];
 +};
 +

 
 Wouldn't having an ordinary struct with lots of reserved space be 
 simpler?  If we add a bitmask, we can even selectively get/set the 
 fields we want (important if new state extends old state: avx vs sse).

Simpler - hmm, maybe. But also less flexible. This would establish a
static order inside this constantly growing mega struct. And a user only
interested in something small at its end would still have to allocate
memory for the whole thing (maybe megabytes in the future, who knows?).
And this mega struct will always carry all the legacy substates, even if
they aren't used anymore in practice.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Jan Kiszka
Avi Kivity wrote:
 On 10/14/2009 01:06 AM, Jan Kiszka wrote:
 @@ -1586,6 +1719,7 @@ static long kvm_dev_ioctl_check_extension_generic(long 
 arg)
  case KVM_CAP_USER_MEMORY:
  case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
  case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
 +case KVM_CAP_VCPU_STATE:
   #ifdef CONFIG_KVM_APIC_ARCHITECTURE
  case KVM_CAP_SET_BOOT_CPU_ID:
   #endif

 
 This should be done only for the archs that implement the ioctl.

All archs already implement the core substates.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states

2009-10-15 Thread Jan Kiszka
Avi Kivity wrote:
 On 10/14/2009 01:06 AM, Jan Kiszka wrote:
 This plugs an NMI-related hole in the VCPU synchronization between
 kernel and user space. So far, neither pending NMIs nor the inhibit NMI
 mask was properly read/set which was able to cause problems on
 vmsave/restore, live migration and system reset. Fix it by making use
 of the new VCPU substate interface.


 +struct kvm_nmi_state {
 +   __u8 pending;
 +   __u8 masked;
 +   __u8 pad1[2];
 +};

 
 Best to be conservative and use 64-bit alignment.  Who knows what we 
 might put after this someday.

OK.

 @@ -513,6 +513,8 @@ struct kvm_x86_ops {
  unsigned char *hypercall_addr);
  void (*set_irq)(struct kvm_vcpu *vcpu);
  void (*set_nmi)(struct kvm_vcpu *vcpu);
 +int (*get_nmi_mask)(struct kvm_vcpu *vcpu);
 +void (*set_nmi_mask)(struct kvm_vcpu *vcpu, int masked);

 
 Prefer bool for booleans, please.

OK.

 
 Needs a KVM_CAP as well.

KVM_CAP_VCPU_STATE will imply KVM_CAP_NMI_STATE, so I skipped the latter
(user space code would use the former anyway to avoid yet another #ifdef
layer).

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Jan Kiszka
Avi Kivity wrote:
 On 10/14/2009 01:06 AM, Jan Kiszka wrote:
 Add a new IOCTL pair to retrieve or set the VCPU state in one chunk.
 More precisely, the IOCTL is able to process a list of substates to be
 read or written. This list is easily extensible without breaking the
 existing ABI, thus we will no longer have to add new IOCTLs when we
 discover a missing VCPU state field or want to support new hardware
 features.

 This patch establishes the generic infrastructure for KVM_GET/
 SET_VCPU_STATE and adds support for the generic substates REGS, SREGS,
 FPU, and MP. To avoid code duplication, the entry point for the
 corresponding original IOCTLs are converted to make use of the new
 infrastructure internally, too.


 
 One last thing - Documentation/kvm/api.txt needs updating.  Glauber, 
 this holds for your patches as well.

OK, will be done once the interface settled.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states

2009-10-15 Thread Avi Kivity

On 10/15/2009 06:22 PM, Jan Kiszka wrote:

Needs a KVM_CAP as well.
 

KVM_CAP_VCPU_STATE will imply KVM_CAP_NMI_STATE, so I skipped the latter
(user space code would use the former anyway to avoid yet another #ifdef
layer).
   


OK.  New bits will need the KVM_CAP, though.

Perhaps it makes sense to query about individual states, including 
existing ones?  That will allow us to deprecate and then phase out 
broken states.  It's probably not worth it.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Can't make virtio block driver work on Windows 2003

2009-10-15 Thread Martin Maurer
Maybe you can find some useful hints in this thread:
http://www.proxmox.com/forum/showthread.php?t=1990

Best Regards,

Martin

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Asdo
 Sent: Mittwoch, 14. Oktober 2009 19:52
 To: kvm@vger.kernel.org
 Subject: Can't make virtio block driver work on Windows 2003
 
 Hi all
 I have a new installation of Windows 2003 SBS server 32bit which I
 installed using IDE disk.
 KVM version is QEMU PC emulator version 0.10.50 (qemu-kvm-devel-86)
 compiled by myself on kernel 2.6.28-11-server.
 
 I have already moved networking from e1000 to virtio (e1000 was
 performing very sluggishly btw, probably was losing many packets,
 virtio
 seems to work)
 
 Now I want to move the disk to virtio...
 
 This is complex so I thought that first I wanted to see virtio
 installed
 and working on another drive.
 So I tried adding another drive, a virtio one, (a new 100MB file at
 host
 side) to the virtual machine and rebooting.
 
 A first problem is that Windows does not detect the new device upon
 boot
 or Add Hardware scan.
 
 Here is the kvm commandline (it's complex because it comes from
 libvirt):
 
 /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc -m 4096-smp 4 -name
 winserv2 -uuid  -monitor pty -boot
 c
 -drive
 file=/virtual_machines/kvm/nfsimport/winserv2.raw,if=ide,index=0,boot=o
 n
 -drive file=/virtual_machines/kvm/nfsimport/zerofile,if=virtio,index=1
 -net nic,macaddr=xx:xx:xx:xx:xx:xx,vlan=0,model=virtio -net
 tap,fd=25,vlan=0 -serial none -parallel none -usb -vnc 127.0.0.1:4
 
 Even if Windows couldn't detect the new device I tried to install the
 driver anyway. On Add Hardware I go through to -- SCSI and RAID
 controllers -- Have Disk .. and point it to the location of viostor
 files (windows 2003 x86) downloaded from:
 
   http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers
   http://people.redhat.com/~yvugenfi/24.09.2009/viostor.zip
 
 Windows does install the driver, however at the end it says:
 
   The software for this device is now installed, but may not work
 correctly.
   This device cannot start. (Code 10)
 
 and the new device gets flagged with a yellow exclamation mark in
 Device
 Manager.
 
 I don't know if it's the same reason as before, that the device is not
 detected so the driver cannot work, or another reason.
 
 Any idea?
 
 Thanks for your help
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Avi Kivity

On 10/15/2009 06:22 PM, Jan Kiszka wrote:

Better to split sse from fpu since we already know it is about to be
replaced.
 

The idea is to reuse the existing state structures, including struct
kvm_fpu. This allows to provide the avoid substates for all archs and
simplifies the migration (see my qemu conversion patch). I think, once
we need support for new/wider registers in x86, we can introduce new
KVM_X86_VCPU_STATE_FPU_WHATEVER substates that are able to replace the
old one.
   


Makes sense, especially if we keep the list instead of the structure.


+
+struct kvm_vcpu_substate {
+   __u32 type;
+   __u32 pad;
+   __s64 offset;
+};
+
+#define KVM_MAX_VCPU_SUBSTATES 64
+
+struct kvm_vcpu_state {
+   __u32 nsubstates; /* number of elements in substates */
+   __u32 nprocessed; /* return value: successfully processed substates */
+   struct kvm_vcpu_substate substates[0];
+};
+

   

Wouldn't having an ordinary struct with lots of reserved space be
simpler?  If we add a bitmask, we can even selectively get/set the
fields we want (important if new state extends old state: avx vs sse).
 

Simpler - hmm, maybe. But also less flexible. This would establish a
static order inside this constantly growing mega struct. And a user only
interested in something small at its end would still have to allocate
memory for the whole thing (maybe megabytes in the future, who knows?).
And this mega struct will always carry all the legacy substates, even if
they aren't used anymore in practice.
   


I hope cpu state doesn't grow into megabytes, or we'll have problems 
live migrating them.  But I see your point.  The initial split assumed 
userspace would be interested in optimizing access (we used to have many 
more exits, and really old versions relied on qemu for emulation), that 
turned out not to be the case, but it's better to keep this capability 
for other possible userspaces.  So let's go ahead with the list.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Autotest] [PATCH] Test 802.1Q vlan of nic

2009-10-15 Thread Amos Kong

Test 802.1Q vlan of nic, config it by vconfig command.
  1) Create two VMs
  2) Setup guests in different vlan by vconfig and test communication by ping
 using hard-coded ip address
  3) Setup guests in same vlan and test communication by ping
  4) Recover the vlan config

Signed-off-by: Amos Kong ak...@redhat.com
---
 client/tests/kvm/kvm_tests.cfg.sample |6 +++
 client/tests/kvm/tests/vlan_tag.py|   73 +
 2 files changed, 79 insertions(+), 0 deletions(-)
 mode change 100644 = 100755 client/tests/kvm/scripts/qemu-ifup
 create mode 100644 client/tests/kvm/tests/vlan_tag.py

diff --git a/client/tests/kvm/kvm_tests.cfg.sample 
b/client/tests/kvm/kvm_tests.cfg.sample
index 9ccc9b5..4e47767 100644
--- a/client/tests/kvm/kvm_tests.cfg.sample
+++ b/client/tests/kvm/kvm_tests.cfg.sample
@@ -166,6 +166,12 @@ variants:
 used_cpus = 5
 used_mem = 2560
 
+- vlan_tag:  install setup
+type = vlan_tag
+subnet2 = 192.168.123
+vlans = 10 20
+nic_mode = tap
+nic_model = e1000
 
 - autoit:   install setup
 type = autoit
diff --git a/client/tests/kvm/scripts/qemu-ifup 
b/client/tests/kvm/scripts/qemu-ifup
old mode 100644
new mode 100755
diff --git a/client/tests/kvm/tests/vlan_tag.py 
b/client/tests/kvm/tests/vlan_tag.py
new file mode 100644
index 000..15e763f
--- /dev/null
+++ b/client/tests/kvm/tests/vlan_tag.py
@@ -0,0 +1,73 @@
+import logging, time
+from autotest_lib.client.common_lib import error
+import kvm_subprocess, kvm_test_utils, kvm_utils
+
+def run_vlan_tag(test, params, env):
+
+Test 802.1Q vlan of nic, config it by vconfig command.
+
+1) Create two VMs
+2) Setup guests in different vlan by vconfig and test communication by 
ping 
+   using hard-coded ip address
+3) Setup guests in same vlan and test communication by ping
+4) Recover the vlan config
+
+@param test: Kvm test object
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+
+
+vm = []
+session = []
+subnet2 = params.get(subnet2)
+vlans = params.get(vlans).split()
+
+vm.append(kvm_test_utils.get_living_vm(env, %s % params.get(main_vm)))
+
+params_vm2 = params.copy()
+params_vm2['image_snapshot'] = yes
+params_vm2['kill_vm_gracefully'] = no
+params_vm2[address_index] = int(params.get(address_index, 0))+1
+vm.append(vm[0].clone(vm2, params_vm2))
+kvm_utils.env_register_vm(env, vm2, vm[1])
+if not vm[1].create():
+raise error.TestError(VM 1 create faild)
+
+for i in range(2):
+session.append(kvm_test_utils.wait_for_login(vm[i]))
+
+try:
+vconfig_cmd = vconfig add eth0 %s;ifconfig eth0.%s %s.%s
+# Attempt to configure IPs for the VMs and record the results in
+# boolean variables
+# Make vm1 and vm2 in the different vlan
+
+ip_config_vm1_ok = (session[0].get_command_status(vconfig_cmd
+   % (vlans[0], vlans[0], subnet2, 11)) == 0)
+ip_config_vm2_ok = (session[1].get_command_status(vconfig_cmd
+   % (vlans[1], vlans[1], subnet2, 12)) == 0)
+if not ip_config_vm1_ok or not ip_config_vm2_ok:
+raise error.TestError, Fail to config VMs ip address
+ping_diff_vlan_ok = (session[0].get_command_status(
+ ping -c 2 %s.12 % subnet2) == 0)
+
+if ping_diff_vlan_ok:
+raise error.TestFail(VM 2 is unexpectedly pingable in different 
+ vlan)
+# Make vm2 in the same vlan with vm1
+vlan_config_vm2_ok = (session[1].get_command_status(
+  vconfig rem eth0.%s;vconfig add eth0 %s;
+  ifconfig eth0.%s %s.12 %
+  (vlans[1], vlans[0], vlans[0], subnet2)) == 0)
+if not vlan_config_vm2_ok:
+raise error.TestError, Fail to config ip address of VM 2
+
+ping_same_vlan_ok = (session[0].get_command_status(
+ ping -c 2 %s.12 % subnet2) == 0)
+if not ping_same_vlan_ok:
+raise error.TestFail(Fail to ping the guest in same vlan)
+finally:
+# Clean the vlan config
+for i in range(2):
+session[i].sendline(vconfig rem eth0.%s % vlans[0])
+session[i].close()
-- 
1.5.5.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM-AUTOTEST PATCH 1/3] KVM test: VM.send_monitor_cmd() minor cleanup

2009-10-15 Thread Michael Goldish
- Move some of the code into a try..finally block
- Shutdown the socket before closing it

Signed-off-by: Michael Goldish mgold...@redhat.com
---
 client/tests/kvm/kvm_vm.py |   45 ---
 1 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
index 82f1eb4..a8d96ca 100755
--- a/client/tests/kvm/kvm_vm.py
+++ b/client/tests/kvm/kvm_vm.py
@@ -465,7 +465,7 @@ class VM:
 end_time = time.time() + timeout
 while time.time()  end_time:
 try:
-o += s.recv(16384)
+o += s.recv(1024)
 if o.splitlines()[-1].split()[-1] == (qemu):
 return (True, o)
 except:
@@ -481,27 +481,32 @@ class VM:
 except:
 logging.debug(Could not connect to monitor socket)
 return (1, )
-status, data = read_up_to_qemu_prompt(s, timeout)
-if not status:
-s.close()
-logging.debug(Could not find (qemu) prompt; output so far: \
-+ kvm_utils.format_str_for_message(data))
-return (1, )
-# Send command
-s.sendall(command + \n)
-# Receive command output
-data = 
-if block:
+
+# Send the command and get the resulting output
+try:
 status, data = read_up_to_qemu_prompt(s, timeout)
-data = \n.join(data.splitlines()[1:])
 if not status:
-s.close()
-logging.debug(Could not find (qemu) prompt after command;
-   output so far: %s,
-   kvm_utils.format_str_for_message(data))
-return (1, data)
-s.close()
-return (0, data)
+logging.debug(Could not find (qemu) prompt; output so far: +
+  kvm_utils.format_str_for_message(data))
+return (1, )
+# Send command
+s.sendall(command + \n)
+# Receive command output
+data = 
+if block:
+status, data = read_up_to_qemu_prompt(s, timeout)
+data = \n.join(data.splitlines()[1:])
+if not status:
+logging.debug(Could not find (qemu) prompt after command; 

+  output so far: +
+  kvm_utils.format_str_for_message(data))
+return (1, data)
+return (0, data)
+
+# Clean up before exiting
+finally:
+s.shutdown(socket.SHUT_RDWR)
+s.close()
 
 
 def destroy(self, gracefully=True):
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM-AUTOTEST PATCH 3/3] KVM test: modify messages in kvm_test_utils.wait_for_login()

2009-10-15 Thread Michael Goldish
Use logged into guest instead of logged in guest.
AFAIK this is more correct.

Signed-off-by: Michael Goldish mgold...@redhat.com
---
 client/tests/kvm/kvm_test_utils.py |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/client/tests/kvm/kvm_test_utils.py 
b/client/tests/kvm/kvm_test_utils.py
index e51520a..bf8aed2 100644
--- a/client/tests/kvm/kvm_test_utils.py
+++ b/client/tests/kvm/kvm_test_utils.py
@@ -52,12 +52,12 @@ def wait_for_login(vm, nic_index=0, timeout=240, start=0, 
step=2):
 @param timeout: Time to wait before giving up.
 @return: A shell session object.
 
-logging.info(Try to login to guest '%s'... % vm.name)
+logging.info(Trying to log into guest '%s'... % vm.name)
 session = kvm_utils.wait_for(lambda: vm.remote_login(nic_index=nic_index),
  timeout, start, step)
 if not session:
 raise error.TestFail(Could not log into guest '%s' % vm.name)
-logging.info(Logged in '%s' % vm.name)
+logging.info(Logged into guest '%s' % vm.name)
 return session
 
 
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM-AUTOTEST PATCH 2/3] KVM test: corrections to guest_s4

2009-10-15 Thread Michael Goldish
- Log into the guest again after it resumes from S4 ('session2' doesn't survive
  S4 in user networking mode).
- Use != 0 when checking the status returned by get_command_status(), because
  when things go wrong, it can also return None.
- Use time.sleep(float(params.get(...))) instead of time.sleep(params.get(...))
  (time.sleep() doesn't accept strings).
- Do not check that the VM is alive after vm.create() because vm.create()
  already does that.
- Use get_command_output(kill_test_s4_cmd) instead of 
sendline(kill_test_s4_cmd),
  because get_command_output() waits for the prompt to return, and allows us to
  be sure that the command got delivered to the guest.  This is especially
  important if the session is closed immediately after sending the command.
  In the case of the command that performs suspend to disk (set_s4_cmd), and the
  test_s4_cmd for Linux (nohup ...), it is OK to use sendline() because the
  prompt may never return.  In the latter case, sleep(5) after the sendline()
  call should ensure that the command got delivered before the test proceeds.
- Change the timeouts in the wait_for(vm.is_dead, ...) call.
- For Windows guests modify test_s4_cmd to 'start ping -t localhost'.
  Using start /B isn't safe because then ping's output is redirected to a dead
  shell session.
- For Windows guests mdify check_s4_cmd to 'tasklist | find /I ping.exe'.
  (It's a little more specific than just ping.)
- Run guest_s4 after autoit tests because it can leave Windows guests with an
  annoying welcome screen.
- Make the logging messages more consistent with the style of other tests.

Signed-off-by: Michael Goldish mgold...@redhat.com
---
 client/tests/kvm/kvm_tests.cfg.sample |   23 -
 client/tests/kvm/tests/guest_s4.py|   45 +++-
 2 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/client/tests/kvm/kvm_tests.cfg.sample 
b/client/tests/kvm/kvm_tests.cfg.sample
index 9ccc9b5..296449d 100644
--- a/client/tests/kvm/kvm_tests.cfg.sample
+++ b/client/tests/kvm/kvm_tests.cfg.sample
@@ -118,15 +118,6 @@ variants:
 - linux_s3: install setup
 type = linux_s3
 
-- guest_s4:
-type = guest_s4
-check_s4_support_cmd = grep -q disk /sys/power/state
-test_s4_cmd = cd /tmp/;nohup tcpdump -q -t ip host localhost
-check_s4_cmd = pgrep tcpdump
-set_s4_cmd = echo disk  /sys/power/state
-kill_test_s4_cmd = pkill tcpdump
-services_up_timeout = 30
-
 - timedrift:install setup
 extra_params +=  -rtc-td-hack
 variants:
@@ -166,7 +157,6 @@ variants:
 used_cpus = 5
 used_mem = 2560
 
-
 - autoit:   install setup
 type = autoit
 autoit_binary = D:\AutoIt3.exe
@@ -176,6 +166,15 @@ variants:
 - notepad:
 autoit_script = autoit/notepad1.au3
 
+- guest_s4:
+type = guest_s4
+check_s4_support_cmd = grep -q disk /sys/power/state
+test_s4_cmd = cd /tmp; nohup tcpdump -q -t ip host localhost
+check_s4_cmd = pgrep tcpdump
+set_s4_cmd = echo disk  /sys/power/state
+kill_test_s4_cmd = pkill tcpdump
+services_up_timeout = 30
+
 - nic_hotplug:   install setup
 type = pci_hotplug
 pci_type = nic
@@ -518,8 +517,8 @@ variants:
 host_load_instances = 8
 guest_s4:
 check_s4_support_cmd = powercfg /hibernate on
-test_s4_cmd = start /B ping -n 3000 localhost
-check_s4_cmd = tasklist | find /I ping
+test_s4_cmd = start ping -t localhost
+check_s4_cmd = tasklist | find /I ping.exe
 set_s4_cmd = rundll32.exe PowrProf.dll, SetSuspendState
 kill_test_s4_cmd = taskkill /IM ping.exe /F
 services_up_timeout = 30
diff --git a/client/tests/kvm/tests/guest_s4.py 
b/client/tests/kvm/tests/guest_s4.py
index 7147e3b..f08b9d2 100644
--- a/client/tests/kvm/tests/guest_s4.py
+++ b/client/tests/kvm/tests/guest_s4.py
@@ -5,7 +5,7 @@ import kvm_test_utils, kvm_utils
 
 def run_guest_s4(test, params, env):
 
-Suspend guest to disk,supports both Linux  Windows OSes.
+Suspend guest to disk, supports both Linux  Windows OSes.
 
 @param test: kvm test object.
 @param params: Dictionary with test parameters.
@@ -14,53 +14,62 @@ def run_guest_s4(test, params, env):
 vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 session = kvm_test_utils.wait_for_login(vm)
 
-logging.info(Checking whether guest OS supports suspend to disk (S4))
+logging.info(Checking whether guest OS supports suspend to disk (S4)...)
 status = session.get_command_status(params.get(check_s4_support_cmd))
 if status is None:
 logging.error(Failed to check if guest OS supports S4)
 elif status != 0:
 raise error.TestFail(Guest OS does not support S4)
 
-logging.info(Wait until all guest OS services are fully started)

Re: [PATCH 4/4] KVM: x86: Add VCPU substate for NMI states

2009-10-15 Thread Jan Kiszka
Avi Kivity wrote:
 On 10/15/2009 06:22 PM, Jan Kiszka wrote:
 Needs a KVM_CAP as well.
  
 KVM_CAP_VCPU_STATE will imply KVM_CAP_NMI_STATE, so I skipped the latter
 (user space code would use the former anyway to avoid yet another #ifdef
 layer).

 
 OK.  New bits will need the KVM_CAP, though.

For sure.

 
 Perhaps it makes sense to query about individual states, including 
 existing ones?  That will allow us to deprecate and then phase out 
 broken states.  It's probably not worth it.

You may do this already with the given design: Set up a VCPU, then issue
KVM_GET_VCPU_STATE on the substate in question. You will either get an
error code or 0 if the substate is supported. At least no additional
kernel code required.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH] KVM test: Add PCI pass through test

2009-10-15 Thread Lucas Meneghel Rodrigues
On Thu, Oct 15, 2009 at 3:45 AM, Yolkfull Chow yz...@redhat.com wrote:
 On Wed, Oct 14, 2009 at 09:08:00AM -0300, Lucas Meneghel Rodrigues wrote:
 Add a new PCI pass trough test. It supports both SR-IOV virtual
 functions and physical NIC card pass through.

 Single Root I/O Virtualization (SR-IOV) allows a single PCI device to
 be shared amongst multiple virtual machines while retaining the
 performance benefit of assigning a PCI device to a virtual machine.
 A common example is where a single SR-IOV capable NIC - with perhaps
 only a single physical network port - might be shared with multiple
 virtual machines by assigning a virtual function to each VM.

 SR-IOV support is implemented in the kernel. The core implementation is
 contained in the PCI subsystem, but there must also be driver support
 for both the Physical Function (PF) and Virtual Function (VF) devices.
 With an SR-IOV capable device one can allocate VFs from a PF. The VFs
 surface as PCI devices which are backed on the physical PCI device by
 resources (queues, and register sets).

 Device support:

 In 2.6.30, the Intel® 82576 Gigabit Ethernet Controller is the only
 SR-IOV capable device supported. The igb driver has PF support and the
 igbvf has VF support.

 In 2.6.31 the Neterion® X3100™ is supported as well. This device uses
 the same vxge driver for the PF as well as the VFs.

 Wow, new NIC card supports SR-IOV...
 At this rate, do we need to move the driver name and its parameter into
 config file so that in future if a new NIC card using different driver
 is supported, we could handle it without changing code ?

Absolutely, it didn't occur to me at first, but yes, we ought to move
the driver name and parameters to the config file.


 In order to configure the test:

   * For SR-IOV virtual functions passthrough, we could specify the
     module parameter 'max_vfs' in config file.
   * For physical NIC card pass through, we should specify the device
     name(s).

 Signed-off-by: Yolkfull Chow yz...@redhat.com
 ---
  client/tests/kvm/kvm_tests.cfg.sample |   11 ++-
  client/tests/kvm/kvm_utils.py         |  278 
 +
  client/tests/kvm/kvm_vm.py            |   72 +
  3 files changed, 360 insertions(+), 1 deletions(-)

 diff --git a/client/tests/kvm/kvm_tests.cfg.sample 
 b/client/tests/kvm/kvm_tests.cfg.sample
 index cc3228a..1dad188 100644
 --- a/client/tests/kvm/kvm_tests.cfg.sample
 +++ b/client/tests/kvm/kvm_tests.cfg.sample
 @@ -786,13 +786,22 @@ variants:
          only default
          image_format = raw

 -
  variants:
      - @smallpages:
      - hugepages:
          pre_command = /usr/bin/python scripts/hugepage.py 
 /mnt/kvm_hugepage
          extra_params +=  -mem-path /mnt/kvm_hugepage

 +variants:
 +    - @no_passthrough:
 +        pass_through = no
 +    - nic_passthrough:
 +        pass_through = pf
 +        passthrough_devs = eth1
 +    - vfs_passthrough:
 +        pass_through = vf
 +        max_vfs = 7
 +        vfs_count = 7

  variants:
      - @basic:
 diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py
 index 53b664a..0e3398c 100644
 --- a/client/tests/kvm/kvm_utils.py
 +++ b/client/tests/kvm/kvm_utils.py
 @@ -788,3 +788,281 @@ def md5sum_file(filename, size=None):
          size -= len(data)
      f.close()
      return o.hexdigest()
 +
 +
 +def get_full_id(pci_id):
 +    
 +    Get full PCI ID of pci_id.
 +    
 +    cmd = lspci -D | awk '/%s/ {print $1}' % pci_id
 +    status, full_id = commands.getstatusoutput(cmd)
 +    if status != 0:
 +        return None
 +    return full_id
 +
 +
 +def get_vendor_id(pci_id):
 +    
 +    Check out the device vendor ID according to PCI ID.
 +    
 +    cmd = lspci -n | awk '/%s/ {print $3}' % pci_id
 +    return re.sub(:,  , commands.getoutput(cmd))
 +
 +
 +def release_dev(pci_id, pci_dict):
 +    
 +    Release a single PCI device.
 +
 +   �...@param pci_id: PCI ID of a given PCI device
 +   �...@param pci_dict: Dictionary with information about PCI devices
 +    
 +    base_dir = /sys/bus/pci
 +    full_id = get_full_id(pci_id)
 +    vendor_id = get_vendor_id(pci_id)
 +    drv_path = os.path.join(base_dir, devices/%s/driver % full_id)
 +    if 'pci-stub' in os.readlink(drv_path):
 +        cmd = echo '%s'  %s/new_id % (vendor_id, drv_path)
 +        if os.system(cmd):
 +            return False
 +
 +        stub_path = os.path.join(base_dir, drivers/pci-stub)
 +        cmd = echo '%s'  %s/unbind % (full_id, stub_path)
 +        if os.system(cmd):
 +            return False
 +
 +        prev_driver = pci_dict[pci_id]
 +        cmd = echo '%s'  %s/bind % (full_id, prev_driver)
 +        if os.system(cmd):
 +            return False
 +    return True
 +
 +
 +def release_pci_devs(pci_dict):
 +    
 +    Release all PCI devices assigned to host.
 +
 +   �...@param pci_dict: Dictionary with information about PCI devices
 +    
 +    for pci_id in pci_dict:
 +        if not release_dev(pci_id, pci_dict):
 +    

Re: Can't make virtio block driver work on Windows 2003

2009-10-15 Thread Asdo

Vadim Rozenfeld wrote:

On 10/14/2009 07:52 PM, Asdo wrote:

...
So I tried adding another drive, a virtio one, (a new 100MB file at 
host side) to the virtual machine and rebooting.


A first problem is that Windows does not detect the new device upon 
boot or Add Hardware scan.
Check PCI devices with info pci. You must have SCSI controller: PCI 
device 1af4:1001 device reported.


It's not there. Does this make it a KVM bug?

I'm attaching the PCI32.EXE output at the bottom of this email

BTW I would probably be able to switch to virtio anyway on this 
installation of Windows 2003, if I knew the way to insert the viostor 
driver into the windows boot image (windows's initrd), because if I set 
the first disk hda as virtio then kvm really makes it virtio (so maybe 
it's a configuration with one IDE and one virtio that does not work in 
KVM) and Windows bluescreens at boot. However I don't know how to insert 
the viostor driver in the windows boot image. Any suggestions?




Here is the kvm commandline (it's complex because it comes from 
libvirt):


/usr/local/kvm/bin/qemu-system-x86_64 -S -M pc -m 4096-smp 4 -name 
winserv2 -uuid  -monitor pty 
-boot c -drive 
file=/virtual_machines/kvm/nfsimport/winserv2.raw,if=ide,index=0,boot=on 
-drive 
file=/virtual_machines/kvm/nfsimport/zerofile,if=virtio,index=1 -net 
nic,macaddr=xx:xx:xx:xx:xx:xx,vlan=0,model=virtio -net 
tap,fd=25,vlan=0 -serial none -parallel none -usb -vnc 127.0.0.1:4




Craig Hart's PCI+AGP bus sniffer, Version 1.6, freeware made in 1996-2005.

Searching for Devices using CFG Mechanism 1 [OS: Win 2003 Service Pack 1]


Bus 0 (PCI), Device Number 0, Device Function 0
Vendor 8086h Intel Corporation
Device 1237h 82441FX 440FX (Natoma) System Controller Rev 2 (SU053)
Command h (Bus Access Disabled!!)
Status h
Revision 02h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Bridge, type PCI to HOST
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown

Bus 0 (PCI), Device Number 1, Device Function 0
Vendor 8086h Intel Corporation
Device 7000h 82371SB PIIX3 ISA Bridge
Command 0007h (I/O Access, Memory Access, BusMaster)
Status 0200h (Medium Timing)
Revision 00h, Header Type 80h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Bridge, type PCI to ISA
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown

Bus 0 (PCI), Device Number 1, Device Function 1
Vendor 8086h Intel Corporation
Device 7010h 82371SB PIIX3 EIDE Controller
Command 0007h (I/O Access, Memory Access, BusMaster)
Status 0280h (Supports Back-To-Back Trans., Medium Timing)
Revision 00h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Storage, type IDE (ATA)
PCI EIDE Controller Features :
  BusMaster EIDE is supported
  Primary   Channel is at I/O Port 01F0h and IRQ 14
  Secondary Channel is at I/O Port 0170h and IRQ 15
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
Address 4 is an I/O Port : C000h

Bus 0 (PCI), Device Number 1, Device Function 2
Vendor 8086h Intel Corporation
Device 7020h 82371SB PIIX3 USB Controller   Rev 1 (SU093)
Command 0007h (I/O Access, Memory Access, BusMaster)
Status h
Revision 01h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Serial, type USB (UHCI)
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
Address 4 is an I/O Port : C020h
System IRQ 11, INT# D

Bus 0 (PCI), Device Number 1, Device Function 3
Vendor 8086h Intel Corporation
Device 7113h 82371MB PIIX4M Power Management Controller
Command h (Bus Access Disabled!!)
Status 0280h (Supports Back-To-Back Trans., Medium Timing)
Revision 03h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Bridge, type PCI to Other
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
System IRQ 9, INT# A

Bus 0 (PCI), Device Number 2, Device Function 0
Vendor 1013h Cirrus Logic
Device 00B8h CL-GD5446 PCI
Command 0007h (I/O Access, Memory Access, BusMaster)
Status h
Revision 00h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Display, type VGA
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
Address 0 is a Memory Address (anywhere in 0-4Gb, Prefetchable) : F000h
Address 1 is a Memory Address (anywhere in 0-4Gb) : F200h

Bus 0 (PCI), Device Number 3, Device Function 0
Vendor 1AF4h Unknown
Device 1000h Unknown
Command 0007h (I/O Access, Memory Access, BusMaster)
Status h
Revision 00h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Network, type Ethernet
Subsystem ID 00011AF4h Unknown
Subsystem Vendor 1AF4h Unknown
Address 0 is an I/O Port : C040h
System IRQ 10, INT# A

Bus 0 (PCI), Device Number 4, Device Function 0
Vendor 1AF4h Unknown
Device 1002h Unknown
Command 0001h (I/O Access)
Status h
Revision 00h, Header Type 

Re: sync guest calls made async on host - SQLite performance

2009-10-15 Thread Christoph Hellwig
On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote:
 Historically it didn't and the only safe way to use virtio was in
 cache=writethrough mode.
 
 Which should be the default on Ubuntu's kvm that this report is 
 concerned with so I'm a bit confused.

So can we please get the detailed setup where this happens, that is:

filesystem used in the guest
any volume manager / software raid used in the guest
kernel version in the guest
image format used
qemu command line including caching mode, using ide/scsi/virtio, etc
qemu/kvm version
filesystem used in the host
any volume manager / software raid used in the host
kernel version in the host


 Avi's patch is a performance optimization, not a correctness issue?

It could actually minimally degrade performace.  For the existing
filesystems as upper layer it doesn not improve correctness either.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Single memory slot

2009-10-15 Thread Alexander Graf


On 15.10.2009, at 09:33, Avi Kivity wrote:

One way to improve the gfn_to_pfn() memslot search is to register  
just one slot.  This can only work on 64-bit, since even the  
smallest guests need 4GB of physical address space.  Apart from  
speeding up gfn_to_page(), it would also speed up mmio which must  
iterate over all slots, so a lookup cache cannot help.


This would require quite a bunch of changes:
- modify gfn_to_pfn() to fail gracefully if the page is in the slot  
but unmapped (hole handling)

- modify qemu to reserve the guest physical address space
- modify qemu memory allocation to use MAP_FIXED to allocate memory
- some hack for the vga aliases (mmap an fd multiple times?)
- some hack for the vmx-specific pages (e.g. APIC-access page)

Not sure it's worthwhile, but something to keep in mind if a simple  
cache or sort by size is insufficient due to mmio.


One thing I've been wondering for quite a while is that slot loop. Why  
do we loop over all possible slots? Couldn't we just remember the max  
extry (usually 1 or 2) and not loop MAX_SLOT_AMOUNT times?


That would be a really easy patch and give instant speed improvements  
for everyone.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sync guest calls made async on host - SQLite performance

2009-10-15 Thread Christoph Hellwig
On Thu, Oct 15, 2009 at 02:17:02PM +0200, Christoph Hellwig wrote:
 On Wed, Oct 14, 2009 at 05:54:23PM -0500, Anthony Liguori wrote:
  Historically it didn't and the only safe way to use virtio was in
  cache=writethrough mode.
  
  Which should be the default on Ubuntu's kvm that this report is 
  concerned with so I'm a bit confused.
 
 So can we please get the detailed setup where this happens, that is:
 
 filesystem used in the guest
 any volume manager / software raid used in the guest
 kernel version in the guest
 image format used
 qemu command line including caching mode, using ide/scsi/virtio, etc
 qemu/kvm version
 filesystem used in the host
 any volume manager / software raid used in the host
 kernel version in the host

And very important the mount options (/proc/self/mounts) of both host
and guest.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't make virtio block driver work on Windows 2003

2009-10-15 Thread Vadim Rozenfeld

On 10/15/2009 01:42 PM, Asdo wrote:

Vadim Rozenfeld wrote:

On 10/14/2009 07:52 PM, Asdo wrote:

...
So I tried adding another drive, a virtio one, (a new 100MB file at 
host side) to the virtual machine and rebooting.


A first problem is that Windows does not detect the new device upon 
boot or Add Hardware scan.
Check PCI devices with info pci. You must have SCSI controller: 
PCI device 1af4:1001 device reported.


It's not there. Does this make it a KVM bug?
Looks like virtio-blk device wasn't initialized. Otherwise I cannot 
explain why 0x1100 device is here.

Try to start block device without index=1
Anyway, if you can, please send info pci output from QEMU monitor console.

Thank you,
Vadim.


I'm attaching the PCI32.EXE output at the bottom of this email

BTW I would probably be able to switch to virtio anyway on this 
installation of Windows 2003, if I knew the way to insert the viostor 
driver into the windows boot image (windows's initrd), because if I 
set the first disk hda as virtio then kvm really makes it virtio (so 
maybe it's a configuration with one IDE and one virtio that does not 
work in KVM) and Windows bluescreens at boot. However I don't know how 
to insert the viostor driver in the windows boot image. Any suggestions?




Here is the kvm commandline (it's complex because it comes from 
libvirt):


/usr/local/kvm/bin/qemu-system-x86_64 -S -M pc -m 4096-smp 4 -name 
winserv2 -uuid  -monitor pty 
-boot c -drive 
file=/virtual_machines/kvm/nfsimport/winserv2.raw,if=ide,index=0,boot=on 
-drive 
file=/virtual_machines/kvm/nfsimport/zerofile,if=virtio,index=1 -net 
nic,macaddr=xx:xx:xx:xx:xx:xx,vlan=0,model=virtio -net 
tap,fd=25,vlan=0 -serial none -parallel none -usb -vnc 127.0.0.1:4




Craig Hart's PCI+AGP bus sniffer, Version 1.6, freeware made in 
1996-2005.


Searching for Devices using CFG Mechanism 1 [OS: Win 2003 Service Pack 1]


Bus 0 (PCI), Device Number 0, Device Function 0
Vendor 8086h Intel Corporation
Device 1237h 82441FX 440FX (Natoma) System Controller Rev 2 (SU053)
Command h (Bus Access Disabled!!)
Status h
Revision 02h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Bridge, type PCI to HOST
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown

Bus 0 (PCI), Device Number 1, Device Function 0
Vendor 8086h Intel Corporation
Device 7000h 82371SB PIIX3 ISA Bridge
Command 0007h (I/O Access, Memory Access, BusMaster)
Status 0200h (Medium Timing)
Revision 00h, Header Type 80h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Bridge, type PCI to ISA
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown

Bus 0 (PCI), Device Number 1, Device Function 1
Vendor 8086h Intel Corporation
Device 7010h 82371SB PIIX3 EIDE Controller
Command 0007h (I/O Access, Memory Access, BusMaster)
Status 0280h (Supports Back-To-Back Trans., Medium Timing)
Revision 00h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Storage, type IDE (ATA)
PCI EIDE Controller Features :
  BusMaster EIDE is supported
  Primary   Channel is at I/O Port 01F0h and IRQ 14
  Secondary Channel is at I/O Port 0170h and IRQ 15
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
Address 4 is an I/O Port : C000h

Bus 0 (PCI), Device Number 1, Device Function 2
Vendor 8086h Intel Corporation
Device 7020h 82371SB PIIX3 USB Controller   Rev 1 (SU093)
Command 0007h (I/O Access, Memory Access, BusMaster)
Status h
Revision 01h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Serial, type USB (UHCI)
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
Address 4 is an I/O Port : C020h
System IRQ 11, INT# D

Bus 0 (PCI), Device Number 1, Device Function 3
Vendor 8086h Intel Corporation
Device 7113h 82371MB PIIX4M Power Management Controller
Command h (Bus Access Disabled!!)
Status 0280h (Supports Back-To-Back Trans., Medium Timing)
Revision 03h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Bridge, type PCI to Other
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
System IRQ 9, INT# A

Bus 0 (PCI), Device Number 2, Device Function 0
Vendor 1013h Cirrus Logic
Device 00B8h CL-GD5446 PCI
Command 0007h (I/O Access, Memory Access, BusMaster)
Status h
Revision 00h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Display, type VGA
Subsystem ID 11001AF4h Unknown
Subsystem Vendor 1AF4h Unknown
Address 0 is a Memory Address (anywhere in 0-4Gb, Prefetchable) : 
F000h

Address 1 is a Memory Address (anywhere in 0-4Gb) : F200h

Bus 0 (PCI), Device Number 3, Device Function 0
Vendor 1AF4h Unknown
Device 1000h Unknown
Command 0007h (I/O Access, Memory Access, BusMaster)
Status h
Revision 00h, Header Type 00h, Bus Latency Timer 00h
Self test 00h (Self test not supported)
PCI Class Network, type 

Re: Raw vs. tap

2009-10-15 Thread Anthony Liguori

Michael S. Tsirkin wrote:

On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote:
  
I would be much more inclined to consider  
taking raw and improving the performance long term if guest-host  
networking worked.  This appears to be a fundamental limitation though  
and I think it's something that will forever plague users if we include  
this feature.



In fact, I think it's fixable with a raw socket bound to a macvlan.
Would that be enough?
  


What setup does that entail on the part of a user?  Wouldn't we be back 
to square one wrt users having to run archaic networking commands in 
order to set things up?


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't make virtio block driver work on Windows 2003

2009-10-15 Thread Asdo

Vadim Rozenfeld wrote:

On 10/15/2009 01:42 PM, Asdo wrote:

Vadim Rozenfeld wrote:

On 10/14/2009 07:52 PM, Asdo wrote:

...
So I tried adding another drive, a virtio one, (a new 100MB file at 
host side) to the virtual machine and rebooting.


A first problem is that Windows does not detect the new device upon 
boot or Add Hardware scan.
Check PCI devices with info pci. You must have SCSI controller: 
PCI device 1af4:1001 device reported.


It's not there. Does this make it a KVM bug?
Looks like virtio-blk device wasn't initialized. Otherwise I cannot 
explain why 0x1100 device is here.

Try to start block device without index=1
Anyway, if you can, please send info pci output from QEMU monitor 
console.


Owh! Ok THAT was info pci
Ok I am copying by hand before removing index=1

(qemu) info pci
Bus 0, device 0, function, 0:
   Host bridge: PCI device 8086:1237
Bus 0 device 1, function 0:
   ISA ridge: PCI device 8086:7000
Bos 0 device 1 function 1:
   IDE controller: PCI device 8086:7010
  BAR4: I/O at 0xc000 [0xc00f].
Bus 0 device 1 function 3:
   Bridge: PCI device 8086:7133
   IRQ 9
Bus 0 device 2 function 0:
VGA controller: PCI device 1013:00b8
  BAR0: 32 but memory at 0xf000 [0xf1ff]
  BAR1: 32 but memory at 0xf200 [0xf2000fff]
Bus 0 device 3 function 0:
   Ethernet controller PCI device 1af4:1000
  IRQ 11
  BAR0: I/O at 0xc020 [0xc03f]
Bus 0 device 4 function 0
   RAM controller: PCI device 1af4:1002
   IRQ 11
   BAR0 : I/O at 0xc040
(qemu)

so it's not there

Now I remove index=1:

WOW it's there now!
...
Bus 0 device 4 function 0:
   Storage controller: PCI device 1af4:1001
   IRQ 11
   BAR0: I/O at 0xc040 [0xc07f]

(just before the 1002 device)

So now windows sees it and I was able to install the viostor drivers 
(btw Windows was not happy with the previously installed viostor 
drivers, I had to reinstall those and I got two devices, and the 
previous one still had the yellow exclamation mark, so I had to 
uninstall that one. After the procedure I was able to boot on virtio 
too! Yeah!).


Great so yes, I'd say you *DO* have a KVM bug: one has to remove index=1 
for the second disk to appear. How did you know that, Vadim, is it a 
known issue with kvm? It's better to fix that because libvirt puts 
index=n for all drives so it's impossible to workaround the problem if 
one uses libvirt. I had to launch manually...


Thanks a lot Vadim.

Asdo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst

2009-10-15 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |  468 ++--
 arch/x86/kvm/x86.c |3 +-
 2 files changed, 459 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 411cbdb..8c186e0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -61,20 +61,168 @@ module_param_named(unrestricted_guest,
 static int __read_mostly emulate_invalid_guest_state = 0;
 module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 
+
+struct __attribute__ ((__packed__)) shadow_vmcs {
+   u32 revision_id;
+   u32 abort;
+   u16 virtual_processor_id;
+   u16 guest_es_selector;
+   u16 guest_cs_selector;
+   u16 guest_ss_selector;
+   u16 guest_ds_selector;
+   u16 guest_fs_selector;
+   u16 guest_gs_selector;
+   u16 guest_ldtr_selector;
+   u16 guest_tr_selector;
+   u16 host_es_selector;
+   u16 host_cs_selector;
+   u16 host_ss_selector;
+   u16 host_ds_selector;
+   u16 host_fs_selector;
+   u16 host_gs_selector;
+   u16 host_tr_selector;
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+   u64 vm_exit_msr_store_addr;
+   u64 vm_exit_msr_load_addr;
+   u64 vm_entry_msr_load_addr;
+   u64 tsc_offset;
+   u64 virtual_apic_page_addr;
+   u64 apic_access_addr;
+   u64 ept_pointer;
+   u64 guest_physical_address;
+   u64 vmcs_link_pointer;
+   u64 guest_ia32_debugctl;
+   u64 guest_ia32_pat;
+   u64 guest_pdptr0;
+   u64 guest_pdptr1;
+   u64 guest_pdptr2;
+   u64 guest_pdptr3;
+   u64 host_ia32_pat;
+   u32 pin_based_vm_exec_control;
+   u32 cpu_based_vm_exec_control;
+   u32 exception_bitmap;
+   u32 page_fault_error_code_mask;
+   u32 page_fault_error_code_match;
+   u32 cr3_target_count;
+   u32 vm_exit_controls;
+   u32 vm_exit_msr_store_count;
+   u32 vm_exit_msr_load_count;
+   u32 vm_entry_controls;
+   u32 vm_entry_msr_load_count;
+   u32 vm_entry_intr_info_field;
+   u32 vm_entry_exception_error_code;
+   u32 vm_entry_instruction_len;
+   u32 tpr_threshold;
+   u32 secondary_vm_exec_control;
+   u32 vm_instruction_error;
+   u32 vm_exit_reason;
+   u32 vm_exit_intr_info;
+   u32 vm_exit_intr_error_code;
+   u32 idt_vectoring_info_field;
+   u32 idt_vectoring_error_code;
+   u32 vm_exit_instruction_len;
+   u32 vmx_instruction_info;
+   u32 guest_es_limit;
+   u32 guest_cs_limit;
+   u32 guest_ss_limit;
+   u32 guest_ds_limit;
+   u32 guest_fs_limit;
+   u32 guest_gs_limit;
+   u32 guest_ldtr_limit;
+   u32 guest_tr_limit;
+   u32 guest_gdtr_limit;
+   u32 guest_idtr_limit;
+   u32 guest_es_ar_bytes;
+   u32 guest_cs_ar_bytes;
+   u32 guest_ss_ar_bytes;
+   u32 guest_ds_ar_bytes;
+   u32 guest_fs_ar_bytes;
+   u32 guest_gs_ar_bytes;
+   u32 guest_ldtr_ar_bytes;
+   u32 guest_tr_ar_bytes;
+   u32 guest_interruptibility_info;
+   u32 guest_activity_state;
+   u32 guest_sysenter_cs;
+   u32 host_ia32_sysenter_cs;
+   unsigned long cr0_guest_host_mask;
+   unsigned long cr4_guest_host_mask;
+   unsigned long cr0_read_shadow;
+   unsigned long cr4_read_shadow;
+   unsigned long cr3_target_value0;
+   unsigned long cr3_target_value1;
+   unsigned long cr3_target_value2;
+   unsigned long cr3_target_value3;
+   unsigned long exit_qualification;
+   unsigned long guest_linear_address;
+   unsigned long guest_cr0;
+   unsigned long guest_cr3;
+   unsigned long guest_cr4;
+   unsigned long guest_es_base;
+   unsigned long guest_cs_base;
+   unsigned long guest_ss_base;
+   unsigned long guest_ds_base;
+   unsigned long guest_fs_base;
+   unsigned long guest_gs_base;
+   unsigned long guest_ldtr_base;
+   unsigned long guest_tr_base;
+   unsigned long guest_gdtr_base;
+   unsigned long guest_idtr_base;
+   unsigned long guest_dr7;
+   unsigned long guest_rsp;
+   unsigned long guest_rip;
+   unsigned long guest_rflags;
+   unsigned long guest_pending_dbg_exceptions;
+   unsigned long guest_sysenter_esp;
+   unsigned long guest_sysenter_eip;
+   unsigned long host_cr0;
+   unsigned long host_cr3;
+   unsigned long host_cr4;
+   unsigned long host_fs_base;
+   unsigned long host_gs_base;
+   unsigned long host_tr_base;
+   unsigned long host_gdtr_base;
+   unsigned long host_idtr_base;
+   unsigned long host_ia32_sysenter_esp;
+   unsigned long host_ia32_sysenter_eip;
+   unsigned long host_rsp;
+   unsigned long host_rip;
+};
+
 struct __attribute__ ((__packed__)) level_state {
/* Has the level1 guest done vmclear? */
bool vmclear;
+   u16 vpid;
+   u64 shadow_efer;
+   unsigned long cr2;
+   

[PATCH 2/5] Nested VMX patch 2 implements vmclear

2009-10-15 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |   70 ---
 1 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 71bd91a..411cbdb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -61,15 +61,26 @@ module_param_named(unrestricted_guest,
 static int __read_mostly emulate_invalid_guest_state = 0;
 module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 
-struct vmcs {
-   u32 revision_id;
-   u32 abort;
-   char data[0];
+struct __attribute__ ((__packed__)) level_state {
+   /* Has the level1 guest done vmclear? */
+   bool vmclear;
 };
 
 struct nested_vmx {
/* Has the level1 guest done vmxon? */
bool vmxon;
+
+   /*
+* Level 2 state : includes vmcs,registers and
+* a copy of vmcs12 for vmread/vmwrite
+*/
+   struct level_state *l2_state;
+};
+
+struct vmcs {
+   u32 revision_id;
+   u32 abort;
+   char data[0];
 };
 
 struct vcpu_vmx {
@@ -186,6 +197,8 @@ static struct kvm_vmx_segment_field {
 
 static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
 
+static int create_l2_state(struct kvm_vcpu *vcpu);
+
 /*
  * Keep MSR_K6_STAR at the end, as setup_msrs() will try to optimize it
  * away by decrementing the array size.
@@ -1293,6 +1306,30 @@ static void vmclear_local_vcpus(void)
__vcpu_clear(vmx);
 }
 
+struct level_state *create_state(void)
+{
+   struct level_state *state = NULL;
+
+   state = kzalloc(sizeof(struct level_state), GFP_KERNEL);
+   if (!state) {
+   printk(KERN_INFO Error create level state\n);
+   return NULL;
+   }
+   return state;
+}
+
+int create_l2_state(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   if (!vmx-nested.l2_state) {
+   vmx-nested.l2_state = create_state();
+   if (!vmx-nested.l2_state)
+   return -ENOMEM;
+   }
+
+   return 0;
+}
 
 /* Just like cpu_vmxoff(), but with the __kvm_handle_fault_on_reboot()
  * tricks.
@@ -3261,6 +3298,27 @@ static int handle_vmx_insn(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
+{
+   unsigned long rflags;
+   rflags = vmx_get_rflags(vcpu);
+   rflags = ~(X86_EFLAGS_CF | X86_EFLAGS_ZF);
+   vmx_set_rflags(vcpu, rflags);
+}
+
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   to_vmx(vcpu)-nested.l2_state-vmclear = 1;
+
+   skip_emulated_instruction(vcpu);
+   clear_rflags_cf_zf(vcpu);
+
+   return 1;
+}
+
 static int handle_vmoff(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3310,6 +3368,8 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
 
vmx-nested.vmxon = 1;
 
+   create_l2_state(vcpu);
+
skip_emulated_instruction(vcpu);
return 1;
 }
@@ -3582,7 +3642,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu 
*vcpu) = {
[EXIT_REASON_HLT] = handle_halt,
[EXIT_REASON_INVLPG]  = handle_invlpg,
[EXIT_REASON_VMCALL]  = handle_vmcall,
-   [EXIT_REASON_VMCLEAR] = handle_vmx_insn,
+   [EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
[EXIT_REASON_VMPTRLD] = handle_vmx_insn,
[EXIT_REASON_VMPTRST] = handle_vmx_insn,
-- 
1.6.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite

2009-10-15 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |  591 +++-
 1 files changed, 589 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8c186e0..6a4c252 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -225,6 +225,21 @@ struct nested_vmx {
struct level_state *l1_state;
 };
 
+enum vmcs_field_type {
+   VMCS_FIELD_TYPE_U16 = 0,
+   VMCS_FIELD_TYPE_U64 = 1,
+   VMCS_FIELD_TYPE_U32 = 2,
+   VMCS_FIELD_TYPE_ULONG = 3
+};
+
+#define VMCS_FIELD_LENGTH_OFFSET 13
+#define VMCS_FIELD_LENGTH_MASK 0x6000
+
+static inline int vmcs_field_length(unsigned long field)
+{
+   return (VMCS_FIELD_LENGTH_MASK  field)  13;
+}
+
 struct vmcs {
u32 revision_id;
u32 abort;
@@ -288,6 +303,404 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu 
*vcpu)
return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+#define SHADOW_VMCS_OFFSET(x) offsetof(struct shadow_vmcs, x)
+
+static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
+
+   [VIRTUAL_PROCESSOR_ID] =
+   SHADOW_VMCS_OFFSET(virtual_processor_id),
+   [GUEST_ES_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_es_selector),
+   [GUEST_CS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_cs_selector),
+   [GUEST_SS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_ss_selector),
+   [GUEST_DS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_ds_selector),
+   [GUEST_FS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_fs_selector),
+   [GUEST_GS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_gs_selector),
+   [GUEST_LDTR_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_ldtr_selector),
+   [GUEST_TR_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_tr_selector),
+   [HOST_ES_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_es_selector),
+   [HOST_CS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_cs_selector),
+   [HOST_SS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_ss_selector),
+   [HOST_DS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_ds_selector),
+   [HOST_FS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_fs_selector),
+   [HOST_GS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_gs_selector),
+   [HOST_TR_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_tr_selector),
+   [IO_BITMAP_A] =
+   SHADOW_VMCS_OFFSET(io_bitmap_a),
+   [IO_BITMAP_A_HIGH] =
+   SHADOW_VMCS_OFFSET(io_bitmap_a)+4,
+   [IO_BITMAP_B] =
+   SHADOW_VMCS_OFFSET(io_bitmap_b),
+   [IO_BITMAP_B_HIGH] =
+   SHADOW_VMCS_OFFSET(io_bitmap_b)+4,
+   [MSR_BITMAP] =
+   SHADOW_VMCS_OFFSET(msr_bitmap),
+   [MSR_BITMAP_HIGH] =
+   SHADOW_VMCS_OFFSET(msr_bitmap)+4,
+   [VM_EXIT_MSR_STORE_ADDR] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr),
+   [VM_EXIT_MSR_STORE_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr)+4,
+   [VM_EXIT_MSR_LOAD_ADDR] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr),
+   [VM_EXIT_MSR_LOAD_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr)+4,
+   [VM_ENTRY_MSR_LOAD_ADDR] =
+   SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr),
+   [VM_ENTRY_MSR_LOAD_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr)+4,
+   [TSC_OFFSET] =
+   SHADOW_VMCS_OFFSET(tsc_offset),
+   [TSC_OFFSET_HIGH] =
+   SHADOW_VMCS_OFFSET(tsc_offset)+4,
+   [VIRTUAL_APIC_PAGE_ADDR] =
+   SHADOW_VMCS_OFFSET(virtual_apic_page_addr),
+   [VIRTUAL_APIC_PAGE_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(virtual_apic_page_addr)+4,
+   [APIC_ACCESS_ADDR] =
+   SHADOW_VMCS_OFFSET(apic_access_addr),
+   [APIC_ACCESS_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(apic_access_addr)+4,
+   [EPT_POINTER] =
+   SHADOW_VMCS_OFFSET(ept_pointer),
+   [EPT_POINTER_HIGH] =
+   SHADOW_VMCS_OFFSET(ept_pointer)+4,
+   [GUEST_PHYSICAL_ADDRESS] =
+   SHADOW_VMCS_OFFSET(guest_physical_address),
+   [GUEST_PHYSICAL_ADDRESS_HIGH] =
+   SHADOW_VMCS_OFFSET(guest_physical_address)+4,
+   [VMCS_LINK_POINTER] =
+   SHADOW_VMCS_OFFSET(vmcs_link_pointer),
+   [VMCS_LINK_POINTER_HIGH] =
+   SHADOW_VMCS_OFFSET(vmcs_link_pointer)+4,
+   [GUEST_IA32_DEBUGCTL] =
+   SHADOW_VMCS_OFFSET(guest_ia32_debugctl),
+   [GUEST_IA32_DEBUGCTL_HIGH] =
+   SHADOW_VMCS_OFFSET(guest_ia32_debugctl)+4,
+   [GUEST_IA32_PAT] =
+   SHADOW_VMCS_OFFSET(guest_ia32_pat),
+   [GUEST_IA32_PAT_HIGH] =
+   SHADOW_VMCS_OFFSET(guest_ia32_pat)+4,
+   [GUEST_PDPTR0] =
+   SHADOW_VMCS_OFFSET(guest_pdptr0),
+   [GUEST_PDPTR0_HIGH] =
+   

[PATCH 1/5] Nested VMX patch 1 implements vmon and vmoff

2009-10-15 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/svm.c |3 -
 arch/x86/kvm/vmx.c |  217 +++-
 arch/x86/kvm/x86.c |6 +-
 arch/x86/kvm/x86.h |2 +
 4 files changed, 222 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2df9b45..3c1f22a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -124,9 +124,6 @@ static int npt = 1;
 
 module_param(npt, int, S_IRUGO);
 
-static int nested = 1;
-module_param(nested, int, S_IRUGO);
-
 static void svm_flush_tlb(struct kvm_vcpu *vcpu);
 static void svm_complete_interrupts(struct vcpu_svm *svm);
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 78101dd..71bd91a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -67,6 +67,11 @@ struct vmcs {
char data[0];
 };
 
+struct nested_vmx {
+   /* Has the level1 guest done vmxon? */
+   bool vmxon;
+};
+
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
struct list_head  local_vcpus_link;
@@ -114,6 +119,9 @@ struct vcpu_vmx {
ktime_t entry_time;
s64 vnmi_blocked_time;
u32 exit_reason;
+
+   /* Nested vmx */
+   struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -967,6 +975,95 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
 }
 
 /*
+ * Handles msr read for nested virtualization
+ */
+static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index,
+ u64 *pdata)
+{
+   u64 vmx_msr = 0;
+
+   switch (msr_index) {
+   case MSR_IA32_FEATURE_CONTROL:
+   *pdata = 0;
+   break;
+   case MSR_IA32_VMX_BASIC:
+   *pdata = 0;
+   rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+   *pdata = (vmx_msr  0x00cf);
+   break;
+   case MSR_IA32_VMX_PINBASED_CTLS:
+   rdmsrl(MSR_IA32_VMX_PINBASED_CTLS, vmx_msr);
+   *pdata = (PIN_BASED_EXT_INTR_MASK  
vmcs_config.pin_based_exec_ctrl) |
+   (PIN_BASED_NMI_EXITING  
vmcs_config.pin_based_exec_ctrl) |
+   (PIN_BASED_VIRTUAL_NMIS  
vmcs_config.pin_based_exec_ctrl);
+   break;
+   case MSR_IA32_VMX_PROCBASED_CTLS:
+   {
+   u32 vmx_msr_high, vmx_msr_low;
+   u32 control = CPU_BASED_HLT_EXITING |
+#ifdef CONFIG_X86_64
+   CPU_BASED_CR8_LOAD_EXITING |
+   CPU_BASED_CR8_STORE_EXITING |
+#endif
+   CPU_BASED_CR3_LOAD_EXITING |
+   CPU_BASED_CR3_STORE_EXITING |
+   CPU_BASED_USE_IO_BITMAPS |
+   CPU_BASED_MOV_DR_EXITING |
+   CPU_BASED_USE_TSC_OFFSETING |
+   CPU_BASED_INVLPG_EXITING |
+   CPU_BASED_TPR_SHADOW |
+   CPU_BASED_USE_MSR_BITMAPS |
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+
+   rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+
+   control = vmx_msr_high; /* bit == 0 in high word == must be 
zero */
+   control |= vmx_msr_low;  /* bit == 1 in low word  == must be 
one  */
+
+   *pdata = (CPU_BASED_HLT_EXITING  control) |
+#ifdef CONFIG_X86_64
+   (CPU_BASED_CR8_LOAD_EXITING  control) |
+   (CPU_BASED_CR8_STORE_EXITING  control) |
+#endif
+   (CPU_BASED_CR3_LOAD_EXITING  control) |
+   (CPU_BASED_CR3_STORE_EXITING  control) |
+   (CPU_BASED_USE_IO_BITMAPS  control) |
+   (CPU_BASED_MOV_DR_EXITING  control) |
+   (CPU_BASED_USE_TSC_OFFSETING  control) |
+   (CPU_BASED_INVLPG_EXITING  control) ;
+
+   if (cpu_has_secondary_exec_ctrls())
+   *pdata |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+
+   if (vm_need_tpr_shadow(vcpu-kvm))
+   *pdata |= CPU_BASED_TPR_SHADOW;
+   break;
+   }
+   case MSR_IA32_VMX_EXIT_CTLS:
+   *pdata = 0;
+#ifdef CONFIG_X86_64
+   *pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#endif
+   break;
+   case MSR_IA32_VMX_ENTRY_CTLS:
+   *pdata = 0;
+   break;
+   case MSR_IA32_VMX_PROCBASED_CTLS2:
+   *pdata = 0;
+   if (vm_need_virtualize_apic_accesses(vcpu-kvm))
+   *pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+   break;
+   case MSR_IA32_VMX_EPT_VPID_CAP:
+   *pdata = 0;
+   break;
+   default:
+   return 1;
+   }
+
+   return 0;
+}
+
+/*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1005,6 +1102,9 @@ static int 

Nested VMX support v3

2009-10-15 Thread oritw
Avi,
We have addressed all of the comments, please apply.

The following patches implement nested VMX support. The patches enable a guest
to use the VMX APIs in order to run its own nested guest (i.e., enable running
other hypervisors which use VMX under KVM). The current patches support running
Linux under a nested KVM using shadow page table (with bypass_guest_pf
disabled). SMP support was fixed.  Reworking EPT support to mesh cleanly with
the current shadow paging design per Avi's comments is a work-in-progress.  

The current patches only support a single nested hypervisor, which can only run
a single guest (multiple guests are work in progress). Only 64-bit nested
hypervisors are supported.

Additional patches for running Windows under nested KVM, and Linux under nested
VMware server(!), are currently running in the lab. We are in the process of
forward-porting those patches to -tip.

This patches were written by:
 Orit Wasserman, or...@il.ibm.com
 Ben-Ami Yassor, ben...@il.ibm.com
 Abel Gordon, ab...@il.ibm.com
 Muli Ben-Yehuda, m...@il.ibm.com
 
With contributions by:
 Anthony Liguori, aligu...@us.ibm.com
 Mike Day, md...@us.ibm.com

This work was inspired by the nested SVM support by Alexander Graf and Joerg
Roedel.

Changes since v2:
Added check to nested_vmx_get_msr.
Static initialization of the vmcs_field_to_offset_table array.
Use the memory allocated by L1 for VMCS12 to store the shadow vmcs.
Some optimization to the prepare_vmcs_12 function.

vpid allocation will be updated with the multiguest support (work in progress).
We are working on fixing the cr0.TS handling, it works for nested kvm by not 
for vmware server.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] allow userspace to adjust kvmclock offset

2009-10-15 Thread Glauber Costa
On Thu, Oct 15, 2009 at 09:46:52AM +0900, Avi Kivity wrote:
 On 10/13/2009 09:46 PM, Glauber Costa wrote:
 On Tue, Oct 13, 2009 at 03:31:08PM +0300, Avi Kivity wrote:

 On 10/13/2009 03:28 PM, Glauber Costa wrote:
  

 Do we want an absolute or relative adjustment?

  
 What exactly do you mean?


 Absolute adjustment: clock = t
 Relative adjustment: clock += t
  
 The delta is absolute, but the adjustment in the clock is relative.

 So we pick the difference between what userspace is passing us and what
 we currently have, then relatively adds up so we can make sure we won't
 go back or suffer a too big skew.


 The motivation for relative adjustment is when you have a jitter  
 resistant place to gather timing information (like the kernel, which can  
 disable interrupts and preemption), then pass it on to kvm without  
 losing information due to scheduling.  For migration there is no such  
 place since it involves two hosts, but it makes sense to support  
 relative adjustments.
Since we added the padding you asked for, we could use that bit of information
to define whether it will be a relative or absolute adjustment, then. Right now,
I don't see the point of implementing a code path that will be completely 
untested.

I'd leave it this way until someone comes up with a need.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't make virtio block driver work on Windows 2003

2009-10-15 Thread Vadim Rozenfeld

On 10/15/2009 04:23 PM, Asdo wrote:

Vadim Rozenfeld wrote:

On 10/15/2009 01:42 PM, Asdo wrote:

Vadim Rozenfeld wrote:

On 10/14/2009 07:52 PM, Asdo wrote:

...
So I tried adding another drive, a virtio one, (a new 100MB file 
at host side) to the virtual machine and rebooting.


A first problem is that Windows does not detect the new device 
upon boot or Add Hardware scan.
Check PCI devices with info pci. You must have SCSI controller: 
PCI device 1af4:1001 device reported.


It's not there. Does this make it a KVM bug?
Looks like virtio-blk device wasn't initialized. Otherwise I cannot 
explain why 0x1100 device is here.

Try to start block device without index=1
Anyway, if you can, please send info pci output from QEMU monitor 
console.


Owh! Ok THAT was info pci
Ok I am copying by hand before removing index=1

(qemu) info pci
Bus 0, device 0, function, 0:
   Host bridge: PCI device 8086:1237
Bus 0 device 1, function 0:
   ISA ridge: PCI device 8086:7000
Bos 0 device 1 function 1:
   IDE controller: PCI device 8086:7010
  BAR4: I/O at 0xc000 [0xc00f].
Bus 0 device 1 function 3:
   Bridge: PCI device 8086:7133
   IRQ 9
Bus 0 device 2 function 0:
VGA controller: PCI device 1013:00b8
  BAR0: 32 but memory at 0xf000 [0xf1ff]
  BAR1: 32 but memory at 0xf200 [0xf2000fff]
Bus 0 device 3 function 0:
   Ethernet controller PCI device 1af4:1000
  IRQ 11
  BAR0: I/O at 0xc020 [0xc03f]
Bus 0 device 4 function 0
   RAM controller: PCI device 1af4:1002
   IRQ 11
   BAR0 : I/O at 0xc040
(qemu)

so it's not there

Now I remove index=1:

WOW it's there now!
...
Bus 0 device 4 function 0:
   Storage controller: PCI device 1af4:1001
   IRQ 11
   BAR0: I/O at 0xc040 [0xc07f]

(just before the 1002 device)

So now windows sees it and I was able to install the viostor drivers 
(btw Windows was not happy with the previously installed viostor 
drivers, I had to reinstall those and I got two devices, and the 
previous one still had the yellow exclamation mark, so I had to 
uninstall that one. After the procedure I was able to boot on virtio 
too! Yeah!).


Great so yes, I'd say you *DO* have a KVM bug: one has to remove 
index=1 for the second disk to appear. How did you know that, Vadim, 
is it a known issue with kvm? 
I don't know. I think, I've seen it once or twice while debugging 
viostor on old qemu-kvm.

But it definitely works with the recent versions.
Regards,
Vadim
It's better to fix that because libvirt puts index=n for all drives 
so it's impossible to workaround the problem if one uses libvirt. I 
had to launch manually...


Thanks a lot Vadim.

Asdo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raw vs. tap

2009-10-15 Thread Michael S. Tsirkin
On Thu, Oct 15, 2009 at 08:32:03AM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote:
   
 I would be much more inclined to consider  taking raw and improving 
 the performance long term if guest-host  networking worked.  This 
 appears to be a fundamental limitation though  and I think it's 
 something that will forever plague users if we include  this feature.
 

 In fact, I think it's fixable with a raw socket bound to a macvlan.
 Would that be enough?
   

 What setup does that entail on the part of a user?  Wouldn't we be back  
 to square one wrt users having to run archaic networking commands in  
 order to set things up?

Unlike bridge, qemu could set up macvlan without disrupting
host networking. The only issue would be cleanup if qemu
is killed.

 Regards,

 Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raw vs. tap

2009-10-15 Thread Anthony Liguori

Michael S. Tsirkin wrote:

On Thu, Oct 15, 2009 at 08:32:03AM -0500, Anthony Liguori wrote:
  

Michael S. Tsirkin wrote:


On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote:
  
  
I would be much more inclined to consider  taking raw and improving 
the performance long term if guest-host  networking worked.  This 
appears to be a fundamental limitation though  and I think it's 
something that will forever plague users if we include  this feature.



In fact, I think it's fixable with a raw socket bound to a macvlan.
Would that be enough?
  
  
What setup does that entail on the part of a user?  Wouldn't we be back  
to square one wrt users having to run archaic networking commands in  
order to set things up?



Unlike bridge, qemu could set up macvlan without disrupting
host networking. The only issue would be cleanup if qemu
is killed.
  


But this would require additional features in macvlan, correct?

This also only works if a guest uses the mac address assigned to it, 
correct?  If a guest was bridging the virtual nic, this would all come 
apart?


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raw vs. tap

2009-10-15 Thread Michael S. Tsirkin
On Thu, Oct 15, 2009 at 10:18:18AM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 On Thu, Oct 15, 2009 at 08:32:03AM -0500, Anthony Liguori wrote:
   
 Michael S. Tsirkin wrote:
 
 On Wed, Oct 14, 2009 at 05:53:56PM -0500, Anthony Liguori wrote:
 
 I would be much more inclined to consider  taking raw and 
 improving the performance long term if guest-host  networking 
 worked.  This appears to be a fundamental limitation though  and 
 I think it's something that will forever plague users if we 
 include  this feature.
 
 In fact, I think it's fixable with a raw socket bound to a macvlan.
 Would that be enough?
 
 What setup does that entail on the part of a user?  Wouldn't we be 
 back  to square one wrt users having to run archaic networking 
 commands in  order to set things up?
 

 Unlike bridge, qemu could set up macvlan without disrupting
 host networking. The only issue would be cleanup if qemu
 is killed.
   

 But this would require additional features in macvlan, correct?

Not sure: what is the this that you are talking about.
It can already be set up without disturbing host networking.

 This also only works if a guest uses the mac address assigned to it,  
 correct?  If a guest was bridging the virtual nic, this would all come  
 apart?

Hmm, you could enable promisc mode, but generally this is true:
if you require bridging, use a bridge.

 Regards,

 Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Switch pcbios to submodule

2009-10-15 Thread Marcelo Tosatti
On Mon, Oct 12, 2009 at 12:25:40PM +0200, Avi Kivity wrote:
 Instead of carrying pcbios as a subtree in kvm/bios/, switch to a
 submodule in roms/pcbios/.  The submodule contains all of the subtree
 history and is merge-compatible with qemu.git's pcbios submodule.

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Jan Kiszka
Glauber Costa wrote:
 On Thu, Oct 15, 2009 at 05:11:27PM +0900, Avi Kivity wrote:
 On 10/14/2009 01:06 AM, Jan Kiszka wrote:
 Add a new IOCTL pair to retrieve or set the VCPU state in one chunk.
 More precisely, the IOCTL is able to process a list of substates to be
 read or written. This list is easily extensible without breaking the
 existing ABI, thus we will no longer have to add new IOCTLs when we
 discover a missing VCPU state field or want to support new hardware
 features.

 This patch establishes the generic infrastructure for KVM_GET/
 SET_VCPU_STATE and adds support for the generic substates REGS, SREGS,
 FPU, and MP. To avoid code duplication, the entry point for the
 corresponding original IOCTLs are converted to make use of the new
 infrastructure internally, too.


 One last thing - Documentation/kvm/api.txt needs updating.  Glauber,  
 this holds for your patches as well.
 Now looking at it... you do realize that that file is terribly outdated, 
 right?

At least it's terribly incomplete. I just decided to add my stuff at the
bottom and wait for a bored soul to refactor, fix, extend, whatever this
thing. :)

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Glauber Costa
On Thu, Oct 15, 2009 at 06:06:04PM +0200, Jan Kiszka wrote:
 Glauber Costa wrote:
  On Thu, Oct 15, 2009 at 05:11:27PM +0900, Avi Kivity wrote:
  On 10/14/2009 01:06 AM, Jan Kiszka wrote:
  Add a new IOCTL pair to retrieve or set the VCPU state in one chunk.
  More precisely, the IOCTL is able to process a list of substates to be
  read or written. This list is easily extensible without breaking the
  existing ABI, thus we will no longer have to add new IOCTLs when we
  discover a missing VCPU state field or want to support new hardware
  features.
 
  This patch establishes the generic infrastructure for KVM_GET/
  SET_VCPU_STATE and adds support for the generic substates REGS, SREGS,
  FPU, and MP. To avoid code duplication, the entry point for the
  corresponding original IOCTLs are converted to make use of the new
  infrastructure internally, too.
 
 
  One last thing - Documentation/kvm/api.txt needs updating.  Glauber,  
  this holds for your patches as well.
  Now looking at it... you do realize that that file is terribly outdated, 
  right?
 
 At least it's terribly incomplete. I just decided to add my stuff at the
 bottom and wait for a bored soul to refactor, fix, extend, whatever this
 thing. :)
We'll probably clash, then =p.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix MSR_COUNT for kvm_arch_save_regs()

2009-10-15 Thread Marcelo Tosatti
On Wed, Oct 14, 2009 at 03:02:27PM -0300, Eduardo Habkost wrote:
 
 A new register was added to the load/save list on commit
 d283d5a65a2bdcc570065267be21848bd6fe3d78, but MSR_COUNT was not updated, 
 leading
 to potential stack corruption on kvm_arch_save_regs().
 
 The following registers are saved by kvm_arch_save_regs():
 
  1) MSR_IA32_SYSENTER_CS
  2) MSR_IA32_SYSENTER_ESP
  3) MSR_IA32_SYSENTER_EIP
  4) MSR_STAR
  5) MSR_IA32_TSC
  6) MSR_VM_HSAVE_PA
  7) MSR_CSTAR (x86_64 only)
  8) MSR_KERNELGSBASE (x86_64 only)
  9) MSR_FMASK (x86_64 only)
 10) MSR_LSTAR (x86_64 only)
 
 Signed-off-by: Eduardo Habkost ehabk...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Prevent kvm_init from corrupting debugfs structures

2009-10-15 Thread Marcelo Tosatti
On Wed, Oct 14, 2009 at 04:21:00PM -0700, Darrick J. Wong wrote:
 I'm seeing an oops condition when kvm-intel and kvm-amd are modprobe'd
 during boot (say on an Intel system) and then rmmod'd:
 
# modprobe kvm-intel
  kvm_init()
  kvm_init_debug()
  kvm_arch_init()  -- stores debugfs dentries internally
  (success, etc)
 
# modprobe kvm-amd
  kvm_init()
  kvm_init_debug() -- second initialization clobbers kvm's
   internal pointers to dentries
  kvm_arch_init()
  kvm_exit_debug() -- and frees them
 
# rmmod kvm-intel
  kvm_exit()
  kvm_exit_debug() -- double free of debugfs files!
 
  *BOOM*
 
 If execution gets to the end of kvm_init(), then the calling module has been
 established as the kvm provider.  Move the debugfs initialization to the end 
 of
 the function, and remove the now-unnecessary call to kvm_exit_debug() from the
 error path.  That way we avoid trampling on the debugfs entries and freeing
 them twice.
 
 Signed-off-by: Darrick J. Wong djw...@us.ibm.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] change function signatures so that they don't take a vcpu argument

2009-10-15 Thread Glauber Costa
At this point, vcpu arguments are passed only for the fd field.
We already provide that in env, as kvm_fd. Replace it.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 cpu-defs.h |1 -
 hw/apic.c  |4 +-
 kvm-tpr-opt.c  |   16 +-
 qemu-kvm-x86.c |   91 ++--
 qemu-kvm.c |   97 +++
 qemu-kvm.h |   74 ++-
 6 files changed, 134 insertions(+), 149 deletions(-)

diff --git a/cpu-defs.h b/cpu-defs.h
index 1f48267..cf502e9 100644
--- a/cpu-defs.h
+++ b/cpu-defs.h
@@ -141,7 +141,6 @@ struct qemu_work_item;
 struct KVMCPUState {
 pthread_t thread;
 int signalled;
-void *vcpu_ctx;
 struct qemu_work_item *queued_work_first, *queued_work_last;
 int regs_modified;
 };
diff --git a/hw/apic.c b/hw/apic.c
index b8fe529..9e707bd 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -900,7 +900,7 @@ static void kvm_kernel_lapic_save_to_user(APICState *s)
 struct kvm_lapic_state *kapic = apic;
 int i, v;
 
-kvm_get_lapic(s-cpu_env-kvm_cpu_state.vcpu_ctx, kapic);
+kvm_get_lapic(s-cpu_env, kapic);
 
 s-id = kapic_reg(kapic, 0x2)  24;
 s-tpr = kapic_reg(kapic, 0x8);
@@ -953,7 +953,7 @@ static void kvm_kernel_lapic_load_from_user(APICState *s)
 kapic_set_reg(klapic, 0x38, s-initial_count);
 kapic_set_reg(klapic, 0x3e, s-divide_conf);
 
-kvm_set_lapic(s-cpu_env-kvm_cpu_state.vcpu_ctx, klapic);
+kvm_set_lapic(s-cpu_env, klapic);
 }
 
 #endif
diff --git a/kvm-tpr-opt.c b/kvm-tpr-opt.c
index f7b6f3b..932b49b 100644
--- a/kvm-tpr-opt.c
+++ b/kvm-tpr-opt.c
@@ -70,7 +70,7 @@ static uint8_t read_byte_virt(CPUState *env, target_ulong 
virt)
 {
 struct kvm_sregs sregs;
 
-kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs);
+kvm_get_sregs(env, sregs);
 return ldub_phys(map_addr(sregs, virt, NULL));
 }
 
@@ -78,7 +78,7 @@ static void write_byte_virt(CPUState *env, target_ulong virt, 
uint8_t b)
 {
 struct kvm_sregs sregs;
 
-kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs);
+kvm_get_sregs(env, sregs);
 stb_phys(map_addr(sregs, virt, NULL), b);
 }
 
@@ -86,7 +86,7 @@ static __u64 kvm_rsp_read(CPUState *env)
 {
 struct kvm_regs regs;
 
-kvm_get_regs(env-kvm_cpu_state.vcpu_ctx, regs);
+kvm_get_regs(env, regs);
 return regs.rsp;
 }
 
@@ -192,7 +192,7 @@ static int bios_is_mapped(CPUState *env, uint64_t rip)
 if (bios_enabled)
return 1;
 
-kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs);
+kvm_get_sregs(env, sregs);
 
 probe = (rip  0xf000) + 0xe;
 phys = map_addr(sregs, probe, perms);
@@ -240,7 +240,7 @@ static int enable_vapic(CPUState *env)
 if (pcr_cpu  0)
return 0;
 
-kvm_enable_vapic(env-kvm_cpu_state.vcpu_ctx, vapic_phys + (pcr_cpu  7));
+kvm_enable_vapic(env, vapic_phys + (pcr_cpu  7));
 cpu_physical_memory_rw(vapic_phys + (pcr_cpu  7) + 4, one, 1, 1);
 bios_enabled = 1;
 
@@ -313,7 +313,7 @@ void kvm_tpr_access_report(CPUState *env, uint64_t rip, int 
is_write)
 
 void kvm_tpr_vcpu_start(CPUState *env)
 {
-kvm_enable_tpr_access_reporting(env-kvm_cpu_state.vcpu_ctx);
+kvm_enable_tpr_access_reporting(env);
 if (bios_enabled)
enable_vapic(env);
 }
@@ -363,7 +363,7 @@ static void vtpr_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 struct kvm_sregs sregs;
 uint32_t rip;
 
-kvm_get_regs(env-kvm_cpu_state.vcpu_ctx, regs);
+kvm_get_regs(env, regs);
 rip = regs.rip - 2;
 write_byte_virt(env, rip, 0x66);
 write_byte_virt(env, rip + 1, 0x90);
@@ -371,7 +371,7 @@ static void vtpr_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
return;
 if (!bios_is_mapped(env, rip))
printf(bios not mapped?\n);
-kvm_get_sregs(env-kvm_cpu_state.vcpu_ctx, sregs);
+kvm_get_sregs(env, sregs);
 for (addr = 0xf000u; addr = 0x8000u; addr -= 4096)
if (map_addr(sregs, addr, NULL) == 0xfee0u) {
real_tpr = addr + 0x80;
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index fffcfd8..8c4140d 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -172,14 +172,14 @@ static int kvm_handle_tpr_access(CPUState *env)
 }
 
 
-int kvm_enable_vapic(kvm_vcpu_context_t vcpu, uint64_t vapic)
+int kvm_enable_vapic(CPUState *env, uint64_t vapic)
 {
int r;
struct kvm_vapic_addr va = {
.vapic_addr = vapic,
};
 
-   r = ioctl(vcpu-fd, KVM_SET_VAPIC_ADDR, va);
+   r = ioctl(env-kvm_fd, KVM_SET_VAPIC_ADDR, va);
if (r == -1) {
r = -errno;
perror(kvm_enable_vapic);
@@ -281,12 +281,12 @@ int kvm_destroy_memory_alias(kvm_context_t kvm, uint64_t 
phys_start)
 
 #ifdef KVM_CAP_IRQCHIP
 
-int kvm_get_lapic(kvm_vcpu_context_t vcpu, struct kvm_lapic_state *s)
+int kvm_get_lapic(CPUState *env, struct kvm_lapic_state *s)
 {
int r;
if 

[PATCH v2 2/3] get rid of vcpu structure

2009-10-15 Thread Glauber Costa
We have no use for it anymore. Only trace of it was in vcpu_create.
Make it disappear.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 qemu-kvm.c |   11 +++
 qemu-kvm.h |5 -
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/qemu-kvm.c b/qemu-kvm.c
index 700d030..7943281 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -440,16 +440,13 @@ static void kvm_create_vcpu(CPUState *env, int id)
 {
 long mmap_size;
 int r;
-kvm_vcpu_context_t vcpu_ctx = qemu_malloc(sizeof(struct kvm_vcpu_context));
 
 r = kvm_vm_ioctl(kvm_state, KVM_CREATE_VCPU, id);
 if (r  0) {
 fprintf(stderr, kvm_create_vcpu: %m\n);
-goto err;
+return;
 }
 
-vcpu_ctx-fd = r;
-
 env-kvm_fd = r;
 env-kvm_state = kvm_state;
 
@@ -459,7 +456,7 @@ static void kvm_create_vcpu(CPUState *env, int id)
 goto err_fd;
 }
 env-kvm_run =
-mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu_ctx-fd,
+mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, env-kvm_fd,
  0);
 if (env-kvm_run == MAP_FAILED) {
 fprintf(stderr, mmap vcpu area: %m\n);
@@ -468,9 +465,7 @@ static void kvm_create_vcpu(CPUState *env, int id)
 
 return;
   err_fd:
-close(vcpu_ctx-fd);
-  err:
-free(vcpu_ctx);
+close(env-kvm_fd);
 }
 
 static int kvm_set_boot_vcpu_id(kvm_context_t kvm, uint32_t id)
diff --git a/qemu-kvm.h b/qemu-kvm.h
index abcb98d..588bc80 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -76,12 +76,7 @@ struct kvm_context {
 int max_gsi;
 };
 
-struct kvm_vcpu_context {
-int fd;
-};
-
 typedef struct kvm_context *kvm_context_t;
-typedef struct kvm_vcpu_context *kvm_vcpu_context_t;
 
 #include kvm.h
 int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] use upstream kvm_vcpu_ioctl

2009-10-15 Thread Glauber Costa
[v2: we already return -errno, so fix testers ]

Signed-off-by: Glauber Costa glom...@redhat.com
---
 kvm-all.c  |3 --
 qemu-kvm-x86.c |   57 +++
 qemu-kvm.c |   31 -
 qemu-kvm.h |1 +
 4 files changed, 26 insertions(+), 66 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 1356aa8..5ea999e 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -861,7 +861,6 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
 return ret;
 }
 
-#ifdef KVM_UPSTREAM
 int kvm_vcpu_ioctl(CPUState *env, int type, ...)
 {
 int ret;
@@ -879,8 +878,6 @@ int kvm_vcpu_ioctl(CPUState *env, int type, ...)
 return ret;
 }
 
-#endif
-
 int kvm_has_sync_mmu(void)
 {
 #ifdef KVM_CAP_SYNC_MMU
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 8c4140d..c8e37ed 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -174,18 +174,11 @@ static int kvm_handle_tpr_access(CPUState *env)
 
 int kvm_enable_vapic(CPUState *env, uint64_t vapic)
 {
-   int r;
struct kvm_vapic_addr va = {
.vapic_addr = vapic,
};
 
-   r = ioctl(env-kvm_fd, KVM_SET_VAPIC_ADDR, va);
-   if (r == -1) {
-   r = -errno;
-   perror(kvm_enable_vapic);
-   return r;
-   }
-   return 0;
+   return kvm_vcpu_ioctl(env, KVM_SET_VAPIC_ADDR, va);
 }
 
 #endif
@@ -283,28 +276,16 @@ int kvm_destroy_memory_alias(kvm_context_t kvm, uint64_t 
phys_start)
 
 int kvm_get_lapic(CPUState *env, struct kvm_lapic_state *s)
 {
-   int r;
if (!kvm_irqchip_in_kernel())
return 0;
-   r = ioctl(env-kvm_fd, KVM_GET_LAPIC, s);
-   if (r == -1) {
-   r = -errno;
-   perror(kvm_get_lapic);
-   }
-   return r;
+   return kvm_vcpu_ioctl(env, KVM_GET_LAPIC, s);
 }
 
 int kvm_set_lapic(CPUState *env, struct kvm_lapic_state *s)
 {
-   int r;
if (!kvm_irqchip_in_kernel())
return 0;
-   r = ioctl(env-kvm_fd, KVM_SET_LAPIC, s);
-   if (r == -1) {
-   r = -errno;
-   perror(kvm_set_lapic);
-   }
-   return r;
+   return kvm_vcpu_ioctl(env, KVM_SET_LAPIC, s);
 }
 
 #endif
@@ -420,29 +401,25 @@ struct kvm_msr_list *kvm_get_msr_list(kvm_context_t kvm)
 int kvm_get_msrs(CPUState *env, struct kvm_msr_entry *msrs, int n)
 {
 struct kvm_msrs *kmsrs = qemu_malloc(sizeof *kmsrs + n * sizeof *msrs);
-int r, e;
+int r;
 
 kmsrs-nmsrs = n;
 memcpy(kmsrs-entries, msrs, n * sizeof *msrs);
-r = ioctl(env-kvm_fd, KVM_GET_MSRS, kmsrs);
-e = errno;
+r = kvm_vcpu_ioctl(env, KVM_GET_MSRS, kmsrs);
 memcpy(msrs, kmsrs-entries, n * sizeof *msrs);
 free(kmsrs);
-errno = e;
 return r;
 }
 
 int kvm_set_msrs(CPUState *env, struct kvm_msr_entry *msrs, int n)
 {
 struct kvm_msrs *kmsrs = qemu_malloc(sizeof *kmsrs + n * sizeof *msrs);
-int r, e;
+int r;
 
 kmsrs-nmsrs = n;
 memcpy(kmsrs-entries, msrs, n * sizeof *msrs);
-r = ioctl(env-kvm_fd, KVM_SET_MSRS, kmsrs);
-e = errno;
+r = kvm_vcpu_ioctl(env, KVM_SET_MSRS, kmsrs);
 free(kmsrs);
-errno = e;
 return r;
 }
 
@@ -464,7 +441,7 @@ int kvm_get_mce_cap_supported(kvm_context_t kvm, uint64_t 
*mce_cap,
 int kvm_setup_mce(CPUState *env, uint64_t *mcg_cap)
 {
 #ifdef KVM_CAP_MCE
-return ioctl(env-kvm_fd, KVM_X86_SETUP_MCE, mcg_cap);
+return kvm_vcpu_ioctl(env, KVM_X86_SETUP_MCE, mcg_cap);
 #else
 return -ENOSYS;
 #endif
@@ -473,7 +450,7 @@ int kvm_setup_mce(CPUState *env, uint64_t *mcg_cap)
 int kvm_set_mce(CPUState *env, struct kvm_x86_mce *m)
 {
 #ifdef KVM_CAP_MCE
-return ioctl(env-kvm_fd, KVM_X86_SET_MCE, m);
+return kvm_vcpu_ioctl(env, KVM_X86_SET_MCE, m);
 #else
 return -ENOSYS;
 #endif
@@ -563,7 +540,7 @@ int kvm_setup_cpuid(CPUState *env, int nent,
 
cpuid-nent = nent;
memcpy(cpuid-entries, entries, nent * sizeof(*entries));
-   r = ioctl(env-kvm_fd, KVM_SET_CPUID, cpuid);
+   r = kvm_vcpu_ioctl(env, KVM_SET_CPUID, cpuid);
 
free(cpuid);
return r;
@@ -579,11 +556,7 @@ int kvm_setup_cpuid2(CPUState *env, int nent,
 
cpuid-nent = nent;
memcpy(cpuid-entries, entries, nent * sizeof(*entries));
-   r = ioctl(env-kvm_fd, KVM_SET_CPUID2, cpuid);
-   if (r == -1) {
-   fprintf(stderr, kvm_setup_cpuid2: %m\n);
-   r = -errno;
-   }
+   r = kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid);
free(cpuid);
return r;
 }
@@ -634,13 +607,7 @@ static int tpr_access_reporting(CPUState *env, int enabled)
r = kvm_ioctl(kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_VAPIC);
if (r = 0)
return -ENOSYS;
-   r = ioctl(env-kvm_fd, KVM_TPR_ACCESS_REPORTING, tac);
-   if (r == -1) {
-   r = -errno;
-   perror(KVM_TPR_ACCESS_REPORTING);
-   return r;
-   }
-   return 0;
+   return 

[PATCH v2 3/4] KVM: x86: Add support for KVM_GET/SET_VCPU_STATE

2009-10-15 Thread Jan Kiszka
Add support for getting/setting MSRs, CPUID tree, and the LACPIC via the
new VCPU state interface. Also in this case we convert the existing
IOCTLs to use the new infrastructure internally.

The MSR interface has to be extended to pass back the number of
processed MSRs via the header structure instead of the return code as
the latter is not available with the new IOCTL. The semantic of the
original KVM_GET/SET_MSRS is not affected by this change.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

 Documentation/kvm/api.txt  |   18 
 arch/x86/include/asm/kvm.h |8 +-
 arch/x86/kvm/x86.c |  209 
 3 files changed, 156 insertions(+), 79 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index 7c0be8d..bee5bbd 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -830,3 +830,21 @@ Deprecates: KVM_GET/SET_MP_STATE
 struct kvm_mp_state {
__u32 mp_state; /* KVM_MP_STATE_* */
 };
+
+6.5 KVM_X86_VCPU_STATE_MSRS
+
+Architectures: x86
+Payload: struct kvm_msrs (see KVM_GET_MSRS)
+Deprecates: KVM_GET/SET_MSRS
+
+6.6 KVM_X86_VCPU_STATE_CPUID
+
+Architectures: x86
+Payload: struct kvm_cpuid2
+Deprecates: KVM_GET/SET_CPUID2
+
+6.7 KVM_X86_VCPU_STATE_LAPIC
+
+Architectures: x86
+Payload: struct kvm_lapic
+Deprecates: KVM_GET/SET_LAPIC
diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
index f02e87a..326615a 100644
--- a/arch/x86/include/asm/kvm.h
+++ b/arch/x86/include/asm/kvm.h
@@ -150,7 +150,7 @@ struct kvm_msr_entry {
 /* for KVM_GET_MSRS and KVM_SET_MSRS */
 struct kvm_msrs {
__u32 nmsrs; /* number of msrs in entries */
-   __u32 pad;
+   __u32 nprocessed; /* return value: successfully processed entries */
 
struct kvm_msr_entry entries[0];
 };
@@ -251,4 +251,10 @@ struct kvm_reinject_control {
__u8 pit_reinject;
__u8 reserved[31];
 };
+
+/* for KVM_GET/SET_VCPU_STATE */
+#define KVM_X86_VCPU_STATE_MSRS1000
+#define KVM_X86_VCPU_STATE_CPUID   1001
+#define KVM_X86_VCPU_STATE_LAPIC   1002
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 685215b..46fad88 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1182,11 +1182,11 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct 
kvm_msrs *msrs,
 static int msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs __user *user_msrs,
  int (*do_msr)(struct kvm_vcpu *vcpu,
unsigned index, u64 *data),
- int writeback)
+ int writeback, int write_nprocessed)
 {
struct kvm_msrs msrs;
struct kvm_msr_entry *entries;
-   int r, n;
+   int r;
unsigned size;
 
r = -EFAULT;
@@ -1207,15 +1207,22 @@ static int msr_io(struct kvm_vcpu *vcpu, struct 
kvm_msrs __user *user_msrs,
if (copy_from_user(entries, user_msrs-entries, size))
goto out_free;
 
-   r = n = __msr_io(vcpu, msrs, entries, do_msr);
+   r = __msr_io(vcpu, msrs, entries, do_msr);
if (r  0)
goto out_free;
 
+   msrs.nprocessed = r;
+
r = -EFAULT;
+   if (write_nprocessed 
+   copy_to_user(user_msrs-nprocessed, msrs.nprocessed,
+sizeof(msrs.nprocessed)))
+   goto out_free;
+
if (writeback  copy_to_user(user_msrs-entries, entries, size))
goto out_free;
 
-   r = n;
+   r = msrs.nprocessed;
 
 out_free:
vfree(entries);
@@ -1792,55 +1799,36 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 {
struct kvm_vcpu *vcpu = filp-private_data;
void __user *argp = (void __user *)arg;
+   struct kvm_vcpu_substate substate;
int r;
-   struct kvm_lapic_state *lapic = NULL;
 
switch (ioctl) {
-   case KVM_GET_LAPIC: {
-   lapic = kzalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL);
-
-   r = -ENOMEM;
-   if (!lapic)
-   goto out;
-   r = kvm_vcpu_ioctl_get_lapic(vcpu, lapic);
-   if (r)
-   goto out;
-   r = -EFAULT;
-   if (copy_to_user(argp, lapic, sizeof(struct kvm_lapic_state)))
-   goto out;
-   r = 0;
+   case KVM_GET_LAPIC:
+   substate.type = KVM_X86_VCPU_STATE_LAPIC;
+   substate.offset = 0;
+   r = kvm_arch_vcpu_get_substate(vcpu, argp, substate);
break;
-   }
-   case KVM_SET_LAPIC: {
-   lapic = kmalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL);
-   r = -ENOMEM;
-   if (!lapic)
-   goto out;
-   r = -EFAULT;
-   if (copy_from_user(lapic, argp, sizeof(struct kvm_lapic_state)))
-   goto out;
-   r = kvm_vcpu_ioctl_set_lapic(vcpu, lapic);
-   if (r)
- 

[PATCH v2 1/4] KVM: Reorder IOCTLs in main kvm.h

2009-10-15 Thread Jan Kiszka
Obviously, people tend to extend this header at the bottom - more or
less blindly. Ensure that deprecated stuff gets its own corner again by
moving things to the top. Also add some comments and reindent IOCTLs to
make them more readable and reduce the risk of number collisions.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

 include/linux/kvm.h |  228 ++-
 1 files changed, 114 insertions(+), 114 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index f8f8900..7d8c382 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -14,12 +14,76 @@
 
 #define KVM_API_VERSION 12
 
-/* for KVM_TRACE_ENABLE, deprecated */
+/* *** Deprecated interfaces *** */
+
+#define KVM_TRC_SHIFT   16
+
+#define KVM_TRC_ENTRYEXIT   (1  KVM_TRC_SHIFT)
+#define KVM_TRC_HANDLER (1  (KVM_TRC_SHIFT + 1))
+
+#define KVM_TRC_VMENTRY (KVM_TRC_ENTRYEXIT + 0x01)
+#define KVM_TRC_VMEXIT  (KVM_TRC_ENTRYEXIT + 0x02)
+#define KVM_TRC_PAGE_FAULT  (KVM_TRC_HANDLER + 0x01)
+
+#define KVM_TRC_HEAD_SIZE   12
+#define KVM_TRC_CYCLE_SIZE  8
+#define KVM_TRC_EXTRA_MAX   7
+
+#define KVM_TRC_INJ_VIRQ (KVM_TRC_HANDLER + 0x02)
+#define KVM_TRC_REDELIVER_EVT(KVM_TRC_HANDLER + 0x03)
+#define KVM_TRC_PEND_INTR(KVM_TRC_HANDLER + 0x04)
+#define KVM_TRC_IO_READ  (KVM_TRC_HANDLER + 0x05)
+#define KVM_TRC_IO_WRITE (KVM_TRC_HANDLER + 0x06)
+#define KVM_TRC_CR_READ  (KVM_TRC_HANDLER + 0x07)
+#define KVM_TRC_CR_WRITE (KVM_TRC_HANDLER + 0x08)
+#define KVM_TRC_DR_READ  (KVM_TRC_HANDLER + 0x09)
+#define KVM_TRC_DR_WRITE (KVM_TRC_HANDLER + 0x0A)
+#define KVM_TRC_MSR_READ (KVM_TRC_HANDLER + 0x0B)
+#define KVM_TRC_MSR_WRITE(KVM_TRC_HANDLER + 0x0C)
+#define KVM_TRC_CPUID(KVM_TRC_HANDLER + 0x0D)
+#define KVM_TRC_INTR (KVM_TRC_HANDLER + 0x0E)
+#define KVM_TRC_NMI  (KVM_TRC_HANDLER + 0x0F)
+#define KVM_TRC_VMMCALL  (KVM_TRC_HANDLER + 0x10)
+#define KVM_TRC_HLT  (KVM_TRC_HANDLER + 0x11)
+#define KVM_TRC_CLTS (KVM_TRC_HANDLER + 0x12)
+#define KVM_TRC_LMSW (KVM_TRC_HANDLER + 0x13)
+#define KVM_TRC_APIC_ACCESS  (KVM_TRC_HANDLER + 0x14)
+#define KVM_TRC_TDP_FAULT(KVM_TRC_HANDLER + 0x15)
+#define KVM_TRC_GTLB_WRITE   (KVM_TRC_HANDLER + 0x16)
+#define KVM_TRC_STLB_WRITE   (KVM_TRC_HANDLER + 0x17)
+#define KVM_TRC_STLB_INVAL   (KVM_TRC_HANDLER + 0x18)
+#define KVM_TRC_PPC_INSTR(KVM_TRC_HANDLER + 0x19)
+
 struct kvm_user_trace_setup {
-   __u32 buf_size; /* sub_buffer size of each per-cpu */
-   __u32 buf_nr; /* the number of sub_buffers of each per-cpu */
+   __u32 buf_size;
+   __u32 buf_nr;
 };
 
+#define __KVM_DEPRECATED_MAIN_W_0x06 \
+   _IOW(KVMIO, 0x06, struct kvm_user_trace_setup)
+#define __KVM_DEPRECATED_MAIN_0x07 _IO(KVMIO, 0x07)
+#define __KVM_DEPRECATED_MAIN_0x08 _IO(KVMIO, 0x08)
+
+#define __KVM_DEPRECATED_VM_R_0x70 _IOR(KVMIO, 0x70, struct kvm_assigned_irq)
+
+struct kvm_breakpoint {
+   __u32 enabled;
+   __u32 padding;
+   __u64 address;
+};
+
+struct kvm_debug_guest {
+   __u32 enabled;
+   __u32 pad;
+   struct kvm_breakpoint breakpoints[4];
+   __u32 singlestep;
+};
+
+#define __KVM_DEPRECATED_VCPU_W_0x87 _IOW(KVMIO, 0x87, struct kvm_debug_guest)
+
+/* *** End of deprecated interfaces *** */
+
+
 /* for KVM_CREATE_MEMORY_REGION */
 struct kvm_memory_region {
__u32 slot;
@@ -329,24 +393,6 @@ struct kvm_ioeventfd {
__u8  pad[36];
 };
 
-#define KVM_TRC_SHIFT   16
-/*
- * kvm trace categories
- */
-#define KVM_TRC_ENTRYEXIT   (1  KVM_TRC_SHIFT)
-#define KVM_TRC_HANDLER (1  (KVM_TRC_SHIFT + 1)) /* only 12 bits */
-
-/*
- * kvm trace action
- */
-#define KVM_TRC_VMENTRY (KVM_TRC_ENTRYEXIT + 0x01)
-#define KVM_TRC_VMEXIT  (KVM_TRC_ENTRYEXIT + 0x02)
-#define KVM_TRC_PAGE_FAULT  (KVM_TRC_HANDLER + 0x01)
-
-#define KVM_TRC_HEAD_SIZE   12
-#define KVM_TRC_CYCLE_SIZE  8
-#define KVM_TRC_EXTRA_MAX   7
-
 #define KVMIO 0xAE
 
 /*
@@ -367,12 +413,10 @@ struct kvm_ioeventfd {
  */
 #define KVM_GET_VCPU_MMAP_SIZE_IO(KVMIO,   0x04) /* in bytes */
 #define KVM_GET_SUPPORTED_CPUID   _IOWR(KVMIO, 0x05, struct kvm_cpuid2)
-/*
- * ioctls for kvm trace
- */
-#define KVM_TRACE_ENABLE  _IOW(KVMIO, 0x06, struct 
kvm_user_trace_setup)
-#define KVM_TRACE_PAUSE   _IO(KVMIO,  0x07)
-#define KVM_TRACE_DISABLE _IO(KVMIO,  0x08)
+#define KVM_TRACE_ENABLE  __KVM_DEPRECATED_MAIN_W_0x06
+#define KVM_TRACE_PAUSE   __KVM_DEPRECATED_MAIN_0x07
+#define KVM_TRACE_DISABLE __KVM_DEPRECATED_MAIN_0x08
+
 /*
  * Extension capability list.
  */
@@ -500,52 +544,54 @@ struct kvm_irqfd {
 /*
  * ioctls for VM fds
  */
-#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 0x40, struct kvm_memory_region)
+#define KVM_SET_MEMORY_REGION   

[PATCH v2 2/4] KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL

2009-10-15 Thread Jan Kiszka
Add a new IOCTL pair to retrieve or set the VCPU state in one chunk.
More precisely, the IOCTL is able to process a list of substates to be
read or written. This list is easily extensible without breaking the
existing ABI, thus we will no longer have to add new IOCTLs when we
discover a missing VCPU state field or want to support new hardware
features.

This patch establishes the generic infrastructure for KVM_GET/
SET_VCPU_STATE and adds support for the generic substates REGS, SREGS,
FPU, and MP. To avoid code duplication, the entry point for the
corresponding original IOCTLs are converted to make use of the new
infrastructure internally, too.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

 Documentation/kvm/api.txt  |   73 ++
 arch/ia64/kvm/kvm-ia64.c   |   12 ++
 arch/powerpc/kvm/powerpc.c |   12 ++
 arch/s390/kvm/kvm-s390.c   |   12 ++
 arch/x86/kvm/x86.c |   12 ++
 include/linux/kvm.h|   24 +++
 include/linux/kvm_host.h   |5 +
 virt/kvm/kvm_main.c|  318 +++-
 8 files changed, 376 insertions(+), 92 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index 5a4bc8c..7c0be8d 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -593,6 +593,49 @@ struct kvm_irqchip {
} chip;
 };
 
+4.27 KVM_GET/SET_VCPU_STATE
+
+Capability: KVM_CAP_VCPU_STATE
+Architectures: all (substate support may vary across architectures)
+Type: vcpu ioctl
+Parameters: struct kvm_vcpu_state (in/out)
+Returns: 0 on success, -1 on error
+
+Reads or sets one or more vcpu substates.
+
+The data structures exchanged between user space and kernel are organized
+in two layers. Layer one is the header structure kvm_vcpu_state:
+
+struct kvm_vcpu_state {
+   __u32 nsubstates; /* number of elements in substates */
+   __u32 nprocessed; /* return value: successfully processed substates */
+   struct kvm_vcpu_substate substates[0];
+};
+
+The kernel accepts up to KVM_MAX_VCPU_SUBSTATES elements in the substates
+array. An element is described by kvm_vcpu_substate:
+
+struct kvm_vcpu_substate {
+   __u32 type; /* KVM_VCPU_STATE_* or KVM_$(ARCH)_VCPU_STATE_* */
+   __u32 pad;
+   __s64 offset;   /* payload offset to kvm_vcpu_state in bytes */
+};
+
+Layer two are the substate-specific payload structures. See section 6 for a
+list of supported substates and their payload format.
+
+Exemplary setup for a single-substate query via KVM_GET_VCPU_STATE:
+
+   struct {
+   struct kvm_vcpu_state header;
+   struct kvm_vcpu_substate substates[1];
+   } request;
+   struct kvm_regs regs;
+
+   request.header.nsubstates = 1;
+   request.header.substates[0].type = KVM_VCPU_STATE_REGS;
+   request.header.substates[0].offset = (size_t)regs - (size_t)request;
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
@@ -757,3 +800,33 @@ powerpc specific.
char padding[256];
};
 };
+
+6. Supported vcpu substates
+
+6.1 KVM_VCPU_STATE_REGS
+
+Architectures: all
+Payload: struct kvm_regs (see KVM_GET_REGS)
+Deprecates: KVM_GET/SET_REGS
+
+6.2 KVM_VCPU_STATE_SREGS
+
+Architectures: all
+Payload: struct kvm_sregs (see KVM_GET_SREGS)
+Deprecates: KVM_GET/SET_SREGS
+
+6.3 KVM_VCPU_STATE_FPU
+
+Architectures: all
+Payload: struct kvm_fpu (see KVM_GET_FPU)
+Deprecates: KVM_GET/SET_FPU
+
+6.4 KVM_VCPU_STATE_MP
+
+Architectures: x86, ia64
+Payload: struct kvm_mp_state
+Deprecates: KVM_GET/SET_MP_STATE
+
+struct kvm_mp_state {
+   __u32 mp_state; /* KVM_MP_STATE_* */
+};
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 5fdeec5..c3450a6 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1991,3 +1991,15 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu 
*vcpu,
vcpu_put(vcpu);
return r;
 }
+
+int kvm_arch_vcpu_get_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base,
+  struct kvm_vcpu_substate *substate)
+{
+   return -EINVAL;
+}
+
+int kvm_arch_vcpu_set_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base,
+  struct kvm_vcpu_substate *substate)
+{
+   return -EINVAL;
+}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 5902bbc..3336ad5 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -436,3 +436,15 @@ int kvm_arch_init(void *opaque)
 void kvm_arch_exit(void)
 {
 }
+
+int kvm_arch_vcpu_get_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base,
+  struct kvm_vcpu_substate *substate)
+{
+   return -EINVAL;
+}
+
+int kvm_arch_vcpu_set_substate(struct kvm_vcpu *vcpu, uint8_t __user *arg_base,
+  struct kvm_vcpu_substate *substate)
+{
+   return -EINVAL;
+}
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 5445058..978ed6c 100644
--- 

[PATCH v2 0/4] Extensible VCPU state IOCTL

2009-10-15 Thread Jan Kiszka
This version addresses the review comments:
 - rename KVM[_X86]_VCPU_* - KVM[_X86]_VCPU_STATE_*
 - more padding for kvm_nmi_state
 - use bool in get/set_nmi_mask
 - add basic documentation.

Find this series also at git://git.kiszka.org/linux-kvm.git queues/vcpu-state

Jan Kiszka (4):
  KVM: Reorder IOCTLs in main kvm.h
  KVM: Add unified KVM_GET/SET_VCPU_STATE IOCTL
  KVM: x86: Add support for KVM_GET/SET_VCPU_STATE
  KVM: x86: Add VCPU substate for NMI states

 Documentation/kvm/api.txt   |  103 +
 arch/ia64/kvm/kvm-ia64.c|   12 ++
 arch/powerpc/kvm/powerpc.c  |   12 ++
 arch/s390/kvm/kvm-s390.c|   12 ++
 arch/x86/include/asm/kvm.h  |   15 ++-
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/svm.c  |   22 +++
 arch/x86/kvm/vmx.c  |   30 
 arch/x86/kvm/x86.c  |  243 -
 include/linux/kvm.h |  246 +--
 include/linux/kvm_host.h|5 +
 virt/kvm/kvm_main.c |  318 +++---
 12 files changed, 740 insertions(+), 280 deletions(-)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raw vs. tap

2009-10-15 Thread Anthony Liguori

Michael S. Tsirkin wrote:

Not sure: what is the this that you are talking about.
  


I meant, fixing guest-host traffic.  You're former argument hinged on 
being able to having networking that Just Worked without adding new 
things.  If this doesn't work today with macvlan, then the argument is 
invalid.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Single memory slot

2009-10-15 Thread Anthony Liguori

Avi Kivity wrote:
One way to improve the gfn_to_pfn() memslot search is to register just 
one slot.  This can only work on 64-bit, since even the smallest 
guests need 4GB of physical address space.  Apart from speeding up 
gfn_to_page(), it would also speed up mmio which must iterate over all 
slots, so a lookup cache cannot help.


This would require quite a bunch of changes:
- modify gfn_to_pfn() to fail gracefully if the page is in the slot 
but unmapped (hole handling)

- modify qemu to reserve the guest physical address space


It could potentially speed up qemu quite a lot too as we would return to 
a model where host va == fixed address + guest pa.  That makes things 
like stl_phys/ldl_phys trivial.


Regards,

Anthony Liguori



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Single memory slot

2009-10-15 Thread Marcelo Tosatti
On Thu, Oct 15, 2009 at 04:33:11PM +0900, Avi Kivity wrote:
 One way to improve the gfn_to_pfn() memslot search is to register just  
 one slot.  This can only work on 64-bit, since even the smallest guests  
 need 4GB of physical address space.  Apart from speeding up  
 gfn_to_page(), it would also speed up mmio which must iterate over all  
 slots, so a lookup cache cannot help.

 This would require quite a bunch of changes:
 - modify gfn_to_pfn() to fail gracefully if the page is in the slot but  
 unmapped (hole handling)
 - modify qemu to reserve the guest physical address space
 - modify qemu memory allocation to use MAP_FIXED to allocate memory
 - some hack for the vga aliases (mmap an fd multiple times?)
 - some hack for the vmx-specific pages (e.g. APIC-access page)

 Not sure it's worthwhile, but something to keep in mind if a simple  
 cache or sort by size is insufficient due to mmio.

Downside is you lose the ability to write protect a small slot only 
(could mprotect(MAP_READ) the desired area but get_log+write_protect 
must be atomic).

Also if you enable dirty log for the large slot largepages are disabled.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Single memory slot

2009-10-15 Thread Marcelo Tosatti
On Thu, Oct 15, 2009 at 02:46:38PM +0200, Alexander Graf wrote:

 On 15.10.2009, at 09:33, Avi Kivity wrote:

 One way to improve the gfn_to_pfn() memslot search is to register just 
 one slot.  This can only work on 64-bit, since even the smallest guests 
 need 4GB of physical address space.  Apart from speeding up 
 gfn_to_page(), it would also speed up mmio which must iterate over all 
 slots, so a lookup cache cannot help.

 This would require quite a bunch of changes:
 - modify gfn_to_pfn() to fail gracefully if the page is in the slot  
 but unmapped (hole handling)
 - modify qemu to reserve the guest physical address space
 - modify qemu memory allocation to use MAP_FIXED to allocate memory
 - some hack for the vga aliases (mmap an fd multiple times?)
 - some hack for the vmx-specific pages (e.g. APIC-access page)

 Not sure it's worthwhile, but something to keep in mind if a simple  
 cache or sort by size is insufficient due to mmio.

 One thing I've been wondering for quite a while is that slot loop. Why  
 do we loop over all possible slots? Couldn't we just remember the max  
 extry (usually 1 or 2) and not loop MAX_SLOT_AMOUNT times?

 That would be a really easy patch and give instant speed improvements  
 for everyone.

gfn_to_memslot_unaliased uses kvm-nmemslots which is the max entry.

Oh, kvm_is_visible_gfn does not. It should just use
gfn_to_memslot_unaliased.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Do clock adjustments over migration

2009-10-15 Thread Marcelo Tosatti
On Thu, Oct 15, 2009 at 01:15:25PM -0400, Glauber Costa wrote:
 Hey,
 
 This patch is a proposal only. Among other things, it relies on a patch
 Juan is yet to send, and I also would want to give it a bit more testing.
 It shows my indented use of the new ioctl interface I've been proposing.
 
 First of all, we have to save the kvmclock msrs. This is per-cpu, and we
 were failing to do it so far.
 
 The ioctls are issued in pre-save and post-load sections of a new vmstate
 handler. I am not doing it in the cpu vmstate handler, because this has to
 be done once per VM, not cpu. What I basically do is to grab the time
 from GET ioctl, pass on through migration, and then do a SET on the other
 side. Should be straighforward.
 
 Please let me hear your thoughts. And don't get me started with this
 you can't hear thoughts thing!
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 ---
  kvm/include/linux/kvm.h |9 +
  qemu-kvm-x86.c  |6 ++
  qemu-kvm.c  |   29 +
  target-i386/cpu.h   |3 ++-
  target-i386/machine.c   |2 ++
  5 files changed, 48 insertions(+), 1 deletions(-)
 
 diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
 index fffcfd8..75e2ffd 100644
 --- a/qemu-kvm-x86.c
 +++ b/qemu-kvm-x86.c
 @@ -834,6 +834,9 @@ static int get_msr_entry(struct kvm_msr_entry *entry, 
 CPUState *env)
  case MSR_VM_HSAVE_PA:
  env-vm_hsave = entry-data;
  break;
 +case MSR_KVM_SYSTEM_TIME:
 +env-system_time_msr = entry-data;
 +break;
  default:
  printf(Warning unknown msr index 0x%x\n, entry-index);
  return 1;
 @@ -1001,6 +1004,7 @@ void kvm_arch_load_regs(CPUState *env)
  set_msr_entry(msrs[n++], MSR_LSTAR  ,   env-lstar);
  }
  #endif
 +set_msr_entry(msrs[n++], MSR_KVM_SYSTEM_TIME,  env-system_time_msr);
  
  rc = kvm_set_msrs(env-kvm_cpu_state.vcpu_ctx, msrs, n);
  if (rc == -1)
 @@ -1179,6 +1183,8 @@ void kvm_arch_save_regs(CPUState *env)
  msrs[n++].index = MSR_LSTAR;
  }
  #endif
 +msrs[n++].index = MSR_KVM_SYSTEM_TIME;
 +

fix MSR_COUNT for kvm_arch_save_regs()

Otherwise looks good.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in kvm_intel

2009-10-15 Thread Andrew Theurer
On Thu, 2009-10-15 at 02:10 +0900, Avi Kivity wrote:
 On 10/13/2009 11:04 PM, Andrew Theurer wrote:
 
  Look at the address where vmx_vcpu_run starts, add 0x26d, and show the
  surrounding code.
 
  Thinking about it, it probably _is_ what you showed, due to module page
  alignment.  But please verify this; I can't reconcile the fault address
  (9fe9a2b) with %rsp at the time of the fault.
   
  Here is the start of the function:
 
 
  3884vmx_vcpu_run:
   3884:   55  push   %rbp
   3885:   48 89 e5mov%rsp,%rbp
   
  and 0x26d later is 0x3af1:
 
 
   3ad2:   4c 8b b1 88 01 00 00mov0x188(%rcx),%r14
   3ad9:   4c 8b b9 90 01 00 00mov0x190(%rcx),%r15
   3ae0:   48 8b 89 20 01 00 00mov0x120(%rcx),%rcx
   3ae7:   75 05   jne3aeevmx_vcpu_run+0x26a
   3ae9:   0f 01 c2vmlaunch
   3aec:   eb 03   jmp3af1vmx_vcpu_run+0x26d
   3aee:   0f 01 c3vmresume
   3af1:   48 87 0c 24 xchg   %rcx,(%rsp)
   3af5:   48 89 81 18 01 00 00mov%rax,0x118(%rcx)
   3afc:   48 89 99 30 01 00 00mov%rbx,0x130(%rcx)
   3b03:   ff 34 24pushq  (%rsp)
   3b06:   8f 81 20 01 00 00   popq   0x120(%rcx)
   
 
 
 Ok.  So it faults on the xchg instruction, rsp is 8806369ffc80 but 
 the fault address is 9fe9a2b4.  So it looks like the IDT is 
 corrupted.
 
 Can you check what's around 9fe9a2b4 in System.map?

85d85b24 B __bss_stop
85d86000 B __brk_base
85d96000 b .brk.dmi_alloc
85da6000 B __brk_limit
ff60 T vgettimeofday
ff600100 t vread_tsc
ff600130 t vread_hpet
ff600140 D __vsyscall_gtod_data
ff600400 T vtime

-Andrew


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] get rid of MSR_COUNT

2009-10-15 Thread Glauber Costa
qemu.git uses an array of 100 entries for the msr list, which is arguably large
enough (tm).

I propose we follow the same path, for two reasons:
 1) ease future merge.
 2) avoid stack overflow problems that had already began to appear

Signed-off-by: Glauber Costa glom...@redhat.com
---
 qemu-kvm-x86.c |   10 ++
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index c8e37ed..350e5fd 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -808,12 +808,6 @@ static int get_msr_entry(struct kvm_msr_entry *entry, 
CPUState *env)
 return 0;
 }
 
-#ifdef TARGET_X86_64
-#define MSR_COUNT 9
-#else
-#define MSR_COUNT 5
-#endif
-
 static void set_v8086_seg(struct kvm_segment *lhs, const SegmentCache *rhs)
 {
 lhs-selector = rhs-selector;
@@ -868,7 +862,7 @@ void kvm_arch_load_regs(CPUState *env)
 struct kvm_regs regs;
 struct kvm_fpu fpu;
 struct kvm_sregs sregs;
-struct kvm_msr_entry msrs[MSR_COUNT];
+struct kvm_msr_entry msrs[100];
 int rc, n, i;
 
 regs.rax = env-regs[R_EAX];
@@ -1021,7 +1015,7 @@ void kvm_arch_save_regs(CPUState *env)
 struct kvm_regs regs;
 struct kvm_fpu fpu;
 struct kvm_sregs sregs;
-struct kvm_msr_entry msrs[MSR_COUNT];
+struct kvm_msr_entry msrs[100];
 uint32_t hflags;
 uint32_t i, n, rc;
 
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Xen PV-on-HVM guest support (v3)

2009-10-15 Thread Ed Swierk
Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.

A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future.  Thus this patch
adds special support for the Xen HVM MSR.

I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files.  When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.

I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

v3: separate blob_{addr,size}_{32,64}; move xen_hvm_config to struct
kvm_arch; remove unneeded ifdefs; return -EFAULT, -E2BIG, etc. from
xen_hvm_config; use is_long_mode(); remove debug printks; document ioctl
in api.txt

Signed-off-by: Ed Swierk eswi...@aristanetworks.com

---
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index 5a4bc8c..5980113 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -593,6 +593,30 @@ struct kvm_irqchip {
} chip;
 };
 
+4.27 KVM_XEN_HVM_CONFIG
+
+Capability: KVM_CAP_XEN_HVM
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_xen_hvm_config (in)
+Returns: 0 on success, -1 on error
+
+Sets the MSR that the Xen HVM guest uses to initialize its hypercall
+page, and provides the starting address and size of the hypercall
+blobs in userspace.  When the guest writes the MSR, kvm copies one
+page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
+memory.
+
+struct kvm_xen_hvm_config {
+   __u32 msr;
+   __u32 pad1;
+   __u64 blob_addr_32;
+   __u64 blob_addr_64;
+   __u8 blob_size_32;
+   __u8 blob_size_64;
+   __u8 pad2[30];
+};
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
index f02e87a..ef9b4b7 100644
--- a/arch/x86/include/asm/kvm.h
+++ b/arch/x86/include/asm/kvm.h
@@ -19,6 +19,7 @@
 #define __KVM_HAVE_MSIX
 #define __KVM_HAVE_MCE
 #define __KVM_HAVE_PIT_STATE2
+#define __KVM_HAVE_XEN_HVM
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/arch/x86/include/asm/kvm_host.h
b/arch/x86/include/asm/kvm_host.h
index 45226f0..aee95b2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -410,6 +410,8 @@ struct kvm_arch{
 
unsigned long irq_sources_bitmap;
u64 vm_init_tsc;
+
+   struct kvm_xen_hvm_config xen_hvm_config;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1d454d9..66149fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -835,6 +835,37 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32
msr, u64 data)
return 0;
 }
 
+static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
+{
+   int lm = is_long_mode(vcpu);
+   u8 *blob_addr = lm ? (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_64
+   : (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_32;
+   u8 blob_size = lm ? vcpu-kvm-arch.xen_hvm_config.blob_size_64
+   : vcpu-kvm-arch.xen_hvm_config.blob_size_32;
+   u32 page_num = data  ~PAGE_MASK;
+   u64 page_addr = data  PAGE_MASK;
+   u8 *page;
+   int r;
+
+   r = -E2BIG;
+   if (page_num = blob_size)
+   goto out;
+   r = -ENOMEM;
+   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!page)
+   goto out;
+   r = -EFAULT;
+   if (copy_from_user(page, blob_addr + (page_num * PAGE_SIZE),
PAGE_SIZE))
+   goto out_free;
+   if (kvm_write_guest(vcpu-kvm, page_addr, page, PAGE_SIZE))
+   goto out_free;
+   r = 0;
+out_free:
+   kfree(page);
+out:
+   return r;
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
switch (msr) {
@@ -950,6 +981,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32
msr, u64 data)
0x%x data 0x%llx\n, msr, data);
break;
default:
+   if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
+   return xen_hvm_config(vcpu, data);
if (!ignore_msrs) {
pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n,
msr, data);
@@ -2411,6 +2444,14 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_XEN_HVM_CONFIG: {
+   r = -EFAULT;
+   if (copy_from_user(kvm-arch.xen_hvm_config, argp,
+  sizeof(struct 

Re: Raw vs. tap

2009-10-15 Thread Michael S. Tsirkin
On Thu, Oct 15, 2009 at 01:37:20PM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 Not sure: what is the this that you are talking about.
   

 I meant, fixing guest-host traffic.

This needs to be supported in networking core.

 You're former argument hinged on  
 being able to having networking that Just Worked without adding new  
 things.  If this doesn't work today with macvlan, then the argument is  
 invalid.

 Regards,

 Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][REPOST] Xen PV-on-HVM guest support (v3)

2009-10-15 Thread Ed Swierk
[Repost; the patch was garbled in my previous attempt.]

Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.

A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future.  Thus this patch
adds special support for the Xen HVM MSR.

I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files.  When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.

I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

v3: separate blob_{addr,size}_{32,64}; move xen_hvm_config to struct
kvm_arch; remove unneeded ifdefs; return -EFAULT, -E2BIG, etc. from
xen_hvm_config; use is_long_mode(); remove debug printks; document ioctl
in api.txt

Signed-off-by: Ed Swierk eswi...@aristanetworks.com

---
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index 5a4bc8c..5980113 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -593,6 +593,30 @@ struct kvm_irqchip {
} chip;
 };
 
+4.27 KVM_XEN_HVM_CONFIG
+
+Capability: KVM_CAP_XEN_HVM
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_xen_hvm_config (in)
+Returns: 0 on success, -1 on error
+
+Sets the MSR that the Xen HVM guest uses to initialize its hypercall
+page, and provides the starting address and size of the hypercall
+blobs in userspace.  When the guest writes the MSR, kvm copies one
+page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
+memory.
+
+struct kvm_xen_hvm_config {
+   __u32 msr;
+   __u32 pad1;
+   __u64 blob_addr_32;
+   __u64 blob_addr_64;
+   __u8 blob_size_32;
+   __u8 blob_size_64;
+   __u8 pad2[30];
+};
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
index f02e87a..ef9b4b7 100644
--- a/arch/x86/include/asm/kvm.h
+++ b/arch/x86/include/asm/kvm.h
@@ -19,6 +19,7 @@
 #define __KVM_HAVE_MSIX
 #define __KVM_HAVE_MCE
 #define __KVM_HAVE_PIT_STATE2
+#define __KVM_HAVE_XEN_HVM
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 45226f0..aee95b2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -410,6 +410,8 @@ struct kvm_arch{
 
unsigned long irq_sources_bitmap;
u64 vm_init_tsc;
+
+   struct kvm_xen_hvm_config xen_hvm_config;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1d454d9..66149fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -835,6 +835,37 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 
data)
return 0;
 }
 
+static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
+{
+   int lm = is_long_mode(vcpu);
+   u8 *blob_addr = lm ? (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_64
+   : (u8 *)vcpu-kvm-arch.xen_hvm_config.blob_addr_32;
+   u8 blob_size = lm ? vcpu-kvm-arch.xen_hvm_config.blob_size_64
+   : vcpu-kvm-arch.xen_hvm_config.blob_size_32;
+   u32 page_num = data  ~PAGE_MASK;
+   u64 page_addr = data  PAGE_MASK;
+   u8 *page;
+   int r;
+
+   r = -E2BIG;
+   if (page_num = blob_size)
+   goto out;
+   r = -ENOMEM;
+   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!page)
+   goto out;
+   r = -EFAULT;
+   if (copy_from_user(page, blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE))
+   goto out_free;
+   if (kvm_write_guest(vcpu-kvm, page_addr, page, PAGE_SIZE))
+   goto out_free;
+   r = 0;
+out_free:
+   kfree(page);
+out:
+   return r;
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
switch (msr) {
@@ -950,6 +981,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 
data)
0x%x data 0x%llx\n, msr, data);
break;
default:
+   if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
+   return xen_hvm_config(vcpu, data);
if (!ignore_msrs) {
pr_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n,
msr, data);
@@ -2411,6 +2444,14 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_XEN_HVM_CONFIG: {
+   r = -EFAULT;
+   if 

Re: Add qemu_send_raw() to vlan.

2009-10-15 Thread Marcelo Tosatti
On Thu, Oct 15, 2009 at 09:33:12AM +0200, Gleb Natapov wrote:
 On Thu, Oct 15, 2009 at 08:04:45AM +0100, Mark McLoughlin wrote:
  Hi Gleb,
  
  On Tue, 2009-05-26 at 13:03 +0300, Gleb Natapov wrote:
   It gets packet without virtio header and adds it if needed.  Allows to
   inject packets to vlan from outside. To send gracious arp for instance.
  ...
   diff --git a/net.h b/net.h
   index 931133b..3d0b6f2 100644
   --- a/net.h
   +++ b/net.h
   ...
   @@ -63,6 +64,7 @@ int qemu_can_send_packet(VLANClientState *vc);
ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov,
  int iovcnt);
int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
   +void qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int 
   size);
void qemu_format_nic_info_str(VLANClientState *vc, uint8_t macaddr[6]);
void qemu_check_nic_model(NICInfo *nd, const char *model);
void qemu_check_nic_model_list(NICInfo *nd, const char * const *models,
  
  I've only just now noticed that we never actually made announce_self()
  use this ... care to do that?
  
 Something like this:
 
 ---
 Use qemu_send_packet_raw to send gracious arp. This will ensure that
 vnet header is handled properly.

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Do clock adjustments over migration

2009-10-15 Thread Glauber Costa
On Thu, Oct 15, 2009 at 05:08:16PM -0300, Marcelo Tosatti wrote:
 On Thu, Oct 15, 2009 at 01:15:25PM -0400, Glauber Costa wrote:
  Hey,
  
  This patch is a proposal only. Among other things, it relies on a patch
  Juan is yet to send, and I also would want to give it a bit more testing.
  It shows my indented use of the new ioctl interface I've been proposing.
  
  First of all, we have to save the kvmclock msrs. This is per-cpu, and we
  were failing to do it so far.
  
  The ioctls are issued in pre-save and post-load sections of a new vmstate
  handler. I am not doing it in the cpu vmstate handler, because this has to
  be done once per VM, not cpu. What I basically do is to grab the time
  from GET ioctl, pass on through migration, and then do a SET on the other
  side. Should be straighforward.
  
  Please let me hear your thoughts. And don't get me started with this
  you can't hear thoughts thing!
  
 
 Otherwise looks good.
 
Also, note that, in vmstate table for kvmclock, I am using U64, instead of 
UINT64. This is unexistant on current qemu, I had to patch it.

Juan said he's planning on sending it to qemu-devel today or tomorrow.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Single memory slot

2009-10-15 Thread Avi Kivity

On 10/16/2009 03:51 AM, Anthony Liguori wrote:

Avi Kivity wrote:
One way to improve the gfn_to_pfn() memslot search is to register 
just one slot.  This can only work on 64-bit, since even the smallest 
guests need 4GB of physical address space.  Apart from speeding up 
gfn_to_page(), it would also speed up mmio which must iterate over 
all slots, so a lookup cache cannot help.


This would require quite a bunch of changes:
- modify gfn_to_pfn() to fail gracefully if the page is in the slot 
but unmapped (hole handling)

- modify qemu to reserve the guest physical address space


It could potentially speed up qemu quite a lot too as we would return 
to a model where host va == fixed address + guest pa.  That makes 
things like stl_phys/ldl_phys trivial.


This doesn't work on 32-bit, and you still need to perform a lookup for 
mmio.  It just shortens the loop.


Note qemu can't depend on mmio holes being unmapped (you could trap the 
SEGV, but that would be unbearably slow).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Single memory slot

2009-10-15 Thread Avi Kivity

On 10/16/2009 04:46 AM, Marcelo Tosatti wrote:

On Thu, Oct 15, 2009 at 04:33:11PM +0900, Avi Kivity wrote:
   

One way to improve the gfn_to_pfn() memslot search is to register just
one slot.  This can only work on 64-bit, since even the smallest guests
need 4GB of physical address space.  Apart from speeding up
gfn_to_page(), it would also speed up mmio which must iterate over all
slots, so a lookup cache cannot help.

This would require quite a bunch of changes:
- modify gfn_to_pfn() to fail gracefully if the page is in the slot but
unmapped (hole handling)
- modify qemu to reserve the guest physical address space
- modify qemu memory allocation to use MAP_FIXED to allocate memory
- some hack for the vga aliases (mmap an fd multiple times?)
- some hack for the vmx-specific pages (e.g. APIC-access page)

Not sure it's worthwhile, but something to keep in mind if a simple
cache or sort by size is insufficient due to mmio.
 

Downside is you lose the ability to write protect a small slot only
(could mprotect(MAP_READ) the desired area but get_log+write_protect
must be atomic).

Also if you enable dirty log for the large slot largepages are disabled.
   


I guess that shoots this idea down.  We could perhaps only enable it if 
a vnc client is not connected and we don't track vga updates.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/5] kvmclock: account stolen time

2009-10-15 Thread Marcelo Tosatti
Which makes stolen time information available in procfs/vmstat.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kernel/kvmclock.c
===
--- kvm.orig/arch/x86/kernel/kvmclock.c
+++ kvm/arch/x86/kernel/kvmclock.c
@@ -22,11 +22,14 @@
 #include asm/msr.h
 #include asm/apic.h
 #include linux/percpu.h
+#include linux/kernel_stat.h
 
 #include asm/x86_init.h
 #include asm/reboot.h
+#include asm/cputime.h
 
 #define KVM_SCALE 22
+#define NS_PER_TICK (10LL / HZ)
 
 static int kvmclock = 1;
 
@@ -50,6 +53,29 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(str
 
 static struct pvclock_wall_clock wall_clock;
 
+static DEFINE_PER_CPU(u64, total_stolen);
+static DEFINE_PER_CPU(u64, residual_stolen);
+
+void kvm_account_steal_time(void)
+{
+   struct kvm_vcpu_runtime_info *rinfo;
+   cputime_t ticks;
+   u64 stolen_time, stolen_delta;
+
+   rinfo = get_cpu_var(run_info);
+   stolen_time = rinfo-stolen_time;
+   stolen_delta = stolen_time - __get_cpu_var(total_stolen);
+
+   __get_cpu_var(total_stolen) = stolen_time;
+   put_cpu_var(rinfo);
+
+   stolen_delta += __get_cpu_var(residual_stolen);
+
+   ticks = iter_div_u64_rem(stolen_delta, NS_PER_TICK, stolen_delta);
+   __get_cpu_var(residual_stolen) = stolen_delta;
+   account_steal_ticks(ticks);
+}
+
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
Index: kvm/kernel/sched.c
===
--- kvm.orig/kernel/sched.c
+++ kvm/kernel/sched.c
@@ -74,6 +74,9 @@
 
 #include asm/tlb.h
 #include asm/irq_regs.h
+#ifdef CONFIG_KVM_CLOCK
+#include asm/kvm_para.h
+#endif
 
 #include sched_cpupri.h
 
@@ -5102,6 +5105,9 @@ void account_process_tick(struct task_st
one_jiffy_scaled);
else
account_idle_time(cputime_one_jiffy);
+#ifdef CONFIG_KVM_CLOCK
+   kvm_account_steal_time();
+#endif
 }
 
 /*
Index: kvm/arch/x86/include/asm/kvm_para.h
===
--- kvm.orig/arch/x86/include/asm/kvm_para.h
+++ kvm/arch/x86/include/asm/kvm_para.h
@@ -58,6 +58,7 @@ struct kvm_vcpu_runtime_info {
 };
 
 extern void kvmclock_init(void);
+extern void kvm_account_steal_time(void);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 5/5] qemu-kvm-x86: report pvclock runtime capability

2009-10-15 Thread Marcelo Tosatti
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index fffcfd8..0b8a858 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1223,6 +1223,9 @@ struct kvm_para_features {
 #ifdef KVM_CAP_CR3_CACHE
{ KVM_CAP_CR3_CACHE, KVM_FEATURE_CR3_CACHE },
 #endif
+#ifdef KVM_CAP_PVCLOCK_RUNTIME
+   { KVM_CAP_PVCLOCK_RUNTIME, KVM_FEATURE_RUNTIME_INFO },
+#endif
{ -1, -1 }
 };
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 0/5] report stolen time via pvclock

2009-10-15 Thread Marcelo Tosatti
Stolen time can be useful diagnostic information when available to
guests. Xen provides it for sometime, so recent vmstat versions 
already display it.

Also increases guests sched_clock accuracy.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/5] pvclock: move code to pvclock.h

2009-10-15 Thread Marcelo Tosatti
To be used by kvmclock.c.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/pvclock.h
===
--- kvm.orig/arch/x86/include/asm/pvclock.h
+++ kvm/arch/x86/include/asm/pvclock.h
@@ -4,6 +4,20 @@
 #include linux/clocksource.h
 #include asm/pvclock-abi.h
 
+/*
+ * These are perodically updated
+ *xen: magic shared_info page
+ *kvm: gpa registered via msr
+ * and then copied here.
+ */
+struct pvclock_shadow_time {
+   u64 tsc_timestamp; /* TSC at last update of time vals.  */
+   u64 system_timestamp;  /* Time, in nanosecs, since boot.*/
+   u32 tsc_to_nsec_mul;
+   int tsc_shift;
+   u32 version;
+};
+
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
@@ -11,4 +25,8 @@ void pvclock_read_wallclock(struct pvclo
struct pvclock_vcpu_time_info *vcpu,
struct timespec *ts);
 
+u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow);
+unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
+struct pvclock_vcpu_time_info *src);
+
 #endif /* _ASM_X86_PVCLOCK_H */
Index: kvm/arch/x86/kernel/pvclock.c
===
--- kvm.orig/arch/x86/kernel/pvclock.c
+++ kvm/arch/x86/kernel/pvclock.c
@@ -20,20 +20,6 @@
 #include asm/pvclock.h
 
 /*
- * These are perodically updated
- *xen: magic shared_info page
- *kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
-   u64 tsc_timestamp; /* TSC at last update of time vals.  */
-   u64 system_timestamp;  /* Time, in nanosecs, since boot.*/
-   u32 tsc_to_nsec_mul;
-   int tsc_shift;
-   u32 version;
-};
-
-/*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
@@ -71,7 +57,7 @@ static inline u64 scale_delta(u64 delta,
return product;
 }
 
-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
 {
u64 delta = native_read_tsc() - shadow-tsc_timestamp;
return scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift);
@@ -81,8 +67,8 @@ static u64 pvclock_get_nsec_offset(struc
  * Reads a consistent set of time-base values from hypervisor,
  * into a shadow data area.
  */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
-   struct pvclock_vcpu_time_info *src)
+unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
+struct pvclock_vcpu_time_info *src)
 {
do {
dst-version = src-version;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/5] KVM: x86: report stolen time

2009-10-15 Thread Marcelo Tosatti
Report stolen time (run_delay field from schedstat) to guests via
pvclock.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/kvm_para.h
===
--- kvm.orig/arch/x86/include/asm/kvm_para.h
+++ kvm/arch/x86/include/asm/kvm_para.h
@@ -15,9 +15,11 @@
 #define KVM_FEATURE_CLOCKSOURCE0
 #define KVM_FEATURE_NOP_IO_DELAY   1
 #define KVM_FEATURE_MMU_OP 2
+#define KVM_FEATURE_RUNTIME_INFO   3
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
+#define MSR_KVM_RUN_TIME0x13
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
@@ -50,6 +52,11 @@ struct kvm_mmu_op_release_pt {
 #ifdef __KERNEL__
 #include asm/processor.h
 
+struct kvm_vcpu_runtime_info {
+   u64 stolen_time;/* time spent starving */
+   u64 reserved[3];/* for future use */
+};
+
 extern void kvmclock_init(void);
 
 
Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -354,6 +354,10 @@ struct kvm_vcpu_arch {
unsigned int time_offset;
struct page *time_page;
 
+   bool stolen_time_enable;
+   struct kvm_vcpu_runtime_info stolen_time;
+   unsigned int stolen_time_offset;
+
bool singlestep; /* guest is single stepped by KVM */
bool nmi_pending;
bool nmi_injected;
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -507,9 +507,9 @@ static inline u32 bit(int bitno)
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN2
+#define KVM_SAVE_MSRS_BEGIN3
 static u32 msrs_to_save[] = {
-   MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+   MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_RUN_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_K6_STAR,
 #ifdef CONFIG_X86_64
@@ -679,6 +679,7 @@ static void kvm_write_guest_time(struct 
struct kvm_vcpu_arch *vcpu = v-arch;
void *shared_kaddr;
unsigned long this_tsc_khz;
+   struct task_struct *task = current;
 
if ((!vcpu-time_page))
return;
@@ -700,6 +701,9 @@ static void kvm_write_guest_time(struct 
 
vcpu-hv_clock.system_time = ts.tv_nsec +
 (NSEC_PER_SEC * (u64)ts.tv_sec);
+
+   vcpu-stolen_time.stolen_time = task-sched_info.run_delay;
+
/*
 * The interface expects us to write an even number signaling that the
 * update is finished. Since the guest won't see the intermediate
@@ -712,6 +716,10 @@ static void kvm_write_guest_time(struct 
memcpy(shared_kaddr + vcpu-time_offset, vcpu-hv_clock,
   sizeof(vcpu-hv_clock));
 
+   if (vcpu-stolen_time_enable)
+   memcpy(shared_kaddr + vcpu-stolen_time_offset,
+  vcpu-stolen_time, sizeof(vcpu-stolen_time));
+
kunmap_atomic(shared_kaddr, KM_USER0);
 
mark_page_dirty(v-kvm, vcpu-time  PAGE_SHIFT);
@@ -937,6 +945,35 @@ int kvm_set_msr_common(struct kvm_vcpu *
kvm_request_guest_time_update(vcpu);
break;
}
+   case MSR_KVM_RUN_TIME: {
+   struct page *page;
+   unsigned int stolen_time_offset;
+
+   if (!vcpu-arch.time_page)
+   return 1;
+
+   /* we verify if the enable bit is set... */
+   if (!(data  1))
+   break;
+
+   /* ...but clean it before doing the actual write */
+   stolen_time_offset = data  ~(PAGE_MASK | 1);
+
+   /* that it matches the hvclock page */
+   page = gfn_to_page(vcpu-kvm, data  PAGE_SHIFT);
+   if (is_error_page(page)) {
+   kvm_release_page_clean(page);
+   return 1;
+   }
+   if (page != vcpu-arch.time_page) {
+   kvm_release_page_clean(page);
+   return 1;
+   }
+   kvm_release_page_clean(page);
+   vcpu-arch.stolen_time_offset = stolen_time_offset;
+   vcpu-arch.stolen_time_enable = 1;
+   break;
+   }
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1246,6 +1283,7 @@ int kvm_dev_ioctl_check_extension(long e
case KVM_CAP_PIT2:
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
+   case KVM_CAP_PVCLOCK_RUNTIME:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
Index: kvm/arch/x86/kvm/Kconfig

[patch 3/5] kvmclock: stolen time aware sched_clock

2009-10-15 Thread Marcelo Tosatti
sched_clock() should time the vcpu run time. Subtract stolen time from
realtime pvclock.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kernel/kvmclock.c
===
--- kvm.orig/arch/x86/kernel/kvmclock.c
+++ kvm/arch/x86/kernel/kvmclock.c
@@ -38,7 +38,16 @@ static int parse_no_kvmclock(char *arg)
 early_param(no-kvmclock, parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+struct time_info {
+   struct pvclock_vcpu_time_info hv_clock;
+   struct kvm_vcpu_runtime_info run_info;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct time_info, time_info);
+
+#define hv_clock time_info.hv_clock
+#define run_info time_info.run_info
+
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -84,6 +93,40 @@ static cycle_t kvm_clock_get_cycles(stru
return kvm_clock_read();
 }
 
+cycle_t kvm_runtime_read(struct pvclock_vcpu_time_info *src,
+struct kvm_vcpu_runtime_info *rinfo)
+{
+   struct pvclock_shadow_time shadow;
+   unsigned version;
+   cycle_t ret, offset;
+   unsigned long long stolen;
+
+   do {
+   version = pvclock_get_time_values(shadow, src);
+   barrier();
+   offset = pvclock_get_nsec_offset(shadow);
+   stolen = rinfo-stolen_time;
+   ret = shadow.system_timestamp + offset - stolen;
+   barrier();
+   } while (version != src-version);
+
+   return ret;
+}
+
+static cycle_t kvm_clock_read_unstolen(void)
+{
+   struct pvclock_vcpu_time_info *src;
+   struct kvm_vcpu_runtime_info *rinfo;
+   cycle_t ret;
+
+   src = get_cpu_var(hv_clock);
+   rinfo = get_cpu_var(run_info);
+   ret = kvm_runtime_read(src, rinfo);
+   put_cpu_var(run_info);
+   put_cpu_var(hv_clock);
+   return ret;
+}
+
 /*
  * If we don't do that, there is the possibility that the guest
  * will calibrate under heavy load - thus, getting a lower lpj -
@@ -133,14 +176,30 @@ static int kvm_register_clock(char *txt)
return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
 }
 
+static int kvm_register_run_info(char *txt)
+{
+   int cpu = smp_processor_id();
+   int low, high;
+
+   low = (int) __pa(per_cpu(run_info, cpu)) | 1;
+   high = ((u64)__pa(per_cpu(run_info, cpu))  32);
+   printk(KERN_INFO kvm-runtime-info: cpu %d, msr %x:%x, %s\n,
+  cpu, high, low, txt);
+   return native_write_msr_safe(MSR_KVM_RUN_TIME, low, high);
+}
+
 #ifdef CONFIG_X86_LOCAL_APIC
 static void __cpuinit kvm_setup_secondary_clock(void)
 {
+   char *txt = secondary cpu clock;
+
/*
 * Now that the first cpu already had this clocksource initialized,
 * we shouldn't fail.
 */
-   WARN_ON(kvm_register_clock(secondary cpu clock));
+   WARN_ON(kvm_register_clock(txt));
+   if (kvm_para_has_feature(KVM_FEATURE_RUNTIME_INFO))
+   kvm_register_run_info(txt);
/* ok, done with our trickery, call native */
setup_secondary_APIC_clock();
 }
@@ -149,7 +208,11 @@ static void __cpuinit kvm_setup_secondar
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
-   WARN_ON(kvm_register_clock(primary cpu clock));
+   char *txt = primary cpu clock;
+
+   WARN_ON(kvm_register_clock(txt));
+   if (kvm_para_has_feature(KVM_FEATURE_RUNTIME_INFO))
+   kvm_register_run_info(txt);
native_smp_prepare_boot_cpu();
 }
 #endif
@@ -204,4 +267,6 @@ void __init kvmclock_init(void)
pv_info.paravirt_enabled = 1;
pv_info.name = KVM;
}
+   if (kvm_para_has_feature(KVM_FEATURE_RUNTIME_INFO))
+   pv_time_ops.sched_clock = kvm_clock_read_unstolen;
 }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html