Re: [PATCH v2 4/5] KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit

2015-08-19 Thread Avi Kivity

On 08/18/2015 10:57 PM, Paolo Bonzini wrote:


On 18/08/2015 11:30, Avi Kivity wrote:

KVM_USER_EXIT in practice should be so rare (at least with in-kernel
LAPIC) that I don't think this matters.  KVM_USER_EXIT is relatively
uninteresting, it only exists to provide an alternative to signals that
doesn't require expensive atomics on each and every KVM_RUN. :(

Ah, so the idea is to remove the cost of changing the signal mask?

Yes, it's explained in the cover letter.


Yes, although it looks like a thread-local operation, it takes a
process-wide lock.

IIRC the lock was only task-wide and uncontended.  Problem is, it's on
the node that created the thread rather than the node that is running
it, and inter-node atomics are really, really slow.


Cached inter-node atomics are (relatively) fast, but I think it really 
is a process-wide lock:


sigprocmask calls:

void __set_current_blocked(const sigset_t *newset)
{
struct task_struct *tsk = current;

spin_lock_irq(tsk-sighand-siglock);
__set_task_blocked(tsk, newset);
spin_unlock_irq(tsk-sighand-siglock);
}

struct sighand_struct {
atomic_tcount;
struct k_sigactionaction[_NSIG];
spinlock_tsiglock;
wait_queue_head_tsignalfd_wqh;
};

Since sigaction is usually process-wide, I conclude that so will 
tsk-sighand.






For guests spanning 1 host NUMA nodes it's not really practical to
ensure that the thread is created on the right node.  Even for guests
that fit into 1 host node, if you rely on AutoNUMA the VCPUs are created
too early for AutoNUMA to have any effect.  And newer machines have
frighteningly small nodes (two nodes per socket, so it's something like
7 pCPUs if you don't have hyper-threading enabled).  True, the NUMA
penalty within the same socket is not huge, but it still costs a few
thousand clock cycles on vmexit.flat and this feature sweeps it away
completely.


I expect most user wakeups are via irqfd, so indeed the performance of
KVM_USER_EXIT is uninteresting.

Yup, either irqfd or KVM_SET_SIGNAL_MSI.

Paolo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit

2015-08-18 Thread Avi Kivity

On 08/17/2015 04:15 PM, Paolo Bonzini wrote:


On 16/08/2015 13:27, Avi Kivity wrote:

On 08/05/2015 07:33 PM, Radim Krčmář wrote:

The guest can use KVM_USER_EXIT instead of a signal-based exiting to
userspace.  Availability depends on KVM_CAP_USER_EXIT.
Only x86 is implemented so far.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
   v2:
* use vcpu ioctl instead of vm one [4/5]
* shrink kvm_user_exit from 64 to 32 bytes [4/5]

   Documentation/virtual/kvm/api.txt | 30 ++
   arch/x86/kvm/x86.c| 24 
   include/uapi/linux/kvm.h  |  7 +++
   virt/kvm/kvm_main.c   |  5 +++--
   4 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt
b/Documentation/virtual/kvm/api.txt
index 3c714d43a717..c5844f0b8e7c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error
 Queues an SMI on the thread's vcpu.
   +
+4.97 KVM_USER_EXIT
+
+Capability: KVM_CAP_USER_EXIT
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_user_exit (in)
+Returns: 0 on success,
+ -EFAULT if the parameter couldn't be read,
+ -EINVAL if 'reserved' is not zeroed,
+
+struct kvm_user_exit {
+__u8 reserved[32];
+};
+
+The ioctl is asynchronous to VCPU execution and can be issued from
all threads.
+format

This breaks an invariant of vcpu ioctls, and also forces a cacheline
bounce when we fget() the vcpu fd.

KVM_USER_EXIT in practice should be so rare (at least with in-kernel
LAPIC) that I don't think this matters.  KVM_USER_EXIT is relatively
uninteresting, it only exists to provide an alternative to signals that
doesn't require expensive atomics on each and every KVM_RUN. :(


Ah, so the idea is to remove the cost of changing the signal mask?

Yes, although it looks like a thread-local operation, it takes a 
process-wide lock.


I expect most user wakeups are via irqfd, so indeed the performance of 
KVM_USER_EXIT is uninteresting.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit

2015-08-16 Thread Avi Kivity

On 08/05/2015 07:33 PM, Radim Krčmář wrote:

The guest can use KVM_USER_EXIT instead of a signal-based exiting to
userspace.  Availability depends on KVM_CAP_USER_EXIT.
Only x86 is implemented so far.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
  v2:
   * use vcpu ioctl instead of vm one [4/5]
   * shrink kvm_user_exit from 64 to 32 bytes [4/5]

  Documentation/virtual/kvm/api.txt | 30 ++
  arch/x86/kvm/x86.c| 24 
  include/uapi/linux/kvm.h  |  7 +++
  virt/kvm/kvm_main.c   |  5 +++--
  4 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 3c714d43a717..c5844f0b8e7c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error
  
  Queues an SMI on the thread's vcpu.
  
+

+4.97 KVM_USER_EXIT
+
+Capability: KVM_CAP_USER_EXIT
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_user_exit (in)
+Returns: 0 on success,
+ -EFAULT if the parameter couldn't be read,
+ -EINVAL if 'reserved' is not zeroed,
+
+struct kvm_user_exit {
+   __u8 reserved[32];
+};
+
+The ioctl is asynchronous to VCPU execution and can be issued from all threads.
+format


This breaks an invariant of vcpu ioctls, and also forces a cacheline 
bounce when we fget() the vcpu fd.


We should really try to avoid this.  One options is to have a 
KVM_VCPU_MAKE_EXITFD vcpu ioctl, which returns an eventfd that you then 
write into.  You can make as many exitfds as you like, one for each 
waking thread, so they never cause cacheline conflicts.


Edit: I see the invariant was already broken.  But the other comment stands.


+Make vcpu_id exit to userspace as soon as possible.  If the VCPU is not running
+in kernel at the time, it will exit early on the next call to KVM_RUN.
+If the VCPU was going to exit because of other reasons when KVM_USER_EXIT was
+issued, it will keep the original exit reason and not exit early on next
+KVM_RUN.
+If VCPU exited because of KVM_USER_EXIT, the exit reason is KVM_EXIT_REQUEST.
+
+This ioctl has very similar effect (same sans some races on userspace exit) as
+sending a signal (that is blocked in userspace and set in KVM_SET_SIGNAL_MASK)
+to the VCPU thread.
+
+
+
  5. The kvm_run structure
  
  
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

index 3493457ad0a1..27d777eb34e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2466,6 +2466,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_ASSIGN_DEV_IRQ:
case KVM_CAP_PCI_2_3:
  #endif
+   case KVM_CAP_USER_EXIT:
r = 1;
break;
case KVM_CAP_X86_SMM:
@@ -3077,6 +3078,20 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
return 0;
  }
  
+int kvm_vcpu_ioctl_user_exit(struct kvm_vcpu *vcpu, struct kvm_user_exit *info)

+{
+   struct kvm_user_exit valid = {};
+   BUILD_BUG_ON(sizeof(valid) != 32);
+
+   if (memcmp(info, valid, sizeof(valid)))
+   return -EINVAL;
+
+   kvm_make_request(KVM_REQ_EXIT, vcpu);
+   kvm_vcpu_kick(vcpu);
+
+   return 0;
+}
+
  long kvm_arch_vcpu_ioctl(struct file *filp,
 unsigned int ioctl, unsigned long arg)
  {
@@ -3341,6 +3356,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = kvm_set_guest_paused(vcpu);
goto out;
}
+   case KVM_USER_EXIT: {
+   struct kvm_user_exit info;
+
+   r = -EFAULT;
+   if (copy_from_user(info, argp, sizeof(info)))
+   goto out;
+   r = kvm_vcpu_ioctl_user_exit(vcpu, info);
+   break;
+   }
default:
r = -EINVAL;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d996a7cdb4d2..bc5a1abe9626 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -826,6 +826,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_X86_SMM 117
  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
  #define KVM_CAP_SPLIT_IRQCHIP 119
+#define KVM_CAP_USER_EXIT 120
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
@@ -1008,6 +1009,10 @@ struct kvm_device_attr {

__u64   addr;   /* userspace address of attr data */
  };
  
+struct kvm_user_exit {

+   __u8 reserved[32];
+};
+
  #define  KVM_DEV_VFIO_GROUP   1
  #define   KVM_DEV_VFIO_GROUP_ADD  1
  #define   KVM_DEV_VFIO_GROUP_DEL  2
@@ -1119,6 +1124,8 @@ struct kvm_s390_ucas_mapping {
  #define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO,  0xab, struct 
kvm_arm_device_addr)
  /* Available with KVM_CAP_PPC_RTAS */
  #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xac, struct 
kvm_rtas_token_args)
+/* Available with KVM_CAP_USER_EXIT */
+#define KVM_USER_EXIT  

Re: [v4 08/16] KVM: kvm-vfio: User API for IRQ forwarding

2015-06-12 Thread Avi Kivity

On 06/12/2015 06:41 PM, Alex Williamson wrote:

On Fri, 2015-06-12 at 00:23 +, Wu, Feng wrote:

-Original Message-
From: Avi Kivity [mailto:avi.kiv...@gmail.com]
Sent: Friday, June 12, 2015 3:59 AM
To: Wu, Feng; kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Cc: pbonz...@redhat.com; mtosa...@redhat.com;
alex.william...@redhat.com; eric.au...@linaro.org
Subject: Re: [v4 08/16] KVM: kvm-vfio: User API for IRQ forwarding

On 06/11/2015 01:51 PM, Feng Wu wrote:

From: Eric Auger eric.au...@linaro.org

This patch adds and documents a new KVM_DEV_VFIO_DEVICE group
and 2 device attributes: KVM_DEV_VFIO_DEVICE_FORWARD_IRQ,
KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ. The purpose is to be able
to set a VFIO device IRQ as forwarded or not forwarded.
the command takes as argument a handle to a new struct named
kvm_vfio_dev_irq.

Is there no way to do this automatically?  After all, vfio knows that a
device interrupt is forwarded to some eventfd, and kvm knows that some
eventfd is forwarded to a guest interrupt.  If they compare notes
through a central registry, they can figure out that the interrupt needs
to be forwarded.

Oh, just like Eric mentioned in his reply, this description is out of context of
this series, I will remove them in the next version.


I suspect Avi's question was more general.  While forward/unforward is
out of context for this series, it's very similar in nature to
enabling/disabling posted interrupts.  So I think the question remains
whether we really need userspace to participate in creating this
shortcut or if kvm and vfio can some how orchestrate figuring it out
automatically.

Personally I don't know how we could do it automatically.  We've always
relied on userspace to independently setup vfio and kvm such that
neither have any idea that the other is there and update each side
independently when anything changes.  So it seems consistent to continue
that here.  It doesn't seem like there's much to gain performance-wise
either, updates should be a relatively rare event I'd expect.

There's really no metadata associated with an eventfd, so comparing
notes automatically might imply some central registration entity.  That
immediately sounds like a much more complex solution, but maybe Avi has
some ideas to manage it.  Thanks,



The idea is to have a central registry maintained by a posted interrupts 
manager.  Both vfio and kvm pass the filp (along with extra information) 
to the posted interrupts manager, which, when it detects a filp match, 
tells each of them what to do.


The advantages are:
- old userspace gains the optimization without change
- a userspace API is more expensive to maintain than internal kernel 
interfaces (CVEs, documentation, maintaining backwards compatibility)
- if you can do it without a new interface, this indicates that all the 
information in the new interface is redundant.  That means you have to 
check it for consistency with the existing information, so it's extra 
work (likely, it's exactly what the posted interrupt manager would be 
doing anyway).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v4 08/16] KVM: kvm-vfio: User API for IRQ forwarding

2015-06-11 Thread Avi Kivity

On 06/11/2015 01:51 PM, Feng Wu wrote:

From: Eric Auger eric.au...@linaro.org

This patch adds and documents a new KVM_DEV_VFIO_DEVICE group
and 2 device attributes: KVM_DEV_VFIO_DEVICE_FORWARD_IRQ,
KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ. The purpose is to be able
to set a VFIO device IRQ as forwarded or not forwarded.
the command takes as argument a handle to a new struct named
kvm_vfio_dev_irq.


Is there no way to do this automatically?  After all, vfio knows that a 
device interrupt is forwarded to some eventfd, and kvm knows that some 
eventfd is forwarded to a guest interrupt.  If they compare notes 
through a central registry, they can figure out that the interrupt needs 
to be forwarded.




Signed-off-by: Eric Auger eric.au...@linaro.org
---
  Documentation/virtual/kvm/devices/vfio.txt | 34 --
  include/uapi/linux/kvm.h   | 12 +++
  2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740..6186e6d 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -4,15 +4,20 @@ VFIO virtual device
  Device types supported:
KVM_DEV_TYPE_VFIO
  
-Only one VFIO instance may be created per VM.  The created device

-tracks VFIO groups in use by the VM and features of those groups
-important to the correctness and acceleration of the VM.  As groups
-are enabled and disabled for use by the VM, KVM should be updated
-about their presence.  When registered with KVM, a reference to the
-VFIO-group is held by KVM.
+Only one VFIO instance may be created per VM.
+
+The created device tracks VFIO groups in use by the VM and features
+of those groups important to the correctness and acceleration of
+the VM.  As groups are enabled and disabled for use by the VM, KVM
+should be updated about their presence.  When registered with KVM,
+a reference to the VFIO-group is held by KVM.
+
+The device also enables to control some IRQ settings of VFIO devices:
+forwarding/posting.
  
  Groups:

KVM_DEV_VFIO_GROUP
+  KVM_DEV_VFIO_DEVICE
  
  KVM_DEV_VFIO_GROUP attributes:

KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
@@ -20,3 +25,20 @@ KVM_DEV_VFIO_GROUP attributes:
  
  For each, kvm_device_attr.addr points to an int32_t file descriptor

  for the VFIO group.
+
+KVM_DEV_VFIO_DEVICE attributes:
+  KVM_DEV_VFIO_DEVICE_FORWARD_IRQ: set a VFIO device IRQ as forwarded
+  KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ: set a VFIO device IRQ as not forwarded
+
+For each, kvm_device_attr.addr points to a kvm_vfio_dev_irq struct.
+
+When forwarded, a physical IRQ is completed by the guest and not by the
+host. This requires HW support in the interrupt controller.
+
+Forwarding can only be set when the corresponding VFIO IRQ is not masked
+(would it be through VFIO_DEVICE_SET_IRQS command or as a consequence of this
+IRQ being currently handled) or active at interrupt controller level.
+In such a situation, -EAGAIN is returned. It is advised to to set the
+forwarding before the VFIO signaling is set up, this avoids trial and errors.
+
+Unforwarding can happen at any time.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4b60056..798f3e4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -999,6 +999,9 @@ struct kvm_device_attr {
  #define  KVM_DEV_VFIO_GROUP   1
  #define   KVM_DEV_VFIO_GROUP_ADD  1
  #define   KVM_DEV_VFIO_GROUP_DEL  2
+#define  KVM_DEV_VFIO_DEVICE   2
+#define   KVM_DEV_VFIO_DEVICE_FORWARD_IRQ  1
+#define   KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ2
  
  enum kvm_device_type {

KVM_DEV_TYPE_FSL_MPIC_20= 1,
@@ -1018,6 +1021,15 @@ enum kvm_device_type {
KVM_DEV_TYPE_MAX,
  };
  
+struct kvm_vfio_dev_irq {

+   __u32   argsz;  /* structure length */
+   __u32   fd; /* file descriptor of the VFIO device */
+   __u32   index;  /* VFIO device IRQ index */
+   __u32   start;  /* start of subindex range */
+   __u32   count;  /* size of subindex range */
+   __u32   gsi[];  /* gsi, ie. virtual IRQ number */
+};
+
  /*
   * ioctls for VM fds
   */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/13] SMM implementation for KVM

2015-06-05 Thread Avi Kivity

On 05/27/2015 08:05 PM, Paolo Bonzini wrote:

This brings together the remaining parts of SMM.  For now I've left the
weird interaction between SMM and NMI blocking, and I'm using the same
format for the state save area (which is also the one used by QEMU) as
the RFC.

It builds on the previous cleanup patches, which (with the exception
of KVM: x86: pass kvm_mmu_page to gfn_to_rmap) are now in kvm/queue.
The first six patches are more or less the same as the previous version,
while the address spaces part hopefully touches all affected functions
now.

Patches 1-6 implement the SMM API and world switch; patches 7-12
implements the multiple address spaces; patch 13 ties the loose
ends and advertises the capability.

Tested with SeaBIOS and OVMF, where SMM provides the trusted base
for secure boot.



Nice work.  While I did not do a thorough review, the mmu bits look robust.




Thanks,

Paolo

Paolo Bonzini (13):
   KVM: x86: introduce num_emulated_msrs
   KVM: x86: pass host_initiated to functions that read MSRs
   KVM: x86: pass the whole hflags field to emulator and back
   KVM: x86: API changes for SMM support
   KVM: x86: stubs for SMM support
   KVM: x86: save/load state on SMM switch
   KVM: add vcpu-specific functions to read/write/translate GFNs
   KVM: implement multiple address spaces
   KVM: x86: pass kvm_mmu_page to gfn_to_rmap
   KVM: x86: use vcpu-specific functions to read/write/translate GFNs
   KVM: x86: work on all available address spaces
   KVM: x86: add SMM to the MMU role, support SMRAM address space
   KVM: x86: advertise KVM_CAP_X86_SMM

  Documentation/virtual/kvm/api.txt|  52 ++-
  arch/powerpc/include/asm/kvm_book3s_64.h |   2 +-
  arch/x86/include/asm/kvm_emulate.h   |   9 +-
  arch/x86/include/asm/kvm_host.h  |  44 ++-
  arch/x86/include/asm/vmx.h   |   1 +
  arch/x86/include/uapi/asm/kvm.h  |  11 +-
  arch/x86/kvm/cpuid.h |   8 +
  arch/x86/kvm/emulate.c   | 262 +-
  arch/x86/kvm/kvm_cache_regs.h|   5 +
  arch/x86/kvm/lapic.c |   4 +-
  arch/x86/kvm/mmu.c   | 171 +-
  arch/x86/kvm/mmu_audit.c |  16 +-
  arch/x86/kvm/paging_tmpl.h   |  18 +-
  arch/x86/kvm/svm.c   |  73 ++--
  arch/x86/kvm/trace.h |  22 ++
  arch/x86/kvm/vmx.c   | 106 +++---
  arch/x86/kvm/x86.c   | 562 ++-
  include/linux/kvm_host.h |  49 ++-
  include/uapi/linux/kvm.h |   6 +-
  virt/kvm/kvm_main.c  | 237 ++---
  20 files changed, 1337 insertions(+), 321 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-27 Thread Avi Kivity



On 05/27/2015 12:30 PM, Paolo Bonzini wrote:


On 26/05/2015 23:25, Christopher Covington wrote:

On 05/25/2015 08:53 AM, Paolo Bonzini wrote:

On 22/05/2015 13:12, Daniel P. Berrange wrote:

In
particular I don't see why we need to have a SATA controller and ISA/LPC
bridge in every virt machine - root PCI bus only should be possible, as you
can provide disks via virtio-blk or virtio-scsi and serial, parallel, mouse,
floppy via PCI devices and/or by adding a USB bus in the cases where you
really need one.

I think removing the ISA/LPC bridge is hard.  It includes the real-time
clock and fw_cfg, for example.

Could VirtIO specified replacements make sense for these peripherals?

Not really.  virtio is too heavyweight and you'd be reinventing the
wheel unnecessarily.

For example, ARM's -M virt uses a pl011 block for the RTC, and also
uses fw_cfg.  Another commonly used ISA device is the UART, for which
again -M virt uses a pl031.



The RTC can be replaced by kvmclock, the keyboard by virtio-console.  
Maybe we can provide an msr- or pci- based interface to fw_cfg.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/4] KVM: x86: Add KVM exit for IOAPIC EOIs

2015-05-26 Thread Avi Kivity

On 05/27/2015 05:06 AM, Steve Rutherford wrote:

On Sun, May 24, 2015 at 07:46:03PM +0300, Avi Kivity wrote:

On 05/13/2015 04:47 AM, Steve Rutherford wrote:

Adds KVM_EXIT_IOAPIC_EOI which passes the interrupt vector up to
userspace.

Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
to be informed (which is identical to the EOI_EXIT_BITMAP field used
by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
exits on older processors).

[Note: A prototype using ResampleFDs found that decoupling the EOI

from the VCPU's thread made it possible for the VCPU to not see a

recent EOI after reentering the guest. This does not match real
hardware.]

Compile tested for Intel x86.

Signed-off-by: Steve Rutherford srutherf...@google.com
---
  Documentation/virtual/kvm/api.txt | 10 ++
  arch/x86/include/asm/kvm_host.h   |  3 +++
  arch/x86/kvm/lapic.c  |  9 +
  arch/x86/kvm/x86.c| 11 +++
  include/linux/kvm_host.h  |  1 +
  include/uapi/linux/kvm.h  |  5 +
  6 files changed, 39 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 0744b4e..dd92996 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3285,6 +3285,16 @@ Valid values for 'type' are:
 */
__u64 kvm_valid_regs;
__u64 kvm_dirty_regs;
+
+   /* KVM_EXIT_IOAPIC_EOI */
+struct {
+  __u8 vector;
+} eoi;
+
+Indicates that an eoi of a level triggered IOAPIC interrupt on vector has
+occurred, which should be handled by the userspace IOAPIC. Triggers when
+the Irqchip has been split between userspace and the kernel.
+

The ioapic is a global resource, so it doesn't make sense for
information about it to be returned in a per-vcpu structure

EOI exits are a per-vcpu behavior, so this doesn't seem all that strange.


(or to block the vcpu while it is being processed).

Blocking doesn't feel clean, but doesn't seem all that bad, given
that these operations are relatively rare on modern configurations.


Agree, maybe the realtime people have an interest here.


The way I'd model it is to emulate the APIC bus that connects local
APICs and the IOAPIC, using a socket pair.  When the user-space
ioapic wants to inject an interrupt, it sends a message to the local
APICs which then inject it, and when it's ack'ed the EOI is sent
back on the same bus.

Although I'm not certain about this, it sounds to me like this would
require a kernel thread to be waiting (in some way) on this socket, which
seems rather heavy handed.


It's been a while since I did kernel programming, but I think you can 
queue a callback to be called when an I/O is ready, and not require a 
thread.  IIRC we do that with irqfd to cause an interrupt to be injected.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] KVM: x86: add KVM_MEM_X86_SMRAM memory slot flag

2015-05-26 Thread Avi Kivity

On 05/08/2015 02:20 PM, Paolo Bonzini wrote:

This adds an arch-specific memslot flag that hides slots unless the
VCPU is in system management mode.

Some care is needed in order to limit the overhead of x86_gfn_to_memslot
when compared with gfn_to_memslot.  Thankfully, we have __gfn_to_memslot
and search_memslots which are the same, so we can add some extra output
to search_memslots.  The compiler will optimize it as dead code in
__gfn_to_memslot, and will use it to thread jumps in x86_gfn_to_memslot.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
  Documentation/virtual/kvm/api.txt | 18 --
  arch/x86/include/uapi/asm/kvm.h   |  3 +++
  arch/x86/kvm/smram.c  | 25 +++--
  include/linux/kvm_host.h  | 14 ++
  virt/kvm/kvm_main.c   |  4 
  5 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 51523b70b6b2..2bc99ae040da 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -933,18 +933,24 @@ It is recommended that the lower 21 bits of 
guest_phys_addr and userspace_addr
  be identical.  This allows large pages in the guest to be backed by large
  pages in the host.
  
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and

-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+The flags field supports two architecture-independent flags:
+KVM_MEM_LOG_DIRTY_PAGES and KVM_MEM_READONLY.  The former can be set
+to instruct KVM to keep track of writes to memory within the slot.
+See KVM_GET_DIRTY_LOG ioctl to know how to use it.  The latter can
+be set, if KVM_CAP_READONLY_MEM capability allows it, to make a new
+slot read-only.  In this case, writes to this memory will be posted to
+userspace as KVM_EXIT_MMIO exits.
  
  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of

  the memory region are automatically reflected into the guest.  For example, an
  mmap() that affects the region will be made visible immediately.  Another
  example is madvise(MADV_DROP).
  
+Each architectures can support other specific flags.  Right now there is

+only one defined.  On x86, if KVM_CAP_X86_SMM is available, the
+KVM_MEM_X86_SMRAM flag will hide the slot to VCPUs that are not
+in system management mode.


Is this generic enough?  For example, a system could configure itself so 
that an SMRAM region goes to mmio, hiding real RAM.


I see two alternatives:

- have three states: SMM, !SMM, both
- define two tables: SMM, !SMM, both spanning the entire address space

you should probably document how dirty bitmap handling happens in the 
presence of SMM.



+
  It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
  The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
  allocation and is deprecated.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 30100a3c1bed..46df15bc844f 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -45,6 +45,9 @@
  #define __KVM_HAVE_XCRS
  #define __KVM_HAVE_READONLY_MEM
  
+#define __KVM_ARCH_VALID_FLAGS		KVM_MEM_X86_SMRAM

+#define KVM_MEM_X86_SMRAM  (1  24)
+
  /* Architectural interrupt line count. */
  #define KVM_NR_INTERRUPTS 256
  
diff --git a/arch/x86/kvm/smram.c b/arch/x86/kvm/smram.c

index 73616edab631..e7dd933673a4 100644
--- a/arch/x86/kvm/smram.c
+++ b/arch/x86/kvm/smram.c
@@ -19,10 +19,23 @@
  
  #include linux/module.h

  #include linux/kvm_host.h
+#include kvm_cache_regs.h
  
  struct kvm_memory_slot *x86_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn)

  {
-   struct kvm_memory_slot *slot = gfn_to_memslot(vcpu-kvm, gfn);
+   /* By using search_memslots directly the compiler can optimize away
+* the if (found) check below.
+ *
+* It cannot do the same for gfn_to_memslot because it is not inlined,
+* and it also cannot do the same for __gfn_to_memslot because the
+* kernel is compiled with -fno-delete-null-pointer-checks.
+*/
+   bool found;
+   struct kvm_memslots *memslots = kvm_memslots(vcpu-kvm);
+   struct kvm_memory_slot *slot = search_memslots(memslots, gfn, found);
+
+   if (found  unlikely(slot-flags  KVM_MEM_X86_SMRAM)  !is_smm(vcpu))
+   return NULL;
  
  	return slot;

  }
@@ -112,7 +125,15 @@ EXPORT_SYMBOL_GPL(x86_read_guest);
  int x86_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
  gpa_t gpa, unsigned long len)
  {
-   return kvm_gfn_to_hva_cache_init(kvm, ghc, gpa, 

Re: [RFC PATCH 2/4] KVM: x86: Add KVM exit for IOAPIC EOIs

2015-05-24 Thread Avi Kivity

On 05/13/2015 04:47 AM, Steve Rutherford wrote:

Adds KVM_EXIT_IOAPIC_EOI which passes the interrupt vector up to
userspace.

Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
to be informed (which is identical to the EOI_EXIT_BITMAP field used
by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
exits on older processors).

[Note: A prototype using ResampleFDs found that decoupling the EOI
from the VCPU's thread made it possible for the VCPU to not see a
recent EOI after reentering the guest. This does not match real
hardware.]

Compile tested for Intel x86.

Signed-off-by: Steve Rutherford srutherf...@google.com
---
  Documentation/virtual/kvm/api.txt | 10 ++
  arch/x86/include/asm/kvm_host.h   |  3 +++
  arch/x86/kvm/lapic.c  |  9 +
  arch/x86/kvm/x86.c| 11 +++
  include/linux/kvm_host.h  |  1 +
  include/uapi/linux/kvm.h  |  5 +
  6 files changed, 39 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 0744b4e..dd92996 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3285,6 +3285,16 @@ Valid values for 'type' are:
 */
__u64 kvm_valid_regs;
__u64 kvm_dirty_regs;
+
+   /* KVM_EXIT_IOAPIC_EOI */
+struct {
+  __u8 vector;
+} eoi;
+
+Indicates that an eoi of a level triggered IOAPIC interrupt on vector has
+occurred, which should be handled by the userspace IOAPIC. Triggers when
+the Irqchip has been split between userspace and the kernel.
+


The ioapic is a global resource, so it doesn't make sense for 
information about it to be returned in a per-vcpu structure (or to block 
the vcpu while it is being processed).


The way I'd model it is to emulate the APIC bus that connects local 
APICs and the IOAPIC, using a socket pair.  When the user-space ioapic 
wants to inject an interrupt, it sends a message to the local APICs 
which then inject it, and when it's ack'ed the EOI is sent back on the 
same bus.


It's true that the APIC bus no longer exists, but modern processors 
still pretend it does.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Avi Kivity

On 05/21/2015 07:21 PM, Paolo Bonzini wrote:


On 21/05/2015 17:48, Avi Kivity wrote:

Lovely!

Note you have memcpy.o instead of memcpy.c.

Doh, and it's not used anyway.  Check the repository, and let me know if
OSv boots with it (it probably needs ACPI; Linux doesn't boot virtio
without ACPI).



Yes, it requires ACPI.  We don't implement the pre-ACPI bootstrap methods.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Avi Kivity

On 05/21/2015 04:51 PM, Paolo Bonzini wrote:

Some of you may have heard about the Clear Containers initiative from
Intel, which couple KVM with various kernel tricks to create extremely
lightweight virtual machines.  The experimental Clear Containers setup
requires only 18-20 MB to launch a virtual machine, and needs about 60
ms to boot.

Now, as all of you probably know, QEMU is great for running Windows or
legacy Linux guests, but that flexibility comes at a hefty price. Not
only does all of the emulation consume memory, it also requires some
form of low-level firmware in the guest as well. All of this adds quite
a bit to virtual-machine startup times (500 to 700 milliseconds is not
unusual).

Right?  In fact, it's for this reason that Clear Containers uses kvmtool
instead of QEMU.

No, wrong!  In fact, reporting bad performance is pretty much the same
as throwing down the gauntlet.

Enter qboot, a minimal x86 firmware that runs on QEMU and, together with
a slimmed-down QEMU configuration, boots a virtual machine in 40
milliseconds[2] on an Ivy Bridge Core i7 processor.

qboot is available at git://github.com/bonzini/qboot.git.  In all the
glory of its 8KB of code, it brings together various existing open
source components:

* a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS

* a couple hardware initialization routines written mostly from scratch
but with good help from SeaBIOS source code

* a minimal 32-bit libc based on kvm-unit-tests

* the Linux loader from QEMU itself

The repository has more information on how to achieve fast boot times,
and examples of using qboot.  Right now there is a limit of 8 MB for
vmlinuz+initrd+cmdline, which however should be enough for initrd-less
containers.

The first commit to qboot is more or less 24 hours old, so there is
definitely more work to do, in particular to extract ACPI tables from
QEMU and present them to the guest.  This is probably another day of
work or so, and it will enable multiprocessor guests with little or no
impact on the boot times.  SMBIOS information is also available from QEMU.

On the QEMU side, there is no support yet for persistent memory and the
NFIT tables from ACPI 6.0.  Once that (and ACPI support) is added, qboot
will automatically start using it.

Happy hacking!



Lovely!

Note you have memcpy.o instead of memcpy.c.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Avi Kivity

On 04/13/2015 08:35 PM, Jan Kiszka wrote:

On 2015-04-13 19:29, Avi Kivity wrote:

On 04/13/2015 10:01 AM, Jan Kiszka wrote:

On 2015-04-07 07:43, Jan Kiszka wrote:

On 2015-04-05 19:12, Valentine Sinitsyn wrote:

Hi Jan,

On 05.04.2015 13:31, Jan Kiszka wrote:

studying the VM exit logic of Jailhouse, I was wondering when AMD's
vmload/vmsave can be avoided. Jailhouse as well as KVM currently use
these instructions unconditionally. However, I think both only need
GS.base, i.e. the per-cpu base address, to be saved and restored if no
user space exit or no CPU migration is involved (both is always
true for
Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also
still uses rsp-based per-cpu variables.

So the question boils down to what is generally faster:

A) vmload
  vmrun
  vmsave

B) wrmsrl(MSR_GS_BASE, guest_gs_base)
  vmrun
  rdmsrl(MSR_GS_BASE, guest_gs_base)

Of course, KVM also has to take into account that heavyweight exits
still require vmload/vmsave, thus become more expensive with B) due to
the additional MSR accesses.

Any thoughts or results of previous experiments?

That's a good question, I also thought about it when I was finalizing
Jailhouse AMD port. I tried lightweight exits with apic-demo but it
didn't seem to affect the latency in any noticeable way. That's why I
decided not to push the patch (in fact, I was even unable to find it
now).

Note however that how AMD chips store host state during VM switches are
implementation-specific. I did my quick experiments on one CPU only, so
your mileage may vary.

Regarding your question, I feel B will be faster anyways but again I'm
afraid that the gain could be within statistical error of the
experiment.

It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more
towards 600 if they are colder (added some usleep to each loop in the
test).

I've tested via vmmcall from guest userspace under Jailhouse. KVM should
be adjustable in a similar way. Attached the benchmark, patch will be in
the Jailhouse next branch soon. We need to check more CPU types, though.

Avi, I found some preparatory patches of yours from 2010 [1]. Do you
happen to remember if it was never completed for a technical reason?

IIRC, I came to the conclusion that it was impossible.  Something about
TR.size not receiving a reasonable value.  Let me see.

To my understanding, TR doesn't play a role until we leave ring 0 again.
Or what could make the CPU look for any of the fields in the 64-bit TSS
before that?


Exceptions that utilize the IST.  I found a writeup [17] that describes 
this, but I think it's even more impossible than that writeup implies.


[17]  http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/



Jan



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Avi Kivity

On 04/13/2015 08:41 PM, Avi Kivity wrote:

On 04/13/2015 08:35 PM, Jan Kiszka wrote:

On 2015-04-13 19:29, Avi Kivity wrote:

On 04/13/2015 10:01 AM, Jan Kiszka wrote:

On 2015-04-07 07:43, Jan Kiszka wrote:

On 2015-04-05 19:12, Valentine Sinitsyn wrote:

Hi Jan,

On 05.04.2015 13:31, Jan Kiszka wrote:

studying the VM exit logic of Jailhouse, I was wondering when AMD's
vmload/vmsave can be avoided. Jailhouse as well as KVM currently 
use

these instructions unconditionally. However, I think both only need
GS.base, i.e. the per-cpu base address, to be saved and restored 
if no

user space exit or no CPU migration is involved (both is always
true for
Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it 
also

still uses rsp-based per-cpu variables.

So the question boils down to what is generally faster:

A) vmload
  vmrun
  vmsave

B) wrmsrl(MSR_GS_BASE, guest_gs_base)
  vmrun
  rdmsrl(MSR_GS_BASE, guest_gs_base)

Of course, KVM also has to take into account that heavyweight exits
still require vmload/vmsave, thus become more expensive with B) 
due to

the additional MSR accesses.

Any thoughts or results of previous experiments?
That's a good question, I also thought about it when I was 
finalizing
Jailhouse AMD port. I tried lightweight exits with apic-demo 
but it
didn't seem to affect the latency in any noticeable way. That's 
why I

decided not to push the patch (in fact, I was even unable to find it
now).

Note however that how AMD chips store host state during VM 
switches are
implementation-specific. I did my quick experiments on one CPU 
only, so

your mileage may vary.

Regarding your question, I feel B will be faster anyways but 
again I'm

afraid that the gain could be within statistical error of the
experiment.
It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, 
more

towards 600 if they are colder (added some usleep to each loop in the
test).

I've tested via vmmcall from guest userspace under Jailhouse. KVM 
should
be adjustable in a similar way. Attached the benchmark, patch will 
be in
the Jailhouse next branch soon. We need to check more CPU types, 
though.

Avi, I found some preparatory patches of yours from 2010 [1]. Do you
happen to remember if it was never completed for a technical reason?

IIRC, I came to the conclusion that it was impossible. Something about
TR.size not receiving a reasonable value.  Let me see.

To my understanding, TR doesn't play a role until we leave ring 0 again.
Or what could make the CPU look for any of the fields in the 64-bit TSS
before that?


Exceptions that utilize the IST.  I found a writeup [17] that 
describes this, but I think it's even more impossible than that 
writeup implies.




I think that Xen does (or did) something along the lines of disabling 
IST usage (by playing with the descriptors in the IDT) and then 
re-enabling them when exiting to userspace.




[17] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/



Jan





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Avi Kivity

On 04/13/2015 08:57 PM, Jan Kiszka wrote:

On 2015-04-13 19:48, Avi Kivity wrote:

I think that Xen does (or did) something along the lines of disabling
IST usage (by playing with the descriptors in the IDT) and then
re-enabling them when exiting to userspace.

So we would reuse that active stack for the current IST users until
then.


Yes.


But I bet there are subtle details that prevent a simple switch at
IDT level. Hmm, no low-hanging fruit it seems...



For sure. It's not insurmountable, but fairly hard.




[17] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/

That thread proposed the complete IST removal. But, given that we still
have it 7 years later,


Well, it's not as if a crack team of kernel hackers was laboring night 
and day to remove it, but...



  I suppose that was not very welcome in general.


Simply removing it is impossible, or an NMI happening immediately after 
SYSCALL will hit user-provided %rsp.



Thanks,
Jan

PS: For the Jailhouse readers: we don't use IST.



You don't have userspace, yes?  Only guests?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Avi Kivity

On 04/13/2015 10:01 AM, Jan Kiszka wrote:

On 2015-04-07 07:43, Jan Kiszka wrote:

On 2015-04-05 19:12, Valentine Sinitsyn wrote:

Hi Jan,

On 05.04.2015 13:31, Jan Kiszka wrote:

studying the VM exit logic of Jailhouse, I was wondering when AMD's
vmload/vmsave can be avoided. Jailhouse as well as KVM currently use
these instructions unconditionally. However, I think both only need
GS.base, i.e. the per-cpu base address, to be saved and restored if no
user space exit or no CPU migration is involved (both is always true for
Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also
still uses rsp-based per-cpu variables.

So the question boils down to what is generally faster:

A) vmload
 vmrun
 vmsave

B) wrmsrl(MSR_GS_BASE, guest_gs_base)
 vmrun
 rdmsrl(MSR_GS_BASE, guest_gs_base)

Of course, KVM also has to take into account that heavyweight exits
still require vmload/vmsave, thus become more expensive with B) due to
the additional MSR accesses.

Any thoughts or results of previous experiments?

That's a good question, I also thought about it when I was finalizing
Jailhouse AMD port. I tried lightweight exits with apic-demo but it
didn't seem to affect the latency in any noticeable way. That's why I
decided not to push the patch (in fact, I was even unable to find it now).

Note however that how AMD chips store host state during VM switches are
implementation-specific. I did my quick experiments on one CPU only, so
your mileage may vary.

Regarding your question, I feel B will be faster anyways but again I'm
afraid that the gain could be within statistical error of the experiment.

It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more
towards 600 if they are colder (added some usleep to each loop in the test).

I've tested via vmmcall from guest userspace under Jailhouse. KVM should
be adjustable in a similar way. Attached the benchmark, patch will be in
the Jailhouse next branch soon. We need to check more CPU types, though.

Avi, I found some preparatory patches of yours from 2010 [1]. Do you
happen to remember if it was never completed for a technical reason?


IIRC, I came to the conclusion that it was impossible.  Something about 
TR.size not receiving a reasonable value.  Let me see.



Joel, can you comment on the benefit of variant B) for the various AMD
CPUs? Is it always positive?

Thanks,
Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61455



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Question regarding the reset value of LINT0

2015-04-09 Thread Avi Kivity

On 04/09/2015 09:21 PM, Nadav Amit wrote:

Bandan Das b...@redhat.com wrote:


Nadav Amit nadav.a...@gmail.com writes:


Jan Kiszka jan.kis...@siemens.com wrote:


On 2015-04-08 19:40, Nadav Amit wrote:

Jan Kiszka jan.kis...@siemens.com wrote:


On 2015-04-08 18:59, Nadav Amit wrote:

Jan Kiszka jan.kis...@siemens.com wrote:


On 2015-04-08 18:40, Nadav Amit wrote:

Hi,

I would appreciate if someone explains the reason for enabling LINT0 during
APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local
Vector Table” that says all LVT registers are reset to 0x1.

In kvm_lapic_reset, I see:

apic_set_reg(apic, APIC_LVT0,
SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT));

Which is actually pretty similar to QEMU’s apic_reset_common:

if (bsp) {
 /*
  * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization
  * time typically by BIOS, so PIC interrupt can be delivered to the
  * processor when local APIC is enabled.
  */
 s-lvt[APIC_LVT_LINT0] = 0x700;
}

Yet, in both cases, I miss the point - if it is typically done by the BIOS,
why does QEMU or KVM enable it?

BTW: KVM seems to run fine without it, and I think setting it causes me
problems in certain cases.

I suspect it has some historic BIOS backgrounds. Already tried to find
more information in the git logs of both code bases? Or something that
indicates of SeaBIOS or BochsBIOS once didn't do this initialization?

Thanks. I found no indication of such thing.

QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says:

  Don't route PIC interrupts through the local APIC if the local APIC
  config says so. By Ari Kivity.

Maybe Avi Kivity knows this guy.

ths? That should have been Thiemo Seufer (IIRC), but he just committed
the code back then (and is no longer with us, sadly).

Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke
about Avi knowing “Ari”).

Ah. No problem. My brain apparently fixed that typo up unnoticed.


But if that commit went in without any BIOS changes around it, QEMU
simply had to do the job of the latter to keep things working.

So should I leave it as is? Can I at least disable in KVM during INIT (and
leave it as is for RESET)?

No, I don't think there is a need to leave this inaccurate for QEMU if
our included BIOS gets it right. I don't know what the backward
bug-compatibility of KVM is, though. Maybe you can identify since when
our BIOS is fine so that we can discuss time frames.

I think that it was addressed in commit
19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the
local APIC of the BSP.”) So it should be included in seabios 0.5.0, which
means qemu 0.12 - so we are talking about the end of 2009 or start of 2010.

The probability that someone will use a newer version of kernel with something
as old as 0.12 is probably minimal. I think it's ok to change it with a comment
indicating the reason. To be on the safe side, however, a user changeable switch
is something worth considering.

I don’t see any existing mechanism for KVM to be aware of its user type and
version. I do see another case of KVM hacks that are intended for fixing
very old QEMU bugs (see 3a624e29c75 changes in vmx_set_segment, which are
from pretty much the same time-frame of the issue I try to fix).

Since this is something which would follow around, please advise what would
be the format. A new ioctl that would supply the userspace “type” (according
to predefined constants) and version?


That would be madness.  KVM shouldn't even know that qemu exists, let 
alone track its versions.


Simply add a new toggle KVM_USE_STANDARD_LAPIC_LVT_INIT and document 
that userspace MUST use it.  Old userspace won't, and will get the old 
buggy behavior.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-10 Thread Avi Kivity

On 03/10/2015 12:47 PM, Paolo Bonzini wrote:


On 09/03/2015 20:49, Avi Kivity wrote:

Yes, and it checked that MAXPHYADDR != 52 before.  If you want to set
only one bit, making that bit 51 makes sense anyway for simplicity, so
it is still 99.9% academic.  Once processors appear with MAXPHYADDR =
52, the remaining 0.1% will become more relevant.

The current limit is IIRC 46 or 48 (on Haswell Xeons).

It will be interesting to have processors with 52 bits of physical
address and 48 bits of virtual address. HIGHMEM for x86_64?  Or 5-level
page tables?

I wonder why Intel chose exactly 52...  HIGHMEM seems more likely than
5-level page tables.  Certainly it wouldn't need hacks like Ingo's 4G-4G.


My bitcoins are on 5-level page tables.  HIGHMEM is too much pain.


50 bits == 1 PiB.  That's quite an amount of RAM.

Not that 64 TiB is not quite an amount of RAM. :)




Depends on how many browser tabs you have open, I guess.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-09 Thread Avi Kivity

On 03/03/2015 11:52 AM, Paolo Bonzini wrote:

In this
case, the VM might expect exceptions when PTE bits which are higher than the
maximum (reported) address width are set, and it would not get such
exceptions. This problem can easily be experienced by small change to the
existing KVM unit-tests.

There are many variants to this problem, and the only solution which I
consider complete is to report to the VM the maximum (52) physical address
width to the VM, configure the VM to exit on #PF with reserved-bit
error-codes, and then emulate these faulting instructions.

Not even that would be a definitive solution.  If the guest tries to map
RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR,
you would get EPT misconfiguration vmexits.

I think there is no way to emulate physical address width correctly,
except by disabling EPT.



Is the issue emulating a higher MAXPHYADDR on the guest than is 
available on the host? I don't think there's any need to support that.


Emulating a lower setting on the guest than is available on the host is, 
I think, desirable. Whether it would work depends on the relative 
priority of EPT misconfiguration exits vs. page table permission faults.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-09 Thread Avi Kivity

On 03/09/2015 07:51 PM, Nadav Amit wrote:

Avi Kivity avi.kiv...@gmail.com wrote:


On 03/03/2015 11:52 AM, Paolo Bonzini wrote:

In this
case, the VM might expect exceptions when PTE bits which are higher than the
maximum (reported) address width are set, and it would not get such
exceptions. This problem can easily be experienced by small change to the
existing KVM unit-tests.

There are many variants to this problem, and the only solution which I
consider complete is to report to the VM the maximum (52) physical address
width to the VM, configure the VM to exit on #PF with reserved-bit
error-codes, and then emulate these faulting instructions.

Not even that would be a definitive solution.  If the guest tries to map
RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR,
you would get EPT misconfiguration vmexits.

I think there is no way to emulate physical address width correctly,
except by disabling EPT.

Is the issue emulating a higher MAXPHYADDR on the guest than is available
on the host? I don't think there's any need to support that.

Emulating a lower setting on the guest than is available on the host is, I
think, desirable. Whether it would work depends on the relative priority
of EPT misconfiguration exits vs. page table permission faults.

Thanks for the feedback.

Guest page-table permissions faults got priority over EPT misconfiguration.
KVM can even be set to trap page-table permission faults, at least in VT-x.
Anyhow, I don’t think it is enough.


Why is it not enough? If you trap a permission fault, you can inject any 
exception error code you like.



  Here is an example

My machine has MAXPHYADDR of 46. I modified kvm-unit-tests access test to
set pte.45 instead of pte.51, which from the VM point-of-view should cause
the #PF error-code indicate the reserved bits are set (just as pte.51 does).
Here is one error from the log:

test pte.p pte.45 pde.p user: FAIL: error code 5 expected d
Dump mapping: address: 1234
--L4: 304b007
--L3: 304c007
--L2: 304d001
--L1: 2201


This is with an ept misconfig programmed into that address, yes?


As you can see, the #PF should have had two reasons: reserved bits, and user
access to supervisor only page. The error-code however does not indicate the
reserved-bits are set.

Note that KVM did not trap any exit on that faulting instruction, as
otherwise it would try to emulate the instruction and assuming it is
supported (and that the #PF was not on an instruction fetch), should be able
to emulate the #PF correctly.
[ The test actually crashes soon after this error due to these reasons. ]

Anyhow, that is the reason for me to assume that having the maximum
MAXPHYADDR is better.



Well, that doesn't work for the reasons Paolo noted.  The guest can have 
a ivshmem device attached, and map it above a host-supported virtual 
address, and suddenly it goes slow.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-09 Thread Avi Kivity

On 03/09/2015 09:38 PM, Paolo Bonzini wrote:


On 09/03/2015 20:19, Avi Kivity wrote:

I can't think of one with reasonable performance either.  Perhaps the
maintainers could raise the issue with Intel.  It looks academic but it
can happen in real life -- KVM for example used to rely on reserved bits
faults (it set all bits in the PTE so it wouldn't have been caught by
this).

Yes, and it checked that MAXPHYADDR != 52 before.  If you want to set
only one bit, making that bit 51 makes sense anyway for simplicity, so
it is still 99.9% academic.  Once processors appear with MAXPHYADDR =
52, the remaining 0.1% will become more relevant.

The current limit is IIRC 46 or 48 (on Haswell Xeons).



It will be interesting to have processors with 52 bits of physical 
address and 48 bits of virtual address. HIGHMEM for x86_64?  Or 5-level 
page tables?


50 bits == 1 PiB.  That's quite an amount of RAM.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-09 Thread Avi Kivity

On 03/09/2015 09:07 PM, Nadav Amit wrote:

Avi Kivity avi.kiv...@gmail.com wrote:


On 03/09/2015 07:51 PM, Nadav Amit wrote:

Avi Kivity avi.kiv...@gmail.com wrote:


On 03/03/2015 11:52 AM, Paolo Bonzini wrote:

In this
case, the VM might expect exceptions when PTE bits which are higher than the
maximum (reported) address width are set, and it would not get such
exceptions. This problem can easily be experienced by small change to the
existing KVM unit-tests.

There are many variants to this problem, and the only solution which I
consider complete is to report to the VM the maximum (52) physical address
width to the VM, configure the VM to exit on #PF with reserved-bit
error-codes, and then emulate these faulting instructions.

Not even that would be a definitive solution.  If the guest tries to map
RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR,
you would get EPT misconfiguration vmexits.

I think there is no way to emulate physical address width correctly,
except by disabling EPT.

Is the issue emulating a higher MAXPHYADDR on the guest than is available
on the host? I don't think there's any need to support that.

Emulating a lower setting on the guest than is available on the host is, I
think, desirable. Whether it would work depends on the relative priority
of EPT misconfiguration exits vs. page table permission faults.

Thanks for the feedback.

Guest page-table permissions faults got priority over EPT misconfiguration.
KVM can even be set to trap page-table permission faults, at least in VT-x.
Anyhow, I don’t think it is enough.

Why is it not enough? If you trap a permission fault, you can inject any 
exception error code you like.

Because there is no real permission fault. In the following example, the VM
expects one (VM’s MAXPHYADDR=40), but there isn’t (Host’s MAXPHYADDR=46), so
the hypervisor cannot trap it. It can only trap all #PF, which is obviously
too intrusive.


There are three cases:

1) The guest has marked the page as not present.  In this case, no 
reserved bits are set and the guest should receive its #PF.
2) The page is present and the permissions are sufficient.  In this 
case, you will get an EPT misconfiguration and can proceed to inject a 
#PF with the reserved bit flag set.
3) The page is present but permissions are not sufficient.  In this case 
you can trap the fault via the PFEC_MASK register and inject a #PF to 
the guest.


So you can emulate it and only trap permission faults.  It's still too 
expensive though.




  Here is an example

My machine has MAXPHYADDR of 46. I modified kvm-unit-tests access test to
set pte.45 instead of pte.51, which from the VM point-of-view should cause
the #PF error-code indicate the reserved bits are set (just as pte.51 does).
Here is one error from the log:

test pte.p pte.45 pde.p user: FAIL: error code 5 expected d
Dump mapping: address: 1234
--L4: 304b007
--L3: 304c007
--L2: 304d001
--L1: 2201

This is with an ept misconfig programmed into that address, yes?

A reserved bit in the PTE is set - from the VM point-of-view. If there
wasn’t another cause for #PF, it would lead to EPT violation/misconfig.


As you can see, the #PF should have had two reasons: reserved bits, and user
access to supervisor only page. The error-code however does not indicate the
reserved-bits are set.

Note that KVM did not trap any exit on that faulting instruction, as
otherwise it would try to emulate the instruction and assuming it is
supported (and that the #PF was not on an instruction fetch), should be able
to emulate the #PF correctly.
[ The test actually crashes soon after this error due to these reasons. ]

Anyhow, that is the reason for me to assume that having the maximum
MAXPHYADDR is better.

Well, that doesn't work for the reasons Paolo noted.  The guest can have a 
ivshmem device attached, and map it above a host-supported virtual address, and 
suddenly it goes slow.

I fully understand. That’s the reason I don’t have a reasonable solution.


I can't think of one with reasonable performance either.  Perhaps the 
maintainers could raise the issue with Intel.  It looks academic but it 
can happen in real life -- KVM for example used to rely on reserved bits 
faults (it set all bits in the PTE so it wouldn't have been caught by this).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-09 Thread Avi Kivity

On 03/09/2015 09:33 PM, Paolo Bonzini wrote:


On 09/03/2015 18:08, Avi Kivity wrote:

Is the issue emulating a higher MAXPHYADDR on the guest than is
available on the host? I don't think there's any need to support that.

No, indeed.  The only problem is that the failure mode is quite horrible
(you get a triple fault, possibly while the guest is running).


Can't qemu simply check for it?


If you need that, disable EPT. :)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seeking a KVM benchmark

2014-11-10 Thread Avi Kivity


On 11/10/2014 02:15 PM, Paolo Bonzini wrote:


On 10/11/2014 11:45, Gleb Natapov wrote:

I tried making also the other shared MSRs the same between guest and
host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
do want to dig out my old Core 2 and see how the new test fares, but it
really looks like your patch will be in 3.19.

Please test on wide variety of HW before final decision.

Yes, definitely.


Also it would
be nice to ask Intel what is expected overhead. It is awesome if they
mange to add EFER switching with non measurable overhead, but also hard
to believe :)

So let's see what happens.  Sneak preview: the result is definitely worth
asking Intel about.

I ran these benchmarks with a stock 3.16.6 KVM.  Instead I patched
kvm-unit-tests to set EFER.SCE in enable_nx.  This makes it much simpler
for others to reproduce the results.  I only ran the inl_from_qemu test.

Perf stat reports that the processor goes from 0.65 to 0.46
instructions per cycle, which is consistent with the improvement from
19k to 12k cycles per iteration.

Unpatched KVM-unit-tests:

  3,385,586,563 cycles#3.189 GHz
 [83.25%]
  2,475,979,685 stalled-cycles-frontend   #   73.13% frontend cycles idle   
 [83.37%]
  2,083,556,270 stalled-cycles-backend#   61.54% backend  cycles idle   
 [66.71%]
  1,573,854,041 instructions  #0.46  insns per cycle
  #1.57  stalled cycles per 
insn [83.20%]
1.108486526 seconds time elapsed


Patched KVM-unit-tests:

  3,252,297,378 cycles#3.147 GHz
 [83.32%]
  2,010,266,184 stalled-cycles-frontend   #   61.81% frontend cycles idle   
 [83.36%]
  1,560,371,769 stalled-cycles-backend#   47.98% backend  cycles idle   
 [66.51%]
  2,133,698,018 instructions  #0.66  insns per cycle
  #0.94  stalled cycles per 
insn [83.45%]
1.072395697 seconds time elapsed

Playing with other events shows that the unpatched benchmark has an
awful load of TLB misses

Unpatched:

 30,311 iTLB-loads
464,641,844 dTLB-loads
 10,813,839 dTLB-load-misses  #2.33% of all dTLB cache hits
 20436,027 iTLB-load-misses  #  67421.16% of all iTLB cache hits

Patched:

  1,440,033 iTLB-loads
640,970,836 dTLB-loads
  2,345,112 dTLB-load-misses  #0.37% of all dTLB cache hits
270,884 iTLB-load-misses  #   18.81% of all iTLB cache hits

This is 100% reproducible.  The meaning of the numbers is clearer if you
look up the raw event numbers in the Intel manuals:

- iTLB-loads is 85h/10h aka perf -e r1085: Number of cache load STLB 
[second-level
TLB] hits. No page walk.

- iTLB-load-misses is 85h/01h aka r185: Misses in all ITLB levels that
cause page walks.

So for example event 85h/04h aka r485 (Cycle PMH is busy with a walk.) and
friends show that the unpatched KVM wastes about 0.1 seconds more than
the patched KVM on page walks:

Unpatched:

 22,583,440 r449 (cycles on dTLB store miss page walks)
 40,452,018 r408 (cycles on dTLB load miss page walks)
  2,115,981 r485 (cycles on iTLB miss page walks)

 65,151,439 total

Patched:

 24,430,676 r449 (cycles on dTLB store miss page walks)
196,017,693 r408 (cycles on dTLB load miss page walks)
213,266,243 r485 (cycles on iTLB miss page walks)
-
433,714,612 total

These 0.1 seconds probably are all on instructions that would have been
fast, since the slow instructions responsible for the low IPC are the
microcoded instructions including VMX and other privileged stuff.

Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM
and 260k in patched KVM.  Let's see where they come from:

Unpatched:

+  98.97%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
+   0.70%  qemu-kvm  [kernel.kallsyms]  [k] page_fault

It's expected that most TLB misses happen just before a page fault (there
are also events to count how many TLB misses do result in a page fault,
if you care about that), and thus are accounted to the first instruction of the
exception handler.

We do not know what causes second-level TLB _flushes_ but it's quite
expected that you'll have a TLB miss after them and possibly a page fault.
And anyway 98.97% of them coming from native_write_msr_safe is totally
anomalous.

A patched benchmark shows no second-level TLB flush occurs after a WRMSR:

+  72.41%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
+   9.07%  qemu-kvm  [kvm_intel][k] vmx_flush_tlb
+   6.60%  qemu-kvm  [kernel.kallsyms]  [k] set_pte_vaddr_pud
+   5.68%  

Re: [PATCH 2/4] KVM: nSVM: propagate the NPF EXITINFO to the guest

2014-09-02 Thread Avi Kivity


On 09/02/2014 07:46 PM, Paolo Bonzini wrote:

*/

if (unlikely(real_gfn == UNMAPPED_GVA))
goto error;
@@ -1974,10 +1974,28 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu 
*vcpu,
  {
struct vcpu_svm *svm = to_svm(vcpu);
  
-	svm-vmcb-control.exit_code = SVM_EXIT_NPF;

-   svm-vmcb-control.exit_code_hi = 0;
-   svm-vmcb-control.exit_info_1 = fault-error_code;
-   svm-vmcb-control.exit_info_2 = fault-address;
+   /*
+* We can keep the value that the processor stored in the VMCB,
+* but make up something sensible if we hit the WARN.
+*/
+   if (WARN_ON(svm-vmcb-control.exit_code != SVM_EXIT_NPF)) {
+   svm-vmcb-control.exit_code = SVM_EXIT_NPF;
+   svm-vmcb-control.exit_code_hi = 0;
+   svm-vmcb-control.exit_info_1 = (1ULL  32);
+   svm-vmcb-control.exit_info_2 = fault-address;
+   }

Its been a while since I looked into this, but is an injected NPF exit
always the result of a real NPF exit?

I think so, but that's why I CCed you. :)


It could always be the result of emulation into which L0 was tricked.  I 
don't think it's a safe assumption.



How about an io-port emulated on
L1 but passed through to L2 by the nested hypervisor. On emulation of
INS or OUTS, KVM would need to read/write to an L2 address space,

It would need to read/write to *L1* (that's where the VMCB's IOIO map
lies), which could result into a regular page fault injected into L1.

Paolo


maybe
causing NPF faults to be injected. In this case an IOIO exit would cause
an injected NPF exit for L1.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Avi Kivity


On 08/18/2014 01:20 PM, Paolo Bonzini wrote:

Il 18/08/2014 11:50, Wanpeng Li ha scritto:

fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on
clts), however, there is no user currently, this patch drop it.

Reviewed-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
---
  arch/x86/include/asm/kvm_host.h | 1 -
  arch/x86/kvm/svm.c  | 1 -
  arch/x86/kvm/vmx.c  | 1 -
  3 files changed, 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5724601..b68f3e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -710,7 +710,6 @@ struct kvm_x86_ops {
void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
-   void (*fpu_activate)(struct kvm_vcpu *vcpu);
void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
  
  	void (*tlb_flush)(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..1f49c86 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
.cache_reg = svm_cache_reg,
.get_rflags = svm_get_rflags,
.set_rflags = svm_set_rflags,
-   .fpu_activate = svm_fpu_activate,
.fpu_deactivate = svm_fpu_deactivate,
  
  	.tlb_flush = svm_flush_tlb,

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 71cbee5..2963303 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
.cache_reg = vmx_cache_reg,
.get_rflags = vmx_get_rflags,
.set_rflags = vmx_set_rflags,
-   .fpu_activate = vmx_fpu_activate,
.fpu_deactivate = vmx_fpu_deactivate,
  
  	.tlb_flush = vmx_flush_tlb,



Avi/Gleb, do you remember any particular reason for this?



IIRC (vaguely) if we expect the fpu to be used in the near future, we 
activate it eagerly so that we don't fault when it is used.


Prevents the sequence:

guest user: use fpu
#NM
host: reflect #NM to guest
guest kernel: CLTS
guest kernel: switch fpu state
#NM
host: switch fpu
guest kernel: switch fpu state (restarted)
guest user: use fpu (restarted)

Why was the user removed? Full-time eager fpu?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Avi Kivity


On 08/18/2014 01:51 PM, Paolo Bonzini wrote:

Il 18/08/2014 12:26, Avi Kivity ha scritto:

On 08/18/2014 01:20 PM, Paolo Bonzini wrote:

Il 18/08/2014 11:50, Wanpeng Li ha scritto:

fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on
clts), however, there is no user currently, this patch drop it.

Reviewed-by: Yang Zhang yang.z.zh...@intel.com
Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
---
   arch/x86/include/asm/kvm_host.h | 1 -
   arch/x86/kvm/svm.c  | 1 -
   arch/x86/kvm/vmx.c  | 1 -
   3 files changed, 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h
b/arch/x86/include/asm/kvm_host.h
index 5724601..b68f3e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -710,7 +710,6 @@ struct kvm_x86_ops {
   void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
   unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
   void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
-void (*fpu_activate)(struct kvm_vcpu *vcpu);
   void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
 void (*tlb_flush)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..1f49c86 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
   .cache_reg = svm_cache_reg,
   .get_rflags = svm_get_rflags,
   .set_rflags = svm_set_rflags,
-.fpu_activate = svm_fpu_activate,
   .fpu_deactivate = svm_fpu_deactivate,
 .tlb_flush = svm_flush_tlb,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 71cbee5..2963303 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
   .cache_reg = vmx_cache_reg,
   .get_rflags = vmx_get_rflags,
   .set_rflags = vmx_set_rflags,
-.fpu_activate = vmx_fpu_activate,
   .fpu_deactivate = vmx_fpu_deactivate,
 .tlb_flush = vmx_flush_tlb,


Avi/Gleb, do you remember any particular reason for this?


IIRC (vaguely) if we expect the fpu to be used in the near future, we
activate it eagerly so that we don't fault when it is used.

Prevents the sequence:

guest user: use fpu
#NM
host: reflect #NM to guest
guest kernel: CLTS
guest kernel: switch fpu state
#NM
host: switch fpu
guest kernel: switch fpu state (restarted)
guest user: use fpu (restarted)

Why was the user removed? Full-time eager fpu?

No, I mean any reason to keep the hooks.


If there are no callers, I can't think of any.


In the meanwhile I found it myself:

commit 2d04a05bd7e93c13f13a82ac40de4065a99d069b
Author: Avi Kivity a...@redhat.com
Date:   Wed Apr 20 15:32:49 2011 +0300

 KVM: x86 emulator: emulate CLTS internally
 
 Avoid using ctxt-vcpu; we can do everything with -get_cr() and -set_cr().
 
 A side effect is that we no longer activate the fpu on emulated CLTS; but that

 should be very rare.
 
 Signed-off-by: Avi Kivity a...@redhat.com


vmx_fpu_activate and svm_fpu_activate are still called on #NM and CLTS, but
never from common code after the above patch.

Activation on CLTS is currently VMX only; I guess on AMD we could check the
decode assists' CR_VALID bit and instruction length to detect CLTS.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86 emulator: emulate MOVNTDQ

2014-07-13 Thread Avi Kivity


On 07/11/2014 11:40 PM, Paolo Bonzini wrote:

Il 11/07/2014 22:05, Alex Williamson ha scritto:

Which will return 'true' for this whether I specify Aligned or not.  If
the standard convention is to make it explicit, I'm happy to add the
extra flag, but I think we already #GP on unaligned as implemented here.
Thanks,


We should still specify Aligned if the corresponding AVX instruction 
requires an aligned operand.  ISTR that this is not the case for 
MOVNTDQ, so your patch is correct.  I'll check the SDM more carefully 
next Monday.




The explicitly aligned/unaligned instructions have an A or a U to 
indicate this (e.g. MOVDQU = explicitly unaligned, MOVDQA = explicitly 
aligned, MOVNTDQ = default).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-25 Thread Avi Kivity


On 06/24/2014 07:45 PM, Marcelo Tosatti wrote:

On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote:

First, it's not sufficient to pin the debug store area, you also
have to pin the guest page tables that are used to map the debug
store.  But even if you do that, as soon as the guest fork()s, it
will create a new pgd which the host will be free to swap out.  The
processor can then attempt a PEBS store to an unmapped address which
will fail, even though the guest is configured correctly.

That's a good point. You're right of course.

The only way I can think around it would be to intercept CR3 writes
while PEBS is active and always pin all the table pages leading
to the PEBS buffer. That's slow, but should be only needed
while PEBS is running.

-Andi

Suppose that can be done separately from the pinned spte patchset.
And it requires accounting into mlock limits as well, as noted.

One set of pagetables per pinned virtual address leading down to the
last translations is sufficient per-vcpu.


Or 4, and use the CR3 exit filter to prevent vmexits among the last 4 
LRU CR3 values.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

2014-06-22 Thread Avi Kivity


On 06/19/2014 09:26 PM, Marcelo Tosatti wrote:

On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote:

On 06/19/2014 02:12 AM, mtosa...@redhat.com wrote:

Allow vcpus to pin spte translations by:

1) Creating a per-vcpu list of pinned ranges.
2) On mmu reload request:
- Fault ranges.
- Mark sptes with a pinned bit.
- Mark shadow pages as pinned.

3) Then modify the following actions:
- Page age = skip spte flush.
- MMU notifiers = force mmu reload request (which kicks cpu out of
guest mode).
- GET_DIRTY_LOG = force mmu reload request.
- SLAB shrinker = skip shadow page deletion.

TDP-only.

+int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
+ gfn_t base_gfn, unsigned long npages)
+{
+   struct kvm_pinned_page_range *p;
+
+   mutex_lock(vcpu-arch.pinned_mmu_mutex);
+   list_for_each_entry(p, vcpu-arch.pinned_mmu_pages, link) {
+   if (p-base_gfn == base_gfn  p-npages == npages) {
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+   return -EEXIST;
+   }
+   }
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+
+   if (vcpu-arch.nr_pinned_ranges =
+   KVM_MAX_PER_VCPU_PINNED_RANGE)
+   return -ENOSPC;
+
+   p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
+   if (!p)
+   return -ENOMEM;
+
+   vcpu-arch.nr_pinned_ranges++;
+
+   trace_kvm_mmu_register_pinned_range(vcpu-vcpu_id, base_gfn, npages);
+
+   INIT_LIST_HEAD(p-link);
+   p-base_gfn = base_gfn;
+   p-npages = npages;
+   mutex_lock(vcpu-arch.pinned_mmu_mutex);
+   list_add(p-link, vcpu-arch.pinned_mmu_pages);
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+   kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+
+   return 0;
+}
+

What happens if ranges overlap (within a vcpu, cross-vcpu)?

The page(s) are faulted multiple times if ranges overlap within a vcpu.

I see no reason to disallow overlapping ranges. Do you?


Not really.  Just making sure nothing horrible happens.




Or if a range overflows and wraps around 0?

Pagefault fails on vm-entry - KVM_REQ_TRIPLE_FAULT.

Will double check for overflows to make sure.


Will the loop terminate?


Looks like you're limiting the number of ranges, but not the number
of pages, so a guest can lock all of its memory.

Yes. The page pinning at get_page time can also lock all of
guest memory.


I'm sure that can't be good.  Maybe subject this pinning to the task 
mlock limit.





+
+/*
+ * Pin KVM MMU page translations. This guarantees, for valid
+ * addresses registered by kvm_mmu_register_pinned_range (valid address
+ * meaning address which posses sufficient information for fault to
+ * be resolved), valid translations exist while in guest mode and
+ * therefore no VM-exits due to faults will occur.
+ *
+ * Failure to instantiate pages will abort guest entry.
+ *
+ * Page frames should be pinned with get_page in advance.
+ *
+ * Pinning is not guaranteed while executing as L2 guest.

Does this undermine security?

PEBS writes should not be enabled when L2 guest is executing.


What prevents L1 for setting up PEBS MSRs for L2?


+   list_for_each_entry(p, vcpu-arch.pinned_mmu_pages, link) {
+   gfn_t gfn_offset;
+
+   for (gfn_offset = 0; gfn_offset  p-npages; gfn_offset++) {
+   gfn_t gfn = p-base_gfn + gfn_offset;
+   int r;
+   bool pinned = false;
+
+   r = vcpu-arch.mmu.page_fault(vcpu, gfn  PAGE_SHIFT,
+PFERR_WRITE_MASK, false,
+true, pinned);
+   /* MMU notifier sequence window: retry */
+   if (!r  !pinned)
+   kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+   if (r) {
+   kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+   break;
+   }
+
+   }
+   }
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+}
+
  int kvm_mmu_load(struct kvm_vcpu *vcpu)
  {
int r;
@@ -3916,6 +4101,7 @@
goto out;
/* set_cr3() should ensure TLB has been flushed */
vcpu-arch.mmu.set_cr3(vcpu, vcpu-arch.mmu.root_hpa);
+   kvm_mmu_pin_pages(vcpu);
  out:
return r;
  }


I don't see where  you unpin pages, so even if you limit the number
of pinned pages, a guest can pin all of memory by iterating over all
of memory and pinning it a chunk at a time.

The caller should be responsible for limiting number of pages pinned it
is pinning the struct pages?


The caller would be the debug store data are MSR callbacks. How would 
they know what the limit it?




And in that case, should remove any limiting

Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-22 Thread Avi Kivity


On 05/30/2014 04:12 AM, Andi Kleen wrote:

From: Andi Kleen a...@linux.intel.com

PEBS (Precise Event Bases Sampling) profiling is very powerful,
allowing improved sampling precision and much additional information,
like address or TSX abort profiling. cycles:p and :pp uses PEBS.

This patch enables PEBS profiling in KVM guests.

PEBS writes profiling records to a virtual address in memory. Since
the guest controls the virtual address space the PEBS record
is directly delivered to the guest buffer. We set up the PEBS state
that is works correctly.The CPU cannot handle any kinds of faults during
these guest writes.

To avoid any problems with guest pages being swapped by the host we
pin the pages when the PEBS buffer is setup, by intercepting
that MSR.

Typically profilers only set up a single page, so pinning that is not
a big problem. The pinning is limited to 17 pages currently (64K+1)

In theory the guest can change its own page tables after the PEBS
setup. The host has no way to track that with EPT. But if a guest
would do that it could only crash itself. It's not expected
that normal profilers do that.




Talking a bit with Gleb about this, I think this is impossible.

First, it's not sufficient to pin the debug store area, you also have to 
pin the guest page tables that are used to map the debug store.  But 
even if you do that, as soon as the guest fork()s, it will create a new 
pgd which the host will be free to swap out.  The processor can then 
attempt a PEBS store to an unmapped address which will fail, even though 
the guest is configured correctly.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

2014-06-19 Thread Avi Kivity


On 06/19/2014 02:12 AM, mtosa...@redhat.com wrote:

Allow vcpus to pin spte translations by:

1) Creating a per-vcpu list of pinned ranges.
2) On mmu reload request:
- Fault ranges.
- Mark sptes with a pinned bit.
- Mark shadow pages as pinned.

3) Then modify the following actions:
- Page age = skip spte flush.
- MMU notifiers = force mmu reload request (which kicks cpu out of
guest mode).
- GET_DIRTY_LOG = force mmu reload request.
- SLAB shrinker = skip shadow page deletion.

TDP-only.

  
+int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,

+ gfn_t base_gfn, unsigned long npages)
+{
+   struct kvm_pinned_page_range *p;
+
+   mutex_lock(vcpu-arch.pinned_mmu_mutex);
+   list_for_each_entry(p, vcpu-arch.pinned_mmu_pages, link) {
+   if (p-base_gfn == base_gfn  p-npages == npages) {
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+   return -EEXIST;
+   }
+   }
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+
+   if (vcpu-arch.nr_pinned_ranges =
+   KVM_MAX_PER_VCPU_PINNED_RANGE)
+   return -ENOSPC;
+
+   p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
+   if (!p)
+   return -ENOMEM;
+
+   vcpu-arch.nr_pinned_ranges++;
+
+   trace_kvm_mmu_register_pinned_range(vcpu-vcpu_id, base_gfn, npages);
+
+   INIT_LIST_HEAD(p-link);
+   p-base_gfn = base_gfn;
+   p-npages = npages;
+   mutex_lock(vcpu-arch.pinned_mmu_mutex);
+   list_add(p-link, vcpu-arch.pinned_mmu_pages);
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+   kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+
+   return 0;
+}
+


What happens if ranges overlap (within a vcpu, cross-vcpu)? Or if a 
range overflows and wraps around 0?  Or if it does not refer to RAM?


Looks like you're limiting the number of ranges, but not the number of 
pages, so a guest can lock all of its memory.



+
+/*
+ * Pin KVM MMU page translations. This guarantees, for valid
+ * addresses registered by kvm_mmu_register_pinned_range (valid address
+ * meaning address which posses sufficient information for fault to
+ * be resolved), valid translations exist while in guest mode and
+ * therefore no VM-exits due to faults will occur.
+ *
+ * Failure to instantiate pages will abort guest entry.
+ *
+ * Page frames should be pinned with get_page in advance.
+ *
+ * Pinning is not guaranteed while executing as L2 guest.


Does this undermine security?


+ *
+ */
+
+static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu)
+{
+   struct kvm_pinned_page_range *p;
+
+   if (is_guest_mode(vcpu))
+   return;
+
+   if (!vcpu-arch.mmu.direct_map)
+   return;
+
+   ASSERT(VALID_PAGE(vcpu-arch.mmu.root_hpa));
+
+   mutex_lock(vcpu-arch.pinned_mmu_mutex);


Is the mutex actually needed? It seems it's only taken in vcpu context, 
so the vcpu mutex should be sufficient.



+   list_for_each_entry(p, vcpu-arch.pinned_mmu_pages, link) {
+   gfn_t gfn_offset;
+
+   for (gfn_offset = 0; gfn_offset  p-npages; gfn_offset++) {
+   gfn_t gfn = p-base_gfn + gfn_offset;
+   int r;
+   bool pinned = false;
+
+   r = vcpu-arch.mmu.page_fault(vcpu, gfn  PAGE_SHIFT,
+PFERR_WRITE_MASK, false,
+true, pinned);
+   /* MMU notifier sequence window: retry */
+   if (!r  !pinned)
+   kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+   if (r) {
+   kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+   break;
+   }
+
+   }
+   }
+   mutex_unlock(vcpu-arch.pinned_mmu_mutex);
+}
+
  int kvm_mmu_load(struct kvm_vcpu *vcpu)
  {
int r;
@@ -3916,6 +4101,7 @@
goto out;
/* set_cr3() should ensure TLB has been flushed */
vcpu-arch.mmu.set_cr3(vcpu, vcpu-arch.mmu.root_hpa);
+   kvm_mmu_pin_pages(vcpu);
  out:
return r;
  }



I don't see where  you unpin pages, so even if you limit the number of 
pinned pages, a guest can pin all of memory by iterating over all of 
memory and pinning it a chunk at a time.


You might try something similar to guest MTRR handling.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM and variable-endianness guest CPUs

2014-01-28 Thread Avi Kivity

On 01/22/2014 12:22 PM, Peter Maydell wrote:

On 22 January 2014 05:39, Victor Kamensky victor.kamen...@linaro.org wrote:

Hi Guys,

Christoffer and I had a bit heated chat :) on this
subject last night. Christoffer, really appreciate
your time! We did not really reach agreement
during the chat and Christoffer asked me to follow
up on this thread.
Here it goes. Sorry, it is very long email.

I don't believe we can assign any endianity to
mmio.data[] byte array. I believe mmio.data[] and
mmio.len acts just memcpy and that is all. As
memcpy does not imply any endianity of underlying
data mmio.data[] should not either.

This email is about five times too long to be actually
useful, but the major issue here is that the data being
transferred is not just a bag of bytes. The data[]
array plus the size field are being (mis)used to indicate
that the memory transaction is one of:
  * an 8 bit access
  * a 16 bit access of some uint16_t value
  * a 32 bit access of some uint32_t value
  * a 64 bit access of some uint64_t value

exactly as a CPU hardware bus would do. It's
because the API is defined in this awkward way with
a uint8_t[] array that we need to specify how both
sides should go from the actual properties of the
memory transaction (value and size) to filling in the
array.


That is not how x86 hardware works.  Back when there was a bus, there 
were no address lines A0-A2; instead we had 8 byte enables BE0-BE7.  A 
memory transaction placed the qword address on the address lines and 
asserted the byte enables for the appropriate byte, word, dword, or 
qword, shifted for the low order bits of the address.


If you generated an unaligned access, the transaction was split into 
two, so an 8-byte write might appear as a 5-byte write followed by a 
3-byte write.  In fact, the two halves of the transaction might go to 
different devices, or one might go to a device and another to memory.


PCI works the same way.





Furthermore, device endianness is entirely irrelevant
for deciding the properties of mmio.data[], because the
thing we're modelling here is essentially the CPU-bus
interface. In real hardware, the properties of individual
devices on the bus are irrelevant to how the CPU's
interface to the bus behaves, and similarly here the
properties of emulated devices don't affect how KVM's
interface to QEMU userspace needs to work.

MemoryRegion's 'endianness' field, incidentally, is
a dreadful mess that we should get rid of. It is attempting
to model the property that some buses/bridges have of
doing byte-lane-swaps on data that passes through as
a property of the device itself. It would be better if we
modelled it properly, with container regions having possible
byte-swapping and devices just being devices.



No, that is not what it is modelling.

Suppose a little endian cpu writes a dword 0x12345678 to address 0 of a 
device, and read back a byte from address 0.  What value do you read back?


Some (most) devices will return 0x78, others will return 0x12. Other 
devices don't support mixed sizes at all, but many do.  PCI 
configuration space is an example; it is common to read both Device ID 
and Vendor ID with a single 32-bit transaction, but you can also read 
them separately with two 16-bit transaction.  Because PCI is 
little-endian, the Vendor ID at address 0 will be returned as the low 
word of the 32-bit read of a little-endian processor.


If you remove device endianness from memory regions, you have to pass 
the data as arrays of bytes (like the KVM interface) and let the device 
assemble words from those bytes itself, taking into consideration its 
own endianness.  What MemoryRegion's endianness does is let the device 
declare its endianness to the API and let it do all the work.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-28 Thread Avi Kivity

On 01/28/2014 01:27 AM, Benjamin Herrenschmidt wrote:

On Wed, 2014-01-22 at 17:29 +, Peter Maydell wrote:

Basically if it would be on real bus, get byte value
that corresponds to phys_addr + 0 address place
it into data[0], get byte value that corresponds to
phys_addr + 1 address place it into data[1], etc.

This just isn't how real buses work.

Actually it can be :-)


  There is no
address + 1, address + 2. There is a single address
for the memory transaction and a set of data on
data lines and some separate size information.
How the device at the far end of the bus chooses
to respond to 32 bit accesses to address X versus
8 bit accesses to addresses X through X+3 is entirely
its own business and unrelated to the CPU.

However the bus has a definition of what byte lane is the lowest in
address order. Byte order invariance is an important function of
all busses.

I think that trying to treat it any differently than an address
ordered series of bytes is going to turn into a complete and
inextricable mess.


I agree.

The two options are:

 (address, byte array, length)

and

 (address, value, word size, endianness)

the first is the KVM ABI, the second is how MemoryRegions work. Both are 
valid, but the first is more general (supports the 3-byte accesses 
sometimes generated on x86).






  (It would
be perfectly possible to have a device which when
you read from address X as 32 bits returned 0x12345678,
when you read from address X as 16 bits returned
0x9abc, returned 0x42 for an 8 bit read from X+1,
and so on. Having byte reads from X..X+3 return
values corresponding to parts of the 32 bit access
is purely a convention.)

Right, it's possible. It's also stupid and not how most modern devices
and busses work. Besides there is no reason why that can't be
implemented with Victor proposal anyway.


Right.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).

  
Not changing current behaviour is certainly safer, but I am still not 100%

convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.



Linux is safe, it does interrupt migration from within the interrupt 
handler.  If you do that before the device-specific EOI, you won't get 
another interrupt until programming the MSI is complete.


Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes, 
consistently faster.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?




Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:53 AM, Paolo Bonzini wrote:

Il 28/11/2013 10:49, Avi Kivity ha scritto:

Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't get
another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes,
consistently faster.

call_rcu() has the problem of rate limiting, too.  It wasn't such a
great idea, I think.

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.



I guess srcu would work.  Do you know what's the typical write-side 
latency there? (in terms of what it waits for, not nanoseconds).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:40 PM, Paolo Bonzini wrote:

Il 28/11/2013 11:16, Avi Kivity ha scritto:

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.

I guess srcu would work.  Do you know what's the typical write-side
latency there? (in terms of what it waits for, not nanoseconds).

If there's no concurrent reader, it's zero if I read the code right.
Otherwise it depends:

- if there are many callbacks, only 10 of them are processed per
millisecond.  But unless there are concurrent synchronize_srcu calls
there should not be any callback at all.  If all VCPUs were to furiously
change the MSIs, the latency could go up to #vcpu/10 milliseconds.

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

So, given the very restricted usage this SRCU would have, we probably
can expect synchronize_srcu_expedited to finish its job in the
busy-looping phase, and 120 us should be the expected maximum
latency---more likely to be an order of magnitude smaller, and in very
rare cases higher.



Looks like it's a good fit then.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:10 PM, Paolo Bonzini wrote:

Il 28/11/2013 12:09, Gleb Natapov ha scritto:

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.



You need a grace period.  Or just sleep().
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.




Isn't the latency due to interrupt migration causing long 
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


The problem occurs with assigned devices too AFAICT.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:22 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.



Isn't the latency due to interrupt migration causing long
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
irq routing table changing MSI configuration should not cause update to
irq routing table (not saying this is what happens with current QEMU, but
theoretically there is not reason to update routing table in this case).


I see.  That pushes the problem to userspace, which uses traditional 
locking, so the problem disappears until qemu starts using rcu too to 
manage this.


There is also irqfd, however.  We could also do a KVM_UPDATE_IRQFD to 
change the payload it delivers, but that has exactly the same problems.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:31 PM, Paolo Bonzini wrote:

Il 28/11/2013 12:23, Gleb Natapov ha scritto:

Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.


So we will have to disable preemption in a reader to prevent big latencies for
a writer, no?

I don't think that's necessary.  The writer itself could also be
scheduled out, and the reader critical sections are really small.  Let's
wait for Zhang to try SRCU and report back.



I think readers have preemption disabled already in that context (irqfd 
writes).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).


With synchronize_rcu(), you have the additional guarantee that any 
parallel accesses to the old routing table have completed.  Since we 
also trigger the irq from rcu context, you know that after 
synchronize_rcu() you won't get any interrupts to the old destination 
(see kvm_set_irq_inatomic()).


It's another question whether the hardware provides the same guarantee.


If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.


Sure, if you drop the synchronize_rcu(), you have to add call_rcu().
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:03 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().



Consider this guest code:

  write msi entry, directing the interrupt away from this vcpu
  nop
  memset(idt, 0, sizeof(idt));

Currently, this code will never trigger a triple fault.  With the change 
to call_rcu(), it may.


Now it may be that the guest does not expect this to work (PCI writes 
are posted; and interrupts can be delayed indefinitely by the pci 
fabric), but we don't know if there's a path that guarantees the guest 
something that we're taking away with this change.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:20 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.


An alternative path is to convince ourselves that the hardware does not 
provide the guarantees that the current code provides, and so we can 
relax them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:25, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.



Why?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.


It's indeed safe, but I think there's a nice win to be had if we drop 
the assumption.



(BTW, PCI memory writes are posted, but configuration writes are not).


MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush the 
write via a read.  But I think the fact that the interrupt messages 
themselves are posted proves that it is safe.  The fact that Linux does 
interrupt migration from within the interrupt handler also shows that 
someone else believes that it is the only safe place to do it.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote:

On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

It's indeed safe, but I think there's a nice win to be had if we
drop the assumption.

I'm not arguing with that, but a minor commoent below:


(BTW, PCI memory writes are posted, but configuration writes are not).

MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush
the write via a read.  But I think the fact that the interrupt
messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.



I guess that kills the optimization then.  Maybe you can do qrcu, 
whatever that is.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:24 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

   



Suppose the guest did not disable_irq() and enable_irq(), but instead 
had a pci read where you have the enable_irq().  After the read you 
cannot have a stale irq (assuming the read flushes the irq all the way 
to the APIC).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 17:24, Gleb Natapov ha scritto:

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

Adding a disable/enable IRQs looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.



Those are guest operations, which may not be there at all.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Avi Kivity
On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:

 On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

snip complicated stuff about parent_pte

I'm not really following, but note that parent_pte predates EPT (and
the use of rcu in kvm), so all the complexity that is the result of
trying to pack as many list entries into a cache line can be dropped.
Most setups now would have exactly one list entry, which is handled
specially antyway.

Alternatively, the trick of storing multiple entries in one list entry
can be moved to generic code, it may be useful to others.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: fix sil/dil/bpl/spl in the mod/rm fields

2013-06-03 Thread Avi Kivity
On Thu, May 30, 2013 at 7:34 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 30/05/2013 17:34, Paolo Bonzini ha scritto:
 Il 30/05/2013 16:35, Paolo Bonzini ha scritto:
 The x86-64 extended low-byte registers were fetched correctly from reg,
 but not from mod/rm.

 This fixes another bug in the boot of RHEL5.9 64-bit, but it is still
 not enough.

 Well, it is enough but it takes 2 minutes to reach the point where
 hardware virtualization is used.  It is doing a lot of stuff in
 emulation mode because FS and GS have leftovers from the A20 test:

 FS =   9300 DPL=0 DS16 [-WA]
 GS = 0000  9300 DPL=0 DS16 [-WA]

 0x000113be:  in $0x92,%al
 0x000113c0:  or $0x2,%al
 0x000113c2:  out%al,$0x92
 0x000113c4:  xor%ax,%ax
 0x000113c6:  mov%ax,%fs
 0x000113c8:  dec%ax
 0x000113c9:  mov%ax,%gs
 0x000113cb:  inc%ax
 0x000113cc:  mov%ax,%fs:0x200
 0x000113d0:  cmp%gs:0x210,%ax
 0x000113d5:  je 0x113cb

 The DPL  RPL test fails.  Any ideas?  Should we introduce a new
 intermediate value for emulate_invalid_guest_state (0=none, 1=some, 2=full)?

 One idea could be to replace invalid descriptors with NULL ones.  Then
 you can intercept this in the #GP handler and trigger emulation for that
 instruction only.

Won't work, vmx won't let you enter in such a configuration.

Maybe you can detect the exact code sequence (%eip, some instructions,
register state) and clear %fs and %gs.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-26 Thread Avi Kivity
On Tue, Feb 26, 2013 at 10:12 AM, Gleb Natapov g...@redhat.com wrote:

 But do not see how to implement efficiently without interface change. The
 idea is basically to register ACK notifier for RTC interrupt but terminate
 it in the kernel instead of reporting to userspace. Kernel should know
 somehow what GSI should it register ACK notifier for.

Right, my idea does not help at all.

 There are couple
 of solutions:
  1. New interface to tell what GSI to track.
  - Cons: KVM_IRQ_LINE_STATUS will stop working for old QEMUs that do
  not use it. New special purpose API.
  2. Register ACK notifier only for RTC.
  - Cons: KVM_IRQ_LINE_STATUS will work only for RTC. This is the only
  thing that use it currently, but such change will make it one
  trick pony officially.
  3. Register ACK notifiers for all edge interrupts in IOAPIC.
  - Cons: with APICv edge interrupt does not require EOI to do vmexit.
  Registering ACK notifiers for all of them will make them all
  do vmexit.
  4. Combinations of 1 and 3. Register ACK notifiers for all edge interrupts
 in IOAPIC and provide API to drop unneeded notifiers.
  - Cons: New special purpose API.


5. Make KVM_IRQ_LINE_STATUS register an ack notifier, document it as
more expensive than KVM_IRQ_LINE, and change qemu to use KVM_IRQ_LINE
when it is sufficient.

Pros: fully backwards compatible
Cons: to realize full performance, need to patch qemu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-25 Thread Avi Kivity

 I see a couple of possible solutions:
 1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons:
 current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it
 will be slow on newer kernels

You could backport the qemu change, verify that it builds, and push it
to stable branches.  It's hardly ideal but if nothing else comes up
then it's a solution.


 2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not
 running during injection. This assumes that if vcpu is running and does
 not process interrupt it is guest fault and the same can happen on real
 HW too. Coalescing when vcpu is not running though is the result of CPU
 overcommit and should be reported. Cons interface definition is kind of
 murky.

You still have a window where the vcpu is scheduled out with guest
interrupts disabled, then scheduled back in and before it manages to
handle the interrupt, the second one hits.

It's worth testing though.

 3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI
 notifiers for interrupt reinjection. This requires us to add interface
 for reporting EOI to userspace. This is not in the scope of this
 patchset. Cons: need to introduce new interface (and the one that will
 not work on AMD BTW)

 Other ideas?

1. inject RTC interrupt
2. not see EOI
3. inject RTC interrupt
4. due to 2, report coalesced

AMD can still use IRR test-and-set.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-25 Thread Avi Kivity
On Mon, Feb 25, 2013 at 7:43 PM, Gleb Natapov g...@redhat.com wrote:

  3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI
  notifiers for interrupt reinjection. This requires us to add interface
  for reporting EOI to userspace. This is not in the scope of this
  patchset. Cons: need to introduce new interface (and the one that will
  not work on AMD BTW)
 
  Other ideas?

 1. inject RTC interrupt
 2. not see EOI
 3. inject RTC interrupt
 4. due to 2, report coalesced

 That's the 3 in my list, no?


No, this idea doesn't involve user interface changes.  We still report
through KVM_IRQ_LINE_STATUS or whatever it's called.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-02-24 Thread Avi Kivity
On Sat, Feb 23, 2013 at 11:57 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2013-02-23 22:45, Nadav Har'El wrote:
 On Sat, Feb 23, 2013, Jan Kiszka wrote about [PATCH] KVM: nVMX: Replace 
 kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state:
 -kvm_set_cr0(vcpu, vmcs12-host_cr0);
 +vmx_set_cr0(vcpu, vmcs12-host_cr0);

 I don't remember now why I did this (and I'm not looking at the code),
 but this you'll need to really test carefully, including
 shadow-on-shadow mode (ept=0 in L0), to verify you're not missing any
 important side-effect of kvm_set_cr0.

 Also, if I remember correctly, during nVMX's review, Avi Kivity asked
 in several places that when I called vmx_set_cr0, I should instead call
 kvm_set_cr0(), because it does some extra stuff and does some extra
 checks. Hmm, see, see this:
   http://markmail.org/message/hhidqyhbo2mrgxxc

 where Avi asked for the reverse patch you're attempting now.

 At least, kvm_set_cr0 can't be used as it assumes an otherwise
 consistent guest state and an explicitly initiated transition - which is
 naturally not the case while emulating a vmexit.

We have the same problem in KVM_SET_SREGS.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-02-24 Thread Avi Kivity
On Sun, Feb 24, 2013 at 11:01 AM, Jan Kiszka jan.kis...@web.de wrote:
 On 2013-02-24 09:56, Avi Kivity wrote:
 On Sat, Feb 23, 2013 at 11:57 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2013-02-23 22:45, Nadav Har'El wrote:
 On Sat, Feb 23, 2013, Jan Kiszka wrote about [PATCH] KVM: nVMX: Replace 
 kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state:
 -kvm_set_cr0(vcpu, vmcs12-host_cr0);
 +vmx_set_cr0(vcpu, vmcs12-host_cr0);

 I don't remember now why I did this (and I'm not looking at the code),
 but this you'll need to really test carefully, including
 shadow-on-shadow mode (ept=0 in L0), to verify you're not missing any
 important side-effect of kvm_set_cr0.

 Also, if I remember correctly, during nVMX's review, Avi Kivity asked
 in several places that when I called vmx_set_cr0, I should instead call
 kvm_set_cr0(), because it does some extra stuff and does some extra
 checks. Hmm, see, see this:
   http://markmail.org/message/hhidqyhbo2mrgxxc

 where Avi asked for the reverse patch you're attempting now.

 At least, kvm_set_cr0 can't be used as it assumes an otherwise
 consistent guest state and an explicitly initiated transition - which is
 naturally not the case while emulating a vmexit.

 We have the same problem in KVM_SET_SREGS.

 I don't see the problem. kvm_arch_vcpu_ioctl_set_sregs open-codes the
 state update, not applying any transition checks.

That's the problem.  We have this open coding in three different
places (KVM_SET_SREGS, nvmx, nsvm).

It's not as if vmx_set_cr0() is defined as kvm_set_cr0() without the
transition checks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-02-24 Thread Avi Kivity
On Sun, Feb 24, 2013 at 11:40 AM, Jan Kiszka jan.kis...@web.de wrote:
 We have the same problem in KVM_SET_SREGS.

 I don't see the problem. kvm_arch_vcpu_ioctl_set_sregs open-codes the
 state update, not applying any transition checks.

 That's the problem.  We have this open coding in three different
 places (KVM_SET_SREGS, nvmx, nsvm).

 It's not as if vmx_set_cr0() is defined as kvm_set_cr0() without the
 transition checks.

 ...and without mmu updates. The latter is done via or after the closing
 cr3 update. Interestingly, nsvm does not perform kvm_set_cr3 on vmexit
 when in npt mode. Seems things aren't that regular.

We do want the mmu updates.  Of course they can't be attached to
kvm_set_cr0_without_the_checks() since there is cross-register
dependencies.

Option 1 is kvm_set_multiple_cr().  This does the checks and updates,
but only after all registers are updated.
Option 2 is kvm_begin_cr_transaction()/kvm_commit_cr_transaction().
Prettier and more flexible, but a more complicated to implement.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-02-24 Thread Avi Kivity
On Sun, Feb 24, 2013 at 12:49 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2013-02-24 11:11, Avi Kivity wrote:
 On Sun, Feb 24, 2013 at 11:40 AM, Jan Kiszka jan.kis...@web.de wrote:
 We have the same problem in KVM_SET_SREGS.

 I don't see the problem. kvm_arch_vcpu_ioctl_set_sregs open-codes the
 state update, not applying any transition checks.

 That's the problem.  We have this open coding in three different
 places (KVM_SET_SREGS, nvmx, nsvm).

 It's not as if vmx_set_cr0() is defined as kvm_set_cr0() without the
 transition checks.

 ...and without mmu updates. The latter is done via or after the closing
 cr3 update. Interestingly, nsvm does not perform kvm_set_cr3 on vmexit
 when in npt mode. Seems things aren't that regular.

 We do want the mmu updates.  Of course they can't be attached to
 kvm_set_cr0_without_the_checks() since there is cross-register
 dependencies.

 Option 1 is kvm_set_multiple_cr().  This does the checks and updates,
 but only after all registers are updated.
 Option 2 is kvm_begin_cr_transaction()/kvm_commit_cr_transaction().
 Prettier and more flexible, but a more complicated to implement.


 The only thing that these three use case truly have in common is the
 closing kvm_mmu_reset_context. Maybe nvmx and nsvm can share a bit more.
 But let's get nVMX right first, then think about sharing.

They all need consistency checks, otherwise userspace or the guest and
inject inconsistent values and perhaps exploit the host.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-24 Thread Avi Kivity
I didn't really follow, but is the root cause the need to keep track
of interrupt coalescing?  If so we can recommend that users use
KVM_IRQ_LINE when coalescing is unneeded, and move interrupt injection
with irq coalescing support to vcpu context.

It's not pleasant to cause a performance regression, so it should be a
last resort (or maybe do both if it helps).

On Sun, Feb 24, 2013 at 8:08 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Sun, Feb 24, 2013 at 04:19:17PM +0200, Gleb Natapov wrote:
 On Sun, Feb 24, 2013 at 01:55:07PM +, Zhang, Yang Z wrote:
   I do not think it fixes it. There is no guaranty that IPI will be
   processed by remote cpu while sending cpu is still in locked section, so
   the same race may happen regardless. As you say above there are 3
   contexts, but only two use locks.
  See following logic, I think the problem you mentioned will not happened 
  with current logic.
 
  get lock
  if test_pir (this will ensure there is no in flight IPI for same 
  interrupt. Since we are taking the lock now, no IPI will be sent before we 
  release the lock and no pir-irr is performed by hardware for same 
  interrupt.)

 Does the PIR IPI interrupt returns synchronously _after_ PIR-IRR transfer
 is made? Don't think so.

 PIR IPI interrupt returns after remote CPU has acked interrupt receival,
 not after remote CPU has acked _and_ performed PIR-IRR transfer.

 Right? (yes, right, see step 4. below).

 Should try to make it easier to drop the lock later, so depend on it as
 little as possible (also please document what it protects in detail).

 Also note:

 
 3. The processor clears the outstanding-notification bit in the
 posted-interrupt descriptor. This is done atomically
 so as to leave the remainder of the descriptor unmodified (e.g., with a
 locked AND operation).
 4. The processor writes zero to the EOI register in the local APIC; this
 dismisses the interrupt with the postedinterrupt
 notification vector from the local APIC.
 5. The logical processor performs a logical-OR of PIR into VIRR and
 clears PIR. No other agent can read or write a
 PIR bit (or group of bits) between the time it is read (to determine
 what to OR into VIRR) and when it is cleared.
 

 So checking the ON bit does not mean the HW has finished performing
 PIR-VIRR transfer (which means ON bit can only be used as an indication
 of whether to send interrupt, not whether PIR-VIRR transfer is
 finished).

 So its fine to do

 - atomic set pir
 - if (atomic_test_and_set(PIR ON bit))
 -  send IPI

 But HW can transfer two distinct bits, set by different serialized instances
 of kvm_set_irq (protected with a lock), in a single PIR-IRR pass.

 I do not see where those assumptions are coming from. Testing pir does
 not guaranty that the IPI is not processed by VCPU right now.

 iothread:   vcpu:
 send_irq()
 lock(pir)
 check pir and irr
 set pir
 send IPI (*)
 unlock(pir)

 send_irq()
 lock(pir)
  receive IPI (*)
  atomic {
pir_tmp = pir
pir = 0
  }
 check pir and irr
  irr = pir_tmp
 set pir
 send IPI
 unlock(pir)

 At this point both pir and irr are set and interrupt may be coalesced,
 but it is reported as delivered.

 s/set pir/injected = !t_a_s(pir)/

 So what prevents the scenario above from happening?

 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-02-24 Thread Avi Kivity
On Sun, Feb 24, 2013 at 9:15 PM, Jan Kiszka jan.kis...@web.de wrote:

 They all need consistency checks, otherwise userspace or the guest and
 inject inconsistent values and perhaps exploit the host.

 To my understanding, the hardware does this for us: If we try to enter
 the guest (L1, L2) with invalid CRx bits set or cleared, we get an
 error, at least on Intel. But I bet AMD does so as well - and, if not,
 it would make this test specific again.

The point is that kvm code may depend on this consistency, so we fail
before that hardware has made the check.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-21 Thread Avi Kivity
On Thu, Feb 21, 2013 at 10:58 AM, Zhang, Yang Z yang.z.zh...@intel.com wrote:
 Thanks. Here is code after changing, please review it:

 asm(
 mov %0, %% _ASM_DX  \n\t
 #ifdef CONFIG_X86_64
 mov %% _ASM_SP , %% _ASM_BX  \n\t
 and $0xfff0, %% _ASM_SP  \n\t
 mov %%ss, %% _ASM_AX  \n\t
 push %% _ASM_AX  \n\t
 push %% _ASM_BX  \n\t
 #endif
 pushf \n\t
 orl $0x200, (%% _ASM_SP ) \n\t
 __ASM_SIZE(push)  %c[cs] \n\t
 call *%%  _ASM_DX  \n\t
 : : m(entry), [cs]i(__KERNEL_CS) :
 #ifdef CONFIG_X86_64
 rax, rbx, rdx
 #else
 edx
 #endif
 );

For cleanliness, you can provide more parameters via the constraints:

  m(entry) - [entry]r(entry) (and then use call *[entry])
  use push %c[ss] for ss
  and [sp]=r(tmp) instead of rdx/edx as the temporary for rsp
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-20 Thread Avi Kivity
On Wed, Feb 20, 2013 at 4:46 AM, Zhang, Yang Z yang.z.zh...@intel.com wrote:

 +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) +{ +
 u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + +   /* +
* If external interrupt exists, IF bit is set in rflags/eflags on
 the +* interrupt stack frame, and interrupt will be enabled on
 a return +* from interrupt handler. +*/ +   if
 ((exit_intr_info  (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK)) +
   == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) {
 +   unsigned int vector; +   unsigned long
 entry; +   gate_desc *desc; +   struct vcpu_vmx
 *vmx = to_vmx(vcpu); + +   vector =  exit_intr_info 
 INTR_INFO_VECTOR_MASK; +#ifdef CONFIG_X86_64 +   desc =
 (void *)vmx-host_idt_base + vector * 16; +#else +   desc =
 (void *)vmx-host_idt_base + vector * 8; +#endif + +
 entry = gate_offset(*desc); +   asm( +
  mov %0, %% _ASM_DX  \n\t +#ifdef CONFIG_X86_64 +
 mov %% _ASM_SP , %% _ASM_BX  \n\t +
 and $0xfff0, %% _ASM_SP  \n\t +
 mov %%ss, %% _ASM_AX  \n\t +   push %%
 _ASM_AX  \n\t +   push %% _ASM_BX  \n\t
 +#endif

 Are we sure no interrupts are using the IST feature?  I guess it's unlikely.
 Linux uses IST for NMI, stack fault, machine-check, double fault and debug 
 interrupt . No external interrupt will use it.
 This feature is only for external interrupt. So we don't need to check it 
 here.

Ok, thanks for checking.



 +   pushf \n\t
 +   pop %% _ASM_AX  \n\t
 +   or $0x200, %% _ASM_AX  \n\t
 +   push %% _ASM_AX  \n\t

 Can simplify to pushf; orl $0x200, %%rsp.
 Sure.

 +   mov %%cs, %% _ASM_AX  \n\t
 +   push %% _ASM_AX  \n\t

 push %%cs
 push %%cs is invalid in x86_64.

Oops. 'push[lq] $__KERNEL_CS' then.


 +   push intr_return \n\t

 push $1f.  Or even combine with the next instruction, and call %rdx.
 Which is faster? jmp or call?

Actually it doesn't matter, the processor is clever enough to minimize
the difference.  But the code is simpler and shorter with 'call'.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-20 Thread Avi Kivity
On Wed, Feb 20, 2013 at 3:10 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote:

 push %%cs
 push %%cs is invalid in x86_64.

 Oops. 'push[lq] $__KERNEL_CS' then.
 Is this right? Just copy it from other file.

 #define __STR(X) #X
 #define STR(X) __STR(X)

 #ifdef CONFIG_X86_64
 pushq $STR(__KERNEL_CS) \n\t
 #else
 pushl $STR(__KERNEL_CS) \n\t
 #endif

 #undef STR
 #undef __STR


Use __ASM_SIZE and an immediate operand for __KERNEL_CS:

 asm ( ... : : [cs]i(__KERNEL_CS) );

and the code will be cleaner.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-19 Thread Avi Kivity
On Tue, Feb 19, 2013 at 3:39 PM, Yang Zhang yang.z.zh...@intel.com wrote:
 From: Yang Zhang yang.z.zh...@intel.com

 The acknowledge interrupt on exit feature controls processor behavior
 for external interrupt acknowledgement. When this control is set, the
 processor acknowledges the interrupt controller to acquire the
 interrupt vector on VM exit.

 After enabling this feature, an interrupt which arrived when target cpu is
 running in vmx non-root mode will be handled by vmx handler instead of handler
 in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
 table to let real handler to handle it. Further, we will recognize the 
 interrupt
 and only delivery the interrupt which not belong to current vcpu through idt 
 table.
 The interrupt which belonged to current vcpu will be handled inside vmx 
 handler.
 This will reduce the interrupt handle cost of KVM.

 Also, interrupt enable logic is changed if this feature is turnning on:
 Before this patch, hypervior call local_irq_enable() to enable it directly.
 Now IF bit is set on interrupt stack frame, and will be enabled on a return 
 from
 interrupt handler if exterrupt interrupt exists. If no external interrupt, 
 still
 call local_irq_enable() to enable it.

 Refer to Intel SDM volum 3, chapter 33.2.


 +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
 +{
 +   u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
 +
 +   /*
 +* If external interrupt exists, IF bit is set in rflags/eflags on the
 +* interrupt stack frame, and interrupt will be enabled on a return
 +* from interrupt handler.
 +*/
 +   if ((exit_intr_info  (INTR_INFO_VALID_MASK | 
 INTR_INFO_INTR_TYPE_MASK))
 +   == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) {
 +   unsigned int vector;
 +   unsigned long entry;
 +   gate_desc *desc;
 +   struct vcpu_vmx *vmx = to_vmx(vcpu);
 +
 +   vector =  exit_intr_info  INTR_INFO_VECTOR_MASK;
 +#ifdef CONFIG_X86_64
 +   desc = (void *)vmx-host_idt_base + vector * 16;
 +#else
 +   desc = (void *)vmx-host_idt_base + vector * 8;
 +#endif
 +
 +   entry = gate_offset(*desc);
 +   asm(
 +   mov %0, %% _ASM_DX  \n\t
 +#ifdef CONFIG_X86_64
 +   mov %% _ASM_SP , %% _ASM_BX  \n\t
 +   and $0xfff0, %% _ASM_SP  \n\t
 +   mov %%ss, %% _ASM_AX  \n\t
 +   push %% _ASM_AX  \n\t
 +   push %% _ASM_BX  \n\t
 +#endif

Are we sure no interrupts are using the IST feature?  I guess it's unlikely.

 +   pushf \n\t
 +   pop %% _ASM_AX  \n\t
 +   or $0x200, %% _ASM_AX  \n\t
 +   push %% _ASM_AX  \n\t

Can simplify to pushf; orl $0x200, %%rsp.

 +   mov %%cs, %% _ASM_AX  \n\t
 +   push %% _ASM_AX  \n\t

push %%cs

 +   push intr_return \n\t

push $1f.  Or even combine with the next instruction, and call %rdx.

 +   jmp *%%  _ASM_DX  \n\t
 +   1: \n\t
 +   .pushsection .rodata \n\t
 +   .global intr_return \n\t
 +   intr_return:  _ASM_PTR  1b \n\t
 +   .popsection \n\t
 +   : : m(entry) :
 +#ifdef CONFIG_X86_64
 +   rax, rbx, rdx
 +#else
 +   eax, edx
 +#endif
 +   );
 +   } else
 +   local_irq_enable();
 +}
 +
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] Convert MUL/DIV to fastop

2013-02-09 Thread Avi Kivity
This patchset converts MUL/DIV to fastop.  We depart from the plan, which
was to encode rdx:rax as a single operand, and instead encode it as two
separate operands.  This reduces register pressure on i386, and is simpler
besides.

As a bonus, XADD is converted as well.

The series results in a nice code size reduction:

  60147   0   0   60147eaf3 arch/x86/kvm/emulate.o.before
  56899   0   0   56899de43 arch/x86/kvm/emulate.o.after


Avi Kivity (8):
  KVM: x86 emulator: add support for writing back the source operand
  KVM: x86 emulator: decode extended accumulator explicity
  KVM: x86 emulator: switch MUL/DIV to DstXacc
  KVM: x86 emulator: Switch fastop src operand to RDX
  KVM: x86 emulator: convert single-operand MUL/IMUL to fastop
  KVM: x86 emulator: convert DIV/IDIV to fastop
  KVM: x86 emulator: drop unused old-style inline emulation
  KVM: x86 emulator: convert XADD to fastop

 arch/x86/kvm/emulate.c | 388 ++---
 1 file changed, 106 insertions(+), 282 deletions(-)

-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] KVM: x86 emulator: add support for writing back the source operand

2013-02-09 Thread Avi Kivity
Some instructions write back the source operand, not just the destination.
Add support for doing this via the decode flags.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 47 ++-
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2b11318..18c86b5 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -152,6 +152,7 @@
 #define Avx ((u64)1  43)  /* Advanced Vector Extensions */
 #define Fastop  ((u64)1  44)  /* Use opcode::u.fastop */
 #define NoWrite ((u64)1  45)  /* No writeback */
+#define SrcWrite((u64)1  46)  /* Write back src operand */
 
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
@@ -1708,45 +1709,42 @@ static void write_register_operand(struct operand *op)
}
 }
 
-static int writeback(struct x86_emulate_ctxt *ctxt)
+static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op)
 {
int rc;
 
-   if (ctxt-d  NoWrite)
-   return X86EMUL_CONTINUE;
-
-   switch (ctxt-dst.type) {
+   switch (op-type) {
case OP_REG:
-   write_register_operand(ctxt-dst);
+   write_register_operand(op);
break;
case OP_MEM:
if (ctxt-lock_prefix)
rc = segmented_cmpxchg(ctxt,
-  ctxt-dst.addr.mem,
-  ctxt-dst.orig_val,
-  ctxt-dst.val,
-  ctxt-dst.bytes);
+  op-addr.mem,
+  op-orig_val,
+  op-val,
+  op-bytes);
else
rc = segmented_write(ctxt,
-ctxt-dst.addr.mem,
-ctxt-dst.val,
-ctxt-dst.bytes);
+op-addr.mem,
+op-val,
+op-bytes);
if (rc != X86EMUL_CONTINUE)
return rc;
break;
case OP_MEM_STR:
rc = segmented_write(ctxt,
-   ctxt-dst.addr.mem,
-   ctxt-dst.data,
-   ctxt-dst.bytes * ctxt-dst.count);
+   op-addr.mem,
+   op-data,
+   op-bytes * op-count);
if (rc != X86EMUL_CONTINUE)
return rc;
break;
case OP_XMM:
-   write_sse_reg(ctxt, ctxt-dst.vec_val, ctxt-dst.addr.xmm);
+   write_sse_reg(ctxt, op-vec_val, op-addr.xmm);
break;
case OP_MM:
-   write_mmx_reg(ctxt, ctxt-dst.mm_val, ctxt-dst.addr.mm);
+   write_mmx_reg(ctxt, op-mm_val, op-addr.mm);
break;
case OP_NONE:
/* no writeback */
@@ -4717,9 +4715,16 @@ special_insn:
goto done;
 
 writeback:
-   rc = writeback(ctxt);
-   if (rc != X86EMUL_CONTINUE)
-   goto done;
+   if (!(ctxt-d  NoWrite)) {
+   rc = writeback(ctxt, ctxt-dst);
+   if (rc != X86EMUL_CONTINUE)
+   goto done;
+   }
+   if (ctxt-d  SrcWrite) {
+   rc = writeback(ctxt, ctxt-src);
+   if (rc != X86EMUL_CONTINUE)
+   goto done;
+   }
 
/*
 * restore dst type in case the decoding will be reused
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] KVM: x86 emulator: switch MUL/DIV to DstXacc

2013-02-09 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index aa8516e..d51f6f4 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -139,6 +139,7 @@
 /* Source 2 operand type */
 #define Src2Shift   (30)
 #define Src2None(OpNone  Src2Shift)
+#define Src2Mem (OpMem  Src2Shift)
 #define Src2CL  (OpCL  Src2Shift)
 #define Src2ImmByte (OpImmByte  Src2Shift)
 #define Src2One (OpOne  Src2Shift)
@@ -542,8 +543,8 @@ FOP_END;
 #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \
do {\
unsigned long _tmp; \
-   ulong *rax = reg_rmw((ctxt), VCPU_REGS_RAX);\
-   ulong *rdx = reg_rmw((ctxt), VCPU_REGS_RDX);\
+   ulong *rax = ctxt-dst.val;\
+   ulong *rdx = ctxt-src.val;\
\
__asm__ __volatile__ (  \
_PRE_EFLAGS(0, 5, 1)  \
@@ -558,7 +559,7 @@ FOP_END;
_ASM_EXTABLE(1b, 3b)\
: =m ((ctxt)-eflags), =r (_tmp),  \
  +a (*rax), +d (*rdx), +qm(_ex)  \
-   : i (EFLAGS_MASK), m ((ctxt)-src.val));\
+   : i (EFLAGS_MASK), m ((ctxt)-src2.val));   \
} while (0)
 
 /* instruction has only one source operand, destination is implicit (e.g. mul, 
div, imul, idiv) */
@@ -3701,10 +3702,10 @@ static const struct opcode group3[] = {
F(DstMem | SrcImm | NoWrite, em_test),
F(DstMem | SrcNone | Lock, em_not),
F(DstMem | SrcNone | Lock, em_neg),
-   I(SrcMem, em_mul_ex),
-   I(SrcMem, em_imul_ex),
-   I(SrcMem, em_div_ex),
-   I(SrcMem, em_idiv_ex),
+   I(DstXacc | Src2Mem, em_mul_ex),
+   I(DstXacc | Src2Mem, em_imul_ex),
+   I(DstXacc | Src2Mem, em_div_ex),
+   I(DstXacc | Src2Mem, em_idiv_ex),
 };
 
 static const struct opcode group4[] = {
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] KVM: x86 emulator: convert DIV/IDIV to fastop

2013-02-09 Thread Avi Kivity
Since DIV and IDIV can generate exceptions, we need an additional output
parameter indicating whether an execption has occured.  To avoid increasing
register pressure on i386, we use %rsi, which is already allocated for
the fastop code pointer.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 51 +-
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0f0c15e..89f56bb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -180,6 +180,7 @@
  * src:rdx(in/out)
  * src2:   rcx(in)
  * flags:  rflags (in/out)
+ * ex: rsi(in:nonzero, out:zero if exception)
  *
  * Moreover, they are all exactly FASTOP_SIZE bytes long, so functions for
  * different operand sizes can be reached by calculation, rather than a jump
@@ -467,7 +468,10 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *));
 #define FOPNOP() FOP_ALIGN FOP_RET
 
 #define FOP1E(op,  dst) \
-   FOP_ALIGN #op  % #dst  \n\t FOP_RET
+   FOP_ALIGN 10:  #op  % #dst  \n\t FOP_RET
+
+#define FOP1EEX(op,  dst) \
+   FOP1E(op, dst) _ASM_EXTABLE(10b, kvm_fastop_exception)
 
 #define FASTOP1(op) \
FOP_START(op) \
@@ -486,6 +490,15 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *));
ON64(FOP1E(op, rcx)) \
FOP_END
 
+/* 1-operand, using src2 (for MUL/DIV r/m), with exceptions */
+#define FASTOP1SRC2EX(op, name) \
+   FOP_START(name) \
+   FOP1EEX(op, cl) \
+   FOP1EEX(op, cx) \
+   FOP1EEX(op, ecx) \
+   ON64(FOP1EEX(op, rcx)) \
+   FOP_END
+
 #define FOP2E(op,  dst, src)  \
FOP_ALIGN #op  % #src , % #dst  \n\t FOP_RET
 
@@ -530,6 +543,9 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *));
 /* Special case for SETcc - 1 instruction per cc */
 #define FOP_SETCC(op) .align 4;  #op  %al; ret \n\t
 
+asm(.global kvm_fastop_exception \n
+kvm_fastop_exception: xor %esi, %esi; ret);
+
 FOP_START(setcc)
 FOP_SETCC(seto)
 FOP_SETCC(setno)
@@ -1001,6 +1017,8 @@ FASTOP2(test);
 
 FASTOP1SRC2(mul, mul_ex);
 FASTOP1SRC2(imul, imul_ex);
+FASTOP1SRC2EX(div, div_ex);
+FASTOP1SRC2EX(idiv, idiv_ex);
 
 FASTOP3WCL(shld);
 FASTOP3WCL(shrd);
@@ -2116,26 +2134,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_div_ex(struct x86_emulate_ctxt *ctxt)
-{
-   u8 de = 0;
-
-   emulate_1op_rax_rdx(ctxt, div, de);
-   if (de)
-   return emulate_de(ctxt);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_idiv_ex(struct x86_emulate_ctxt *ctxt)
-{
-   u8 de = 0;
-
-   emulate_1op_rax_rdx(ctxt, idiv, de);
-   if (de)
-   return emulate_de(ctxt);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_grp45(struct x86_emulate_ctxt *ctxt)
 {
int rc = X86EMUL_CONTINUE;
@@ -3700,8 +3698,8 @@ static const struct opcode group3[] = {
F(DstMem | SrcNone | Lock, em_neg),
F(DstXacc | Src2Mem, em_mul_ex),
F(DstXacc | Src2Mem, em_imul_ex),
-   I(DstXacc | Src2Mem, em_div_ex),
-   I(DstXacc | Src2Mem, em_idiv_ex),
+   F(DstXacc | Src2Mem, em_div_ex),
+   F(DstXacc | Src2Mem, em_idiv_ex),
 };
 
 static const struct opcode group4[] = {
@@ -4518,9 +4516,12 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *))
if (!(ctxt-d  ByteOp))
fop += __ffs(ctxt-dst.bytes) * FASTOP_SIZE;
asm(push %[flags]; popf; call *%[fastop]; pushf; pop %[flags]\n
-   : +a(ctxt-dst.val), +d(ctxt-src.val), [flags]+D(flags)
-   : c(ctxt-src2.val), [fastop]S(fop));
+   : +a(ctxt-dst.val), +d(ctxt-src.val), [flags]+D(flags),
+ [fastop]+S(fop)
+   : c(ctxt-src2.val));
ctxt-eflags = (ctxt-eflags  ~EFLAGS_MASK) | (flags  EFLAGS_MASK);
+   if (!fop)
+   return emulate_de(ctxt);
return X86EMUL_CONTINUE;
 }
 
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] KVM: x86 emulator: convert XADD to fastop

2013-02-09 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a706e52..2f895c2 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -847,6 +847,8 @@ FASTOP2W(bts);
 FASTOP2W(btr);
 FASTOP2W(btc);
 
+FASTOP2(xadd);
+
 static u8 test_cc(unsigned int condition, unsigned long flags)
 {
u8 rc;
@@ -3824,7 +3826,7 @@ static const struct opcode twobyte_table[256] = {
F(DstReg | SrcMem | ModRM, em_bsf), F(DstReg | SrcMem | ModRM, em_bsr),
D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov),
/* 0xC0 - 0xC7 */
-   D2bv(DstMem | SrcReg | ModRM | Lock),
+   F2bv(DstMem | SrcReg | ModRM | SrcWrite | Lock, em_xadd),
N, D(DstMem | SrcReg | ModRM | Mov),
N, N, N, GD(0, group9),
/* 0xC8 - 0xCF */
@@ -4643,12 +4645,6 @@ twobyte_insn:
ctxt-dst.val = (ctxt-src.bytes == 1) ? (s8) ctxt-src.val :
(s16) ctxt-src.val;
break;
-   case 0xc0 ... 0xc1: /* xadd */
-   fastop(ctxt, em_add);
-   /* Write back the register source. */
-   ctxt-src.val = ctxt-dst.orig_val;
-   write_register_operand(ctxt-src);
-   break;
case 0xc3:  /* movnti */
ctxt-dst.bytes = ctxt-op_bytes;
ctxt-dst.val = (ctxt-op_bytes == 4) ? (u32) ctxt-src.val :
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] KVM: x86 emulator: convert single-operand MUL/IMUL to fastop

2013-02-09 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 35 ---
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index fe91e70..0f0c15e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -477,6 +477,15 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *));
ON64(FOP1E(op##q, rax)) \
FOP_END
 
+/* 1-operand, using src2 (for MUL/DIV r/m) */
+#define FASTOP1SRC2(op, name) \
+   FOP_START(name) \
+   FOP1E(op, cl) \
+   FOP1E(op, cx) \
+   FOP1E(op, ecx) \
+   ON64(FOP1E(op, rcx)) \
+   FOP_END
+
 #define FOP2E(op,  dst, src)  \
FOP_ALIGN #op  % #src , % #dst  \n\t FOP_RET
 
@@ -990,6 +999,9 @@ FASTOP2(xor);
 FASTOP2(cmp);
 FASTOP2(test);
 
+FASTOP1SRC2(mul, mul_ex);
+FASTOP1SRC2(imul, imul_ex);
+
 FASTOP3WCL(shld);
 FASTOP3WCL(shrd);
 
@@ -2104,22 +2116,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
-{
-   u8 ex = 0;
-
-   emulate_1op_rax_rdx(ctxt, mul, ex);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_imul_ex(struct x86_emulate_ctxt *ctxt)
-{
-   u8 ex = 0;
-
-   emulate_1op_rax_rdx(ctxt, imul, ex);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_div_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 de = 0;
@@ -3702,8 +3698,8 @@ static const struct opcode group3[] = {
F(DstMem | SrcImm | NoWrite, em_test),
F(DstMem | SrcNone | Lock, em_not),
F(DstMem | SrcNone | Lock, em_neg),
-   I(DstXacc | Src2Mem, em_mul_ex),
-   I(DstXacc | Src2Mem, em_imul_ex),
+   F(DstXacc | Src2Mem, em_mul_ex),
+   F(DstXacc | Src2Mem, em_imul_ex),
I(DstXacc | Src2Mem, em_div_ex),
I(DstXacc | Src2Mem, em_idiv_ex),
 };
@@ -4519,7 +4515,8 @@ static void fetch_possible_mmx_operand(struct 
x86_emulate_ctxt *ctxt,
 static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))
 {
ulong flags = (ctxt-eflags  EFLAGS_MASK) | X86_EFLAGS_IF;
-   fop += __ffs(ctxt-dst.bytes) * FASTOP_SIZE;
+   if (!(ctxt-d  ByteOp))
+   fop += __ffs(ctxt-dst.bytes) * FASTOP_SIZE;
asm(push %[flags]; popf; call *%[fastop]; pushf; pop %[flags]\n
: +a(ctxt-dst.val), +d(ctxt-src.val), [flags]+D(flags)
: c(ctxt-src2.val), [fastop]S(fop));
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] KVM: x86 emulator: Switch fastop src operand to RDX

2013-02-09 Thread Avi Kivity
This makes OpAccHi useful.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d51f6f4..fe91e70 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -176,8 +176,8 @@
 /*
  * fastop functions have a special calling convention:
  *
- * dst:[rdx]:rax  (in/out)
- * src:rbx(in/out)
+ * dst:rax(in/out)
+ * src:rdx(in/out)
  * src2:   rcx(in)
  * flags:  rflags (in/out)
  *
@@ -482,19 +482,19 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *));
 
 #define FASTOP2(op) \
FOP_START(op) \
-   FOP2E(op##b, al, bl) \
-   FOP2E(op##w, ax, bx) \
-   FOP2E(op##l, eax, ebx) \
-   ON64(FOP2E(op##q, rax, rbx)) \
+   FOP2E(op##b, al, dl) \
+   FOP2E(op##w, ax, dx) \
+   FOP2E(op##l, eax, edx) \
+   ON64(FOP2E(op##q, rax, rdx)) \
FOP_END
 
 /* 2 operand, word only */
 #define FASTOP2W(op) \
FOP_START(op) \
FOPNOP() \
-   FOP2E(op##w, ax, bx) \
-   FOP2E(op##l, eax, ebx) \
-   ON64(FOP2E(op##q, rax, rbx)) \
+   FOP2E(op##w, ax, dx) \
+   FOP2E(op##l, eax, edx) \
+   ON64(FOP2E(op##q, rax, rdx)) \
FOP_END
 
 /* 2 operand, src is CL */
@@ -513,9 +513,9 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *));
 #define FASTOP3WCL(op) \
FOP_START(op) \
FOPNOP() \
-   FOP3E(op##w, ax, bx, cl) \
-   FOP3E(op##l, eax, ebx, cl) \
-   ON64(FOP3E(op##q, rax, rbx, cl)) \
+   FOP3E(op##w, ax, dx, cl) \
+   FOP3E(op##l, eax, edx, cl) \
+   ON64(FOP3E(op##q, rax, rdx, cl)) \
FOP_END
 
 /* Special case for SETcc - 1 instruction per cc */
@@ -4521,7 +4521,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void 
(*fop)(struct fastop *))
ulong flags = (ctxt-eflags  EFLAGS_MASK) | X86_EFLAGS_IF;
fop += __ffs(ctxt-dst.bytes) * FASTOP_SIZE;
asm(push %[flags]; popf; call *%[fastop]; pushf; pop %[flags]\n
-   : +a(ctxt-dst.val), +b(ctxt-src.val), [flags]+D(flags)
+   : +a(ctxt-dst.val), +d(ctxt-src.val), [flags]+D(flags)
: c(ctxt-src2.val), [fastop]S(fop));
ctxt-eflags = (ctxt-eflags  ~EFLAGS_MASK) | (flags  EFLAGS_MASK);
return X86EMUL_CONTINUE;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] KVM: x86 emulator: drop unused old-style inline emulation

2013-02-09 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 198 -
 1 file changed, 198 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 89f56bb..a706e52 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -282,174 +282,17 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
 }
 
 /*
- * Instruction emulation:
- * Most instructions are emulated directly via a fragment of inline assembly
- * code. This allows us to save/restore EFLAGS and thus very easily pick up
- * any modified flags.
- */
-
-#if defined(CONFIG_X86_64)
-#define _LO32 k  /* force 32-bit operand */
-#define _STK  %%rsp  /* stack pointer */
-#elif defined(__i386__)
-#define _LO32/* force 32-bit operand */
-#define _STK  %%esp  /* stack pointer */
-#endif
-
-/*
  * These EFLAGS bits are restored from saved value during emulation, and
  * any changes are written back to the saved value after emulation.
  */
 #define EFLAGS_MASK (EFLG_OF|EFLG_SF|EFLG_ZF|EFLG_AF|EFLG_PF|EFLG_CF)
 
-/* Before executing instruction: restore necessary bits in EFLAGS. */
-#define _PRE_EFLAGS(_sav, _msk, _tmp)  \
-   /* EFLAGS = (_sav  _msk) | (EFLAGS  ~_msk); _sav = ~_msk; */ \
-   movl %_sav,%_LO32 _tmp;   \
-   push %_tmp; \
-   push %_tmp; \
-   movl %_msk,%_LO32 _tmp;   \
-   andl %_LO32 _tmp,(_STK);  \
-   pushf;\
-   notl %_LO32 _tmp;   \
-   andl %_LO32 _tmp,(_STK);  \
-   andl %_LO32 _tmp,__stringify(BITS_PER_LONG/4)(_STK);\
-   pop  %_tmp; \
-   orl  %_LO32 _tmp,(_STK);  \
-   popf; \
-   pop  %_sav; 
-
-/* After executing instruction: write-back necessary bits in EFLAGS. */
-#define _POST_EFLAGS(_sav, _msk, _tmp) \
-   /* _sav |= EFLAGS  _msk; */\
-   pushf;\
-   pop  %_tmp; \
-   andl %_msk,%_LO32 _tmp;   \
-   orl  %_LO32 _tmp,%_sav; 
-
 #ifdef CONFIG_X86_64
 #define ON64(x) x
 #else
 #define ON64(x)
 #endif
 
-#define emulate_2op(ctxt, _op, _x, _y, _suffix, _dsttype)  \
-   do {\
-   __asm__ __volatile__ (  \
-   _PRE_EFLAGS(0, 4, 2)  \
-   _op _suffix  %_x3,%1;   \
-   _POST_EFLAGS(0, 4, 2) \
-   : =m ((ctxt)-eflags),\
- +q (*(_dsttype*)(ctxt)-dst.val),  \
- =r (_tmp)  \
-   : _y ((ctxt)-src.val), i (EFLAGS_MASK)); \
-   } while (0)
-
-
-/* Raw emulation: instruction has two explicit operands. */
-#define __emulate_2op_nobyte(ctxt,_op,_wx,_wy,_lx,_ly,_qx,_qy) \
-   do {\
-   unsigned long _tmp; \
-   \
-   switch ((ctxt)-dst.bytes) {\
-   case 2: \
-   emulate_2op(ctxt,_op,_wx,_wy,w,u16);  \
-   break;  \
-   case 4: \
-   emulate_2op(ctxt,_op,_lx,_ly,l,u32);  \
-   break;  \
-   case 8: \
-   ON64(emulate_2op(ctxt,_op,_qx,_qy,q,u64)); \
-   break;  \
-   }   \
-   } while (0)
-
-#define __emulate_2op(ctxt,_op,_bx,_by,_wx,_wy,_lx,_ly,_qx,_qy)
 \
-   do { \
-   unsigned long _tmp;  \
-   switch ((ctxt)-dst.bytes) { \
-   case 1:  \
-   emulate_2op(ctxt,_op,_bx,_by,b,u8

[PATCH 2/8] KVM: x86 emulator: decode extended accumulator explicity

2013-02-09 Thread Avi Kivity
Single-operand MUL and DIV access an extended accumulator: AX for byte
instructions, and DX:AX, EDX:EAX, or RDX:RAX for larger-sized instructions.
Add support for fetching the extended accumulator.

In order not to change things too much, RDX is loaded into Src2, which is
already loaded by fastop().  This avoids increasing register pressure on
i386.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 18c86b5..aa8516e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -60,6 +60,8 @@
 #define OpGS  25ull  /* GS */
 #define OpMem826ull  /* 8-bit zero extended memory operand */
 #define OpImm64   27ull  /* Sign extended 16/32/64-bit immediate */
+#define OpAccLo   29ull  /* Low part of extended acc (AX/AX/EAX/RAX) */
+#define OpAccHi   30ull  /* High part of extended acc (-/DX/EDX/RDX) */
 
 #define OpBits 5  /* Width of operand field */
 #define OpMask ((1ull  OpBits) - 1)
@@ -85,6 +87,7 @@
 #define DstMem64(OpMem64  DstShift)
 #define DstImmUByte (OpImmUByte  DstShift)
 #define DstDX   (OpDX  DstShift)
+#define DstAccLo(OpAccLo  DstShift)
 #define DstMask (OpMask  DstShift)
 /* Source operand type. */
 #define SrcShift6
@@ -106,6 +109,7 @@
 #define SrcImm64(OpImm64  SrcShift)
 #define SrcDX   (OpDX  SrcShift)
 #define SrcMem8 (OpMem8  SrcShift)
+#define SrcAccHi(OpAccHi  SrcShift)
 #define SrcMask (OpMask  SrcShift)
 #define BitOp   (111)
 #define MemAbs  (112)  /* Memory operand is absolute displacement */
@@ -154,6 +158,8 @@
 #define NoWrite ((u64)1  45)  /* No writeback */
 #define SrcWrite((u64)1  46)  /* Write back src operand */
 
+#define DstXacc (DstAccLo | SrcAccHi | SrcWrite)
+
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
 #define X4(x...) X2(x), X2(x)
@@ -4129,6 +4135,22 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
fetch_register_operand(op);
op-orig_val = op-val;
break;
+   case OpAccLo:
+   op-type = OP_REG;
+   op-bytes = (ctxt-d  ByteOp) ? 2 : ctxt-op_bytes;
+   op-addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX);
+   fetch_register_operand(op);
+   op-orig_val = op-val;
+   break;
+   case OpAccHi:
+   if (ctxt-d  ByteOp)
+   break;
+   op-type = OP_REG;
+   op-bytes = ctxt-op_bytes;
+   op-addr.reg = reg_rmw(ctxt, VCPU_REGS_RDX);
+   fetch_register_operand(op);
+   op-orig_val = op-val;
+   break;
case OpDI:
op-type = OP_MEM;
op-bytes = (ctxt-d  ByteOp) ? 1 : ctxt-op_bytes;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] emulator: add MUL tests

2013-02-09 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 x86/emulator.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/x86/emulator.c b/x86/emulator.c
index a128e13..96576e5 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -583,9 +583,9 @@ static void test_imul(ulong *mem)
 report(imul rax, mem, imm, a == 0x1D950BDE1D950BC8L);
 }
 
-static void test_div(long *mem)
+static void test_muldiv(long *mem)
 {
-long a, d;
+long a, d, aa, dd;
 u8 ex = 1;
 
 *mem = 0; a = 1; d = 2;
@@ -598,6 +598,19 @@ static void test_div(long *mem)
 : +a(a), +d(d), +q(ex) : m(*mem));
 report(divq (1),
   a == 0x1ffb1b963b33ul  d == 0x273ba4384ede2ul  !ex);
+aa = 0x; dd = 0x;
+*mem = 0x; a = aa; d = dd;
+asm(mulb %2 : +a(a), +d(d) : m(*mem));
+report(mulb mem, a == 0x0363  d == dd);
+*mem = 0x; a = aa; d = dd;
+asm(mulw %2 : +a(a), +d(d) : m(*mem));
+report(mulw mem, a == 0xc963  d == 0x0369);
+*mem = 0x; a = aa; d = dd;
+asm(mull %2 : +a(a), +d(d) : m(*mem));
+report(mull mem, a == 0x962fc963  d == 0x369d036);
+*mem = 0x; a = aa; d = dd;
+asm(mulq %2 : +a(a), +d(d) : m(*mem));
+report(mulq mem, a == 0x2fc962fc962fc963  d == 0x369d0369d0369d0);
 }
 
 typedef unsigned __attribute__((vector_size(16))) sse128;
@@ -934,7 +947,7 @@ int main()
test_btc(mem);
test_bsfbsr(mem);
test_imul(mem);
-   test_div(mem);
+   test_muldiv(mem);
test_sse(mem);
test_mmx(mem);
test_rip_relative(mem, insn_ram);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86 emulator: fix test_cc() build failure on i386

2013-01-26 Thread Avi Kivity
'pushq' doesn't exist on i386.  Replace with 'push', which should work
since the operand is a register.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e99fb72..2b11318 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1013,7 +1013,7 @@ static u8 test_cc(unsigned int condition, unsigned long 
flags)
void (*fop)(void) = (void *)em_setcc + 4 * (condition  0xf);
 
flags = (flags  EFLAGS_MASK) | X86_EFLAGS_IF;
-   asm(pushq %[flags]; popf; call *%[fastop]
+   asm(push %[flags]; popf; call *%[fastop]
: =a(rc) : [fastop]r(fop), [flags]r(flags));
return rc;
 }
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: Tree for Jan 25 (kvm)

2013-01-26 Thread Avi Kivity
On Sat, Jan 26, 2013 at 1:46 AM, Stephen Rothwell s...@canb.auug.org.au wrote:
 On Fri, 25 Jan 2013 08:53:58 -0800 Randy Dunlap rdun...@infradead.org wrote:

 Seeing lots of this error on i386:

 arch/x86/kvm/emulate.c:1016: Error: unsupported for `push'

 Caused by commit 9ae9febae950 (KVM: x86 emulator: covert SETCC to
 fastop) from the kvm tree.  cc's added.



I posted a fix.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/8] KVM: x86 emulator: convert a few freestanding emulations to fastop

2013-01-23 Thread Avi Kivity
On Wed, Jan 23, 2013 at 2:21 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 Missing signed off by.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/8] More fastop patches

2013-01-19 Thread Avi Kivity
After this, only the diffcult MUL and DIV case remains.

Changes from v1:
 - make SHLD/SHRD more consistent with the others
 - fix SETcc misordering

Avi Kivity (8):
  KVM: x86 emulator: Convert SHLD, SHRD to fastop
  KVM: x86 emulator: convert shift/rotate instructions to fastop
  KVM: x86 emulator: covert SETCC to fastop
  KVM: x86 emulator: convert INC/DEC to fastop
  KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop
  KVM: x86 emulator: convert 2-operand IMUL to fastop
  KVM: x86 emulator: rearrange fastop definitions
  KVM: x86 emulator: convert a few freestanding emulations to fastop

 arch/x86/kvm/emulate.c | 311 +
 1 file changed, 136 insertions(+), 175 deletions(-)

-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/8] KVM: x86 emulator: convert shift/rotate instructions to fastop

2013-01-19 Thread Avi Kivity
SHL, SHR, ROL, ROR, RCL, RCR, SAR, SAL

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 72 ++
 1 file changed, 31 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a21773f..a94b1d7 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
ON64(FOP2E(op##q, rax, rbx)) \
FOP_END
 
+/* 2 operand, src is CL */
+#define FASTOP2CL(op) \
+   FOP_START(op) \
+   FOP2E(op##b, al, cl) \
+   FOP2E(op##w, ax, cl) \
+   FOP2E(op##l, eax, cl) \
+   ON64(FOP2E(op##q, rax, cl)) \
+   FOP_END
+
 #define FOP3E(op,  dst, src, src2) \
FOP_ALIGN #op  % #src2 , % #src , % #dst  \n\t FOP_RET
 
@@ -2046,38 +2055,17 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_grp2(struct x86_emulate_ctxt *ctxt)
-{
-   switch (ctxt-modrm_reg) {
-   case 0: /* rol */
-   emulate_2op_SrcB(ctxt, rol);
-   break;
-   case 1: /* ror */
-   emulate_2op_SrcB(ctxt, ror);
-   break;
-   case 2: /* rcl */
-   emulate_2op_SrcB(ctxt, rcl);
-   break;
-   case 3: /* rcr */
-   emulate_2op_SrcB(ctxt, rcr);
-   break;
-   case 4: /* sal/shl */
-   case 6: /* sal/shl */
-   emulate_2op_SrcB(ctxt, sal);
-   break;
-   case 5: /* shr */
-   emulate_2op_SrcB(ctxt, shr);
-   break;
-   case 7: /* sar */
-   emulate_2op_SrcB(ctxt, sar);
-   break;
-   }
-   return X86EMUL_CONTINUE;
-}
-
 FASTOP1(not);
 FASTOP1(neg);
 
+FASTOP2CL(rol);
+FASTOP2CL(ror);
+FASTOP2CL(rcl);
+FASTOP2CL(rcr);
+FASTOP2CL(shl);
+FASTOP2CL(shr);
+FASTOP2CL(sar);
+
 static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 ex = 0;
@@ -3726,6 +3714,17 @@ static const struct opcode group1A[] = {
I(DstMem | SrcNone | Mov | Stack, em_pop), N, N, N, N, N, N, N,
 };
 
+static const struct opcode group2[] = {
+   F(DstMem | ModRM, em_rol),
+   F(DstMem | ModRM, em_ror),
+   F(DstMem | ModRM, em_rcl),
+   F(DstMem | ModRM, em_rcr),
+   F(DstMem | ModRM, em_shl),
+   F(DstMem | ModRM, em_shr),
+   F(DstMem | ModRM, em_shl),
+   F(DstMem | ModRM, em_sar),
+};
+
 static const struct opcode group3[] = {
F(DstMem | SrcImm | NoWrite, em_test),
F(DstMem | SrcImm | NoWrite, em_test),
@@ -3949,7 +3948,7 @@ static const struct opcode opcode_table[256] = {
/* 0xB8 - 0xBF */
X8(I(DstReg | SrcImm64 | Mov, em_mov)),
/* 0xC0 - 0xC7 */
-   D2bv(DstMem | SrcImmByte | ModRM),
+   G(ByteOp | Src2ImmByte, group2), G(Src2ImmByte, group2),
I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm),
I(ImplicitOps | Stack, em_ret),
I(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES, em_lseg),
@@ -3961,7 +3960,8 @@ static const struct opcode opcode_table[256] = {
D(ImplicitOps), DI(SrcImmByte, intn),
D(ImplicitOps | No64), II(ImplicitOps, em_iret, iret),
/* 0xD0 - 0xD7 */
-   D2bv(DstMem | SrcOne | ModRM), D2bv(DstMem | ModRM),
+   G(Src2One | ByteOp, group2), G(Src2One, group2),
+   G(Src2CL | ByteOp, group2), G(Src2CL, group2),
N, I(DstAcc | SrcImmByte | No64, em_aad), N, N,
/* 0xD8 - 0xDF */
N, E(0, escape_d9), N, E(0, escape_db), N, E(0, escape_dd), N, N,
@@ -4713,9 +4713,6 @@ special_insn:
case 8: ctxt-dst.val = (s32)ctxt-dst.val; break;
}
break;
-   case 0xc0 ... 0xc1:
-   rc = em_grp2(ctxt);
-   break;
case 0xcc:  /* int3 */
rc = emulate_int(ctxt, 3);
break;
@@ -4726,13 +4723,6 @@ special_insn:
if (ctxt-eflags  EFLG_OF)
rc = emulate_int(ctxt, 4);
break;
-   case 0xd0 ... 0xd1: /* Grp2 */
-   rc = em_grp2(ctxt);
-   break;
-   case 0xd2 ... 0xd3: /* Grp2 */
-   ctxt-src.val = reg_read(ctxt, VCPU_REGS_RCX);
-   rc = em_grp2(ctxt);
-   break;
case 0xe9: /* jmp rel */
case 0xeb: /* jmp rel short */
jmp_rel(ctxt, ctxt-src.val);
-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/8] KVM: x86 emulator: covert SETCC to fastop

2013-01-19 Thread Avi Kivity
This is a bit of a special case since we don't have the usual
byte/word/long/quad switch; instead we switch on the condition code embedded
in the instruction.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 60 --
 1 file changed, 29 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a94b1d7..e13138d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -499,6 +499,28 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
ON64(FOP3E(op##q, rax, rbx, cl)) \
FOP_END
 
+/* Special case for SETcc - 1 instruction per cc */
+#define FOP_SETCC(op) .align 4;  #op  %al; ret \n\t
+
+FOP_START(setcc)
+FOP_SETCC(seto)
+FOP_SETCC(setno)
+FOP_SETCC(setc)
+FOP_SETCC(setnc)
+FOP_SETCC(setz)
+FOP_SETCC(setnz)
+FOP_SETCC(setbe)
+FOP_SETCC(setnbe)
+FOP_SETCC(sets)
+FOP_SETCC(setns)
+FOP_SETCC(setp)
+FOP_SETCC(setnp)
+FOP_SETCC(setl)
+FOP_SETCC(setnl)
+FOP_SETCC(setle)
+FOP_SETCC(setnle)
+FOP_END;
+
 #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \
do {\
unsigned long _tmp; \
@@ -939,39 +961,15 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt,
return rc;
 }
 
-static int test_cc(unsigned int condition, unsigned int flags)
+static u8 test_cc(unsigned int condition, unsigned long flags)
 {
-   int rc = 0;
+   u8 rc;
+   void (*fop)(void) = (void *)em_setcc + 4 * (condition  0xf);
 
-   switch ((condition  15)  1) {
-   case 0: /* o */
-   rc |= (flags  EFLG_OF);
-   break;
-   case 1: /* b/c/nae */
-   rc |= (flags  EFLG_CF);
-   break;
-   case 2: /* z/e */
-   rc |= (flags  EFLG_ZF);
-   break;
-   case 3: /* be/na */
-   rc |= (flags  (EFLG_CF|EFLG_ZF));
-   break;
-   case 4: /* s */
-   rc |= (flags  EFLG_SF);
-   break;
-   case 5: /* p/pe */
-   rc |= (flags  EFLG_PF);
-   break;
-   case 7: /* le/ng */
-   rc |= (flags  EFLG_ZF);
-   /* fall through */
-   case 6: /* l/nge */
-   rc |= (!(flags  EFLG_SF) != !(flags  EFLG_OF));
-   break;
-   }
-
-   /* Odd condition identifiers (lsb == 1) have inverted sense. */
-   return (!!rc ^ (condition  1));
+   flags = (flags  EFLAGS_MASK) | X86_EFLAGS_IF;
+   asm(pushq %[flags]; popf; call *%[fastop]
+   : =a(rc) : [fastop]r(fop), [flags]r(flags));
+   return rc;
 }
 
 static void fetch_register_operand(struct operand *op)
-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/8] KVM: x86 emulator: convert INC/DEC to fastop

2013-01-19 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e13138d..edb09e9c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2055,6 +2055,8 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
 
 FASTOP1(not);
 FASTOP1(neg);
+FASTOP1(inc);
+FASTOP1(dec);
 
 FASTOP2CL(rol);
 FASTOP2CL(ror);
@@ -2105,12 +2107,6 @@ static int em_grp45(struct x86_emulate_ctxt *ctxt)
int rc = X86EMUL_CONTINUE;
 
switch (ctxt-modrm_reg) {
-   case 0: /* inc */
-   emulate_1op(ctxt, inc);
-   break;
-   case 1: /* dec */
-   emulate_1op(ctxt, dec);
-   break;
case 2: /* call near abs */ {
long int old_eip;
old_eip = ctxt-_eip;
@@ -3735,14 +3731,14 @@ static const struct opcode group3[] = {
 };
 
 static const struct opcode group4[] = {
-   I(ByteOp | DstMem | SrcNone | Lock, em_grp45),
-   I(ByteOp | DstMem | SrcNone | Lock, em_grp45),
+   F(ByteOp | DstMem | SrcNone | Lock, em_inc),
+   F(ByteOp | DstMem | SrcNone | Lock, em_dec),
N, N, N, N, N, N,
 };
 
 static const struct opcode group5[] = {
-   I(DstMem | SrcNone | Lock,  em_grp45),
-   I(DstMem | SrcNone | Lock,  em_grp45),
+   F(DstMem | SrcNone | Lock,  em_inc),
+   F(DstMem | SrcNone | Lock,  em_dec),
I(SrcMem | Stack,   em_grp45),
I(SrcMemFAddr | ImplicitOps | Stack,em_call_far),
I(SrcMem | Stack,   em_grp45),
@@ -3891,7 +3887,7 @@ static const struct opcode opcode_table[256] = {
/* 0x38 - 0x3F */
F6ALU(NoWrite, em_cmp), N, N,
/* 0x40 - 0x4F */
-   X16(D(DstReg)),
+   X8(F(DstReg, em_inc)), X8(F(DstReg, em_dec)),
/* 0x50 - 0x57 */
X8(I(SrcReg | Stack, em_push)),
/* 0x58 - 0x5F */
@@ -4681,12 +4677,6 @@ special_insn:
goto twobyte_insn;
 
switch (ctxt-b) {
-   case 0x40 ... 0x47: /* inc r16/r32 */
-   emulate_1op(ctxt, inc);
-   break;
-   case 0x48 ... 0x4f: /* dec r16/r32 */
-   emulate_1op(ctxt, dec);
-   break;
case 0x63:  /* movsxd */
if (ctxt-mode != X86EMUL_MODE_PROT64)
goto cannot_emulate;
-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/8] KVM: x86 emulator: rearrange fastop definitions

2013-01-19 Thread Avi Kivity
Make fastop opcodes usable in other emulations.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 70 +-
 1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 45ddec8..d06354d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -972,6 +972,41 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt,
return rc;
 }
 
+FASTOP2(add);
+FASTOP2(or);
+FASTOP2(adc);
+FASTOP2(sbb);
+FASTOP2(and);
+FASTOP2(sub);
+FASTOP2(xor);
+FASTOP2(cmp);
+FASTOP2(test);
+
+FASTOP3WCL(shld);
+FASTOP3WCL(shrd);
+
+FASTOP2W(imul);
+
+FASTOP1(not);
+FASTOP1(neg);
+FASTOP1(inc);
+FASTOP1(dec);
+
+FASTOP2CL(rol);
+FASTOP2CL(ror);
+FASTOP2CL(rcl);
+FASTOP2CL(rcr);
+FASTOP2CL(shl);
+FASTOP2CL(shr);
+FASTOP2CL(sar);
+
+FASTOP2W(bsf);
+FASTOP2W(bsr);
+FASTOP2W(bt);
+FASTOP2W(bts);
+FASTOP2W(btr);
+FASTOP2W(btc);
+
 static u8 test_cc(unsigned int condition, unsigned long flags)
 {
u8 rc;
@@ -2064,26 +2099,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-FASTOP1(not);
-FASTOP1(neg);
-FASTOP1(inc);
-FASTOP1(dec);
-
-FASTOP2CL(rol);
-FASTOP2CL(ror);
-FASTOP2CL(rcl);
-FASTOP2CL(rcr);
-FASTOP2CL(shl);
-FASTOP2CL(shr);
-FASTOP2CL(sar);
-
-FASTOP2W(bsf);
-FASTOP2W(bsr);
-FASTOP2W(bt);
-FASTOP2W(bts);
-FASTOP2W(btr);
-FASTOP2W(btc);
-
 static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 ex = 0;
@@ -3040,21 +3055,6 @@ static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-FASTOP2(add);
-FASTOP2(or);
-FASTOP2(adc);
-FASTOP2(sbb);
-FASTOP2(and);
-FASTOP2(sub);
-FASTOP2(xor);
-FASTOP2(cmp);
-FASTOP2(test);
-
-FASTOP3WCL(shld);
-FASTOP3WCL(shrd);
-
-FASTOP2W(imul);
-
 static int em_xchg(struct x86_emulate_ctxt *ctxt)
 {
/* Write back the register source. */
-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 8/8] KVM: x86 emulator: convert a few freestanding emulations to fastop

2013-01-19 Thread Avi Kivity
---
 arch/x86/kvm/emulate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d06354d..e99fb72 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2209,7 +2209,7 @@ static int em_cmpxchg(struct x86_emulate_ctxt *ctxt)
/* Save real source value, then compare EAX against destination. */
ctxt-src.orig_val = ctxt-src.val;
ctxt-src.val = reg_read(ctxt, VCPU_REGS_RAX);
-   emulate_2op_SrcV(ctxt, cmp);
+   fastop(ctxt, em_cmp);
 
if (ctxt-eflags  EFLG_ZF) {
/* Success: write back to memory. */
@@ -2977,7 +2977,7 @@ static int em_das(struct x86_emulate_ctxt *ctxt)
ctxt-src.type = OP_IMM;
ctxt-src.val = 0;
ctxt-src.bytes = 1;
-   emulate_2op_SrcV(ctxt, or);
+   fastop(ctxt, em_or);
ctxt-eflags = ~(X86_EFLAGS_AF | X86_EFLAGS_CF);
if (cf)
ctxt-eflags |= X86_EFLAGS_CF;
@@ -4816,7 +4816,7 @@ twobyte_insn:
(s16) ctxt-src.val;
break;
case 0xc0 ... 0xc1: /* xadd */
-   emulate_2op_SrcV(ctxt, add);
+   fastop(ctxt, em_add);
/* Write back the register source. */
ctxt-src.val = ctxt-dst.orig_val;
write_register_operand(ctxt-src);
-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/8] KVM: x86 emulator: convert 2-operand IMUL to fastop

2013-01-19 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 62014dc..45ddec8 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -441,6 +441,8 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
}   \
} while (0)
 
+static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *));
+
 #define FOP_ALIGN .align  __stringify(FASTOP_SIZE)  \n\t
 #define FOP_RET   ret \n\t
 
@@ -3051,6 +3053,8 @@ FASTOP2(test);
 FASTOP3WCL(shld);
 FASTOP3WCL(shrd);
 
+FASTOP2W(imul);
+
 static int em_xchg(struct x86_emulate_ctxt *ctxt)
 {
/* Write back the register source. */
@@ -3063,16 +3067,10 @@ static int em_xchg(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_imul(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, imul);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_imul_3op(struct x86_emulate_ctxt *ctxt)
 {
ctxt-dst.val = ctxt-src2.val;
-   return em_imul(ctxt);
+   return fastop(ctxt, em_imul);
 }
 
 static int em_cwd(struct x86_emulate_ctxt *ctxt)
@@ -4010,7 +4008,7 @@ static const struct opcode twobyte_table[256] = {
F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
-   D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
+   D(ModRM), F(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg),
I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg),
-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/8] KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop

2013-01-19 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 76 +-
 1 file changed, 26 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index edb09e9c..62014dc 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
ON64(FOP2E(op##q, rax, rbx)) \
FOP_END
 
+/* 2 operand, word only */
+#define FASTOP2W(op) \
+   FOP_START(op) \
+   FOPNOP() \
+   FOP2E(op##w, ax, bx) \
+   FOP2E(op##l, eax, ebx) \
+   ON64(FOP2E(op##q, rax, rbx)) \
+   FOP_END
+
 /* 2 operand, src is CL */
 #define FASTOP2CL(op) \
FOP_START(op) \
@@ -2066,6 +2075,13 @@ FASTOP2CL(shl);
 FASTOP2CL(shr);
 FASTOP2CL(sar);
 
+FASTOP2W(bsf);
+FASTOP2W(bsr);
+FASTOP2W(bt);
+FASTOP2W(bts);
+FASTOP2W(btr);
+FASTOP2W(btc);
+
 static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 ex = 0;
@@ -3377,47 +3393,6 @@ static int em_sti(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_bt(struct x86_emulate_ctxt *ctxt)
-{
-   /* Disable writeback. */
-   ctxt-dst.type = OP_NONE;
-   /* only subword offset */
-   ctxt-src.val = (ctxt-dst.bytes  3) - 1;
-
-   emulate_2op_SrcV_nobyte(ctxt, bt);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_bts(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, bts);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_btr(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, btr);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_btc(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, btc);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_bsf(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, bsf);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_bsr(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, bsr);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_cpuid(struct x86_emulate_ctxt *ctxt)
 {
u32 eax, ebx, ecx, edx;
@@ -3773,10 +3748,10 @@ static const struct group_dual group7 = { {
 
 static const struct opcode group8[] = {
N, N, N, N,
-   I(DstMem | SrcImmByte,  em_bt),
-   I(DstMem | SrcImmByte | Lock | PageTable,   em_bts),
-   I(DstMem | SrcImmByte | Lock,   em_btr),
-   I(DstMem | SrcImmByte | Lock | PageTable,   em_btc),
+   F(DstMem | SrcImmByte | NoWrite,em_bt),
+   F(DstMem | SrcImmByte | Lock | PageTable,   em_bts),
+   F(DstMem | SrcImmByte | Lock,   em_btr),
+   F(DstMem | SrcImmByte | Lock | PageTable,   em_btc),
 };
 
 static const struct group_dual group9 = { {
@@ -4025,28 +4000,29 @@ static const struct opcode twobyte_table[256] = {
X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)),
/* 0xA0 - 0xA7 */
I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg),
-   II(ImplicitOps, em_cpuid, cpuid), I(DstMem | SrcReg | ModRM | BitOp, 
em_bt),
+   II(ImplicitOps, em_cpuid, cpuid),
+   F(DstMem | SrcReg | ModRM | BitOp | NoWrite, em_bt),
F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shld),
F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N,
/* 0xA8 - 0xAF */
I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
DI(ImplicitOps, rsm),
-   I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
+   F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg),
I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg),
-   I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr),
+   F(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr),
I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg),
I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg),
D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov),
/* 0xB8 - 0xBF */
N, N,
G(BitOp, group8),
-   I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc),
-   I(DstReg | SrcMem | ModRM, em_bsf), I(DstReg | SrcMem | ModRM, em_bsr),
+   F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc),
+   F(DstReg | SrcMem | ModRM, em_bsf), F(DstReg | SrcMem | ModRM, em_bsr),
D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov),
/* 0xC0 - 0xC7 */
D2bv(DstMem | SrcReg | ModRM | Lock),
-- 
1.8.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

[PATCH 0/8] More fastop patches

2013-01-12 Thread Avi Kivity
After this, only the diffult MUL and DIV case remains.

Avi Kivity (8):
  KVM: x86 emulator: Streamline SHLD, SHRD
  KVM: x86 emulator: convert shift/rotate instructions to fastop
  KVM: x86 emulator: covert SETCC to fastop
  KVM: x86 emulator: convert INC/DEC to fastop
  KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop
  KVM: x86 emulator: convert 2-operand IMUL to fastop
  KVM: x86 emulator: rearrange fastop definitions
  KVM: x86 emulator: convert a few freestanding emulations to fastop

 arch/x86/kvm/emulate.c | 311 +
 1 file changed, 136 insertions(+), 175 deletions(-)

-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] KVM: x86 emulator: Streamline SHLD, SHRD

2013-01-12 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 33 +
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 619a33d..2189c6a 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -454,6 +454,8 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
 #define FOP_END \
.popsection)
 
+#define FOPNOP() FOP_ALIGN FOP_RET
+
 #define FOP1E(op,  dst) \
FOP_ALIGN #op  % #dst  \n\t FOP_RET
 
@@ -476,6 +478,18 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
ON64(FOP2E(op##q, rax, rbx)) \
FOP_END
 
+#define FOP3E(op,  dst, src, src2) \
+   FOP_ALIGN #op  % #src2 , % #src , % #dst  \n\t FOP_RET
+
+/* 3-operand, word-only, src2=cl */
+#define FASTOP3WCL(op) \
+   FOP_START(op) \
+   FOPNOP() \
+   FOP3E(op, ax, bx, cl) \
+   FOP3E(op, eax, ebx, cl) \
+   ON64(FOP3E(op, rax, rbx, cl)) \
+   FOP_END
+
 #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \
do {\
unsigned long _tmp; \
@@ -3036,6 +3050,9 @@ FASTOP2(xor);
 FASTOP2(cmp);
 FASTOP2(test);
 
+FASTOP3WCL(shld);
+FASTOP3WCL(shrd);
+
 static int em_xchg(struct x86_emulate_ctxt *ctxt)
 {
/* Write back the register source. */
@@ -4015,14 +4032,14 @@ static const struct opcode twobyte_table[256] = {
/* 0xA0 - 0xA7 */
I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg),
II(ImplicitOps, em_cpuid, cpuid), I(DstMem | SrcReg | ModRM | BitOp, 
em_bt),
-   D(DstMem | SrcReg | Src2ImmByte | ModRM),
-   D(DstMem | SrcReg | Src2CL | ModRM), N, N,
+   F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shld),
+   F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N,
/* 0xA8 - 0xAF */
I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
DI(ImplicitOps, rsm),
I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
-   D(DstMem | SrcReg | Src2ImmByte | ModRM),
-   D(DstMem | SrcReg | Src2CL | ModRM),
+   F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
+   F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg),
@@ -4834,14 +4851,6 @@ twobyte_insn:
case 0x90 ... 0x9f: /* setcc r/m8 */
ctxt-dst.val = test_cc(ctxt-b, ctxt-eflags);
break;
-   case 0xa4: /* shld imm8, r, r/m */
-   case 0xa5: /* shld cl, r, r/m */
-   emulate_2op_cl(ctxt, shld);
-   break;
-   case 0xac: /* shrd imm8, r, r/m */
-   case 0xad: /* shrd cl, r, r/m */
-   emulate_2op_cl(ctxt, shrd);
-   break;
case 0xae:  /* clflush */
break;
case 0xb6 ... 0xb7: /* movzx */
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] KVM: x86 emulator: convert shift/rotate instructions to fastop

2013-01-12 Thread Avi Kivity
SHL, SHR, ROL, ROR, RCL, RCR, SAR, SAL

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 72 ++
 1 file changed, 31 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2189c6a..d641178 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
ON64(FOP2E(op##q, rax, rbx)) \
FOP_END
 
+/* 2 operand, src is CL */
+#define FASTOP2CL(op) \
+   FOP_START(op) \
+   FOP2E(op##b, al, cl) \
+   FOP2E(op##w, ax, cl) \
+   FOP2E(op##l, eax, cl) \
+   ON64(FOP2E(op##q, rax, cl)) \
+   FOP_END
+
 #define FOP3E(op,  dst, src, src2) \
FOP_ALIGN #op  % #src2 , % #src , % #dst  \n\t FOP_RET
 
@@ -2046,38 +2055,17 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_grp2(struct x86_emulate_ctxt *ctxt)
-{
-   switch (ctxt-modrm_reg) {
-   case 0: /* rol */
-   emulate_2op_SrcB(ctxt, rol);
-   break;
-   case 1: /* ror */
-   emulate_2op_SrcB(ctxt, ror);
-   break;
-   case 2: /* rcl */
-   emulate_2op_SrcB(ctxt, rcl);
-   break;
-   case 3: /* rcr */
-   emulate_2op_SrcB(ctxt, rcr);
-   break;
-   case 4: /* sal/shl */
-   case 6: /* sal/shl */
-   emulate_2op_SrcB(ctxt, sal);
-   break;
-   case 5: /* shr */
-   emulate_2op_SrcB(ctxt, shr);
-   break;
-   case 7: /* sar */
-   emulate_2op_SrcB(ctxt, sar);
-   break;
-   }
-   return X86EMUL_CONTINUE;
-}
-
 FASTOP1(not);
 FASTOP1(neg);
 
+FASTOP2CL(rol);
+FASTOP2CL(ror);
+FASTOP2CL(rcl);
+FASTOP2CL(rcr);
+FASTOP2CL(shl);
+FASTOP2CL(shr);
+FASTOP2CL(sar);
+
 static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 ex = 0;
@@ -3726,6 +3714,17 @@ static const struct opcode group1A[] = {
I(DstMem | SrcNone | Mov | Stack, em_pop), N, N, N, N, N, N, N,
 };
 
+static const struct opcode group2[] = {
+   F(DstMem | ModRM, em_rol),
+   F(DstMem | ModRM, em_ror),
+   F(DstMem | ModRM, em_rcl),
+   F(DstMem | ModRM, em_rcr),
+   F(DstMem | ModRM, em_shl),
+   F(DstMem | ModRM, em_shr),
+   F(DstMem | ModRM, em_shl),
+   F(DstMem | ModRM, em_sar),
+};
+
 static const struct opcode group3[] = {
F(DstMem | SrcImm | NoWrite, em_test),
F(DstMem | SrcImm | NoWrite, em_test),
@@ -3949,7 +3948,7 @@ static const struct opcode opcode_table[256] = {
/* 0xB8 - 0xBF */
X8(I(DstReg | SrcImm64 | Mov, em_mov)),
/* 0xC0 - 0xC7 */
-   D2bv(DstMem | SrcImmByte | ModRM),
+   G(ByteOp | Src2ImmByte, group2), G(Src2ImmByte, group2),
I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm),
I(ImplicitOps | Stack, em_ret),
I(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES, em_lseg),
@@ -3961,7 +3960,8 @@ static const struct opcode opcode_table[256] = {
D(ImplicitOps), DI(SrcImmByte, intn),
D(ImplicitOps | No64), II(ImplicitOps, em_iret, iret),
/* 0xD0 - 0xD7 */
-   D2bv(DstMem | SrcOne | ModRM), D2bv(DstMem | ModRM),
+   G(Src2One | ByteOp, group2), G(Src2One, group2),
+   G(Src2CL | ByteOp, group2), G(Src2CL, group2),
N, I(DstAcc | SrcImmByte | No64, em_aad), N, N,
/* 0xD8 - 0xDF */
N, E(0, escape_d9), N, E(0, escape_db), N, E(0, escape_dd), N, N,
@@ -4713,9 +4713,6 @@ special_insn:
case 8: ctxt-dst.val = (s32)ctxt-dst.val; break;
}
break;
-   case 0xc0 ... 0xc1:
-   rc = em_grp2(ctxt);
-   break;
case 0xcc:  /* int3 */
rc = emulate_int(ctxt, 3);
break;
@@ -4726,13 +4723,6 @@ special_insn:
if (ctxt-eflags  EFLG_OF)
rc = emulate_int(ctxt, 4);
break;
-   case 0xd0 ... 0xd1: /* Grp2 */
-   rc = em_grp2(ctxt);
-   break;
-   case 0xd2 ... 0xd3: /* Grp2 */
-   ctxt-src.val = reg_read(ctxt, VCPU_REGS_RCX);
-   rc = em_grp2(ctxt);
-   break;
case 0xe9: /* jmp rel */
case 0xeb: /* jmp rel short */
jmp_rel(ctxt, ctxt-src.val);
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] KVM: x86 emulator: convert INC/DEC to fastop

2013-01-12 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index f6f615e..d89e88f 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2055,6 +2055,8 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
 
 FASTOP1(not);
 FASTOP1(neg);
+FASTOP1(inc);
+FASTOP1(dec);
 
 FASTOP2CL(rol);
 FASTOP2CL(ror);
@@ -2105,12 +2107,6 @@ static int em_grp45(struct x86_emulate_ctxt *ctxt)
int rc = X86EMUL_CONTINUE;
 
switch (ctxt-modrm_reg) {
-   case 0: /* inc */
-   emulate_1op(ctxt, inc);
-   break;
-   case 1: /* dec */
-   emulate_1op(ctxt, dec);
-   break;
case 2: /* call near abs */ {
long int old_eip;
old_eip = ctxt-_eip;
@@ -3735,14 +3731,14 @@ static const struct opcode group3[] = {
 };
 
 static const struct opcode group4[] = {
-   I(ByteOp | DstMem | SrcNone | Lock, em_grp45),
-   I(ByteOp | DstMem | SrcNone | Lock, em_grp45),
+   F(ByteOp | DstMem | SrcNone | Lock, em_inc),
+   F(ByteOp | DstMem | SrcNone | Lock, em_dec),
N, N, N, N, N, N,
 };
 
 static const struct opcode group5[] = {
-   I(DstMem | SrcNone | Lock,  em_grp45),
-   I(DstMem | SrcNone | Lock,  em_grp45),
+   F(DstMem | SrcNone | Lock,  em_inc),
+   F(DstMem | SrcNone | Lock,  em_dec),
I(SrcMem | Stack,   em_grp45),
I(SrcMemFAddr | ImplicitOps | Stack,em_call_far),
I(SrcMem | Stack,   em_grp45),
@@ -3891,7 +3887,7 @@ static const struct opcode opcode_table[256] = {
/* 0x38 - 0x3F */
F6ALU(NoWrite, em_cmp), N, N,
/* 0x40 - 0x4F */
-   X16(D(DstReg)),
+   X8(F(DstReg, em_inc)), X8(F(DstReg, em_dec)),
/* 0x50 - 0x57 */
X8(I(SrcReg | Stack, em_push)),
/* 0x58 - 0x5F */
@@ -4681,12 +4677,6 @@ special_insn:
goto twobyte_insn;
 
switch (ctxt-b) {
-   case 0x40 ... 0x47: /* inc r16/r32 */
-   emulate_1op(ctxt, inc);
-   break;
-   case 0x48 ... 0x4f: /* dec r16/r32 */
-   emulate_1op(ctxt, dec);
-   break;
case 0x63:  /* movsxd */
if (ctxt-mode != X86EMUL_MODE_PROT64)
goto cannot_emulate;
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] KVM: x86 emulator: convert a few freestanding emulations to fastop

2013-01-12 Thread Avi Kivity
---
 arch/x86/kvm/emulate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index da2b903..1bb0af2 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2209,7 +2209,7 @@ static int em_cmpxchg(struct x86_emulate_ctxt *ctxt)
/* Save real source value, then compare EAX against destination. */
ctxt-src.orig_val = ctxt-src.val;
ctxt-src.val = reg_read(ctxt, VCPU_REGS_RAX);
-   emulate_2op_SrcV(ctxt, cmp);
+   fastop(ctxt, em_cmp);
 
if (ctxt-eflags  EFLG_ZF) {
/* Success: write back to memory. */
@@ -2977,7 +2977,7 @@ static int em_das(struct x86_emulate_ctxt *ctxt)
ctxt-src.type = OP_IMM;
ctxt-src.val = 0;
ctxt-src.bytes = 1;
-   emulate_2op_SrcV(ctxt, or);
+   fastop(ctxt, em_or);
ctxt-eflags = ~(X86_EFLAGS_AF | X86_EFLAGS_CF);
if (cf)
ctxt-eflags |= X86_EFLAGS_CF;
@@ -4816,7 +4816,7 @@ twobyte_insn:
(s16) ctxt-src.val;
break;
case 0xc0 ... 0xc1: /* xadd */
-   emulate_2op_SrcV(ctxt, add);
+   fastop(ctxt, em_add);
/* Write back the register source. */
ctxt-src.val = ctxt-dst.orig_val;
write_register_operand(ctxt-src);
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] KVM: x86 emulator: covert SETCC to fastop

2013-01-12 Thread Avi Kivity
This is a bit of a special case since we don't have the usual
byte/word/long/quad switch; instead we switch on the condition code embedded
in the instruction.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 60 --
 1 file changed, 29 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d641178..f6f615e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -499,6 +499,28 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
ON64(FOP3E(op, rax, rbx, cl)) \
FOP_END
 
+/* Special case for SETcc - 1 instruction per cc */
+#define FOP_SETCC(op) .align 4;  #op  %al; ret \n\t
+
+FOP_START(setcc)
+FOP_SETCC(seto)
+FOP_SETCC(setc)
+FOP_SETCC(setz)
+FOP_SETCC(setbe)
+FOP_SETCC(sets)
+FOP_SETCC(setp)
+FOP_SETCC(setl)
+FOP_SETCC(setle)
+FOP_SETCC(setno)
+FOP_SETCC(setnc)
+FOP_SETCC(setnz)
+FOP_SETCC(setnbe)
+FOP_SETCC(setns)
+FOP_SETCC(setnp)
+FOP_SETCC(setnl)
+FOP_SETCC(setnle)
+FOP_END;
+
 #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \
do {\
unsigned long _tmp; \
@@ -939,39 +961,15 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt,
return rc;
 }
 
-static int test_cc(unsigned int condition, unsigned int flags)
+static u8 test_cc(unsigned int condition, unsigned long flags)
 {
-   int rc = 0;
+   u8 rc;
+   void (*fop)(void) = (void *)em_setcc + 4 * (condition  0xf);
 
-   switch ((condition  15)  1) {
-   case 0: /* o */
-   rc |= (flags  EFLG_OF);
-   break;
-   case 1: /* b/c/nae */
-   rc |= (flags  EFLG_CF);
-   break;
-   case 2: /* z/e */
-   rc |= (flags  EFLG_ZF);
-   break;
-   case 3: /* be/na */
-   rc |= (flags  (EFLG_CF|EFLG_ZF));
-   break;
-   case 4: /* s */
-   rc |= (flags  EFLG_SF);
-   break;
-   case 5: /* p/pe */
-   rc |= (flags  EFLG_PF);
-   break;
-   case 7: /* le/ng */
-   rc |= (flags  EFLG_ZF);
-   /* fall through */
-   case 6: /* l/nge */
-   rc |= (!(flags  EFLG_SF) != !(flags  EFLG_OF));
-   break;
-   }
-
-   /* Odd condition identifiers (lsb == 1) have inverted sense. */
-   return (!!rc ^ (condition  1));
+   flags = (flags  EFLAGS_MASK) | X86_EFLAGS_IF;
+   asm(pushq %[flags]; popf; call *%[fastop]
+   : =a(rc) : [fastop]r(fop), [flags]r(flags));
+   return rc;
 }
 
 static void fetch_register_operand(struct operand *op)
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] KVM: x86 emulator: rearrange fastop definitions

2013-01-12 Thread Avi Kivity
Make fastop opcodes usable in other emulations.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 70 +-
 1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c7578d0..da2b903 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -972,6 +972,41 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt,
return rc;
 }
 
+FASTOP2(add);
+FASTOP2(or);
+FASTOP2(adc);
+FASTOP2(sbb);
+FASTOP2(and);
+FASTOP2(sub);
+FASTOP2(xor);
+FASTOP2(cmp);
+FASTOP2(test);
+
+FASTOP3WCL(shld);
+FASTOP3WCL(shrd);
+
+FASTOP2W(imul);
+
+FASTOP1(not);
+FASTOP1(neg);
+FASTOP1(inc);
+FASTOP1(dec);
+
+FASTOP2CL(rol);
+FASTOP2CL(ror);
+FASTOP2CL(rcl);
+FASTOP2CL(rcr);
+FASTOP2CL(shl);
+FASTOP2CL(shr);
+FASTOP2CL(sar);
+
+FASTOP2W(bsf);
+FASTOP2W(bsr);
+FASTOP2W(bt);
+FASTOP2W(bts);
+FASTOP2W(btr);
+FASTOP2W(btc);
+
 static u8 test_cc(unsigned int condition, unsigned long flags)
 {
u8 rc;
@@ -2064,26 +2099,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-FASTOP1(not);
-FASTOP1(neg);
-FASTOP1(inc);
-FASTOP1(dec);
-
-FASTOP2CL(rol);
-FASTOP2CL(ror);
-FASTOP2CL(rcl);
-FASTOP2CL(rcr);
-FASTOP2CL(shl);
-FASTOP2CL(shr);
-FASTOP2CL(sar);
-
-FASTOP2W(bsf);
-FASTOP2W(bsr);
-FASTOP2W(bt);
-FASTOP2W(bts);
-FASTOP2W(btr);
-FASTOP2W(btc);
-
 static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 ex = 0;
@@ -3040,21 +3055,6 @@ static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-FASTOP2(add);
-FASTOP2(or);
-FASTOP2(adc);
-FASTOP2(sbb);
-FASTOP2(and);
-FASTOP2(sub);
-FASTOP2(xor);
-FASTOP2(cmp);
-FASTOP2(test);
-
-FASTOP3WCL(shld);
-FASTOP3WCL(shrd);
-
-FASTOP2W(imul);
-
 static int em_xchg(struct x86_emulate_ctxt *ctxt)
 {
/* Write back the register source. */
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] KVM: x86 emulator: convert 2-operand IMUL to fastop

2013-01-12 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7ff83d9..c7578d0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -441,6 +441,8 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
}   \
} while (0)
 
+static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *));
+
 #define FOP_ALIGN .align  __stringify(FASTOP_SIZE)  \n\t
 #define FOP_RET   ret \n\t
 
@@ -3051,6 +3053,8 @@ FASTOP2(test);
 FASTOP3WCL(shld);
 FASTOP3WCL(shrd);
 
+FASTOP2W(imul);
+
 static int em_xchg(struct x86_emulate_ctxt *ctxt)
 {
/* Write back the register source. */
@@ -3063,16 +3067,10 @@ static int em_xchg(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_imul(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, imul);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_imul_3op(struct x86_emulate_ctxt *ctxt)
 {
ctxt-dst.val = ctxt-src2.val;
-   return em_imul(ctxt);
+   return fastop(ctxt, em_imul);
 }
 
 static int em_cwd(struct x86_emulate_ctxt *ctxt)
@@ -4010,7 +4008,7 @@ static const struct opcode twobyte_table[256] = {
F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
-   D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
+   D(ModRM), F(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg),
I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg),
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop

2013-01-12 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 76 +-
 1 file changed, 26 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d89e88f..7ff83d9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
ON64(FOP2E(op##q, rax, rbx)) \
FOP_END
 
+/* 2 operand, word only */
+#define FASTOP2W(op) \
+   FOP_START(op) \
+   FOPNOP() \
+   FOP2E(op##w, ax, bx) \
+   FOP2E(op##l, eax, ebx) \
+   ON64(FOP2E(op##q, rax, rbx)) \
+   FOP_END
+
 /* 2 operand, src is CL */
 #define FASTOP2CL(op) \
FOP_START(op) \
@@ -2066,6 +2075,13 @@ FASTOP2CL(shl);
 FASTOP2CL(shr);
 FASTOP2CL(sar);
 
+FASTOP2W(bsf);
+FASTOP2W(bsr);
+FASTOP2W(bt);
+FASTOP2W(bts);
+FASTOP2W(btr);
+FASTOP2W(btc);
+
 static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 ex = 0;
@@ -3377,47 +3393,6 @@ static int em_sti(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_bt(struct x86_emulate_ctxt *ctxt)
-{
-   /* Disable writeback. */
-   ctxt-dst.type = OP_NONE;
-   /* only subword offset */
-   ctxt-src.val = (ctxt-dst.bytes  3) - 1;
-
-   emulate_2op_SrcV_nobyte(ctxt, bt);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_bts(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, bts);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_btr(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, btr);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_btc(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, btc);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_bsf(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, bsf);
-   return X86EMUL_CONTINUE;
-}
-
-static int em_bsr(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_2op_SrcV_nobyte(ctxt, bsr);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_cpuid(struct x86_emulate_ctxt *ctxt)
 {
u32 eax, ebx, ecx, edx;
@@ -3773,10 +3748,10 @@ static const struct group_dual group7 = { {
 
 static const struct opcode group8[] = {
N, N, N, N,
-   I(DstMem | SrcImmByte,  em_bt),
-   I(DstMem | SrcImmByte | Lock | PageTable,   em_bts),
-   I(DstMem | SrcImmByte | Lock,   em_btr),
-   I(DstMem | SrcImmByte | Lock | PageTable,   em_btc),
+   F(DstMem | SrcImmByte | NoWrite,em_bt),
+   F(DstMem | SrcImmByte | Lock | PageTable,   em_bts),
+   F(DstMem | SrcImmByte | Lock,   em_btr),
+   F(DstMem | SrcImmByte | Lock | PageTable,   em_btc),
 };
 
 static const struct group_dual group9 = { {
@@ -4025,28 +4000,29 @@ static const struct opcode twobyte_table[256] = {
X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)),
/* 0xA0 - 0xA7 */
I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg),
-   II(ImplicitOps, em_cpuid, cpuid), I(DstMem | SrcReg | ModRM | BitOp, 
em_bt),
+   II(ImplicitOps, em_cpuid, cpuid),
+   F(DstMem | SrcReg | ModRM | BitOp | NoWrite, em_bt),
F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shld),
F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N,
/* 0xA8 - 0xAF */
I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
DI(ImplicitOps, rsm),
-   I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
+   F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg),
I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg),
-   I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr),
+   F(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr),
I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg),
I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg),
D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov),
/* 0xB8 - 0xBF */
N, N,
G(BitOp, group8),
-   I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc),
-   I(DstReg | SrcMem | ModRM, em_bsf), I(DstReg | SrcMem | ModRM, em_bsr),
+   F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc),
+   F(DstReg | SrcMem | ModRM, em_bsf), F(DstReg | SrcMem | ModRM, em_bsr),
D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov),
/* 0xC0 - 0xC7 */
D2bv(DstMem | SrcReg | ModRM | Lock),
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

[PATCH v2 0/7] Streamline arithmetic instruction emulation

2013-01-04 Thread Avi Kivity
The current arithmetic instruction emulation is fairly clumsy: after
decode, each instruction gets a switch (size), and for every size
we fetch the operands, prepare flags, emulate the instruction, then store
back the flags and operands.

This patchset simplifies things by moving everything into common code
except the instruction itself.  All the pre- and post- processing is
coded just once.  The per-instrution code looks like:

  add %bl, %al
  ret

  add %bx, %ax
  ret

  add %ebx, %eax
  ret

  add %rbx, %rax
  ret

The savings in size, for the ten instructions converted in this patchset,
are fairly large:

   textdata bss dec hex filename
  63724   0   0   63724f8ec arch/x86/kvm/emulate.o.before
  61268   0   0   61268ef54 arch/x86/kvm/emulate.o.after

- around 2500 bytes.


v2: rebased

Avi Kivity (7):
  KVM: x86 emulator: framework for streamlining arithmetic opcodes
  KVM: x86 emulator: Support for declaring single operand fastops
  KVM: x86 emulator: introduce NoWrite flag
  KVM: x86 emulator: mark CMP, CMPS, SCAS, TEST as NoWrite
  KVM: x86 emulator: convert NOT, NEG to fastop
  KVM: x86 emulator: add macros for defining 2-operand fastop emulation
  KVM: x86 emulator: convert basic ALU ops to fastop

 arch/x86/kvm/emulate.c | 215 +++--
 1 file changed, 120 insertions(+), 95 deletions(-)

-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/7] KVM: x86 emulator: framework for streamlining arithmetic opcodes

2013-01-04 Thread Avi Kivity
We emulate arithmetic opcodes by executing a similar (same operation,
different operands) on the cpu.  This ensures accurate emulation, esp. wrt.
eflags.  However, the prologue and epilogue around the opcode is fairly long,
consisting of a switch (for the operand size) and code to load and save the
operands.  This is repeated for every opcode.

This patch introduces an alternative way to emulate arithmetic opcodes.
Instead of the above, we have four (three on i386) functions consisting
of just the opcode and a ret; one for each operand size.  For example:

   .align 8
   em_notb:
not %al
ret

   .align 8
   em_notw:
not %ax
ret

   .align 8
   em_notl:
not %eax
ret

   .align 8
   em_notq:
not %rax
ret

The prologue and epilogue are shared across all opcodes.  Note the functions
use a special calling convention; notably eflags is an input/output parameter
and is not clobbered.  Rather than dispatching the four functions through a
jump table, the functions are declared as a constant size (8) so their address
can be calculated.

Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 53c5ad6..dd71567 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -149,6 +149,7 @@
 #define Aligned ((u64)1  41)  /* Explicitly aligned (e.g. MOVDQA) */
 #define Unaligned   ((u64)1  42)  /* Explicitly unaligned (e.g. MOVDQU) */
 #define Avx ((u64)1  43)  /* Advanced Vector Extensions */
+#define Fastop  ((u64)1  44)  /* Use opcode::u.fastop */
 
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
@@ -159,6 +160,27 @@
 #define X8(x...) X4(x), X4(x)
 #define X16(x...) X8(x), X8(x)
 
+#define NR_FASTOP (ilog2(sizeof(ulong)) + 1)
+#define FASTOP_SIZE 8
+
+/*
+ * fastop functions have a special calling convention:
+ *
+ * dst:[rdx]:rax  (in/out)
+ * src:rbx(in/out)
+ * src2:   rcx(in)
+ * flags:  rflags (in/out)
+ *
+ * Moreover, they are all exactly FASTOP_SIZE bytes long, so functions for
+ * different operand sizes can be reached by calculation, rather than a jump
+ * table (which would be bigger than the code).
+ *
+ * fastop functions are declared as taking a never-defined fastop parameter,
+ * so they can't be called from C directly.
+ */
+
+struct fastop;
+
 struct opcode {
u64 flags : 56;
u64 intercept : 8;
@@ -168,6 +190,7 @@ struct opcode {
const struct group_dual *gdual;
const struct gprefix *gprefix;
const struct escape *esc;
+   void (*fastop)(struct fastop *fake);
} u;
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
 };
@@ -3646,6 +3669,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 #define GD(_f, _g) { .flags = ((_f) | GroupDual | ModRM), .u.gdual = (_g) }
 #define E(_f, _e) { .flags = ((_f) | Escape | ModRM), .u.esc = (_e) }
 #define I(_f, _e) { .flags = (_f), .u.execute = (_e) }
+#define F(_f, _e) { .flags = (_f) | Fastop, .u.fastop = (_e) }
 #define II(_f, _e, _i) \
{ .flags = (_f), .u.execute = (_e), .intercept = x86_intercept_##_i }
 #define IIP(_f, _e, _i, _p) \
@@ -4502,6 +4526,16 @@ static void fetch_possible_mmx_operand(struct 
x86_emulate_ctxt *ctxt,
read_mmx_reg(ctxt, op-mm_val, op-addr.mm);
 }
 
+static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *))
+{
+   ulong flags = (ctxt-eflags  EFLAGS_MASK) | X86_EFLAGS_IF;
+   fop += __ffs(ctxt-dst.bytes) * FASTOP_SIZE;
+   asm(push %[flags]; popf; call *%[fastop]; pushf; pop %[flags]\n
+   : +a(ctxt-dst.val), +b(ctxt-src.val), [flags]+D(flags)
+   : c(ctxt-src2.val), [fastop]S(fop));
+   ctxt-eflags = (ctxt-eflags  ~EFLAGS_MASK) | (flags  EFLAGS_MASK);
+   return X86EMUL_CONTINUE;
+}
 
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 {
@@ -4631,6 +4665,13 @@ special_insn:
}
 
if (ctxt-execute) {
+   if (ctxt-d  Fastop) {
+   void (*fop)(struct fastop *) = (void *)ctxt-execute;
+   rc = fastop(ctxt, fop);
+   if (rc != X86EMUL_CONTINUE)
+   goto done;
+   goto writeback;
+   }
rc = ctxt-execute(ctxt);
if (rc != X86EMUL_CONTINUE)
goto done;
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/7] KVM: x86 emulator: Support for declaring single operand fastops

2013-01-04 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index dd71567..42c53c8 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -24,6 +24,7 @@
 #include kvm_cache_regs.h
 #include linux/module.h
 #include asm/kvm_emulate.h
+#include linux/stringify.h
 
 #include x86.h
 #include tss.h
@@ -439,6 +440,30 @@ static void invalidate_registers(struct x86_emulate_ctxt 
*ctxt)
}   \
} while (0)
 
+#define FOP_ALIGN .align  __stringify(FASTOP_SIZE)  \n\t
+#define FOP_RET   ret \n\t
+
+#define FOP_START(op) \
+   extern void em_##op(struct fastop *fake); \
+   asm(.pushsection .text, \ax\ \n\t \
+   .global em_ #op  \n\t \
+FOP_ALIGN \
+   em_ #op : \n\t
+
+#define FOP_END \
+   .popsection)
+
+#define FOP1E(op,  dst) \
+   FOP_ALIGN #op  % #dst  \n\t FOP_RET
+
+#define FASTOP1(op) \
+   FOP_START(op) \
+   FOP1E(op##b, al) \
+   FOP1E(op##w, ax) \
+   FOP1E(op##l, eax) \
+   ON64(FOP1E(op##q, rax)) \
+   FOP_END
+
 #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \
do {\
unsigned long _tmp; \
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/7] KVM: x86 emulator: mark CMP, CMPS, SCAS, TEST as NoWrite

2013-01-04 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index fe113fb..2af0c44 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3069,16 +3069,12 @@ static int em_xor(struct x86_emulate_ctxt *ctxt)
 static int em_cmp(struct x86_emulate_ctxt *ctxt)
 {
emulate_2op_SrcV(ctxt, cmp);
-   /* Disable writeback. */
-   ctxt-dst.type = OP_NONE;
return X86EMUL_CONTINUE;
 }
 
 static int em_test(struct x86_emulate_ctxt *ctxt)
 {
emulate_2op_SrcV(ctxt, test);
-   /* Disable writeback. */
-   ctxt-dst.type = OP_NONE;
return X86EMUL_CONTINUE;
 }
 
@@ -3747,7 +3743,7 @@ static const struct opcode group1[] = {
I(Lock | PageTable, em_and),
I(Lock, em_sub),
I(Lock, em_xor),
-   I(0, em_cmp),
+   I(NoWrite, em_cmp),
 };
 
 static const struct opcode group1A[] = {
@@ -3755,8 +3751,8 @@ static const struct opcode group1A[] = {
 };
 
 static const struct opcode group3[] = {
-   I(DstMem | SrcImm, em_test),
-   I(DstMem | SrcImm, em_test),
+   I(DstMem | SrcImm | NoWrite, em_test),
+   I(DstMem | SrcImm | NoWrite, em_test),
I(DstMem | SrcNone | Lock, em_not),
I(DstMem | SrcNone | Lock, em_neg),
I(SrcMem, em_mul_ex),
@@ -3920,7 +3916,7 @@ static const struct opcode opcode_table[256] = {
/* 0x30 - 0x37 */
I6ALU(Lock, em_xor), N, N,
/* 0x38 - 0x3F */
-   I6ALU(0, em_cmp), N, N,
+   I6ALU(NoWrite, em_cmp), N, N,
/* 0x40 - 0x4F */
X16(D(DstReg)),
/* 0x50 - 0x57 */
@@ -3946,7 +3942,7 @@ static const struct opcode opcode_table[256] = {
G(DstMem | SrcImm, group1),
G(ByteOp | DstMem | SrcImm | No64, group1),
G(DstMem | SrcImmByte, group1),
-   I2bv(DstMem | SrcReg | ModRM, em_test),
+   I2bv(DstMem | SrcReg | ModRM | NoWrite, em_test),
I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_xchg),
/* 0x88 - 0x8F */
I2bv(DstMem | SrcReg | ModRM | Mov | PageTable, em_mov),
@@ -3966,12 +3962,12 @@ static const struct opcode opcode_table[256] = {
I2bv(DstAcc | SrcMem | Mov | MemAbs, em_mov),
I2bv(DstMem | SrcAcc | Mov | MemAbs | PageTable, em_mov),
I2bv(SrcSI | DstDI | Mov | String, em_mov),
-   I2bv(SrcSI | DstDI | String, em_cmp),
+   I2bv(SrcSI | DstDI | String | NoWrite, em_cmp),
/* 0xA8 - 0xAF */
-   I2bv(DstAcc | SrcImm, em_test),
+   I2bv(DstAcc | SrcImm | NoWrite, em_test),
I2bv(SrcAcc | DstDI | Mov | String, em_mov),
I2bv(SrcSI | DstAcc | Mov | String, em_mov),
-   I2bv(SrcAcc | DstDI | String, em_cmp),
+   I2bv(SrcAcc | DstDI | String | NoWrite, em_cmp),
/* 0xB0 - 0xB7 */
X8(I(ByteOp | DstReg | SrcImm | Mov, em_mov)),
/* 0xB8 - 0xBF */
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] KVM: x86 emulator: convert NOT, NEG to fastop

2013-01-04 Thread Avi Kivity
Signed-off-by: Avi Kivity avi.kiv...@gmail.com
---
 arch/x86/kvm/emulate.c | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2af0c44..09dbdc5 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2050,17 +2050,8 @@ static int em_grp2(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_not(struct x86_emulate_ctxt *ctxt)
-{
-   ctxt-dst.val = ~ctxt-dst.val;
-   return X86EMUL_CONTINUE;
-}
-
-static int em_neg(struct x86_emulate_ctxt *ctxt)
-{
-   emulate_1op(ctxt, neg);
-   return X86EMUL_CONTINUE;
-}
+FASTOP1(not);
+FASTOP1(neg);
 
 static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
 {
@@ -3753,8 +3744,8 @@ static const struct opcode group1A[] = {
 static const struct opcode group3[] = {
I(DstMem | SrcImm | NoWrite, em_test),
I(DstMem | SrcImm | NoWrite, em_test),
-   I(DstMem | SrcNone | Lock, em_not),
-   I(DstMem | SrcNone | Lock, em_neg),
+   F(DstMem | SrcNone | Lock, em_not),
+   F(DstMem | SrcNone | Lock, em_neg),
I(SrcMem, em_mul_ex),
I(SrcMem, em_imul_ex),
I(SrcMem, em_div_ex),
-- 
1.8.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >