Re: Some more basic questions..

2014-05-30 Thread Zhang Haoyu
Thanks Zhang and Venkateshwara, some more follow up questions below:)

1. Does -realtime mlock=on allocate all the memory upfront and keep it
for the VM, or does it just make sure the memory that is allocated
within the guest is not swapped out under host memory pressure?

“-realtime mlock=on” will mlockall(MCL_CURRENT | MCL_FUTURE) QEMU's ALL memory,
because VM's memory is part of QEMU's, so this option will keep VM's memory 
unswapped.

2.  I notice on a 4G guest on an 8G host, guest allocates only about
1G initially, and the rest later as I start applications. Is there a
way for me to reserve ALL memory (4G in this case) upfront somehow
without changes to guest and without allocating it? It will have to be
something the host OS or some component within the host OS. Isnt there
something to that effect? It seems odd that there isnt.

On linux, user-process's memory is allocating on demand, the physical memory 
does not allocate until the virtual memory is touched.
Because VM's memory is part of QEMU's, so ...
I guess the VM you said above is linux guest. Windows guest will memset its 
memory during booting period.

You can use -realtime mlock=on to reserve VM's ALL memory upfront.


Thank you in advance.


Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters

2014-05-30 Thread Peter Zijlstra
On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 Currently perf unconditionally disables PEBS for guest.
 
 Now that we have the infrastructure in place to handle
 it we can allow it for KVM owned guest events. For
 the perf needs to know that a event is owned by
 a guest. Add a new state bit in the perf_event for that.
 

This doesn't make sense; why does it need to be owned?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] perf: Handle guest PEBS events with a fake event

2014-05-30 Thread Peter Zijlstra
On Thu, May 29, 2014 at 06:12:06PM -0700, Andi Kleen wrote:

 Note: in very rare cases with exotic events this may lead to spurious PMIs
 in the guest.

Qualify that statement so that if someone runs into it we at least know
it is known/expected.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implement PEBS virtualization for Silvermont

2014-05-30 Thread Peter Zijlstra
On Thu, May 29, 2014 at 06:12:03PM -0700, Andi Kleen wrote:
 PEBS is very useful (e.g. enabling the more cycles:pp event or
 memory profiling) Unfortunately it didn't work in virtualization,
 which is becoming more and more common.
 
 This patch kit implements simple PEBS virtualization for KVM on Silvermont
 CPUs. Silvermont does not have the leak problems that prevented successfull
 PEBS virtualization earlier.

Silvermont is such an underpowered thing, who in his right mind would
run anything virt on it to further reduce performance?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: powerpc/pseries: Use new defines when calling h_set_mode

2014-05-30 Thread Alexander Graf


On 29.05.14 23:52, Benjamin Herrenschmidt wrote:

On Thu, 2014-05-29 at 23:27 +0200, Alexander Graf wrote:

On 29.05.14 09:45, Michael Neuling wrote:

+/* Values for 2nd argument to H_SET_MODE */
+#define H_SET_MODE_RESOURCE_SET_CIABR1
+#define H_SET_MODE_RESOURCE_SET_DAWR2
+#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
+#define H_SET_MODE_RESOURCE_LE4

Much better, but I think you want to make use of these in non-kvm code too,
no? At least the LE one is definitely already implemented as call :)

Sure but that's a different patch below.

Ben, how would you like to handle these 2 patches? If you give me an ack
I can just put this patch into my kvm queue. Alternatively we could both
carry a patch that adds the H_SET_MODE header bits only and whoever hits
Linus' tree first wins ;).

No biggie. Worst case it's a trivial conflict.


Well, the way the patches are split right now it won't be a conflict, 
but a build failure on either side.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/23] MIPS: KVM: Fixes and guest timer rewrite

2014-05-30 Thread Paolo Bonzini

Il 29/05/2014 22:44, James Hogan ha scritto:

Yes, I agree with your analysis and had considered something like this,
although it doesn't particularly appeal to my sense of perfectionism :).


I can see that.  But I think the simplification of the code is worth it.

It is hard to explain why the invalid times-goes-backwards case can 
happen if env-count_save_time is overwritten with data from another 
machine.  I think the explanation is that (due to 
timers_state.cpu_ticks_enabled) the value of cpu_get_clock_at(now) - 
env-count_save_time does not depend on get_clock(), but the code 
doesn't have any comment for that.


Rather than adding comments, we might as well force it to be always zero 
and just write get_clock() to COUNT_RESUME.


Finally, having to serialize env-count_save_time makes harder to 
support migration from TCG to KVM and back.



It would be race free though, and if you're stopping the VM at all you
expect to lose some time anyway.


Since you mentioned perfectionism, :) your code also loses some time; 
COUNT_RESUME is written based on when the CPU state becomes clean, not 
on when the CPU was restarted.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-05-30 Thread Gleb Natapov
On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 PEBS (Precise Event Bases Sampling) profiling is very powerful,
 allowing improved sampling precision and much additional information,
 like address or TSX abort profiling. cycles:p and :pp uses PEBS.
 
 This patch enables PEBS profiling in KVM guests.
That sounds really cool!

 
 PEBS writes profiling records to a virtual address in memory. Since
 the guest controls the virtual address space the PEBS record
 is directly delivered to the guest buffer. We set up the PEBS state
 that is works correctly.The CPU cannot handle any kinds of faults during
 these guest writes.
 
 To avoid any problems with guest pages being swapped by the host we
 pin the pages when the PEBS buffer is setup, by intercepting
 that MSR.
It will avoid guest page to be swapped, but shadow paging code may still drop
shadow PT pages that build a mapping from DS virtual address to the guest page.
With EPT it is less likely to happen (but still possible IIRC depending on 
memory
pressure and how much memory shadow paging code is allowed to use), without EPT
it will happen for sure.

 
 Typically profilers only set up a single page, so pinning that is not
 a big problem. The pinning is limited to 17 pages currently (64K+1)
 
 In theory the guest can change its own page tables after the PEBS
 setup. The host has no way to track that with EPT. But if a guest
 would do that it could only crash itself. It's not expected
 that normal profilers do that.
Spec says:

 The following restrictions should be applied to the DS save area.
   • The three DS save area sections should be allocated from a
   non-paged pool, and marked accessed and dirty. It is the responsibility
   of the operating system to keep the pages that contain the buffer
   present and to mark them accessed and dirty. The implication is that
   the operating system cannot do “lazy” page-table entry propagation
   for these pages.

There is nothing, as far as I can see, that says what will happen if the
condition is not met. I always interpreted it as undefined behaviour so
anything can happen including CPU dies completely.  You are saying above
on one hand that CPU cannot handle any kinds of faults during write to
DS area, but on the other hand a guest could only crash itself. Is this
architecturally guarantied?


 
 The patch also adds the basic glue to enable the PEBS CPUIDs
 and other PEBS MSRs, and ask perf to enable PEBS as needed.
 
 Due to various limitations it currently only works on Silvermont
 based systems.
 
 This patch doesn't implement the extended MSRs some CPUs support.
 For example latency profiling on SLM will not work at this point.
 
 Timing:
 
 The emulation is somewhat more expensive than a real PMU. This
 may trigger the expensive PMI detection in the guest.
 Usually this can be disabled with
 echo 0  /proc/sys/kernel/perf_cpu_time_max_percent
 
 Migration:
 
 In theory it should should be possible (as long as we migrate to
 a host with the same PEBS event and the same PEBS format), but I'm not
 sure the basic KVM PMU code supports it correctly: no code to
 save/restore state, unless I'm missing something. Once the PMU
 code grows proper migration support it should be straight forward
 to handle the PEBS state too.
 
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  arch/x86/include/asm/kvm_host.h   |   6 ++
  arch/x86/include/uapi/asm/msr-index.h |   4 +
  arch/x86/kvm/cpuid.c  |  10 +-
  arch/x86/kvm/pmu.c| 184 
 --
  arch/x86/kvm/vmx.c|   6 ++
  5 files changed, 196 insertions(+), 14 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 7de069af..d87cb66 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -319,6 +319,8 @@ struct kvm_pmc {
   struct kvm_vcpu *vcpu;
  };
  
 +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
 +
  struct kvm_pmu {
   unsigned nr_arch_gp_counters;
   unsigned nr_arch_fixed_counters;
 @@ -335,6 +337,10 @@ struct kvm_pmu {
   struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
   struct irq_work irq_work;
   u64 reprogram_pmi;
 + u64 pebs_enable;
 + u64 ds_area;
 + struct page *pinned_pages[MAX_PINNED_PAGES];
 + unsigned num_pinned_pages;
  };
  
  enum {
 diff --git a/arch/x86/include/uapi/asm/msr-index.h 
 b/arch/x86/include/uapi/asm/msr-index.h
 index fcf2b3a..409a582 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -72,6 +72,10 @@
  #define MSR_IA32_PEBS_ENABLE 0x03f1
  #define MSR_IA32_DS_AREA 0x0600
  #define MSR_IA32_PERF_CAPABILITIES   0x0345
 +#define PERF_CAP_PEBS_TRAP   (1U  6)
 +#define PERF_CAP_ARCH_REG(1U  7)
 +#define PERF_CAP_PEBS_FORMAT (0xf  8)
 +
  #define 

Re: [RFC] Implement Batched (group) ticket lock

2014-05-30 Thread Raghavendra K T

On 05/30/2014 04:15 AM, Waiman Long wrote:

On 05/28/2014 08:16 AM, Raghavendra K T wrote:

- we need an intelligent way to nullify the effect of batching for
baremetal
  (because extra cmpxchg is not required).


To do this, you will need to have 2 slightly different algorithms
depending on the paravirt_ticketlocks_enabled jump label.


Thanks for the hint Waiman.

[...]

+spin:
+for (;;) {
+inc.head = ACCESS_ONCE(lock-tickets.head);
+if (!(inc.head  TICKET_LOCK_HEAD_INC)) {
+new.head = inc.head | TICKET_LOCK_HEAD_INC;
+if (cmpxchg(lock-tickets.head, inc.head, new.head)
+== inc.head)
+goto out;
+}
+cpu_relax();
+}
+


It had taken me some time to figure out the the LSB of inc.head is used
as a bit lock for the contending tasks in the spin loop. I would suggest
adding some comment here to make it easier to look at.


Agree. 'll add a comment.

[...]

+#define TICKET_BATCH0x4 /* 4 waiters can contend simultaneously */
+#define TICKET_LOCK_BATCH_MASK
(~(TICKET_BATCHTICKET_LOCK_INC_SHIFT) + \
+  TICKET_LOCK_TAIL_INC - 1)


I don't think TAIL_INC has anything to do with setting the BATCH_MASK.
It works here because TAIL_INC is 2. I think it is clearer to define it
as either (~(TICKET_BATCHTICKET_LOCK_INC_SHIFT) + 1) or
(~((TICKET_BATCHTICKET_LOCK_INC_SHIFT) - 1)).


You are right.
Thanks for pointing out. Your expression is simple and clearer. 'll use
one of them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: powerpc/pseries: Use new defines when calling h_set_mode

2014-05-30 Thread Michael Ellerman
On Thu, 2014-05-29 at 17:45 +1000, Michael Neuling wrote:
   +/* Values for 2nd argument to H_SET_MODE */
   +#define H_SET_MODE_RESOURCE_SET_CIABR1
   +#define H_SET_MODE_RESOURCE_SET_DAWR2
   +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
   +#define H_SET_MODE_RESOURCE_LE4
  
  Much better, but I think you want to make use of these in non-kvm code too,
  no? At least the LE one is definitely already implemented as call :)
 
 powerpc/pseries: Use new defines when calling h_set_mode
 
 Now that we define these in the KVM code, use these defines when we call
 h_set_mode.  No functional change.
 
 Signed-off-by: Michael Neuling mi...@neuling.org
 --
 This depends on the KVM h_set_mode patches.
 
 diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
 b/arch/powerpc/include/asm/plpar_wrappers.h
 index 12c32c5..67859ed 100644
 --- a/arch/powerpc/include/asm/plpar_wrappers.h
 +++ b/arch/powerpc/include/asm/plpar_wrappers.h
 @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, 
 unsigned long resource,
  static inline long enable_reloc_on_exceptions(void)
  {
   /* mflags = 3: Exceptions at 0xC0004000 */
 - return plpar_set_mode(3, 3, 0, 0);
 + return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0);
  }

Which header are these coming from, and why aren't we including it? And is it
going to still build with CONFIG_KVM=n?

cheers




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 2/6] KVM: s390: Enable DAT support for TPROT handler

2014-05-30 Thread Christian Borntraeger
From: Thomas Huth th...@linux.vnet.ibm.com

The TPROT instruction can be used to check the accessability of storage
for any kind of logical addresses. So far, our handler only supported
real addresses. This patch now also enables support for addresses that
have to be translated via DAT first. And while we're at it, change the
code to use the common KVM function gfn_to_hva_prot() to check for the
validity and writability of the memory page.

Signed-off-by: Thomas Huth th...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/kvm/gaccess.c |  4 ++--
 arch/s390/kvm/gaccess.h |  2 ++
 arch/s390/kvm/priv.c| 56 +
 3 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 5f73826..4653ac6 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -292,7 +292,7 @@ static void ipte_unlock_siif(struct kvm_vcpu *vcpu)
wake_up(vcpu-kvm-arch.ipte_wq);
 }
 
-static void ipte_lock(struct kvm_vcpu *vcpu)
+void ipte_lock(struct kvm_vcpu *vcpu)
 {
if (vcpu-arch.sie_block-eca  1)
ipte_lock_siif(vcpu);
@@ -300,7 +300,7 @@ static void ipte_lock(struct kvm_vcpu *vcpu)
ipte_lock_simple(vcpu);
 }
 
-static void ipte_unlock(struct kvm_vcpu *vcpu)
+void ipte_unlock(struct kvm_vcpu *vcpu)
 {
if (vcpu-arch.sie_block-eca  1)
ipte_unlock_siif(vcpu);
diff --git a/arch/s390/kvm/gaccess.h b/arch/s390/kvm/gaccess.h
index 2d37a46..0149cf1 100644
--- a/arch/s390/kvm/gaccess.h
+++ b/arch/s390/kvm/gaccess.h
@@ -327,6 +327,8 @@ int read_guest_real(struct kvm_vcpu *vcpu, unsigned long 
gra, void *data,
return access_guest_real(vcpu, gra, data, len, 0);
 }
 
+void ipte_lock(struct kvm_vcpu *vcpu);
+void ipte_unlock(struct kvm_vcpu *vcpu);
 int ipte_lock_held(struct kvm_vcpu *vcpu);
 int kvm_s390_check_low_addr_protection(struct kvm_vcpu *vcpu, unsigned long 
ga);
 
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 6296159..f89c1cd 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -930,8 +930,9 @@ int kvm_s390_handle_eb(struct kvm_vcpu *vcpu)
 static int handle_tprot(struct kvm_vcpu *vcpu)
 {
u64 address1, address2;
-   struct vm_area_struct *vma;
-   unsigned long user_address;
+   unsigned long hva, gpa;
+   int ret = 0, cc = 0;
+   bool writable;
 
vcpu-stat.instruction_tprot++;
 
@@ -942,32 +943,41 @@ static int handle_tprot(struct kvm_vcpu *vcpu)
 
/* we only handle the Linux memory detection case:
 * access key == 0
-* guest DAT == off
 * everything else goes to userspace. */
if (address2  0xf0)
return -EOPNOTSUPP;
if (vcpu-arch.sie_block-gpsw.mask  PSW_MASK_DAT)
-   return -EOPNOTSUPP;
-
-   down_read(current-mm-mmap_sem);
-   user_address = __gmap_translate(address1, vcpu-arch.gmap);
-   if (IS_ERR_VALUE(user_address))
-   goto out_inject;
-   vma = find_vma(current-mm, user_address);
-   if (!vma)
-   goto out_inject;
-   vcpu-arch.sie_block-gpsw.mask = ~(3ul  44);
-   if (!(vma-vm_flags  VM_WRITE)  (vma-vm_flags  VM_READ))
-   vcpu-arch.sie_block-gpsw.mask |= (1ul  44);
-   if (!(vma-vm_flags  VM_WRITE)  !(vma-vm_flags  VM_READ))
-   vcpu-arch.sie_block-gpsw.mask |= (2ul  44);
-
-   up_read(current-mm-mmap_sem);
-   return 0;
+   ipte_lock(vcpu);
+   ret = guest_translate_address(vcpu, address1, gpa, 1);
+   if (ret == PGM_PROTECTION) {
+   /* Write protected? Try again with read-only... */
+   cc = 1;
+   ret = guest_translate_address(vcpu, address1, gpa, 0);
+   }
+   if (ret) {
+   if (ret == PGM_ADDRESSING || ret == PGM_TRANSLATION_SPEC) {
+   ret = kvm_s390_inject_program_int(vcpu, ret);
+   } else if (ret  0) {
+   /* Translation not available */
+   kvm_s390_set_psw_cc(vcpu, 3);
+   ret = 0;
+   }
+   goto out_unlock;
+   }
 
-out_inject:
-   up_read(current-mm-mmap_sem);
-   return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
+   hva = gfn_to_hva_prot(vcpu-kvm, gpa_to_gfn(gpa), writable);
+   if (kvm_is_error_hva(hva)) {
+   ret = kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
+   } else {
+   if (!writable)
+   cc = 1; /* Write not permitted == read-only */
+   kvm_s390_set_psw_cc(vcpu, cc);
+   /* Note: CC2 only occurs for storage keys (not supported yet) */
+   }
+out_unlock:
+   if (vcpu-arch.sie_block-gpsw.mask  PSW_MASK_DAT)
+   ipte_unlock(vcpu);
+   return ret;
 }
 
 int kvm_s390_handle_e5(struct kvm_vcpu *vcpu)
-- 
1.8.4.2

--
To 

[GIT PULL 0/6] KVM: s390: Fixes and cleanups for 3.16

2014-05-30 Thread Christian Borntraeger
Paolo,

The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb:

  KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 17:47:18 
+0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git  
tags/kvm-s390-20140530

for you to fetch changes up to 5a5e65361f01b44caa51ba202e6720d458829fc5:

  KVM: s390: Intercept the tprot instruction (2014-05-30 09:39:40 +0200)


1. Several minor fixes and cleanups for KVM:
2. Fix flag check for gdb support
3. Remove unnecessary vcpu start
4. Remove code duplication for sigp interrupts
5. Better DAT handling for the TPROT instruction
6. Correct addressing exception for standby memory


David Hildenbrand (2):
  KVM: s390: check the given debug flags, not the set ones
  KVM: s390: a VCPU is already started when delivering interrupts

Jens Freimann (1):
  KVM: s390: clean up interrupt injection in sigp code

Matthew Rosato (1):
  KVM: s390: Intercept the tprot instruction

Thomas Huth (2):
  KVM: s390: Add a generic function for translating guest addresses
  KVM: s390: Enable DAT support for TPROT handler

 arch/s390/include/asm/kvm_host.h |  1 +
 arch/s390/kvm/gaccess.c  | 57 ++--
 arch/s390/kvm/gaccess.h  |  5 
 arch/s390/kvm/interrupt.c|  1 -
 arch/s390/kvm/kvm-s390.c |  6 +++--
 arch/s390/kvm/priv.c | 56 +++
 arch/s390/kvm/sigp.c | 56 +--
 7 files changed, 116 insertions(+), 66 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 4/6] KVM: s390: check the given debug flags, not the set ones

2014-05-30 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

This patch fixes a minor bug when updating the guest debug settings.
We should check the given debug flags, not the already set ones.
Doesn't do any harm but too many (for now unused) flags could be set internally
without error.

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
---
 arch/s390/kvm/kvm-s390.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index e519860..06d1888 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -950,7 +950,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
vcpu-guest_debug = 0;
kvm_s390_clear_bp_data(vcpu);
 
-   if (vcpu-guest_debug  ~VALID_GUESTDBG_FLAGS)
+   if (dbg-control  ~VALID_GUESTDBG_FLAGS)
return -EINVAL;
 
if (dbg-control  KVM_GUESTDBG_ENABLE) {
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 1/6] KVM: s390: Add a generic function for translating guest addresses

2014-05-30 Thread Christian Borntraeger
From: Thomas Huth th...@linux.vnet.ibm.com

This patch adds a function for translating logical guest addresses into
physical guest addresses without touching the memory at the given location.

Signed-off-by: Thomas Huth th...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/kvm/gaccess.c | 53 +
 arch/s390/kvm/gaccess.h |  3 +++
 2 files changed, 56 insertions(+)

diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index db608c3..5f73826 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -645,6 +645,59 @@ int access_guest_real(struct kvm_vcpu *vcpu, unsigned long 
gra,
 }
 
 /**
+ * guest_translate_address - translate guest logical into guest absolute 
address
+ *
+ * Parameter semantics are the same as the ones from guest_translate.
+ * The memory contents at the guest address are not changed.
+ *
+ * Note: The IPTE lock is not taken during this function, so the caller
+ * has to take care of this.
+ */
+int guest_translate_address(struct kvm_vcpu *vcpu, unsigned long gva,
+   unsigned long *gpa, int write)
+{
+   struct kvm_s390_pgm_info *pgm = vcpu-arch.pgm;
+   psw_t *psw = vcpu-arch.sie_block-gpsw;
+   struct trans_exc_code_bits *tec;
+   union asce asce;
+   int rc;
+
+   /* Access register mode is not supported yet. */
+   if (psw_bits(*psw).t  psw_bits(*psw).as == PSW_AS_ACCREG)
+   return -EOPNOTSUPP;
+
+   gva = kvm_s390_logical_to_effective(vcpu, gva);
+   memset(pgm, 0, sizeof(*pgm));
+   tec = (struct trans_exc_code_bits *)pgm-trans_exc_code;
+   tec-as = psw_bits(*psw).as;
+   tec-fsi = write ? FSI_STORE : FSI_FETCH;
+   tec-addr = gva  PAGE_SHIFT;
+   if (is_low_address(gva)  low_address_protection_enabled(vcpu)) {
+   if (write) {
+   rc = pgm-code = PGM_PROTECTION;
+   return rc;
+   }
+   }
+
+   asce.val = get_vcpu_asce(vcpu);
+   if (psw_bits(*psw).t  !asce.r) {  /* Use DAT? */
+   rc = guest_translate(vcpu, gva, gpa, write);
+   if (rc  0) {
+   if (rc == PGM_PROTECTION)
+   tec-b61 = 1;
+   pgm-code = rc;
+   }
+   } else {
+   rc = 0;
+   *gpa = kvm_s390_real_to_abs(vcpu, gva);
+   if (kvm_is_error_gpa(vcpu-kvm, *gpa))
+   rc = pgm-code = PGM_ADDRESSING;
+   }
+
+   return rc;
+}
+
+/**
  * kvm_s390_check_low_addr_protection - check for low-address protection
  * @ga: Guest address
  *
diff --git a/arch/s390/kvm/gaccess.h b/arch/s390/kvm/gaccess.h
index a07ee08..2d37a46 100644
--- a/arch/s390/kvm/gaccess.h
+++ b/arch/s390/kvm/gaccess.h
@@ -155,6 +155,9 @@ int read_guest_lc(struct kvm_vcpu *vcpu, unsigned long gra, 
void *data,
return kvm_read_guest(vcpu-kvm, gpa, data, len);
 }
 
+int guest_translate_address(struct kvm_vcpu *vcpu, unsigned long gva,
+   unsigned long *gpa, int write);
+
 int access_guest(struct kvm_vcpu *vcpu, unsigned long ga, void *data,
 unsigned long len, int write);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 6/6] KVM: s390: Intercept the tprot instruction

2014-05-30 Thread Christian Borntraeger
From: Matthew Rosato mjros...@linux.vnet.ibm.com

Based on original patch from Jeng-fang (Nick) Wang

When standby memory is specified for a guest Linux, but no virtual memory has
been allocated on the Qemu host backing that guest, the guest memory detection
process encounters a memory access exception which is not thrown from the KVM
handle_tprot() instruction-handler function. The access exception comes from
sie64a returning EFAULT, which then passes an addressing exception to the guest.
Unfortunately this does not the proper PSW fixup (nullifying vs.
suppressing) so the guest will get a fault for the wrong address.

Let's just intercept the tprot instruction all the time to do the right thing
and not go the page fault handler path for standby memory. tprot is only used
by Linux during startup so some exits should be ok.
Without this patch, standby memory cannot be used with KVM.

Signed-off-by: Nick Wang jfw...@us.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Tested-by: Matthew Rosato mjros...@linux.vnet.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 arch/s390/include/asm/kvm_host.h | 1 +
 arch/s390/kvm/kvm-s390.c | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index a27f500..4181d7b 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -110,6 +110,7 @@ struct kvm_s390_sie_block {
 #define ICTL_ISKE  0x4000
 #define ICTL_SSKE  0x2000
 #define ICTL_RRBE  0x1000
+#define ICTL_TPROT 0x0200
__u32   ictl;   /* 0x0048 */
__u32   eca;/* 0x004c */
 #define ICPT_INST  0x04
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 06d1888..43e191b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -637,7 +637,9 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
if (sclp_has_siif())
vcpu-arch.sie_block-eca |= 1;
vcpu-arch.sie_block-fac   = (int) (long) vfacilities;
-   vcpu-arch.sie_block-ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE;
+   vcpu-arch.sie_block-ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE |
+ ICTL_TPROT;
+
if (kvm_s390_cmma_enabled(vcpu-kvm)) {
rc = kvm_s390_vcpu_setup_cmma(vcpu);
if (rc)
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 5/6] KVM: s390: a VCPU is already started when delivering interrupts

2014-05-30 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

This patch removes the start of a VCPU when delivering a RESTART interrupt.
Interrupt delivery is called from kvm_arch_vcpu_ioctl_run. So the VCPU is
already considered started - no need to call kvm_s390_vcpu_start. This function
will early exit anyway.

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
---
 arch/s390/kvm/interrupt.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index bf0d9bc..90c8de2 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -442,7 +442,6 @@ static void __do_deliver_interrupt(struct kvm_vcpu *vcpu,
rc |= read_guest_lc(vcpu, offsetof(struct _lowcore, 
restart_psw),
vcpu-arch.sie_block-gpsw,
sizeof(psw_t));
-   kvm_s390_vcpu_start(vcpu);
break;
case KVM_S390_PROGRAM_INT:
VCPU_EVENT(vcpu, 4, interrupt: pgm check code:%x, ilc:%x,
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 3/6] KVM: s390: clean up interrupt injection in sigp code

2014-05-30 Thread Christian Borntraeger
From: Jens Freimann jf...@linux.vnet.ibm.com

We have all the logic to inject interrupts available in
kvm_s390_inject_vcpu(), so let's use it instead of
injecting irqs manually to the list in sigp code.

SIGP stop is special because we have to check the
action_flags before injecting the interrupt. As
the action_flags are not available in kvm_s390_inject_vcpu()
we leave the code for the stop order code untouched for now.

Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com
Reviewed-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 arch/s390/kvm/sigp.c | 56 +---
 1 file changed, 18 insertions(+), 38 deletions(-)

diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
index d0341d2..43079a4 100644
--- a/arch/s390/kvm/sigp.c
+++ b/arch/s390/kvm/sigp.c
@@ -54,33 +54,23 @@ static int __sigp_sense(struct kvm_vcpu *vcpu, u16 cpu_addr,
 
 static int __sigp_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr)
 {
-   struct kvm_s390_local_interrupt *li;
-   struct kvm_s390_interrupt_info *inti;
+   struct kvm_s390_interrupt s390int = {
+   .type = KVM_S390_INT_EMERGENCY,
+   .parm = vcpu-vcpu_id,
+   };
struct kvm_vcpu *dst_vcpu = NULL;
+   int rc = 0;
 
if (cpu_addr  KVM_MAX_VCPUS)
dst_vcpu = kvm_get_vcpu(vcpu-kvm, cpu_addr);
if (!dst_vcpu)
return SIGP_CC_NOT_OPERATIONAL;
 
-   inti = kzalloc(sizeof(*inti), GFP_KERNEL);
-   if (!inti)
-   return -ENOMEM;
-
-   inti-type = KVM_S390_INT_EMERGENCY;
-   inti-emerg.code = vcpu-vcpu_id;
+   rc = kvm_s390_inject_vcpu(dst_vcpu, s390int);
+   if (!rc)
+   VCPU_EVENT(vcpu, 4, sent sigp emerg to cpu %x, cpu_addr);
 
-   li = dst_vcpu-arch.local_int;
-   spin_lock_bh(li-lock);
-   list_add_tail(inti-list, li-list);
-   atomic_set(li-active, 1);
-   atomic_set_mask(CPUSTAT_EXT_INT, li-cpuflags);
-   if (waitqueue_active(li-wq))
-   wake_up_interruptible(li-wq);
-   spin_unlock_bh(li-lock);
-   VCPU_EVENT(vcpu, 4, sent sigp emerg to cpu %x, cpu_addr);
-
-   return SIGP_CC_ORDER_CODE_ACCEPTED;
+   return rc ? rc : SIGP_CC_ORDER_CODE_ACCEPTED;
 }
 
 static int __sigp_conditional_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr,
@@ -116,33 +106,23 @@ static int __sigp_conditional_emergency(struct kvm_vcpu 
*vcpu, u16 cpu_addr,
 
 static int __sigp_external_call(struct kvm_vcpu *vcpu, u16 cpu_addr)
 {
-   struct kvm_s390_local_interrupt *li;
-   struct kvm_s390_interrupt_info *inti;
+   struct kvm_s390_interrupt s390int = {
+   .type = KVM_S390_INT_EXTERNAL_CALL,
+   .parm = vcpu-vcpu_id,
+   };
struct kvm_vcpu *dst_vcpu = NULL;
+   int rc;
 
if (cpu_addr  KVM_MAX_VCPUS)
dst_vcpu = kvm_get_vcpu(vcpu-kvm, cpu_addr);
if (!dst_vcpu)
return SIGP_CC_NOT_OPERATIONAL;
 
-   inti = kzalloc(sizeof(*inti), GFP_KERNEL);
-   if (!inti)
-   return -ENOMEM;
-
-   inti-type = KVM_S390_INT_EXTERNAL_CALL;
-   inti-extcall.code = vcpu-vcpu_id;
-
-   li = dst_vcpu-arch.local_int;
-   spin_lock_bh(li-lock);
-   list_add_tail(inti-list, li-list);
-   atomic_set(li-active, 1);
-   atomic_set_mask(CPUSTAT_EXT_INT, li-cpuflags);
-   if (waitqueue_active(li-wq))
-   wake_up_interruptible(li-wq);
-   spin_unlock_bh(li-lock);
-   VCPU_EVENT(vcpu, 4, sent sigp ext call to cpu %x, cpu_addr);
+   rc = kvm_s390_inject_vcpu(dst_vcpu, s390int);
+   if (!rc)
+   VCPU_EVENT(vcpu, 4, sent sigp ext call to cpu %x, cpu_addr);
 
-   return SIGP_CC_ORDER_CODE_ACCEPTED;
+   return rc ? rc : SIGP_CC_ORDER_CODE_ACCEPTED;
 }
 
 static int __inject_sigp_stop(struct kvm_s390_local_interrupt *li, int action)
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: powerpc/pseries: Use new defines when calling h_set_mode

2014-05-30 Thread Michael Neuling
On Fri, 2014-05-30 at 18:56 +1000, Michael Ellerman wrote:
 On Thu, 2014-05-29 at 17:45 +1000, Michael Neuling wrote:
+/* Values for 2nd argument to H_SET_MODE */
+#define H_SET_MODE_RESOURCE_SET_CIABR1
+#define H_SET_MODE_RESOURCE_SET_DAWR2
+#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
+#define H_SET_MODE_RESOURCE_LE4
   
   Much better, but I think you want to make use of these in non-kvm code 
   too,
   no? At least the LE one is definitely already implemented as call :)
  
  powerpc/pseries: Use new defines when calling h_set_mode
  
  Now that we define these in the KVM code, use these defines when we call
  h_set_mode.  No functional change.
  
  Signed-off-by: Michael Neuling mi...@neuling.org
  --
  This depends on the KVM h_set_mode patches.
  
  diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
  b/arch/powerpc/include/asm/plpar_wrappers.h
  index 12c32c5..67859ed 100644
  --- a/arch/powerpc/include/asm/plpar_wrappers.h
  +++ b/arch/powerpc/include/asm/plpar_wrappers.h
  @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, 
  unsigned long resource,
   static inline long enable_reloc_on_exceptions(void)
   {
  /* mflags = 3: Exceptions at 0xC0004000 */
  -   return plpar_set_mode(3, 3, 0, 0);
  +   return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0);
   }
 
 Which header are these coming from, and why aren't we including it? And is it
 going to still build with CONFIG_KVM=n?

From include/asm/hvcall.h in the h_set_mode patch set I sent before.

And yes it compiles with CONFIG_KVM=n fine.

Mikey
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: powerpc/pseries: Use new defines when calling h_set_mode

2014-05-30 Thread Alexander Graf


On 30.05.14 11:10, Michael Neuling wrote:

On Fri, 2014-05-30 at 18:56 +1000, Michael Ellerman wrote:

On Thu, 2014-05-29 at 17:45 +1000, Michael Neuling wrote:

+/* Values for 2nd argument to H_SET_MODE */
+#define H_SET_MODE_RESOURCE_SET_CIABR1
+#define H_SET_MODE_RESOURCE_SET_DAWR2
+#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
+#define H_SET_MODE_RESOURCE_LE4

Much better, but I think you want to make use of these in non-kvm code too,
no? At least the LE one is definitely already implemented as call :)

powerpc/pseries: Use new defines when calling h_set_mode

Now that we define these in the KVM code, use these defines when we call
h_set_mode.  No functional change.

Signed-off-by: Michael Neuling mi...@neuling.org
--
This depends on the KVM h_set_mode patches.

diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 12c32c5..67859ed 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, 
unsigned long resource,
  static inline long enable_reloc_on_exceptions(void)
  {
/* mflags = 3: Exceptions at 0xC0004000 */
-   return plpar_set_mode(3, 3, 0, 0);
+   return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0);
  }

Which header are these coming from, and why aren't we including it? And is it
going to still build with CONFIG_KVM=n?

 From include/asm/hvcall.h in the h_set_mode patch set I sent before.

And yes it compiles with CONFIG_KVM=n fine.


Please split that patch into one that adds the definitions and one that 
changes the KVM code to use those definitions. Both Ben and me can then 
apply the definition patch and our respective tree patch.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: powerpc/pseries: Use new defines when calling h_set_mode

2014-05-30 Thread Alexander Graf


On 30.05.14 11:44, Michael Neuling wrote:




 Which header are these coming from, and why aren't we including 
it? And is it

 going to still build with CONFIG_KVM=n?

  From include/asm/hvcall.h in the h_set_mode patch set I sent before.

 And yes it compiles with CONFIG_KVM=n fine.


 Please split that patch into one that adds the definitions and one 
that changes the KVM code to use those definitions. Both Ben and me 
can then apply the definition patch and our respective tree patch.



Why don't you just take the original h_set_mode patch and I'll repost 
this cleanup later to ben when yours is upstream.  This cleanup patch 
is not critical to anything and it avoid more churn.




That works too, but please keep in mind that my path to upstream is much 
longer than what you're used to ;).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/23] MIPS: Export local_flush_icache_range for KVM

2014-05-30 Thread Ralf Baechle
On Thu, May 29, 2014 at 10:16:24AM +0100, James Hogan wrote:

 Export the local_flush_icache_range function pointer for GPL modules so
 that it can be used by KVM for syncing the icache after binary
 translation of trapping instructions.

Acked-by: Ralf Baechle r...@linux-mips.org

  Ralf
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 14/23] MIPS: KVM: Override guest kernel timer frequency directly

2014-05-30 Thread Ralf Baechle
On Thu, May 29, 2014 at 10:16:36AM +0100, James Hogan wrote:

 The KVM_HOST_FREQ Kconfig symbol was used by KVM guest kernels to
 override the timer frequency calculation to a value based on the host
 frequency. Now that the KVM timer emulation is implemented independent
 of the host timer frequency and defaults to 100MHz, adjust the working
 of CONFIG_KVM_HOST_FREQ to match.
 
 The Kconfig symbol now specifies the guest timer frequency directly, and
 has been renamed accordingly to KVM_GUEST_TIMER_FREQ. It now defaults to
 100MHz too and the help text is updated to make it clear that a zero
 value will allow the normal timer frequency calculation to take place
 (based on the emulated RTC).

Acked-by: Ralf Baechle r...@linux-mips.org

  Ralf
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/23] MIPS: KVM: Fixes and guest timer rewrite

2014-05-30 Thread Paolo Bonzini

Il 29/05/2014 11:16, James Hogan ha scritto:

Here are a range of MIPS KVM TE fixes, preferably for v3.16 but I know
it's probably a bit late now. Changes are pretty minimal though since
v1 so please consider. They can also be found on my kvm_mips_queue
branch (and the kvm_mips_timer_v2 tag) here:
git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/kvm-mips.git

They originally served to allow it to work better on Ingenic XBurst
cores which have some peculiarities which break non-portable assumptions
in the MIPS KVM implementation (patches 1-4, 13).

Fixing guest CP0_Count emulation to work without a running host
CP0_Count (patch 13) however required a rewrite of the timer emulation
code to use the kernel monotonic time instead, which needed doing anyway
since basing it directly off the host CP0_Count was broken. Various bugs
were fixed in the process (patches 10-12) and improvements made thanks to
valuable feedback from Paolo Bonzini for the last QEMU MIPS/KVM patchset
(patches 5-7, 15-16).

Finally there are some misc cleanups which I did along the way (patches
17-23).

Only the first patch (fixes MIPS KVM with 4K pages) is marked for
stable. For KVM to work on XBurst it needs the timer rework which is a
fairly complex change, so there's little point marking any of the XBurst
specific changes for stable.

All feedback welcome!

Patches 1-4:
Fix KVM/MIPS with 4K pages, missing RDHWR SYNCI (XBurst),
unmoving CP0_Random (XBurst).
Patches 5-9:
Add EPC, Count, Compare, UserLocal, HWREna guest CP0 registers
to KVM register ioctl interface.
Patches 10-12:
Fix a few potential races relating to timers.
Patches 13-14:
Rewrite guest timer emulation to use ktime_get().
Patches 15-16:
Add KVM virtual registers for controlling guest timer, including
master timer disable, and timer frequency.
Patches 17-23:
Cleanups.

Changes in v2 (tag:kvm_mips_timer_v2):
 Patchset:
 - Drop patch 4 MIPS: KVM: Fix CP0_EBASE KVM register id (David
   Daney).
 - Drop patch 14 MIPS: KVM: Add nanosecond count bias KVM register.
   The COUNT_CTL and COUNT_RESUME API is clean and sufficient.
 - Add missing access to UserLocal and HWREna guest CP0 registers
   (patches 15 and 16).
 - Add export of local_flush_icache_range (patch 2).
 Patch 12 MIPS: KVM: Migrate hrtimer to follow VCPU
 - Move kvm_mips_migrate_count() into kvm_tlb.c to fix a link error when
   KVM is built as a module, since kvm_tlb.c is built statically and
   cannot reference symbols in kvm_mips_emul.c.
 Patch 15 MIPS: KVM: Add master disable count interface
 - Make KVM_REG_MIPS_COUNT_RESUME writable too so that userland can
   control timer using master DC and without bias register. New values
   are rejected if they refer to a monotonic time in the future.
 - Expand on description of KVM_REG_MIPS_COUNT_RESUME about the effects
   of the register and that it can be written.

v1 (tag:kvm_mips_timer_v1):
 see http://marc.info/?l=kvmm=139843936102657w=2

James Hogan (23):
  MIPS: KVM: Allocate at least 16KB for exception handlers
  MIPS: Export local_flush_icache_range for KVM
  MIPS: KVM: Use local_flush_icache_range to fix RI on XBurst
  MIPS: KVM: Use tlb_write_random
  MIPS: KVM: Add CP0_EPC KVM register access
  MIPS: KVM: Move KVM_{GET,SET}_ONE_REG definitions into kvm_host.h
  MIPS: KVM: Add CP0_Count/Compare KVM register access
  MIPS: KVM: Add CP0_UserLocal KVM register access
  MIPS: KVM: Add CP0_HWREna KVM register access
  MIPS: KVM: Deliver guest interrupts after local_irq_disable()
  MIPS: KVM: Fix timer race modifying guest CP0_Cause
  MIPS: KVM: Migrate hrtimer to follow VCPU
  MIPS: KVM: Rewrite count/compare timer emulation
  MIPS: KVM: Override guest kernel timer frequency directly
  MIPS: KVM: Add master disable count interface
  MIPS: KVM: Add count frequency KVM register
  MIPS: KVM: Make kvm_mips_comparecount_{func,wakeup} static
  MIPS: KVM: Whitespace fixes in kvm_mips_callbacks
  MIPS: KVM: Fix kvm_debug bit-rottage
  MIPS: KVM: Remove ifdef DEBUG around kvm_debug
  MIPS: KVM: Quieten kvm_info() logging
  MIPS: KVM: Remove redundant NULL checks before kfree()
  MIPS: KVM: Remove redundant semicolon

 arch/mips/Kconfig |  12 +-
 arch/mips/include/asm/kvm_host.h  | 183 ++---
 arch/mips/include/uapi/asm/kvm.h  |  35 +++
 arch/mips/kvm/kvm_locore.S|  32 ---
 arch/mips/kvm/kvm_mips.c  | 140 +-
 arch/mips/kvm/kvm_mips_dyntrans.c |  15 +-
 arch/mips/kvm/kvm_mips_emul.c | 557 --
 arch/mips/kvm/kvm_tlb.c   |  77 +++---
 arch/mips/kvm/kvm_trap_emul.c |  86 +-
 arch/mips/mm/cache.c  |   1 +
 arch/mips/mti-malta/malta-time.c  |  14 +-
 11 files changed, 920 insertions(+), 232 deletions(-)

Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Gleb Natapov g...@kernel.org
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle r...@linux-mips.org
Cc: linux-m...@linux-mips.org
Cc: David Daney 

[PULL 01/41] KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR

2014-05-30 Thread Alexander Graf
The L1 instruction cache control register contains bits that indicate
that we're still handling a request. Mask those out when we set the SPR
so that a read doesn't assume we're still doing something.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/e500_emulate.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 89b7f82..95d886f 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -222,6 +222,7 @@ int kvmppc_core_emulate_mtspr_e500(struct kvm_vcpu *vcpu, 
int sprn, ulong spr_va
break;
case SPRN_L1CSR1:
vcpu_e500-l1csr1 = spr_val;
+   vcpu_e500-l1csr1 = ~(L1CSR1_ICFI | L1CSR1_ICLFR);
break;
case SPRN_HID0:
vcpu_e500-hid0 = spr_val;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 02/41] KVM: PPC: E500: Add dcbtls emulation

2014-05-30 Thread Alexander Graf
The dcbtls instruction is able to lock data inside the L1 cache.

We don't want to give the guest actual access to hardware cache locks,
as that could influence other VMs on the same system. But we can tell
the guest that its locking attempt failed.

By implementing the instruction we at least don't give the guest a
program exception which it definitely does not expect.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/reg_booke.h |  1 +
 arch/powerpc/kvm/e500_emulate.c  | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/include/asm/reg_booke.h 
b/arch/powerpc/include/asm/reg_booke.h
index 163c3b0..464f108 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -583,6 +583,7 @@
 
 /* Bit definitions for L1CSR0. */
 #define L1CSR0_CPE 0x0001  /* Data Cache Parity Enable */
+#define L1CSR0_CUL 0x0400  /* Data Cache Unable to Lock */
 #define L1CSR0_CLFC0x0100  /* Cache Lock Bits Flash Clear */
 #define L1CSR0_DCFI0x0002  /* Data Cache Flash Invalidate */
 #define L1CSR0_CFI 0x0002  /* Cache Flash Invalidate */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 95d886f..002d517 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -19,6 +19,7 @@
 #include booke.h
 #include e500.h
 
+#define XOP_DCBTLS  166
 #define XOP_MSGSND  206
 #define XOP_MSGCLR  238
 #define XOP_TLBIVAX 786
@@ -103,6 +104,15 @@ static int kvmppc_e500_emul_ehpriv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
return emulated;
 }
 
+static int kvmppc_e500_emul_dcbtls(struct kvm_vcpu *vcpu)
+{
+   struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+
+   /* Always fail to lock the cache */
+   vcpu_e500-l1csr0 |= L1CSR0_CUL;
+   return EMULATE_DONE;
+}
+
 int kvmppc_core_emulate_op_e500(struct kvm_run *run, struct kvm_vcpu *vcpu,
unsigned int inst, int *advance)
 {
@@ -116,6 +126,10 @@ int kvmppc_core_emulate_op_e500(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
case 31:
switch (get_xop(inst)) {
 
+   case XOP_DCBTLS:
+   emulated = kvmppc_e500_emul_dcbtls(vcpu);
+   break;
+
 #ifdef CONFIG_KVM_E500MC
case XOP_MSGSND:
emulated = kvmppc_e500_emul_msgsnd(vcpu, rb);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 11/41] KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian

2014-05-30 Thread Alexander Graf
When the guest does an RTAS hypercall it keeps all RTAS variables inside a
big endian data structure.

To make sure we don't have to bother about endianness inside the actual RTAS
handlers, let's just convert the whole structure to host endian before we
call our RTAS handlers and back to big endian when we return to the guest.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_rtas.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_rtas.c b/arch/powerpc/kvm/book3s_rtas.c
index 7a05315..edb14ba 100644
--- a/arch/powerpc/kvm/book3s_rtas.c
+++ b/arch/powerpc/kvm/book3s_rtas.c
@@ -205,6 +205,32 @@ int kvm_vm_ioctl_rtas_define_token(struct kvm *kvm, void 
__user *argp)
return rc;
 }
 
+static void kvmppc_rtas_swap_endian_in(struct rtas_args *args)
+{
+#ifdef __LITTLE_ENDIAN__
+   int i;
+
+   args-token = be32_to_cpu(args-token);
+   args-nargs = be32_to_cpu(args-nargs);
+   args-nret = be32_to_cpu(args-nret);
+   for (i = 0; i  args-nargs; i++)
+   args-args[i] = be32_to_cpu(args-args[i]);
+#endif
+}
+
+static void kvmppc_rtas_swap_endian_out(struct rtas_args *args)
+{
+#ifdef __LITTLE_ENDIAN__
+   int i;
+
+   for (i = 0; i  args-nret; i++)
+   args-args[i] = cpu_to_be32(args-args[i]);
+   args-token = cpu_to_be32(args-token);
+   args-nargs = cpu_to_be32(args-nargs);
+   args-nret = cpu_to_be32(args-nret);
+#endif
+}
+
 int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu)
 {
struct rtas_token_definition *d;
@@ -223,6 +249,8 @@ int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu)
if (rc)
goto fail;
 
+   kvmppc_rtas_swap_endian_in(args);
+
/*
 * args-rets is a pointer into args-args. Now that we've
 * copied args we need to fix it up to point into our copy,
@@ -247,6 +275,7 @@ int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu)
 
if (rc == 0) {
args.rets = orig_rets;
+   kvmppc_rtas_swap_endian_out(args);
rc = kvm_write_guest(vcpu-kvm, args_phys, args, sizeof(args));
if (rc)
goto fail;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 09/41] KVM: PPC: Book3S PR: Default to big endian guest

2014-05-30 Thread Alexander Graf
The default MSR when user space does not define anything should be identical
on little and big endian hosts, so remove MSR_LE from it.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_pr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 01a7156..d7b0ad2 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1216,7 +1216,7 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_pr(struct 
kvm *kvm,
kvmppc_set_pvr_pr(vcpu, vcpu-arch.pvr);
vcpu-arch.slb_nr = 64;
 
-   vcpu-arch.shadow_msr = MSR_USER64;
+   vcpu-arch.shadow_msr = MSR_USER64  ~MSR_LE;
 
err = kvmppc_mmu_init(vcpu);
if (err  0)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 34/41] KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()

2014-05-30 Thread Alexander Graf
From: Paul Mackerras pau...@samba.org

The global_invalidates() function contains a check that is intended
to tell whether we are currently executing in the context of a hypercall
issued by the guest.  The reason is that the optimization of using a
local TLB invalidate instruction is only valid in that context.  The
check was testing local_paca-kvm_hstate.kvm_vcore, which gets set
when entering the guest but no longer gets cleared when exiting the
guest.  To fix this, we use the kvm_vcpu field instead, which does
get cleared when exiting the guest, by the kvmppc_release_hwthread()
calls inside kvmppc_run_core().

The effect of having the check wrong was that when kvmppc_do_h_remove()
got called from htab_write() on the destination machine during a
migration, it cleared the current cpu's bit in kvm-arch.need_tlb_flush.
This meant that when the guest started running in the destination VM,
it may miss out on doing a complete TLB flush, and therefore may end
up using stale TLB entries from a previous guest that used the same
LPID value.

This should make migration more reliable.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 1d6c56a..ac840c6 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -42,13 +42,14 @@ static int global_invalidates(struct kvm *kvm, unsigned 
long flags)
 
/*
 * If there is only one vcore, and it's currently running,
+* as indicated by local_paca-kvm_hstate.kvm_vcpu being set,
 * we can use tlbiel as long as we mark all other physical
 * cores as potentially having stale TLB entries for this lpid.
 * If we're not using MMU notifiers, we never take pages away
 * from the guest, so we can use tlbiel if requested.
 * Otherwise, don't use tlbiel.
 */
-   if (kvm-arch.online_vcores == 1  local_paca-kvm_hstate.kvm_vcore)
+   if (kvm-arch.online_vcores == 1  local_paca-kvm_hstate.kvm_vcpu)
global = 0;
else if (kvm-arch.using_mmu_notifiers)
global = 1;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages

2014-05-30 Thread Alexander Graf
From: Paul Mackerras pau...@samba.org

Current, when testing whether a page is dirty (when constructing the
bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit
in the HPT entries mapping the page, and if it is 0, we consider the
page to be clean.  However, the Power ISA doesn't require processors
to set the C bit to 1 immediately when writing to a page, and in fact
allows them to delay the writeback of the C bit until they receive a
TLB invalidation for the page.  Thus it is possible that the page
could be dirty and we miss it.

Now, if there are vcpus running, this is not serious since the
collection of the dirty log is racy already - some vcpu could dirty
the page just after we check it.  But if there are no vcpus running we
should return definitive results, in case we are in the final phase of
migrating the guest.

Also, if the permission bits in the HPTE don't allow writing, then we
know that no CPU can set C.  If the HPTE was previously writable and
the page was modified, any C bit writeback would have been flushed out
by the tlbie that we did when changing the HPTE to read-only.

Otherwise we need to do a TLB invalidation even if the C bit is 0, and
then check the C bit.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 +
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 96c9044..8056107 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1060,6 +1060,11 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long 
hva, pte_t pte)
kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
 }
 
+static int vcpus_running(struct kvm *kvm)
+{
+   return atomic_read(kvm-arch.vcpus_running) != 0;
+}
+
 /*
  * Returns the number of system pages that are dirty.
  * This can be more than 1 if we find a huge-page HPTE.
@@ -1069,6 +1074,7 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, 
unsigned long *rmapp)
struct revmap_entry *rev = kvm-arch.revmap;
unsigned long head, i, j;
unsigned long n;
+   unsigned long v, r;
unsigned long *hptep;
int npages_dirty = 0;
 
@@ -1088,7 +1094,22 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, 
unsigned long *rmapp)
hptep = (unsigned long *) (kvm-arch.hpt_virt + (i  4));
j = rev[i].forw;
 
-   if (!(hptep[1]  HPTE_R_C))
+   /*
+* Checking the C (changed) bit here is racy since there
+* is no guarantee about when the hardware writes it back.
+* If the HPTE is not writable then it is stable since the
+* page can't be written to, and we would have done a tlbie
+* (which forces the hardware to complete any writeback)
+* when making the HPTE read-only.
+* If vcpus are running then this call is racy anyway
+* since the page could get dirtied subsequently, so we
+* expect there to be a further call which would pick up
+* any delayed C bit writeback.
+* Otherwise we need to do the tlbie even if C==0 in
+* order to pick up any delayed writeback of C.
+*/
+   if (!(hptep[1]  HPTE_R_C) 
+   (!hpte_is_writable(hptep[1]) || vcpus_running(kvm)))
continue;
 
if (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) {
@@ -1100,23 +1121,29 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, 
unsigned long *rmapp)
}
 
/* Now check and modify the HPTE */
-   if ((hptep[0]  HPTE_V_VALID)  (hptep[1]  HPTE_R_C)) {
-   /* need to make it temporarily absent to clear C */
-   hptep[0] |= HPTE_V_ABSENT;
-   kvmppc_invalidate_hpte(kvm, hptep, i);
-   hptep[1] = ~HPTE_R_C;
-   eieio();
-   hptep[0] = (hptep[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;
+   if (!(hptep[0]  HPTE_V_VALID))
+   continue;
+
+   /* need to make it temporarily absent so C is stable */
+   hptep[0] |= HPTE_V_ABSENT;
+   kvmppc_invalidate_hpte(kvm, hptep, i);
+   v = hptep[0];
+   r = hptep[1];
+   if (r  HPTE_R_C) {
+   hptep[1] = r  ~HPTE_R_C;
if (!(rev[i].guest_rpte  HPTE_R_C)) {
rev[i].guest_rpte |= HPTE_R_C;
note_hpte_modification(kvm, rev[i]);
}
-   n = hpte_page_size(hptep[0], hptep[1]);
+   n = hpte_page_size(v, r);
n = (n + 

[PULL 39/41] KVM: PPC: Book3S HV: Fix machine check delivery to guest

2014-05-30 Thread Alexander Graf
From: Paul Mackerras pau...@samba.org

The code that delivered a machine check to the guest after handling
it in real mode failed to load up r11 before calling kvmppc_msr_interrupt,
which needs the old MSR value in r11 so it can see the transactional
state there.  This adds the missing load.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 60fe8ba..220aefb 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2144,6 +2144,7 @@ machine_check_realmode:
beq mc_cont
/* If not, deliver a machine check.  SRR0/1 are already set */
li  r10, BOOK3S_INTERRUPT_MACHINE_CHECK
+   ld  r11, VCPU_MSR(r9)
bl  kvmppc_msr_interrupt
b   fast_interrupt_c_return
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 35/41] KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address

2014-05-30 Thread Alexander Graf
From: Paul Mackerras pau...@samba.org

Currently, when a huge page is faulted in for a guest, we select the
rmap chain to insert the HPTE into based on the guest physical address
that the guest tried to access.  Since there is an rmap chain for each
system page, there are many rmap chains for the area covered by a huge
page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page
HPTE could end up in any one of them.

For consistency, and to make the huge-page HPTEs easier to find, we now
put huge-page HPTEs in the rmap chain corresponding to the base address
of the huge page.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index f32896f..4e22ecb 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -585,6 +585,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
struct kvm *kvm = vcpu-kvm;
unsigned long *hptep, hpte[3], r;
unsigned long mmu_seq, psize, pte_size;
+   unsigned long gpa_base, gfn_base;
unsigned long gpa, gfn, hva, pfn;
struct kvm_memory_slot *memslot;
unsigned long *rmap;
@@ -623,7 +624,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 
/* Translate the logical address and get the page */
psize = hpte_page_size(hpte[0], r);
-   gpa = (r  HPTE_R_RPN  ~(psize - 1)) | (ea  (psize - 1));
+   gpa_base = r  HPTE_R_RPN  ~(psize - 1);
+   gfn_base = gpa_base  PAGE_SHIFT;
+   gpa = gpa_base | (ea  (psize - 1));
gfn = gpa  PAGE_SHIFT;
memslot = gfn_to_memslot(kvm, gfn);
 
@@ -635,6 +638,13 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (!kvm-arch.using_mmu_notifiers)
return -EFAULT; /* should never get here */
 
+   /*
+* This should never happen, because of the slot_is_aligned()
+* check in kvmppc_do_h_enter().
+*/
+   if (gfn_base  memslot-base_gfn)
+   return -EFAULT;
+
/* used to check for invalidations in progress */
mmu_seq = kvm-mmu_notifier_seq;
smp_rmb();
@@ -727,7 +737,8 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
goto out_unlock;
hpte[0] = (hpte[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;
 
-   rmap = memslot-arch.rmap[gfn - memslot-base_gfn];
+   /* Always put the HPTE in the rmap chain for the page base address */
+   rmap = memslot-arch.rmap[gfn_base - memslot-base_gfn];
lock_rmap(rmap);
 
/* Check if we might have been invalidated; let the guest retry if so */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 12/41] KVM: PPC: PR: Fill pvinfo hcall instructions in big endian

2014-05-30 Thread Alexander Graf
We expose a blob of hypercall instructions to user space that it gives to
the guest via device tree again. That blob should contain a stream of
instructions necessary to do a hypercall in big endian, as it just gets
passed into the guest and old guests use them straight away.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/powerpc.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 3cf541a..a9bd0ff 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -1015,10 +1015,10 @@ static int kvm_vm_ioctl_get_pvinfo(struct 
kvm_ppc_pvinfo *pvinfo)
u32 inst_nop = 0x6000;
 #ifdef CONFIG_KVM_BOOKE_HV
u32 inst_sc1 = 0x4422;
-   pvinfo-hcall[0] = inst_sc1;
-   pvinfo-hcall[1] = inst_nop;
-   pvinfo-hcall[2] = inst_nop;
-   pvinfo-hcall[3] = inst_nop;
+   pvinfo-hcall[0] = cpu_to_be32(inst_sc1);
+   pvinfo-hcall[1] = cpu_to_be32(inst_nop);
+   pvinfo-hcall[2] = cpu_to_be32(inst_nop);
+   pvinfo-hcall[3] = cpu_to_be32(inst_nop);
 #else
u32 inst_lis = 0x3c00;
u32 inst_ori = 0x6000;
@@ -1034,10 +1034,10 @@ static int kvm_vm_ioctl_get_pvinfo(struct 
kvm_ppc_pvinfo *pvinfo)
 *sc
 *nop
 */
-   pvinfo-hcall[0] = inst_lis | ((KVM_SC_MAGIC_R0  16)  inst_imm_mask);
-   pvinfo-hcall[1] = inst_ori | (KVM_SC_MAGIC_R0  inst_imm_mask);
-   pvinfo-hcall[2] = inst_sc;
-   pvinfo-hcall[3] = inst_nop;
+   pvinfo-hcall[0] = cpu_to_be32(inst_lis | ((KVM_SC_MAGIC_R0  16)  
inst_imm_mask));
+   pvinfo-hcall[1] = cpu_to_be32(inst_ori | (KVM_SC_MAGIC_R0  
inst_imm_mask));
+   pvinfo-hcall[2] = cpu_to_be32(inst_sc);
+   pvinfo-hcall[3] = cpu_to_be32(inst_nop);
 #endif
 
pvinfo-flags = KVM_PPC_PVINFO_FLAGS_EV_IDLE;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 40/41] KVM: PPC: Book3S PR: Use SLB entry 0

2014-05-30 Thread Alexander Graf
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0
will always be used by the Linux linear SLB entry, so the fact that slbia
does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway.

Just enable use of SLB entry 0 for our shadow SLB code.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_64_mmu_host.c | 11 ---
 arch/powerpc/kvm/book3s_64_slb.S  |  3 ++-
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c 
b/arch/powerpc/kvm/book3s_64_mmu_host.c
index e2efb85..0ac9839 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -271,11 +271,8 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, 
ulong esid)
int found_inval = -1;
int r;
 
-   if (!svcpu-slb_max)
-   svcpu-slb_max = 1;
-
/* Are we overwriting? */
-   for (i = 1; i  svcpu-slb_max; i++) {
+   for (i = 0; i  svcpu-slb_max; i++) {
if (!(svcpu-slb[i].esid  SLB_ESID_V))
found_inval = i;
else if ((svcpu-slb[i].esid  ESID_MASK) == esid) {
@@ -285,7 +282,7 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, 
ulong esid)
}
 
/* Found a spare entry that was invalidated before */
-   if (found_inval  0) {
+   if (found_inval = 0) {
r = found_inval;
goto out;
}
@@ -359,7 +356,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong 
ea, ulong seg_size)
ulong seg_mask = -seg_size;
int i;
 
-   for (i = 1; i  svcpu-slb_max; i++) {
+   for (i = 0; i  svcpu-slb_max; i++) {
if ((svcpu-slb[i].esid  SLB_ESID_V) 
(svcpu-slb[i].esid  seg_mask) == ea) {
/* Invalidate this entry */
@@ -373,7 +370,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong 
ea, ulong seg_size)
 void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu)
 {
struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
-   svcpu-slb_max = 1;
+   svcpu-slb_max = 0;
svcpu-slb[0].esid = 0;
svcpu_put(svcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S
index 596140e..84c52c6 100644
--- a/arch/powerpc/kvm/book3s_64_slb.S
+++ b/arch/powerpc/kvm/book3s_64_slb.S
@@ -138,7 +138,8 @@ slb_do_enter:
 
/* Restore bolted entries from the shadow and fix it along the way */
 
-   /* We don't store anything in entry 0, so we don't need to take care of 
it */
+   li  r0, r0
+   slbmte  r0, r0
slbia
isync
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Alexander Graf
From: Paul Mackerras pau...@samba.org

Commit b005255e12a3 (KVM: PPC: Book3S HV: Context-switch new POWER8
SPRs) added a definition of KVM_REG_PPC_WORT with the same register
number as the existing KVM_REG_PPC_VRSAVE (though in fact the
definitions are not identical because of the different register sizes.)

For clarity, this moves KVM_REG_PPC_WORT to the next unused number,
and also adds it to api.txt.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 Documentation/virtual/kvm/api.txt   | 1 +
 arch/powerpc/include/uapi/asm/kvm.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 9a95770..6b30290 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1873,6 +1873,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_PPR  | 64
   PPC   | KVM_REG_PPC_ARCH_COMPAT 32
   PPC   | KVM_REG_PPC_DABRX | 32
+  PPC   | KVM_REG_PPC_WORT  | 64
   PPC   | KVM_REG_PPC_TM_GPR0  | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31 | 64
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index a6665be..2bc4a94 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -545,7 +545,6 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TCSCR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb1)
 #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2)
 #define KVM_REG_PPC_ACOP   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3)
-#define KVM_REG_PPC_WORT   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb4)
 
 #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
 #define KVM_REG_PPC_LPCR   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
@@ -555,6 +554,7 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb7)
 
 #define KVM_REG_PPC_DABRX  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb8)
+#define KVM_REG_PPC_WORT   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb9)
 
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 38/41] KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs

2014-05-30 Thread Alexander Graf
From: Paul Mackerras pau...@samba.org

This adds workarounds for two hardware bugs in the POWER8 performance
monitor unit (PMU), both related to interrupt generation.  The effect
of these bugs is that PMU interrupts can get lost, leading to tools
such as perf reporting fewer counts and samples than they should.

The first bug relates to the PMAO (perf. mon. alert occurred) bit in
MMCR0; setting it should cause an interrupt, but doesn't.  The other
bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0.
Setting PMAE when a counter is negative and counter negative
conditions are enabled to cause alerts should cause an alert, but
doesn't.

The workaround for the first bug is to create conditions where a
counter will overflow, whenever we are about to restore a MMCR0
value that has PMAO set (and PMAO_SYNC clear).  The workaround for
the second bug is to freeze all counters using MMCR2 before reading
MMCR0.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/reg.h  | 12 ---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 59 +++--
 2 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e5d2e0b..4852bcf 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -670,18 +670,20 @@
 #define   MMCR0_PROBLEM_DISABLE MMCR0_FCP
 #define   MMCR0_FCM1   0x1000UL /* freeze counters while MSR mark = 1 */
 #define   MMCR0_FCM0   0x0800UL /* freeze counters while MSR mark = 0 */
-#define   MMCR0_PMXE   0x0400UL /* performance monitor exception enable */
-#define   MMCR0_FCECE  0x0200UL /* freeze ctrs on enabled cond or event */
+#define   MMCR0_PMXE   ASM_CONST(0x0400) /* perf mon exception enable */
+#define   MMCR0_FCECE  ASM_CONST(0x0200) /* freeze ctrs on enabled cond or 
event */
 #define   MMCR0_TBEE   0x0040UL /* time base exception enable */
 #define   MMCR0_BHRBA  0x0020UL /* BHRB Access allowed in userspace */
 #define   MMCR0_EBE0x0010UL /* Event based branch enable */
 #define   MMCR0_PMCC   0x000cUL /* PMC control */
 #define   MMCR0_PMCC_U60x0008UL /* PMC1-6 are R/W by user (PR) */
 #define   MMCR0_PMC1CE 0x8000UL /* PMC1 count enable*/
-#define   MMCR0_PMCjCE 0x4000UL /* PMCj count enable*/
+#define   MMCR0_PMCjCE ASM_CONST(0x4000) /* PMCj count enable*/
 #define   MMCR0_TRIGGER0x2000UL /* TRIGGER enable */
-#define   MMCR0_PMAO_SYNC 0x0800UL /* PMU interrupt is synchronous */
-#define   MMCR0_PMAO   0x0080UL /* performance monitor alert has occurred, 
set to 0 after handling exception */
+#define   MMCR0_PMAO_SYNC ASM_CONST(0x0800) /* PMU intr is synchronous */
+#define   MMCR0_C56RUN ASM_CONST(0x0100) /* PMC5/6 count when RUN=0 */
+/* performance monitor alert has occurred, set to 0 after handling exception */
+#define   MMCR0_PMAO   ASM_CONST(0x0080)
 #define   MMCR0_SHRFC  0x0040UL /* SHRre freeze conditions between threads 
*/
 #define   MMCR0_FC56   0x0010UL /* freeze counters 5 and 6 */
 #define   MMCR0_FCTI   0x0008UL /* freeze counters in tags inactive mode */
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index ffbb871..60fe8ba 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -86,6 +86,12 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
lbz r4, LPPACA_PMCINUSE(r3)
cmpwi   r4, 0
beq 23f /* skip if not */
+BEGIN_FTR_SECTION
+   ld  r3, HSTATE_MMCR(r13)
+   andi.   r4, r3, MMCR0_PMAO_SYNC | MMCR0_PMAO
+   cmpwi   r4, MMCR0_PMAO
+   beqlkvmppc_fix_pmao
+END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG)
lwz r3, HSTATE_PMC(r13)
lwz r4, HSTATE_PMC + 4(r13)
lwz r5, HSTATE_PMC + 8(r13)
@@ -726,6 +732,12 @@ skip_tm:
sldir3, r3, 31  /* MMCR0_FC (freeze counters) bit */
mtspr   SPRN_MMCR0, r3  /* freeze all counters, disable ints */
isync
+BEGIN_FTR_SECTION
+   ld  r3, VCPU_MMCR(r4)
+   andi.   r5, r3, MMCR0_PMAO_SYNC | MMCR0_PMAO
+   cmpwi   r5, MMCR0_PMAO
+   beqlkvmppc_fix_pmao
+END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG)
lwz r3, VCPU_PMC(r4)/* always load up guest PMU registers */
lwz r5, VCPU_PMC + 4(r4)/* to prevent information leak */
lwz r6, VCPU_PMC + 8(r4)
@@ -1324,6 +1336,30 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 25:
/* Save PMU registers if requested */
/* r8 and cr0.eq are live here */
+BEGIN_FTR_SECTION
+   /*
+* POWER8 seems to have a hardware bug where setting
+* MMCR0[PMAE] along with MMCR0[PMC1CE] and/or MMCR0[PMCjCE]
+* when some counters are already negative doesn't seem
+* to cause a performance monitor 

[PULL 41/41] KVM: PPC: Book3S PR: Rework SLB switching code

2014-05-30 Thread Alexander Graf
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kernel/paca.c   |  3 ++
 arch/powerpc/kvm/book3s_64_slb.S | 83 ++--
 arch/powerpc/mm/slb.c|  2 +-
 3 files changed, 42 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index ad302f8..d6e195e 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -98,6 +98,9 @@ static inline void free_lppacas(void) { }
 /*
  * 3 persistent SLBs are registered here.  The buffer will be zero
  * initially, hence will all be invaild until we actually write them.
+ *
+ * If you make the number of persistent SLB entries dynamic, please also
+ * update PR KVM to flush and restore them accordingly.
  */
 static struct slb_shadow *slb_shadow;
 
diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S
index 84c52c6..3589c4e 100644
--- a/arch/powerpc/kvm/book3s_64_slb.S
+++ b/arch/powerpc/kvm/book3s_64_slb.S
@@ -17,29 +17,9 @@
  * Authors: Alexander Graf ag...@suse.de
  */
 
-#define SHADOW_SLB_ESID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10))
-#define SHADOW_SLB_VSID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10) + 0x8)
-#define UNBOLT_SLB_ENTRY(num) \
-   li  r11, SHADOW_SLB_ESID(num);  \
-   LDX_BE  r9, r12, r11;   \
-   /* Invalid? Skip. */;   \
-   rldicl. r0, r9, 37, 63; \
-   beq slb_entry_skip_ ## num; \
-   xoris   r9, r9, SLB_ESID_V@h;   \
-   STDX_BE r9, r12, r11;   \
-  slb_entry_skip_ ## num:
-
-#define REBOLT_SLB_ENTRY(num) \
-   li  r8, SHADOW_SLB_ESID(num);   \
-   li  r7, SHADOW_SLB_VSID(num);   \
-   LDX_BE  r10, r11, r8;   \
-   cmpdi   r10, 0; \
-   beq slb_exit_skip_ ## num;  \
-   orisr10, r10, SLB_ESID_V@h; \
-   LDX_BE  r9, r11, r7;\
-   slbmte  r9, r10;\
-   STDX_BE r10, r11, r8;   \
-slb_exit_skip_ ## num:
+#define SHADOW_SLB_ENTRY_LEN   0x10
+#define OFFSET_ESID(x) (SHADOW_SLB_ENTRY_LEN * x)
+#define OFFSET_VSID(x) ((SHADOW_SLB_ENTRY_LEN * x) + 8)
 
 /**
  **
@@ -63,20 +43,15 @@ slb_exit_skip_ ## num:
 * SVCPU[LR]  = guest LR
 */
 
-   /* Remove LPAR shadow entries */
+BEGIN_FW_FTR_SECTION
 
-#if SLB_NUM_BOLTED == 3
+   /* Declare SLB shadow as 0 entries big */
 
-   ld  r12, PACA_SLBSHADOWPTR(r13)
+   ld  r11, PACA_SLBSHADOWPTR(r13)
+   li  r8, 0
+   stb r8, 3(r11)
 
-   /* Remove bolted entries */
-   UNBOLT_SLB_ENTRY(0)
-   UNBOLT_SLB_ENTRY(1)
-   UNBOLT_SLB_ENTRY(2)
-   
-#else
-#error unknown number of bolted entries
-#endif
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR)
 
/* Flush SLB */
 
@@ -99,7 +74,7 @@ slb_loop_enter:
 
ld  r10, 0(r11)
 
-   rldicl. r0, r10, 37, 63
+   andis.  r9, r10, SLB_ESID_V@h
beq slb_loop_enter_skip
 
ld  r9, 8(r11)
@@ -136,24 +111,42 @@ slb_do_enter:
 *
 */
 
-   /* Restore bolted entries from the shadow and fix it along the way */
+   /* Remove all SLB entries that are in use. */
 
li  r0, r0
slbmte  r0, r0
slbia
-   isync
 
-#if SLB_NUM_BOLTED == 3
+   /* Restore bolted entries from the shadow */
 
ld  r11, PACA_SLBSHADOWPTR(r13)
 
-   REBOLT_SLB_ENTRY(0)
-   REBOLT_SLB_ENTRY(1)
-   REBOLT_SLB_ENTRY(2)
-   
-#else
-#error unknown number of bolted entries
-#endif
+BEGIN_FW_FTR_SECTION
+
+   /* Declare SLB shadow as SLB_NUM_BOLTED entries big */
+
+   li  r8, SLB_NUM_BOLTED
+   stb r8, 3(r11)
+
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR)
+
+   /* Manually load all entries from shadow SLB */
+
+   li  r8, SLBSHADOW_SAVEAREA
+   li  r7, SLBSHADOW_SAVEAREA + 8
+
+   .rept   SLB_NUM_BOLTED
+   LDX_BE  r10, r11, r8
+   cmpdi   r10, 0
+   beq 

[PULL 36/41] KVM: PPC: Book3S HV: Fix dirty map for hugepages

2014-05-30 Thread Alexander Graf
From: Alexey Kardashevskiy a...@ozlabs.ru

The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has
one bit per system page (4K/64K).  Currently, we only set one bit in
the map for each HPT entry with the Change bit set, even if the HPT is
for a large page (e.g., 16MB).  Userspace then considers only the
first system page dirty, though in fact the guest may have modified
anywhere in the large page.

To fix this, we make kvm_test_clear_dirty() return the actual number
of pages that are dirty (and rename it to kvm_test_clear_dirty_npages()
to emphasize that that's what it returns).  In kvmppc_hv_get_dirty_log()
we then set that many bits in the dirty map.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 4e22ecb..96c9044 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1060,22 +1060,27 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long 
hva, pte_t pte)
kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
 }
 
-static int kvm_test_clear_dirty(struct kvm *kvm, unsigned long *rmapp)
+/*
+ * Returns the number of system pages that are dirty.
+ * This can be more than 1 if we find a huge-page HPTE.
+ */
+static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp)
 {
struct revmap_entry *rev = kvm-arch.revmap;
unsigned long head, i, j;
+   unsigned long n;
unsigned long *hptep;
-   int ret = 0;
+   int npages_dirty = 0;
 
  retry:
lock_rmap(rmapp);
if (*rmapp  KVMPPC_RMAP_CHANGED) {
*rmapp = ~KVMPPC_RMAP_CHANGED;
-   ret = 1;
+   npages_dirty = 1;
}
if (!(*rmapp  KVMPPC_RMAP_PRESENT)) {
unlock_rmap(rmapp);
-   return ret;
+   return npages_dirty;
}
 
i = head = *rmapp  KVMPPC_RMAP_INDEX;
@@ -1106,13 +,16 @@ static int kvm_test_clear_dirty(struct kvm *kvm, 
unsigned long *rmapp)
rev[i].guest_rpte |= HPTE_R_C;
note_hpte_modification(kvm, rev[i]);
}
-   ret = 1;
+   n = hpte_page_size(hptep[0], hptep[1]);
+   n = (n + PAGE_SIZE - 1)  PAGE_SHIFT;
+   if (n  npages_dirty)
+   npages_dirty = n;
}
hptep[0] = ~HPTE_V_HVLOCK;
} while ((i = j) != head);
 
unlock_rmap(rmapp);
-   return ret;
+   return npages_dirty;
 }
 
 static void harvest_vpa_dirty(struct kvmppc_vpa *vpa,
@@ -1136,15 +1144,22 @@ static void harvest_vpa_dirty(struct kvmppc_vpa *vpa,
 long kvmppc_hv_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot,
 unsigned long *map)
 {
-   unsigned long i;
+   unsigned long i, j;
unsigned long *rmapp;
struct kvm_vcpu *vcpu;
 
preempt_disable();
rmapp = memslot-arch.rmap;
for (i = 0; i  memslot-npages; ++i) {
-   if (kvm_test_clear_dirty(kvm, rmapp)  map)
-   __set_bit_le(i, map);
+   int npages = kvm_test_clear_dirty_npages(kvm, rmapp);
+   /*
+* Note that if npages  0 then i must be a multiple of npages,
+* since we always put huge-page HPTEs in the rmap chain
+* corresponding to their page base address.
+*/
+   if (npages  map)
+   for (j = i; npages; ++j, --npages)
+   __set_bit_le(j, map);
++rmapp;
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 10/41] KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian

2014-05-30 Thread Alexander Graf
The HTAB on PPC is always in big endian. When we access it via hypercalls
on behalf of the guest and we're running on a little endian host, we need
to make sure we swap the bits accordingly.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_pr_papr.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_pr_papr.c 
b/arch/powerpc/kvm/book3s_pr_papr.c
index 5efa97b..255e5b1 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -57,7 +57,7 @@ static int kvmppc_h_pr_enter(struct kvm_vcpu *vcpu)
for (i = 0; ; ++i) {
if (i == 8)
goto done;
-   if ((*hpte  HPTE_V_VALID) == 0)
+   if ((be64_to_cpu(*hpte)  HPTE_V_VALID) == 0)
break;
hpte += 2;
}
@@ -67,8 +67,8 @@ static int kvmppc_h_pr_enter(struct kvm_vcpu *vcpu)
goto done;
}
 
-   hpte[0] = kvmppc_get_gpr(vcpu, 6);
-   hpte[1] = kvmppc_get_gpr(vcpu, 7);
+   hpte[0] = cpu_to_be64(kvmppc_get_gpr(vcpu, 6));
+   hpte[1] = cpu_to_be64(kvmppc_get_gpr(vcpu, 7));
pteg_addr += i * HPTE_SIZE;
copy_to_user((void __user *)pteg_addr, hpte, HPTE_SIZE);
kvmppc_set_gpr(vcpu, 4, pte_index | i);
@@ -93,6 +93,8 @@ static int kvmppc_h_pr_remove(struct kvm_vcpu *vcpu)
pteg = get_pteg_addr(vcpu, pte_index);
mutex_lock(vcpu-kvm-arch.hpt_mutex);
copy_from_user(pte, (void __user *)pteg, sizeof(pte));
+   pte[0] = be64_to_cpu(pte[0]);
+   pte[1] = be64_to_cpu(pte[1]);
 
ret = H_NOT_FOUND;
if ((pte[0]  HPTE_V_VALID) == 0 ||
@@ -169,6 +171,8 @@ static int kvmppc_h_pr_bulk_remove(struct kvm_vcpu *vcpu)
 
pteg = get_pteg_addr(vcpu, tsh  H_BULK_REMOVE_PTEX);
copy_from_user(pte, (void __user *)pteg, sizeof(pte));
+   pte[0] = be64_to_cpu(pte[0]);
+   pte[1] = be64_to_cpu(pte[1]);
 
/* tsl = AVPN */
flags = (tsh  H_BULK_REMOVE_FLAGS)  26;
@@ -207,6 +211,8 @@ static int kvmppc_h_pr_protect(struct kvm_vcpu *vcpu)
pteg = get_pteg_addr(vcpu, pte_index);
mutex_lock(vcpu-kvm-arch.hpt_mutex);
copy_from_user(pte, (void __user *)pteg, sizeof(pte));
+   pte[0] = be64_to_cpu(pte[0]);
+   pte[1] = be64_to_cpu(pte[1]);
 
ret = H_NOT_FOUND;
if ((pte[0]  HPTE_V_VALID) == 0 ||
@@ -225,6 +231,8 @@ static int kvmppc_h_pr_protect(struct kvm_vcpu *vcpu)
 
rb = compute_tlbie_rb(v, r, pte_index);
vcpu-arch.mmu.tlbie(vcpu, rb, rb  1 ? true : false);
+   pte[0] = cpu_to_be64(pte[0]);
+   pte[1] = cpu_to_be64(pte[1]);
copy_to_user((void __user *)pteg, pte, sizeof(pte));
ret = H_SUCCESS;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 31/41] KVM: PPC: Add CAP to indicate hcall fixes

2014-05-30 Thread Alexander Graf
We worked around some nasty KVM magic page hcall breakages:

  1) NX bit not honored, so ignore NX when we detect it
  2) LE guests swizzle hypercall instruction

Without these fixes in place, there's no way it would make sense to expose kvm
hypercalls to a guest. Chances are immensely high it would trip over and break.

So add a new CAP that gives user space a hint that we have workarounds for the
bugs above in place. It can use those as hint to disable PV hypercalls when
the guest CPU is anything POWER7 or higher and the host does not have fixes
in place.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/powerpc.c | 1 +
 include/uapi/linux/kvm.h   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 154f352..bab20f4 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -416,6 +416,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_SPAPR_TCE:
case KVM_CAP_PPC_ALLOC_HTAB:
case KVM_CAP_PPC_RTAS:
+   case KVM_CAP_PPC_FIXUP_HCALL:
 #ifdef CONFIG_KVM_XICS
case KVM_CAP_IRQ_XICS:
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2b83cf3..16c923d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -748,6 +748,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_IRQCHIP 99
 #define KVM_CAP_IOEVENTFD_NO_LENGTH 100
 #define KVM_CAP_VM_ATTRIBUTES 101
+#define KVM_CAP_PPC_FIXUP_HCALL 102
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 30/41] KVM: PPC: MPIC: Reset IRQ source private members

2014-05-30 Thread Alexander Graf
When we reset the in-kernel MPIC controller, we forget to reset some hidden
state such as destmask and output. This state is usually set when the guest
writes to the IDR register for a specific IRQ line.

To make sure we stay in sync and don't forget hidden state, treat reset of
the IDR register as a simple write of the IDR register. That automatically
updates all the hidden state as well.

Reported-by: Paul Janzen p...@pauljanzen.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/mpic.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index efbd996..b68d0dc 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -126,6 +126,8 @@ static int openpic_cpu_write_internal(void *opaque, gpa_t 
addr,
  u32 val, int idx);
 static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
 u32 *ptr, int idx);
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+   uint32_t val);
 
 enum irq_type {
IRQ_TYPE_NORMAL = 0,
@@ -528,7 +530,6 @@ static void openpic_reset(struct openpic *opp)
/* Initialise IRQ sources */
for (i = 0; i  opp-max_irq; i++) {
opp-src[i].ivpr = opp-ivpr_reset;
-   opp-src[i].idr = opp-idr_reset;
 
switch (opp-src[i].type) {
case IRQ_TYPE_NORMAL:
@@ -543,6 +544,8 @@ static void openpic_reset(struct openpic *opp)
case IRQ_TYPE_FSLSPECIAL:
break;
}
+
+   write_IRQreg_idr(opp, i, opp-idr_reset);
}
/* Initialise IRQ destinations */
for (i = 0; i  MAX_CPU; i++) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 27/41] KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler

2014-05-30 Thread Alexander Graf
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

Use make_dsisr instead of open coding it. This also have
the added benefit of handling alignment interrupt on additional
instructions.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/disassemble.h | 34 +
 arch/powerpc/kernel/align.c| 34 +
 arch/powerpc/kvm/book3s_emulate.c  | 39 +-
 3 files changed, 36 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/disassemble.h 
b/arch/powerpc/include/asm/disassemble.h
index 856f8de..6330a61 100644
--- a/arch/powerpc/include/asm/disassemble.h
+++ b/arch/powerpc/include/asm/disassemble.h
@@ -81,4 +81,38 @@ static inline unsigned int get_oc(u32 inst)
 {
return (inst  11)  0x7fff;
 }
+
+#define IS_XFORM(inst) (get_op(inst)  == 31)
+#define IS_DSFORM(inst)(get_op(inst) = 56)
+
+/*
+ * Create a DSISR value from the instruction
+ */
+static inline unsigned make_dsisr(unsigned instr)
+{
+   unsigned dsisr;
+
+
+   /* bits  6:15 -- 22:31 */
+   dsisr = (instr  0x03ff)  16;
+
+   if (IS_XFORM(instr)) {
+   /* bits 29:30 -- 15:16 */
+   dsisr |= (instr  0x0006)  14;
+   /* bit 25 --17 */
+   dsisr |= (instr  0x0040)  8;
+   /* bits 21:24 -- 18:21 */
+   dsisr |= (instr  0x0780)  3;
+   } else {
+   /* bit  5 --17 */
+   dsisr |= (instr  0x0400)  12;
+   /* bits  1: 4 -- 18:21 */
+   dsisr |= (instr  0x7800)  17;
+   /* bits 30:31 -- 12:13 */
+   if (IS_DSFORM(instr))
+   dsisr |= (instr  0x0003)  18;
+   }
+
+   return dsisr;
+}
 #endif /* __ASM_PPC_DISASSEMBLE_H__ */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index 94908af..34f5552 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -25,14 +25,13 @@
 #include asm/cputable.h
 #include asm/emulated_ops.h
 #include asm/switch_to.h
+#include asm/disassemble.h
 
 struct aligninfo {
unsigned char len;
unsigned char flags;
 };
 
-#define IS_XFORM(inst) (((inst)  26) == 31)
-#define IS_DSFORM(inst)(((inst)  26) = 56)
 
 #define INVALID{ 0, 0 }
 
@@ -192,37 +191,6 @@ static struct aligninfo aligninfo[128] = {
 };
 
 /*
- * Create a DSISR value from the instruction
- */
-static inline unsigned make_dsisr(unsigned instr)
-{
-   unsigned dsisr;
-
-
-   /* bits  6:15 -- 22:31 */
-   dsisr = (instr  0x03ff)  16;
-
-   if (IS_XFORM(instr)) {
-   /* bits 29:30 -- 15:16 */
-   dsisr |= (instr  0x0006)  14;
-   /* bit 25 --17 */
-   dsisr |= (instr  0x0040)  8;
-   /* bits 21:24 -- 18:21 */
-   dsisr |= (instr  0x0780)  3;
-   } else {
-   /* bit  5 --17 */
-   dsisr |= (instr  0x0400)  12;
-   /* bits  1: 4 -- 18:21 */
-   dsisr |= (instr  0x7800)  17;
-   /* bits 30:31 -- 12:13 */
-   if (IS_DSFORM(instr))
-   dsisr |= (instr  0x0003)  18;
-   }
-
-   return dsisr;
-}
-
-/*
  * The dcbz (data cache block zero) instruction
  * gives an alignment fault if used on non-cacheable
  * memory.  We handle the fault mainly for the
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 61f38eb..c992447 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -634,44 +634,7 @@ unprivileged:
 
 u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst)
 {
-   u32 dsisr = 0;
-
-   /*
-* This is what the spec says about DSISR bits (not mentioned = 0):
-*
-* 12:13[DS]Set to bits 30:31
-* 15:16[X] Set to bits 29:30
-* 17   [X] Set to bit 25
-*  [D/DS]  Set to bit 5
-* 18:21[X] Set to bits 21:24
-*  [D/DS]  Set to bits 1:4
-* 22:26Set to bits 6:10 (RT/RS/FRT/FRS)
-* 27:31Set to bits 11:15 (RA)
-*/
-
-   switch (get_op(inst)) {
-   /* D-form */
-   case OP_LFS:
-   case OP_LFD:
-   case OP_STFD:
-   case OP_STFS:
-   dsisr |= (inst  12)  0x4000; /* bit 17 */
-   dsisr |= (inst  17)  0x3c00; /* bits 18:21 */
-   break;
-   /* X-form */
-   case 31:
-   dsisr |= (inst  14)  0x18000; /* bits 15:16 */
-   dsisr |= (inst  8)   0x04000; /* bit 17 */
-   dsisr |= (inst  3)   0x03c00; /* bits 18:21 */

[PULL 28/41] PPC: ePAPR: Fix hypercall on LE guest

2014-05-30 Thread Alexander Graf
We get an array of instructions from the hypervisor via device tree that
we write into a buffer that gets executed whenever we want to make an
ePAPR compliant hypercall.

However, the hypervisor passes us these instructions in BE order which
we have to manually convert to LE when we want to run them in LE mode.

With this fixup in place, I can successfully run LE kernels with KVM
PV enabled on PR KVM.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kernel/epapr_paravirt.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/epapr_paravirt.c 
b/arch/powerpc/kernel/epapr_paravirt.c
index 7898be9..d9b7935 100644
--- a/arch/powerpc/kernel/epapr_paravirt.c
+++ b/arch/powerpc/kernel/epapr_paravirt.c
@@ -47,9 +47,10 @@ static int __init early_init_dt_scan_epapr(unsigned long 
node,
return -1;
 
for (i = 0; i  (len / 4); i++) {
-   patch_instruction(epapr_hypercall_start + i, insts[i]);
+   u32 inst = be32_to_cpu(insts[i]);
+   patch_instruction(epapr_hypercall_start + i, inst);
 #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64)
-   patch_instruction(epapr_ev_idle_start + i, insts[i]);
+   patch_instruction(epapr_ev_idle_start + i, inst);
 #endif
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 32/41] KVM: PPC: Book3S: Add ONE_REG register names that were missed

2014-05-30 Thread Alexander Graf
From: Paul Mackerras pau...@samba.org

Commit 3b7834743f9 (KVM: PPC: Book3S HV: Reserve POWER8 space in 
get/set_one_reg) added definitions for several KVM_REG_PPC_* symbols
but missed adding some to api.txt.  This adds them.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 Documentation/virtual/kvm/api.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 0581f6c..9a95770 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1794,6 +1794,11 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_MMCR0 | 64
   PPC   | KVM_REG_PPC_MMCR1 | 64
   PPC   | KVM_REG_PPC_MMCRA | 64
+  PPC   | KVM_REG_PPC_MMCR2 | 64
+  PPC   | KVM_REG_PPC_MMCRS | 64
+  PPC   | KVM_REG_PPC_SIAR  | 64
+  PPC   | KVM_REG_PPC_SDAR  | 64
+  PPC   | KVM_REG_PPC_SIER  | 64
   PPC   | KVM_REG_PPC_PMC1  | 32
   PPC   | KVM_REG_PPC_PMC2  | 32
   PPC   | KVM_REG_PPC_PMC3  | 32
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 07/41] KVM: PPC: Book3S_64 PR: Access HTAB in big endian

2014-05-30 Thread Alexander Graf
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.

Wrap all accesses to the guest HTAB with big endian accessors.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_64_mmu.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c
index 171e5ca..b93c245 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -275,12 +275,15 @@ do_second:
key = 4;
 
for (i=0; i16; i+=2) {
+   u64 pte0 = be64_to_cpu(pteg[i]);
+   u64 pte1 = be64_to_cpu(pteg[i + 1]);
+
/* Check all relevant fields of 1st dword */
-   if ((pteg[i]  v_mask) == v_val) {
+   if ((pte0  v_mask) == v_val) {
/* If large page bit is set, check pgsize encoding */
if (slbe-large 
(vcpu-arch.hflags  BOOK3S_HFLAG_MULTI_PGSIZE)) {
-   pgsize = decode_pagesize(slbe, pteg[i+1]);
+   pgsize = decode_pagesize(slbe, pte1);
if (pgsize  0)
continue;
}
@@ -297,8 +300,8 @@ do_second:
goto do_second;
}
 
-   v = pteg[i];
-   r = pteg[i+1];
+   v = be64_to_cpu(pteg[i]);
+   r = be64_to_cpu(pteg[i+1]);
pp = (r  HPTE_R_PP) | key;
if (r  HPTE_R_PP0)
pp |= 8;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 15/41] KVM: PPC: Book3S: Move little endian conflict to HV KVM

2014-05-30 Thread Alexander Graf
With the previous patches applied, we can now successfully use PR KVM on
little endian hosts which means we can now allow users to select it.

However, HV KVM still needs some work, so let's keep the kconfig conflict
on that one.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 141b202..d6a53b9 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -6,7 +6,6 @@ source virt/kvm/Kconfig
 
 menuconfig VIRTUALIZATION
bool Virtualization
-   depends on !CPU_LITTLE_ENDIAN
---help---
  Say Y here to get to see options for using your Linux host to run
  other operating systems inside virtual machines (guests).
@@ -76,6 +75,7 @@ config KVM_BOOK3S_64
 config KVM_BOOK3S_64_HV
tristate KVM support for POWER7 and PPC970 using hypervisor mode in 
host
depends on KVM_BOOK3S_64
+   depends on !CPU_LITTLE_ENDIAN
select KVM_BOOK3S_HV_POSSIBLE
select MMU_NOTIFIER
select CMA
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 19/41] KVM: PPC: Book3S PR: Expose TAR facility to guest

2014-05-30 Thread Alexander Graf
POWER8 implements a new register called TAR. This register has to be
enabled in FSCR and then from KVM's point of view is mere storage.

This patch enables the guest to use TAR.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/kvm_host.h |  2 ++
 arch/powerpc/kernel/asm-offsets.c   |  2 ++
 arch/powerpc/kvm/book3s.c   |  6 ++
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_pr.c| 18 ++
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 232ec5f..29fbb55 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -449,7 +449,9 @@ struct kvm_vcpu_arch {
ulong pc;
ulong ctr;
ulong lr;
+#ifdef CONFIG_PPC_BOOK3S
ulong tar;
+#endif
 
ulong xer;
u32 cr;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index e2b86b5..93e1465 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -446,7 +446,9 @@ int main(void)
DEFINE(VCPU_XER, offsetof(struct kvm_vcpu, arch.xer));
DEFINE(VCPU_CTR, offsetof(struct kvm_vcpu, arch.ctr));
DEFINE(VCPU_LR, offsetof(struct kvm_vcpu, arch.lr));
+#ifdef CONFIG_PPC_BOOK3S
DEFINE(VCPU_TAR, offsetof(struct kvm_vcpu, arch.tar));
+#endif
DEFINE(VCPU_CR, offsetof(struct kvm_vcpu, arch.cr));
DEFINE(VCPU_PC, offsetof(struct kvm_vcpu, arch.pc));
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 79cfa2d..4046a1a 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -634,6 +634,9 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, 
struct kvm_one_reg *reg)
case KVM_REG_PPC_FSCR:
val = get_reg_val(reg-id, vcpu-arch.fscr);
break;
+   case KVM_REG_PPC_TAR:
+   val = get_reg_val(reg-id, vcpu-arch.tar);
+   break;
default:
r = -EINVAL;
break;
@@ -726,6 +729,9 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, 
struct kvm_one_reg *reg)
case KVM_REG_PPC_FSCR:
vcpu-arch.fscr = set_reg_val(reg-id, val);
break;
+   case KVM_REG_PPC_TAR:
+   vcpu-arch.tar = set_reg_val(reg-id, val);
+   break;
default:
r = -EINVAL;
break;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0092e12..ee1d8ee 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -891,9 +891,6 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, u64 
id,
case KVM_REG_PPC_BESCR:
*val = get_reg_val(id, vcpu-arch.bescr);
break;
-   case KVM_REG_PPC_TAR:
-   *val = get_reg_val(id, vcpu-arch.tar);
-   break;
case KVM_REG_PPC_DPDES:
*val = get_reg_val(id, vcpu-arch.vcore-dpdes);
break;
@@ -1100,9 +1097,6 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_BESCR:
vcpu-arch.bescr = set_reg_val(id, *val);
break;
-   case KVM_REG_PPC_TAR:
-   vcpu-arch.tar = set_reg_val(id, *val);
-   break;
case KVM_REG_PPC_DPDES:
vcpu-arch.vcore-dpdes = set_reg_val(id, *val);
break;
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index ddc626e..7d27a95 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -90,6 +90,7 @@ static void kvmppc_core_vcpu_put_pr(struct kvm_vcpu *vcpu)
 #endif
 
kvmppc_giveup_ext(vcpu, MSR_FP | MSR_VEC | MSR_VSX);
+   kvmppc_giveup_fac(vcpu, FSCR_TAR_LG);
vcpu-cpu = -1;
 }
 
@@ -625,6 +626,14 @@ static void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong 
fac)
/* Facility not available to the guest, ignore giveup request*/
return;
}
+
+   switch (fac) {
+   case FSCR_TAR_LG:
+   vcpu-arch.tar = mfspr(SPRN_TAR);
+   mtspr(SPRN_TAR, current-thread.tar);
+   vcpu-arch.shadow_fscr = ~FSCR_TAR;
+   break;
+   }
 #endif
 }
 
@@ -794,6 +803,12 @@ static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong 
fac)
}
 
switch (fac) {
+   case FSCR_TAR_LG:
+   /* TAR switching isn't lazy in Linux yet */
+   current-thread.tar = mfspr(SPRN_TAR);
+   mtspr(SPRN_TAR, vcpu-arch.tar);
+   vcpu-arch.shadow_fscr |= FSCR_TAR;
+   break;
default:
kvmppc_emulate_fac(vcpu, fac);
 

[PULL 13/41] KVM: PPC: Make shared struct aka magic page guest endian

2014-05-30 Thread Alexander Graf
The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.

When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.

Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.

For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.

For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/kvm_book3s.h|  3 +-
 arch/powerpc/include/asm/kvm_booke.h |  5 --
 arch/powerpc/include/asm/kvm_host.h  |  3 +
 arch/powerpc/include/asm/kvm_ppc.h   | 80 +-
 arch/powerpc/kernel/asm-offsets.c|  4 ++
 arch/powerpc/kvm/book3s.c| 72 
 arch/powerpc/kvm/book3s_32_mmu.c | 21 +++
 arch/powerpc/kvm/book3s_32_mmu_host.c|  4 +-
 arch/powerpc/kvm/book3s_64_mmu.c | 19 ---
 arch/powerpc/kvm/book3s_64_mmu_host.c|  4 +-
 arch/powerpc/kvm/book3s_emulate.c| 28 -
 arch/powerpc/kvm/book3s_exports.c|  1 +
 arch/powerpc/kvm/book3s_hv.c | 11 
 arch/powerpc/kvm/book3s_interrupts.S | 23 +++-
 arch/powerpc/kvm/book3s_paired_singles.c | 16 +++---
 arch/powerpc/kvm/book3s_pr.c | 97 +++-
 arch/powerpc/kvm/book3s_pr_papr.c|  2 +-
 arch/powerpc/kvm/emulate.c   | 24 
 arch/powerpc/kvm/powerpc.c   | 33 ++-
 arch/powerpc/kvm/trace_pr.h  |  2 +-
 20 files changed, 309 insertions(+), 143 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bb1e38a..f52f656 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -268,9 +268,10 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu-arch.pc;
 }
 
+static inline u64 kvmppc_get_msr(struct kvm_vcpu *vcpu);
 static inline bool kvmppc_need_byteswap(struct kvm_vcpu *vcpu)
 {
-   return (vcpu-arch.shared-msr  MSR_LE) != (MSR_KERNEL  MSR_LE);
+   return (kvmppc_get_msr(vcpu)  MSR_LE) != (MSR_KERNEL  MSR_LE);
 }
 
 static inline u32 kvmppc_get_last_inst_internal(struct kvm_vcpu *vcpu, ulong 
pc)
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index 80d46b5..c7aed61 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -108,9 +108,4 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
 {
return vcpu-arch.fault_dear;
 }
-
-static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
-{
-   return vcpu-arch.shared-msr;
-}
 #endif /* __ASM_KVM_BOOKE_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index d342f8e..15f19d3 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -623,6 +623,9 @@ struct kvm_vcpu_arch {
wait_queue_head_t cpu_run;
 
struct kvm_vcpu_arch_shared *shared;
+#if defined(CONFIG_PPC_BOOK3S_64)  defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
+   bool shared_big_endian;
+#endif
unsigned long magic_page_pa; /* phys addr to map the magic page to */
unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 4096f16..4a7cc45 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -449,6 +449,84 @@ static inline void kvmppc_mmu_flush_icache(pfn_t pfn)
 }
 
 /*
+ * Shared struct helpers. The shared struct can be little or big endian,
+ * depending on the guest endianness. So expose helpers to all of them.
+ */
+static inline bool kvmppc_shared_big_endian(struct kvm_vcpu *vcpu)
+{
+#if defined(CONFIG_PPC_BOOK3S_64)  defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
+   /* Only Book3S_64 PR supports bi-endian for now */
+   return vcpu-arch.shared_big_endian;
+#elif defined(CONFIG_PPC_BOOK3S_64)  defined(__LITTLE_ENDIAN__)
+   /* Book3s_64 HV on little endian is always little endian */
+   return false;
+#else
+   return true;
+#endif
+}
+
+#define SHARED_WRAPPER_GET(reg, size)  \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   if (kvmppc_shared_big_endian(vcpu))   

[PULL 14/41] KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions

2014-05-30 Thread Alexander Graf
When the host CPU we're running on doesn't support dcbz32 itself, but the
guest wants to have dcbz only clear 32 bytes of data, we loop through every
executable mapped page to search for dcbz instructions and patch them with
a special privileged instruction that we emulate as dcbz32.

The only guests that want to see dcbz act as 32byte are book3s_32 guests, so
we don't have to worry about little endian instruction ordering. So let's
just always search for big endian dcbz instructions, also when we're on a
little endian host.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_32_mmu.c | 2 +-
 arch/powerpc/kvm/book3s_pr.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c
index 628d90e..93503bb 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -131,7 +131,7 @@ static hva_t kvmppc_mmu_book3s_32_get_pteg(struct kvm_vcpu 
*vcpu,
pteg = (vcpu_book3s-sdr1  0x) | hash;
 
dprintk(MMU: pc=0x%lx eaddr=0x%lx sdr1=0x%llx pteg=0x%x vsid=0x%x\n,
-   kvmppc_get_pc(vcpu_book3s-vcpu), eaddr, vcpu_book3s-sdr1, 
pteg,
+   kvmppc_get_pc(vcpu), eaddr, vcpu_book3s-sdr1, pteg,
sr_vsid(sre));
 
r = gfn_to_hva(vcpu-kvm, pteg  PAGE_SHIFT);
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index d424ca0..6e55934 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -428,8 +428,8 @@ static void kvmppc_patch_dcbz(struct kvm_vcpu *vcpu, struct 
kvmppc_pte *pte)
 
/* patch dcbz into reserved instruction, so we trap */
for (i=hpage_offset; i  hpage_offset + (HW_PAGE_SIZE / 4); i++)
-   if ((page[i]  0xff0007ff) == INS_DCBZ)
-   page[i] = 0xfff7;
+   if ((be32_to_cpu(page[i])  0xff0007ff) == INS_DCBZ)
+   page[i] = cpu_to_be32(0xfff7);
 
kunmap_atomic(page);
put_page(hpage);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 21/41] KVM: PPC: Book3S PR: Expose TM registers

2014-05-30 Thread Alexander Graf
POWER8 introduces transactional memory which brings along a number of new
registers and MSR bits.

Implementing all of those is a pretty big headache, so for now let's at least
emulate enough to make Linux's context switching code happy.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_emulate.c | 22 ++
 arch/powerpc/kvm/book3s_pr.c  | 20 +++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index e1165ba..9bdff15 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -451,6 +451,17 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong spr_val)
case SPRN_EBBRR:
vcpu-arch.ebbrr = spr_val;
break;
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   case SPRN_TFHAR:
+   vcpu-arch.tfhar = spr_val;
+   break;
+   case SPRN_TEXASR:
+   vcpu-arch.texasr = spr_val;
+   break;
+   case SPRN_TFIAR:
+   vcpu-arch.tfiar = spr_val;
+   break;
+#endif
 #endif
case SPRN_ICTC:
case SPRN_THRM1:
@@ -572,6 +583,17 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong *spr_val
case SPRN_EBBRR:
*spr_val = vcpu-arch.ebbrr;
break;
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   case SPRN_TFHAR:
+   *spr_val = vcpu-arch.tfhar;
+   break;
+   case SPRN_TEXASR:
+   *spr_val = vcpu-arch.texasr;
+   break;
+   case SPRN_TFIAR:
+   *spr_val = vcpu-arch.tfiar;
+   break;
+#endif
 #endif
case SPRN_THRM1:
case SPRN_THRM2:
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 7d27a95..23367a7 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -794,9 +794,27 @@ static void kvmppc_emulate_fac(struct kvm_vcpu *vcpu, 
ulong fac)
 /* Enable facilities (TAR, EBB, DSCR) for the guest */
 static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong fac)
 {
+   bool guest_fac_enabled;
BUG_ON(!cpu_has_feature(CPU_FTR_ARCH_207S));
 
-   if (!(vcpu-arch.fscr  (1ULL  fac))) {
+   /*
+* Not every facility is enabled by FSCR bits, check whether the
+* guest has this facility enabled at all.
+*/
+   switch (fac) {
+   case FSCR_TAR_LG:
+   case FSCR_EBB_LG:
+   guest_fac_enabled = (vcpu-arch.fscr  (1ULL  fac));
+   break;
+   case FSCR_TM_LG:
+   guest_fac_enabled = kvmppc_get_msr(vcpu)  MSR_TM;
+   break;
+   default:
+   guest_fac_enabled = false;
+   break;
+   }
+
+   if (!guest_fac_enabled) {
/* Facility not enabled by the guest */
kvmppc_trigger_fac_interrupt(vcpu, fac);
return RESUME_GUEST;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 16/41] KVM: PPC: Book3S PR: Ignore PMU SPRs

2014-05-30 Thread Alexander Graf
When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs
that we don't emulate. Just ignore accesses to them.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_emulate.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 45d0a80..52448ef 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -455,6 +455,13 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong spr_val)
case SPRN_WPAR_GEKKO:
case SPRN_MSSSR0:
case SPRN_DABR:
+#ifdef CONFIG_PPC_BOOK3S_64
+   case SPRN_MMCRS:
+   case SPRN_MMCRA:
+   case SPRN_MMCR0:
+   case SPRN_MMCR1:
+   case SPRN_MMCR2:
+#endif
break;
 unprivileged:
default:
@@ -553,6 +560,13 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong *spr_val
case SPRN_WPAR_GEKKO:
case SPRN_MSSSR0:
case SPRN_DABR:
+#ifdef CONFIG_PPC_BOOK3S_64
+   case SPRN_MMCRS:
+   case SPRN_MMCRA:
+   case SPRN_MMCR0:
+   case SPRN_MMCR1:
+   case SPRN_MMCR2:
+#endif
*spr_val = 0;
break;
default:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 25/41] PPC: KVM: Make NX bit available with magic page

2014-05-30 Thread Alexander Graf
Because old kernels enable the magic page and then choke on NXed trampoline
code we have to disable NX by default in KVM when we use the magic page.

However, since commit b18db0b8 we have successfully fixed that and can now
leave NX enabled, so tell the hypervisor about this.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kernel/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 6a01752..5e6f24f 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -417,7 +417,7 @@ static void kvm_map_magic_page(void *data)
ulong out[8];
 
in[0] = KVM_MAGIC_PAGE;
-   in[1] = KVM_MAGIC_PAGE;
+   in[1] = KVM_MAGIC_PAGE | MAGIC_PAGE_FLAG_NOT_MAPPED_NX;
 
epapr_hypercall(in, out, KVM_HCALL_TOKEN(KVM_HC_PPC_MAP_MAGIC_PAGE));
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 23/41] KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest

2014-05-30 Thread Alexander Graf
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size.  Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page.  This
capability is called mixed page-size segment (MPSS).  With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always = base page size ].

We use ibm,segment-page-sizes device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.

This patch exposes MPSS support to KVM guest by advertising the
feature via ibm,segment-page-sizes. It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 146 ++-
 arch/powerpc/kvm/book3s_hv.c |   7 ++
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 51388be..fddb72b 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -77,34 +77,122 @@ static inline long try_lock_hpte(unsigned long *hpte, 
unsigned long bits)
return old == 0;
 }
 
+static inline int __hpte_actual_psize(unsigned int lp, int psize)
+{
+   int i, shift;
+   unsigned int mask;
+
+   /* start from 1 ignoring MMU_PAGE_4K */
+   for (i = 1; i  MMU_PAGE_COUNT; i++) {
+
+   /* invalid penc */
+   if (mmu_psize_defs[psize].penc[i] == -1)
+   continue;
+   /*
+* encoding bits per actual page size
+*PTE LP actual page size
+* rrrz =8KB
+* rrzz =16KB
+* rzzz =32KB
+*  =64KB
+* ...
+*/
+   shift = mmu_psize_defs[i].shift - LP_SHIFT;
+   if (shift  LP_BITS)
+   shift = LP_BITS;
+   mask = (1  shift) - 1;
+   if ((lp  mask) == mmu_psize_defs[psize].penc[i])
+   return i;
+   }
+   return -1;
+}
+
 static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 unsigned long pte_index)
 {
-   unsigned long rb, va_low;
+   int b_psize, a_psize;
+   unsigned int penc;
+   unsigned long rb = 0, va_low, sllp;
+   unsigned int lp = (r  LP_SHIFT)  ((1  LP_BITS) - 1);
+
+   if (!(v  HPTE_V_LARGE)) {
+   /* both base and actual psize is 4k */
+   b_psize = MMU_PAGE_4K;
+   a_psize = MMU_PAGE_4K;
+   } else {
+   for (b_psize = 0; b_psize  MMU_PAGE_COUNT; b_psize++) {
+
+   /* valid entries have a shift value */
+   if (!mmu_psize_defs[b_psize].shift)
+   continue;
 
+   a_psize = __hpte_actual_psize(lp, b_psize);
+   if (a_psize != -1)
+   break;
+   }
+   }
+   /*
+* Ignore the top 14 bits of va
+* v have top two bits covering segment size, hence move
+* by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits.
+* AVA field in v also have the lower 23 bits ignored.
+* For base page size 4K we need 14 .. 65 bits (so need to
+* collect extra 11 bits)
+* For others we need 14..14+i
+*/
+   /* This covers 14..54 bits of va*/
rb = (v  ~0x7fUL)  16;   /* AVA field */
+   /*
+* AVA in v had cleared lower 23 bits. We need to derive
+* that from pteg index
+*/
va_low = pte_index  3;
if (v  HPTE_V_SECONDARY)
va_low = ~va_low;
-   /* xor vsid from AVA */
+   /*
+* get the vpn bits from va_low using reverse of hashing.
+* In v we have va with 23 bits dropped and then left shifted
+* HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need
+* right shift it with (SID_SHIFT - (23 - 7))
+*/
if (!(v  HPTE_V_1TB_SEG))
-   va_low ^= v  12;
+ 

[PULL 26/41] KVM: PPC: BOOK3S: Always use the saved DAR value

2014-05-30 Thread Alexander Graf
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

Although it's optional, IBM POWER cpus always had DAR value set on
alignment interrupt. So don't try to compute these values.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_emulate.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 9bdff15..61f38eb 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -676,6 +676,12 @@ u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned 
int inst)
 
 ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst)
 {
+#ifdef CONFIG_PPC_BOOK3S_64
+   /*
+* Linux's fix_alignment() assumes that DAR is valid, so can we
+*/
+   return vcpu-arch.fault_dar;
+#else
ulong dar = 0;
ulong ra = get_ra(inst);
ulong rb = get_rb(inst);
@@ -700,4 +706,5 @@ ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned 
int inst)
}
 
return dar;
+#endif
 }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 18/41] KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR

2014-05-30 Thread Alexander Graf
POWER8 introduced a new interrupt type called Facility unavailable interrupt
which contains its status message in a new register called FSCR.

Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/kvm_asm.h| 18 
 arch/powerpc/include/asm/kvm_book3s_asm.h |  2 +
 arch/powerpc/include/asm/kvm_host.h   |  1 +
 arch/powerpc/kernel/asm-offsets.c |  3 ++
 arch/powerpc/kvm/book3s.c | 10 +
 arch/powerpc/kvm/book3s_emulate.c |  6 +++
 arch/powerpc/kvm/book3s_hv.c  |  6 ---
 arch/powerpc/kvm/book3s_pr.c  | 68 +++
 arch/powerpc/kvm/book3s_segment.S | 25 
 9 files changed, 125 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 19eb74a..9601741 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -102,6 +102,7 @@
 #define BOOK3S_INTERRUPT_PERFMON   0xf00
 #define BOOK3S_INTERRUPT_ALTIVEC   0xf20
 #define BOOK3S_INTERRUPT_VSX   0xf40
+#define BOOK3S_INTERRUPT_FAC_UNAVAIL   0xf60
 #define BOOK3S_INTERRUPT_H_FAC_UNAVAIL 0xf80
 
 #define BOOK3S_IRQPRIO_SYSTEM_RESET0
@@ -114,14 +115,15 @@
 #define BOOK3S_IRQPRIO_FP_UNAVAIL  7
 #define BOOK3S_IRQPRIO_ALTIVEC 8
 #define BOOK3S_IRQPRIO_VSX 9
-#define BOOK3S_IRQPRIO_SYSCALL 10
-#define BOOK3S_IRQPRIO_MACHINE_CHECK   11
-#define BOOK3S_IRQPRIO_DEBUG   12
-#define BOOK3S_IRQPRIO_EXTERNAL13
-#define BOOK3S_IRQPRIO_DECREMENTER 14
-#define BOOK3S_IRQPRIO_PERFORMANCE_MONITOR 15
-#define BOOK3S_IRQPRIO_EXTERNAL_LEVEL  16
-#define BOOK3S_IRQPRIO_MAX 17
+#define BOOK3S_IRQPRIO_FAC_UNAVAIL 10
+#define BOOK3S_IRQPRIO_SYSCALL 11
+#define BOOK3S_IRQPRIO_MACHINE_CHECK   12
+#define BOOK3S_IRQPRIO_DEBUG   13
+#define BOOK3S_IRQPRIO_EXTERNAL14
+#define BOOK3S_IRQPRIO_DECREMENTER 15
+#define BOOK3S_IRQPRIO_PERFORMANCE_MONITOR 16
+#define BOOK3S_IRQPRIO_EXTERNAL_LEVEL  17
+#define BOOK3S_IRQPRIO_MAX 18
 
 #define BOOK3S_HFLAG_DCBZ320x1
 #define BOOK3S_HFLAG_SLB   0x2
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 821725c..5bdfb5d 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -104,6 +104,7 @@ struct kvmppc_host_state {
 #ifdef CONFIG_PPC_BOOK3S_64
u64 cfar;
u64 ppr;
+   u64 host_fscr;
 #endif
 };
 
@@ -133,6 +134,7 @@ struct kvmppc_book3s_shadow_vcpu {
u64 esid;
u64 vsid;
} slb[64];  /* guest SLB */
+   u64 shadow_fscr;
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 15f19d3..232ec5f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -475,6 +475,7 @@ struct kvm_vcpu_arch {
ulong ppr;
ulong pspb;
ulong fscr;
+   ulong shadow_fscr;
ulong ebbhr;
ulong ebbrr;
ulong bescr;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index bbf3b9a..e2b86b5 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -537,6 +537,7 @@ int main(void)
DEFINE(VCPU_CFAR, offsetof(struct kvm_vcpu, arch.cfar));
DEFINE(VCPU_PPR, offsetof(struct kvm_vcpu, arch.ppr));
DEFINE(VCPU_FSCR, offsetof(struct kvm_vcpu, arch.fscr));
+   DEFINE(VCPU_SHADOW_FSCR, offsetof(struct kvm_vcpu, arch.shadow_fscr));
DEFINE(VCPU_PSPB, offsetof(struct kvm_vcpu, arch.pspb));
DEFINE(VCPU_EBBHR, offsetof(struct kvm_vcpu, arch.ebbhr));
DEFINE(VCPU_EBBRR, offsetof(struct kvm_vcpu, arch.ebbrr));
@@ -618,6 +619,7 @@ int main(void)
 #ifdef CONFIG_PPC64
SVCPU_FIELD(SVCPU_SLB, slb);
SVCPU_FIELD(SVCPU_SLB_MAX, slb_max);
+   SVCPU_FIELD(SVCPU_SHADOW_FSCR, shadow_fscr);
 #endif
 
HSTATE_FIELD(HSTATE_HOST_R1, host_r1);
@@ -653,6 +655,7 @@ int main(void)
 #ifdef CONFIG_PPC_BOOK3S_64
HSTATE_FIELD(HSTATE_CFAR, cfar);
HSTATE_FIELD(HSTATE_PPR, ppr);
+   HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #else /* CONFIG_PPC_BOOK3S */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 81abc5c..79cfa2d 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -145,6 +145,7 @@ static int kvmppc_book3s_vec2irqprio(unsigned int vec)

[PULL 29/41] KVM: PPC: Graciously fail broken LE hypercalls

2014-05-30 Thread Alexander Graf
There are LE Linux guests out there that don't handle hypercalls correctly.
Instead of interpreting the instruction stream from device tree as big endian
they assume it's a little endian instruction stream and fail.

When we see an illegal instruction from such a byte reversed instruction stream,
bail out graciously and just declare every hcall as error.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_emulate.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index c992447..3f29526 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -94,8 +94,25 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
int rs = get_rs(inst);
int ra = get_ra(inst);
int rb = get_rb(inst);
+   u32 inst_sc = 0x4402;
 
switch (get_op(inst)) {
+   case 0:
+   emulated = EMULATE_FAIL;
+   if ((kvmppc_get_msr(vcpu)  MSR_LE) 
+   (inst == swab32(inst_sc))) {
+   /*
+* This is the byte reversed syscall instruction of our
+* hypercall handler. Early versions of LE Linux didn't
+* swap the instructions correctly and ended up in
+* illegal instructions.
+* Just always fail hypercalls on these broken systems.
+*/
+   kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED);
+   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+   emulated = EMULATE_DONE;
+   }
+   break;
case 19:
switch (get_xop(inst)) {
case OP_19_XOP_RFID:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 20/41] KVM: PPC: Book3S PR: Expose EBB registers

2014-05-30 Thread Alexander Graf
POWER8 introduces a new facility called the Event Based Branch facility.
It contains of a few registers that indicate where a guest should branch to
when a defined event occurs and it's in PR mode.

We don't want to really enable EBB as it will create a big mess with !PR guest
mode while hardware is in PR and we don't really emulate the PMU anyway.

So instead, let's just leave it at emulation of all its registers.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s.c | 18 ++
 arch/powerpc/kvm/book3s_emulate.c | 22 ++
 arch/powerpc/kvm/book3s_hv.c  | 18 --
 3 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 4046a1a..52c654d 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -637,6 +637,15 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, 
struct kvm_one_reg *reg)
case KVM_REG_PPC_TAR:
val = get_reg_val(reg-id, vcpu-arch.tar);
break;
+   case KVM_REG_PPC_EBBHR:
+   val = get_reg_val(reg-id, vcpu-arch.ebbhr);
+   break;
+   case KVM_REG_PPC_EBBRR:
+   val = get_reg_val(reg-id, vcpu-arch.ebbrr);
+   break;
+   case KVM_REG_PPC_BESCR:
+   val = get_reg_val(reg-id, vcpu-arch.bescr);
+   break;
default:
r = -EINVAL;
break;
@@ -732,6 +741,15 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, 
struct kvm_one_reg *reg)
case KVM_REG_PPC_TAR:
vcpu-arch.tar = set_reg_val(reg-id, val);
break;
+   case KVM_REG_PPC_EBBHR:
+   vcpu-arch.ebbhr = set_reg_val(reg-id, val);
+   break;
+   case KVM_REG_PPC_EBBRR:
+   vcpu-arch.ebbrr = set_reg_val(reg-id, val);
+   break;
+   case KVM_REG_PPC_BESCR:
+   vcpu-arch.bescr = set_reg_val(reg-id, val);
+   break;
default:
r = -EINVAL;
break;
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index e8133e5..e1165ba 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -441,6 +441,17 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong spr_val)
case SPRN_FSCR:
vcpu-arch.fscr = spr_val;
break;
+#ifdef CONFIG_PPC_BOOK3S_64
+   case SPRN_BESCR:
+   vcpu-arch.bescr = spr_val;
+   break;
+   case SPRN_EBBHR:
+   vcpu-arch.ebbhr = spr_val;
+   break;
+   case SPRN_EBBRR:
+   vcpu-arch.ebbrr = spr_val;
+   break;
+#endif
case SPRN_ICTC:
case SPRN_THRM1:
case SPRN_THRM2:
@@ -551,6 +562,17 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong *spr_val
case SPRN_FSCR:
*spr_val = vcpu-arch.fscr;
break;
+#ifdef CONFIG_PPC_BOOK3S_64
+   case SPRN_BESCR:
+   *spr_val = vcpu-arch.bescr;
+   break;
+   case SPRN_EBBHR:
+   *spr_val = vcpu-arch.ebbhr;
+   break;
+   case SPRN_EBBRR:
+   *spr_val = vcpu-arch.ebbrr;
+   break;
+#endif
case SPRN_THRM1:
case SPRN_THRM2:
case SPRN_THRM3:
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ee1d8ee..3a94561 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -882,15 +882,6 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_PSPB:
*val = get_reg_val(id, vcpu-arch.pspb);
break;
-   case KVM_REG_PPC_EBBHR:
-   *val = get_reg_val(id, vcpu-arch.ebbhr);
-   break;
-   case KVM_REG_PPC_EBBRR:
-   *val = get_reg_val(id, vcpu-arch.ebbrr);
-   break;
-   case KVM_REG_PPC_BESCR:
-   *val = get_reg_val(id, vcpu-arch.bescr);
-   break;
case KVM_REG_PPC_DPDES:
*val = get_reg_val(id, vcpu-arch.vcore-dpdes);
break;
@@ -1088,15 +1079,6 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_PSPB:
vcpu-arch.pspb = set_reg_val(id, *val);
break;
-   case KVM_REG_PPC_EBBHR:
-   vcpu-arch.ebbhr = set_reg_val(id, *val);
-   break;
-   case KVM_REG_PPC_EBBRR:
-   vcpu-arch.ebbrr = set_reg_val(id, *val);
-   break;
-   case KVM_REG_PPC_BESCR:
-   

[PULL 24/41] KVM: PPC: Disable NX for old magic page using guests

2014-05-30 Thread Alexander Graf
Old guests try to use the magic page, but map their trampoline code inside
of an NX region.

Since we can't fix those old kernels, try to detect whether the guest is sane
or not. If not, just disable NX functionality in KVM so that old guests at
least work at all. For newer guests, add a bit that we can set to keep NX
functionality available.

Signed-off-by: Alexander Graf ag...@suse.de
---
 Documentation/virtual/kvm/ppc-pv.txt | 14 ++
 arch/powerpc/include/asm/kvm_host.h  |  1 +
 arch/powerpc/include/uapi/asm/kvm_para.h |  6 ++
 arch/powerpc/kvm/book3s_64_mmu.c |  3 +++
 arch/powerpc/kvm/powerpc.c   | 14 --
 5 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/ppc-pv.txt 
b/Documentation/virtual/kvm/ppc-pv.txt
index 4643cde..3195606 100644
--- a/Documentation/virtual/kvm/ppc-pv.txt
+++ b/Documentation/virtual/kvm/ppc-pv.txt
@@ -94,10 +94,24 @@ a bitmap of available features inside the magic page.
 The following enhancements to the magic page are currently available:
 
   KVM_MAGIC_FEAT_SRMaps SR registers r/w in the magic page
+  KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs
 
 For enhanced features in the magic page, please check for the existence of the
 feature before using them!
 
+Magic page flags
+
+
+In addition to features that indicate whether a host is capable of a particular
+feature we also have a channel for a guest to tell the guest whether it's 
capable
+of something. This is what we call flags.
+
+Flags are passed to the host in the low 12 bits of the Effective Address.
+
+The following flags are currently available for a guest to expose:
+
+  MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correclty wrt magic page
+
 MSR bits
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 29fbb55..bb66d8b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -631,6 +631,7 @@ struct kvm_vcpu_arch {
 #endif
unsigned long magic_page_pa; /* phys addr to map the magic page to */
unsigned long magic_page_ea; /* effect. addr to map the magic page to */
+   bool disable_kernel_nx;
 
int irq_type;   /* one of KVM_IRQ_* */
int irq_cpu_id;
diff --git a/arch/powerpc/include/uapi/asm/kvm_para.h 
b/arch/powerpc/include/uapi/asm/kvm_para.h
index e3af328..91e42f0 100644
--- a/arch/powerpc/include/uapi/asm/kvm_para.h
+++ b/arch/powerpc/include/uapi/asm/kvm_para.h
@@ -82,10 +82,16 @@ struct kvm_vcpu_arch_shared {
 
 #define KVM_FEATURE_MAGIC_PAGE 1
 
+/* Magic page flags from host to guest */
+
 #define KVM_MAGIC_FEAT_SR  (1  0)
 
 /* MASn, ESR, PIR, and high SPRGs */
 #define KVM_MAGIC_FEAT_MAS0_TO_SPRG7   (1  1)
 
+/* Magic page flags from guest to host */
+
+#define MAGIC_PAGE_FLAG_NOT_MAPPED_NX  (1  0)
+
 
 #endif /* _UAPI__POWERPC_KVM_PARA_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c
index 278729f..774a253 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -313,6 +313,9 @@ do_second:
gpte-raddr = (r  HPTE_R_RPN  ~eaddr_mask) | (eaddr  eaddr_mask);
gpte-page_size = pgsize;
gpte-may_execute = ((r  HPTE_R_N) ? false : true);
+   if (unlikely(vcpu-arch.disable_kernel_nx) 
+   !(kvmppc_get_msr(vcpu)  MSR_PR))
+   gpte-may_execute = true;
gpte-may_read = false;
gpte-may_write = false;
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b4e15bf..154f352 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -177,8 +177,18 @@ int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
vcpu-arch.shared_big_endian = shared_big_endian;
 #endif
 
-   vcpu-arch.magic_page_pa = param1;
-   vcpu-arch.magic_page_ea = param2;
+   if (!(param2  MAGIC_PAGE_FLAG_NOT_MAPPED_NX)) {
+   /*
+* Older versions of the Linux magic page code had
+* a bug where they would map their trampoline code
+* NX. If that's the case, remove !PR NX capability.
+*/
+   vcpu-arch.disable_kernel_nx = true;
+   kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+   }
+
+   vcpu-arch.magic_page_pa = param1  ~0xfffULL;
+   vcpu-arch.magic_page_ea = param2  ~0xfffULL;
 
r2 = KVM_MAGIC_FEAT_SR | KVM_MAGIC_FEAT_MAS0_TO_SPRG7;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 22/41] KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation

2014-05-30 Thread Alexander Graf
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

Today when KVM tries to reserve memory for the hash page table it
allocates from the normal page allocator first. If that fails it
falls back to CMA's reserved region. One of the side effects of
this is that we could end up exhausting the page allocator and
get linux into OOM conditions while we still have plenty of space
available in CMA.

This patch addresses this issue by first trying hash page table
allocation from CMA's reserved region before falling back to the normal
page allocator. So if we run out of memory, we really are out of memory.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index fb25ebc..f32896f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -52,7 +52,7 @@ static void kvmppc_rmap_reset(struct kvm *kvm);
 
 long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 {
-   unsigned long hpt;
+   unsigned long hpt = 0;
struct revmap_entry *rev;
struct page *page = NULL;
long order = KVM_DEFAULT_HPT_ORDER;
@@ -64,22 +64,11 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
}
 
kvm-arch.hpt_cma_alloc = 0;
-   /*
-* try first to allocate it from the kernel page allocator.
-* We keep the CMA reserved for failed allocation.
-*/
-   hpt = __get_free_pages(GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT |
-  __GFP_NOWARN, order - PAGE_SHIFT);
-
-   /* Next try to allocate from the preallocated pool */
-   if (!hpt) {
-   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
-   page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
-   if (page) {
-   hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
-   kvm-arch.hpt_cma_alloc = 1;
-   } else
-   --order;
+   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
+   page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
+   if (page) {
+   hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
+   kvm-arch.hpt_cma_alloc = 1;
}
 
/* Lastly try successively smaller sizes from the page allocator */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 17/41] KVM: PPC: Book3S PR: Emulate TIR register

2014-05-30 Thread Alexander Graf
In parallel to the Processor ID Register (PIR) threaded POWER8 also adds a
Thread ID Register (TIR). Since PR KVM doesn't emulate more than one thread
per core, we can just always expose 0 here.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_emulate.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 52448ef..0a1de29 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -566,6 +566,7 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, int 
sprn, ulong *spr_val
case SPRN_MMCR0:
case SPRN_MMCR1:
case SPRN_MMCR2:
+   case SPRN_TIR:
 #endif
*spr_val = 0;
break;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 04/41] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on

2014-05-30 Thread Alexander Graf
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

With debug option sleep inside atomic section checking enabled we get
the below WARN_ON during a PR KVM boot. This is because upstream now
have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the
warning by adding preempt_disable/enable around floating point and altivec
enable.

WARNING: at arch/powerpc/kernel/process.c:156
Modules linked in: kvm_pr kvm
CPU: 1 PID: 3990 Comm: qemu-system-ppc Tainted: GW 3.15.0-rc1+ #4
task: c000eb85b3a0 ti: c000ec59c000 task.ti: c000ec59c000
NIP: c0015c84 LR: d3334644 CTR: c0015c00
REGS: c000ec59f140 TRAP: 0700   Tainted: GW  (3.15.0-rc1+)
MSR: 80029032 SF,EE,ME,IR,DR,RI  CR: 4224  XER: 2000
CFAR: c0015c24 SOFTE: 1
GPR00: d3334644 c000ec59f3c0 c0e2fa40 c000e2f8
GPR04: 0800 2000 0001 8000
GPR08: 0001 0001 2000 c0015c00
GPR12: d333da18 cfb80900  
GPR16:    3fffce4e0fa1
GPR20: 0010 0001 0002 100b9a38
GPR24: 0002   0013
GPR28:  c000eb85b3a0 2000 c000e2f8
NIP [c0015c84] .enable_kernel_fp+0x84/0x90
LR [d3334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr]
Call Trace:
[c000ec59f3c0] [0010] 0x10 (unreliable)
[c000ec59f430] [d3334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr]
[c000ec59f4c0] [d324b380] .kvmppc_set_msr+0x30/0x50 [kvm]
[c000ec59f530] [d3337cac] .kvmppc_core_emulate_op_pr+0x16c/0x5e0 
[kvm_pr]
[c000ec59f5f0] [d324a944] .kvmppc_emulate_instruction+0x284/0xa80 
[kvm]
[c000ec59f6c0] [d3336888] .kvmppc_handle_exit_pr+0x488/0xb70 
[kvm_pr]
[c000ec59f790] [d3338d34] kvm_start_lightweight+0xcc/0xdc [kvm_pr]
[c000ec59f960] [d3336288] .kvmppc_vcpu_run_pr+0xc8/0x190 [kvm_pr]
[c000ec59f9f0] [d324c880] .kvmppc_vcpu_run+0x30/0x50 [kvm]
[c000ec59fa60] [d3249e74] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0 [kvm]
[c000ec59faf0] [d3244948] .kvm_vcpu_ioctl+0x478/0x760 [kvm]
[c000ec59fcb0] [c0224e34] .do_vfs_ioctl+0x4d4/0x790
[c000ec59fd90] [c0225148] .SyS_ioctl+0x58/0xb0
[c000ec59fe30] [c000a1e4] syscall_exit+0x0/0x98

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_pr.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 8c05cb5..01a7156 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -683,16 +683,20 @@ static int kvmppc_handle_ext(struct kvm_vcpu *vcpu, 
unsigned int exit_nr,
 #endif
 
if (msr  MSR_FP) {
+   preempt_disable();
enable_kernel_fp();
load_fp_state(vcpu-arch.fp);
t-fp_save_area = vcpu-arch.fp;
+   preempt_enable();
}
 
if (msr  MSR_VEC) {
 #ifdef CONFIG_ALTIVEC
+   preempt_disable();
enable_kernel_altivec();
load_vr_state(vcpu-arch.vr);
t-vr_save_area = vcpu-arch.vr;
+   preempt_enable();
 #endif
}
 
@@ -716,13 +720,17 @@ static void kvmppc_handle_lost_ext(struct kvm_vcpu *vcpu)
return;
 
if (lost_ext  MSR_FP) {
+   preempt_disable();
enable_kernel_fp();
load_fp_state(vcpu-arch.fp);
+   preempt_enable();
}
 #ifdef CONFIG_ALTIVEC
if (lost_ext  MSR_VEC) {
+   preempt_disable();
enable_kernel_altivec();
load_vr_state(vcpu-arch.vr);
+   preempt_enable();
}
 #endif
current-thread.regs-msr |= lost_ext;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 00/41] ppc patch queue 2014-05-30

2014-05-30 Thread Alexander Graf
Hi Paolo / Marcelo,

This is my current patch queue for ppc.  Please pull.

Alex


The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb:

  KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 17:47:18 
+0200)

are available in the git repository at:

  git://github.com/agraf/linux-2.6.git tags/signed-kvm-ppc-next

for you to fetch changes up to d8d164a9850d486cc48081c18831680254688d0f:

  KVM: PPC: Book3S PR: Rework SLB switching code (2014-05-30 14:26:30 +0200)


Patch queue for ppc - 2014-05-30

In this round we have a few nice gems. PR KVM gains initial POWER8 support
as well as LE host awareness, ihe e500 targets can now properly run u-boot,
LE guests now work with PR KVM including KVM hypercalls and HV KVM guests
can now use huge pages.

On top of this there are some bug fixes.


Alexander Graf (27):
  KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR
  KVM: PPC: E500: Add dcbtls emulation
  KVM: PPC: Book3S: PR: Fix C/R bit setting
  KVM: PPC: Book3S_32: PR: Access HTAB in big endian
  KVM: PPC: Book3S_64 PR: Access HTAB in big endian
  KVM: PPC: Book3S_64 PR: Access shadow slb in big endian
  KVM: PPC: Book3S PR: Default to big endian guest
  KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian
  KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian
  KVM: PPC: PR: Fill pvinfo hcall instructions in big endian
  KVM: PPC: Make shared struct aka magic page guest endian
  KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions
  KVM: PPC: Book3S: Move little endian conflict to HV KVM
  KVM: PPC: Book3S PR: Ignore PMU SPRs
  KVM: PPC: Book3S PR: Emulate TIR register
  KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR
  KVM: PPC: Book3S PR: Expose TAR facility to guest
  KVM: PPC: Book3S PR: Expose EBB registers
  KVM: PPC: Book3S PR: Expose TM registers
  KVM: PPC: Disable NX for old magic page using guests
  PPC: KVM: Make NX bit available with magic page
  PPC: ePAPR: Fix hypercall on LE guest
  KVM: PPC: Graciously fail broken LE hypercalls
  KVM: PPC: MPIC: Reset IRQ source private members
  KVM: PPC: Add CAP to indicate hcall fixes
  KVM: PPC: Book3S PR: Use SLB entry 0
  KVM: PPC: Book3S PR: Rework SLB switching code

Alexey Kardashevskiy (1):
  KVM: PPC: Book3S HV: Fix dirty map for hugepages

Aneesh Kumar K.V (6):
  KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest
  KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on
  KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation
  KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
  KVM: PPC: BOOK3S: Always use the saved DAR value
  KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler

Paul Mackerras (7):
  KVM: PPC: Book3S: Add ONE_REG register names that were missed
  KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
  KVM: PPC: Book3S HV: Fix check for running inside guest in 
global_invalidates()
  KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
  KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
  KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
  KVM: PPC: Book3S HV: Fix machine check delivery to guest

 Documentation/virtual/kvm/api.txt |   6 +
 Documentation/virtual/kvm/ppc-pv.txt  |  14 ++
 arch/powerpc/include/asm/disassemble.h|  34 +
 arch/powerpc/include/asm/kvm_asm.h|  18 ++-
 arch/powerpc/include/asm/kvm_book3s.h |   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  | 146 +++---
 arch/powerpc/include/asm/kvm_book3s_asm.h |   2 +
 arch/powerpc/include/asm/kvm_booke.h  |   5 -
 arch/powerpc/include/asm/kvm_host.h   |   9 +-
 arch/powerpc/include/asm/kvm_ppc.h|  80 +-
 arch/powerpc/include/asm/reg.h|  12 +-
 arch/powerpc/include/asm/reg_booke.h  |   1 +
 arch/powerpc/include/uapi/asm/kvm.h   |   2 +-
 arch/powerpc/include/uapi/asm/kvm_para.h  |   6 +
 arch/powerpc/kernel/align.c   |  34 +
 arch/powerpc/kernel/asm-offsets.c |  11 +-
 arch/powerpc/kernel/epapr_paravirt.c  |   5 +-
 arch/powerpc/kernel/kvm.c |   2 +-
 arch/powerpc/kernel/paca.c|   3 +
 arch/powerpc/kvm/Kconfig  |   2 +-
 arch/powerpc/kvm/book3s.c | 106 -
 arch/powerpc/kvm/book3s_32_mmu.c  |  41 ++---
 arch/powerpc/kvm/book3s_32_mmu_host.c |   4 +-
 arch/powerpc/kvm/book3s_64_mmu.c  |  39 +++--
 arch/powerpc/kvm/book3s_64_mmu_host.c |  15 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 116 ++-
 arch/powerpc/kvm/book3s_64_slb.S  |  87 +--
 arch/powerpc/kvm/book3s_emulate.c | 156 

[PULL 05/41] KVM: PPC: Book3S: PR: Fix C/R bit setting

2014-05-30 Thread Alexander Graf
Commit 9308ab8e2d made C/R HTAB updates go byte-wise into the target HTAB.
However, it didn't update the guest's copy of the HTAB, but instead the
host local copy of it.

Write to the guest's HTAB instead.

Signed-off-by: Alexander Graf ag...@suse.de
CC: Paul Mackerras pau...@samba.org
Acked-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_32_mmu.c | 2 +-
 arch/powerpc/kvm/book3s_64_mmu.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c
index 76a64ce..60fc3f4 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -270,7 +270,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu 
*vcpu, gva_t eaddr,
   page */
if (found) {
u32 pte_r = pteg[i+1];
-   char __user *addr = (char __user *) pteg[i+1];
+   char __user *addr = (char __user *) (ptegp + (i+1) * 
sizeof(u32));
 
/*
 * Use single-byte writes to update the HPTE, to
diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c
index 8231b83..171e5ca 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -342,14 +342,14 @@ do_second:
 * non-PAPR platforms such as mac99, and this is
 * what real hardware does.
 */
-   char __user *addr = (char __user *) pteg[i+1];
+char __user *addr = (char __user *) (ptegp + (i + 1) * 
sizeof(u64));
r |= HPTE_R_R;
put_user(r  8, addr + 6);
}
if (iswrite  gpte-may_write  !(r  HPTE_R_C)) {
/* Set the dirty flag */
/* Use a single byte write */
-   char __user *addr = (char __user *) pteg[i+1];
+char __user *addr = (char __user *) (ptegp + (i + 1) * 
sizeof(u64));
r |= HPTE_R_C;
put_user(r, addr + 7);
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 03/41] KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest

2014-05-30 Thread Alexander Graf
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

This patch make sure we inherit the LE bit correctly in different case
so that we can run Little Endian distro in PR mode

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kernel/asm-offsets.c   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu.c|  2 +-
 arch/powerpc/kvm/book3s_pr.c| 23 ++-
 4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1eaea2d..d342f8e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -562,6 +562,7 @@ struct kvm_vcpu_arch {
 #ifdef CONFIG_PPC_BOOK3S
ulong fault_dar;
u32 fault_dsisr;
+   unsigned long intr_msr;
 #endif
 
 #ifdef CONFIG_BOOKE
@@ -654,7 +655,6 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
-   unsigned long intr_msr;
 #endif
 };
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index dba8140..6a4b77d 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -493,7 +493,6 @@ int main(void)
DEFINE(VCPU_DAR, offsetof(struct kvm_vcpu, arch.shregs.dar));
DEFINE(VCPU_VPA, offsetof(struct kvm_vcpu, arch.vpa.pinned_addr));
DEFINE(VCPU_VPA_DIRTY, offsetof(struct kvm_vcpu, arch.vpa.dirty));
-   DEFINE(VCPU_INTR_MSR, offsetof(struct kvm_vcpu, arch.intr_msr));
 #endif
 #ifdef CONFIG_PPC_BOOK3S
DEFINE(VCPU_VCPUID, offsetof(struct kvm_vcpu, vcpu_id));
@@ -528,6 +527,7 @@ int main(void)
DEFINE(VCPU_SLB_NR, offsetof(struct kvm_vcpu, arch.slb_nr));
DEFINE(VCPU_FAULT_DSISR, offsetof(struct kvm_vcpu, arch.fault_dsisr));
DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, arch.fault_dar));
+   DEFINE(VCPU_INTR_MSR, offsetof(struct kvm_vcpu, arch.intr_msr));
DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
DEFINE(VCPU_TRAP, offsetof(struct kvm_vcpu, arch.trap));
DEFINE(VCPU_CFAR, offsetof(struct kvm_vcpu, arch.cfar));
diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c
index 83da1f8..8231b83 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -38,7 +38,7 @@
 
 static void kvmppc_mmu_book3s_64_reset_msr(struct kvm_vcpu *vcpu)
 {
-   kvmppc_set_msr(vcpu, MSR_SF);
+   kvmppc_set_msr(vcpu, vcpu-arch.intr_msr);
 }
 
 static struct kvmppc_slb *kvmppc_mmu_book3s_64_find_slbe(
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index c5c052a..8c05cb5 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -249,7 +249,7 @@ static void kvmppc_recalc_shadow_msr(struct kvm_vcpu *vcpu)
ulong smsr = vcpu-arch.shared-msr;
 
/* Guest MSR values */
-   smsr = MSR_FE0 | MSR_FE1 | MSR_SF | MSR_SE | MSR_BE;
+   smsr = MSR_FE0 | MSR_FE1 | MSR_SF | MSR_SE | MSR_BE | MSR_LE;
/* Process MSR values */
smsr |= MSR_ME | MSR_RI | MSR_IR | MSR_DR | MSR_PR | MSR_EE;
/* External providers the guest reserved */
@@ -1110,6 +1110,15 @@ static int kvmppc_get_one_reg_pr(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_HIOR:
*val = get_reg_val(id, to_book3s(vcpu)-hior);
break;
+   case KVM_REG_PPC_LPCR:
+   /*
+* We are only interested in the LPCR_ILE bit
+*/
+   if (vcpu-arch.intr_msr  MSR_LE)
+   *val = get_reg_val(id, LPCR_ILE);
+   else
+   *val = get_reg_val(id, 0);
+   break;
default:
r = -EINVAL;
break;
@@ -1118,6 +1127,14 @@ static int kvmppc_get_one_reg_pr(struct kvm_vcpu *vcpu, 
u64 id,
return r;
 }
 
+static void kvmppc_set_lpcr_pr(struct kvm_vcpu *vcpu, u64 new_lpcr)
+{
+   if (new_lpcr  LPCR_ILE)
+   vcpu-arch.intr_msr |= MSR_LE;
+   else
+   vcpu-arch.intr_msr = ~MSR_LE;
+}
+
 static int kvmppc_set_one_reg_pr(struct kvm_vcpu *vcpu, u64 id,
 union kvmppc_one_reg *val)
 {
@@ -1128,6 +1145,9 @@ static int kvmppc_set_one_reg_pr(struct kvm_vcpu *vcpu, 
u64 id,
to_book3s(vcpu)-hior = set_reg_val(id, *val);
to_book3s(vcpu)-hior_explicit = true;
break;
+   case KVM_REG_PPC_LPCR:
+   kvmppc_set_lpcr_pr(vcpu, set_reg_val(id, *val));
+   break;
default:
r = -EINVAL;
break;
@@ -1180,6 +1200,7 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_pr(struct 
kvm *kvm,
vcpu-arch.pvr = 0x3C0301;
if (mmu_has_feature(MMU_FTR_1T_SEGMENT))
vcpu-arch.pvr = mfspr(SPRN_PVR);

[PULL 06/41] KVM: PPC: Book3S_32: PR: Access HTAB in big endian

2014-05-30 Thread Alexander Graf
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.

Wrap all accesses to the guest HTAB with big endian accessors.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_32_mmu.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c
index 60fc3f4..0e42b16 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -208,6 +208,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu 
*vcpu, gva_t eaddr,
u32 sre;
hva_t ptegp;
u32 pteg[16];
+   u32 pte0, pte1;
u32 ptem = 0;
int i;
int found = 0;
@@ -233,11 +234,13 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu 
*vcpu, gva_t eaddr,
}
 
for (i=0; i16; i+=2) {
-   if (ptem == pteg[i]) {
+   pte0 = be32_to_cpu(pteg[i]);
+   pte1 = be32_to_cpu(pteg[i + 1]);
+   if (ptem == pte0) {
u8 pp;
 
-   pte-raddr = (pteg[i+1]  ~(0xFFFULL)) | (eaddr  
0xFFF);
-   pp = pteg[i+1]  3;
+   pte-raddr = (pte1  ~(0xFFFULL)) | (eaddr  0xFFF);
+   pp = pte1  3;
 
if ((sr_kp(sre)   (vcpu-arch.shared-msr  MSR_PR)) 
||
(sr_ks(sre)  !(vcpu-arch.shared-msr  MSR_PR)))
@@ -260,7 +263,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu 
*vcpu, gva_t eaddr,
}
 
dprintk_pte(MMU: Found PTE - %x %x - %x\n,
-   pteg[i], pteg[i+1], pp);
+   pte0, pte1, pp);
found = 1;
break;
}
@@ -269,7 +272,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* Update PTE C and A bits, so the guest's swapper knows we used the
   page */
if (found) {
-   u32 pte_r = pteg[i+1];
+   u32 pte_r = pte1;
char __user *addr = (char __user *) (ptegp + (i+1) * 
sizeof(u32));
 
/*
@@ -296,7 +299,8 @@ no_page_found:
to_book3s(vcpu)-sdr1, ptegp);
for (i=0; i16; i+=2) {
dprintk_pte(   %02d: 0x%x - 0x%x (0x%x)\n,
-   i, pteg[i], pteg[i+1], ptem);
+   i, be32_to_cpu(pteg[i]),
+   be32_to_cpu(pteg[i+1]), ptem);
}
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 08/41] KVM: PPC: Book3S_64 PR: Access shadow slb in big endian

2014-05-30 Thread Alexander Graf
The shadow SLB in the PACA is shared with the hypervisor, so it has to
be big endian. We access the shadow SLB during world switch, so let's make
sure we access it in big endian even when we're on a little endian host.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_64_slb.S | 33 -
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S
index 4f12e8f..596140e 100644
--- a/arch/powerpc/kvm/book3s_64_slb.S
+++ b/arch/powerpc/kvm/book3s_64_slb.S
@@ -17,29 +17,28 @@
  * Authors: Alexander Graf ag...@suse.de
  */
 
-#ifdef __LITTLE_ENDIAN__
-#error Need to fix SLB shadow accesses in little endian mode
-#endif
-
 #define SHADOW_SLB_ESID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10))
 #define SHADOW_SLB_VSID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10) + 0x8)
 #define UNBOLT_SLB_ENTRY(num) \
-   ld  r9, SHADOW_SLB_ESID(num)(r12); \
-   /* Invalid? Skip. */; \
-   rldicl. r0, r9, 37, 63; \
-   beq slb_entry_skip_ ## num; \
-   xoris   r9, r9, SLB_ESID_V@h; \
-   std r9, SHADOW_SLB_ESID(num)(r12); \
+   li  r11, SHADOW_SLB_ESID(num);  \
+   LDX_BE  r9, r12, r11;   \
+   /* Invalid? Skip. */;   \
+   rldicl. r0, r9, 37, 63; \
+   beq slb_entry_skip_ ## num; \
+   xoris   r9, r9, SLB_ESID_V@h;   \
+   STDX_BE r9, r12, r11;   \
   slb_entry_skip_ ## num:
 
 #define REBOLT_SLB_ENTRY(num) \
-   ld  r10, SHADOW_SLB_ESID(num)(r11); \
-   cmpdi   r10, 0; \
-   beq slb_exit_skip_ ## num; \
-   orisr10, r10, SLB_ESID_V@h; \
-   ld  r9, SHADOW_SLB_VSID(num)(r11); \
-   slbmte  r9, r10; \
-   std r10, SHADOW_SLB_ESID(num)(r11); \
+   li  r8, SHADOW_SLB_ESID(num);   \
+   li  r7, SHADOW_SLB_VSID(num);   \
+   LDX_BE  r10, r11, r8;   \
+   cmpdi   r10, 0; \
+   beq slb_exit_skip_ ## num;  \
+   orisr10, r10, SLB_ESID_V@h; \
+   LDX_BE  r9, r11, r7;\
+   slbmte  r9, r10;\
+   STDX_BE r10, r11, r8;   \
 slb_exit_skip_ ## num:
 
 /**
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 00/41] ppc patch queue 2014-05-30

2014-05-30 Thread Paolo Bonzini

Il 30/05/2014 14:42, Alexander Graf ha scritto:

Hi Paolo / Marcelo,

This is my current patch queue for ppc.  Please pull.

Alex


The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb:

  KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 17:47:18 
+0200)

are available in the git repository at:

  git://github.com/agraf/linux-2.6.git tags/signed-kvm-ppc-next

for you to fetch changes up to d8d164a9850d486cc48081c18831680254688d0f:

  KVM: PPC: Book3S PR: Rework SLB switching code (2014-05-30 14:26:30 +0200)


Patch queue for ppc - 2014-05-30

In this round we have a few nice gems. PR KVM gains initial POWER8 support
as well as LE host awareness, ihe e500 targets can now properly run u-boot,
LE guests now work with PR KVM including KVM hypercalls and HV KVM guests
can now use huge pages.

On top of this there are some bug fixes.


Thanks for sending the patches well before the merge window!

There is a conflict in capability numbers.  KVM_CAP_PPC_FIXUP_HCALL is 
102 on the branch, but will be 103 when I merge.


This will be a very large release for KVM, with over 200 patches 
scattered over all architectures except ia64 (~25 MIPS, ~20 ARM, ~40 
PPC, ~35 x86, ~80 s390).


Paolo



Alexander Graf (27):
  KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR
  KVM: PPC: E500: Add dcbtls emulation
  KVM: PPC: Book3S: PR: Fix C/R bit setting
  KVM: PPC: Book3S_32: PR: Access HTAB in big endian
  KVM: PPC: Book3S_64 PR: Access HTAB in big endian
  KVM: PPC: Book3S_64 PR: Access shadow slb in big endian
  KVM: PPC: Book3S PR: Default to big endian guest
  KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian
  KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian
  KVM: PPC: PR: Fill pvinfo hcall instructions in big endian
  KVM: PPC: Make shared struct aka magic page guest endian
  KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions
  KVM: PPC: Book3S: Move little endian conflict to HV KVM
  KVM: PPC: Book3S PR: Ignore PMU SPRs
  KVM: PPC: Book3S PR: Emulate TIR register
  KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR
  KVM: PPC: Book3S PR: Expose TAR facility to guest
  KVM: PPC: Book3S PR: Expose EBB registers
  KVM: PPC: Book3S PR: Expose TM registers
  KVM: PPC: Disable NX for old magic page using guests
  PPC: KVM: Make NX bit available with magic page
  PPC: ePAPR: Fix hypercall on LE guest
  KVM: PPC: Graciously fail broken LE hypercalls
  KVM: PPC: MPIC: Reset IRQ source private members
  KVM: PPC: Add CAP to indicate hcall fixes
  KVM: PPC: Book3S PR: Use SLB entry 0
  KVM: PPC: Book3S PR: Rework SLB switching code

Alexey Kardashevskiy (1):
  KVM: PPC: Book3S HV: Fix dirty map for hugepages

Aneesh Kumar K.V (6):
  KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest
  KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on
  KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation
  KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
  KVM: PPC: BOOK3S: Always use the saved DAR value
  KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler

Paul Mackerras (7):
  KVM: PPC: Book3S: Add ONE_REG register names that were missed
  KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
  KVM: PPC: Book3S HV: Fix check for running inside guest in 
global_invalidates()
  KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
  KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
  KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
  KVM: PPC: Book3S HV: Fix machine check delivery to guest

 Documentation/virtual/kvm/api.txt |   6 +
 Documentation/virtual/kvm/ppc-pv.txt  |  14 ++
 arch/powerpc/include/asm/disassemble.h|  34 +
 arch/powerpc/include/asm/kvm_asm.h|  18 ++-
 arch/powerpc/include/asm/kvm_book3s.h |   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  | 146 +++---
 arch/powerpc/include/asm/kvm_book3s_asm.h |   2 +
 arch/powerpc/include/asm/kvm_booke.h  |   5 -
 arch/powerpc/include/asm/kvm_host.h   |   9 +-
 arch/powerpc/include/asm/kvm_ppc.h|  80 +-
 arch/powerpc/include/asm/reg.h|  12 +-
 arch/powerpc/include/asm/reg_booke.h  |   1 +
 arch/powerpc/include/uapi/asm/kvm.h   |   2 +-
 arch/powerpc/include/uapi/asm/kvm_para.h  |   6 +
 arch/powerpc/kernel/align.c   |  34 +
 arch/powerpc/kernel/asm-offsets.c |  11 +-
 arch/powerpc/kernel/epapr_paravirt.c  |   5 +-
 arch/powerpc/kernel/kvm.c |   2 +-
 arch/powerpc/kernel/paca.c|   3 +
 arch/powerpc/kvm/Kconfig  |   2 +-
 arch/powerpc/kvm/book3s.c | 106 

Re: [PULL 00/41] ppc patch queue 2014-05-30

2014-05-30 Thread Alexander Graf


 Am 30.05.2014 um 14:58 schrieb Paolo Bonzini pbonz...@redhat.com:
 
 Il 30/05/2014 14:42, Alexander Graf ha scritto:
 Hi Paolo / Marcelo,
 
 This is my current patch queue for ppc.  Please pull.
 
 Alex
 
 
 The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb:
 
  KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 
 17:47:18 +0200)
 
 are available in the git repository at:
 
  git://github.com/agraf/linux-2.6.git tags/signed-kvm-ppc-next
 
 for you to fetch changes up to d8d164a9850d486cc48081c18831680254688d0f:
 
  KVM: PPC: Book3S PR: Rework SLB switching code (2014-05-30 14:26:30 +0200)
 
 
 Patch queue for ppc - 2014-05-30
 
 In this round we have a few nice gems. PR KVM gains initial POWER8 support
 as well as LE host awareness, ihe e500 targets can now properly run u-boot,
 LE guests now work with PR KVM including KVM hypercalls and HV KVM guests
 can now use huge pages.
 
 On top of this there are some bug fixes.
 
 Thanks for sending the patches well before the merge window!

Heh, I figured I'd be nice for a change. And my qemu queue is beyond 100 
patches already and waiting on this one ;).

 
 There is a conflict in capability numbers.  KVM_CAP_PPC_FIXUP_HCALL is 102 on 
 the branch, but will be 103 when I merge.

That's ok, I've waited for the consumer of this cap until now, so the wrong 
number will be unused. Thanks a lot for the heads-up though :). And thanks for 
merging!

 
 This will be a very large release for KVM, with over 200 patches scattered 
 over all architectures except ia64 (~25 MIPS, ~20 ARM, ~40 PPC, ~35 x86, ~80 
 s390).

Woot, nice!

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


does anybody still care about kvm-ia64?

2014-05-30 Thread Paolo Bonzini
I was thinking of removing it in Linux 3.17.  I'm not even sure it 
compiles right now, hasn't seen any action in years, and all open-source 
userspace code to use it has been dead for years.


If you disagree, please speak up loudly in the next month.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 07/16] qspinlock: Use a simple write to grab the lock, if applicable

2014-05-30 Thread Waiman Long
Currently, atomic_cmpxchg() is used to get the lock. However, this is
not really necessary if there is more than one task in the queue and
the queue head don't need to reset the queue code word. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
queue code word (tail) in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   62 ---
 1 files changed, 46 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7f10758..2c7abe7 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   u8   locked;
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[3];
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -197,6 +206,22 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * This routine should only be called when the caller is the only one
+ * entitled to acquire the lock.
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   barrier();
+   ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
+   barrier();
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
 

[PATCH v11 16/16] pvqspinlock, x86: Enable PV qspinlock for XEN

2014-05-30 Thread Waiman Long
This patch adds the necessary XEN specific code to allow XEN to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/xen/spinlock.c |  147 +--
 kernel/Kconfig.locks|2 +-
 2 files changed, 143 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d1b6a32..2a259bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,12 @@
 #include xen-ops.h
 #include debugfs.h
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifndef CONFIG_QUEUE_SPINLOCK
+
 enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +106,9 @@ struct xen_lock_waiting {
__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
int irq = __this_cpu_read(lock_kicker_irq);
@@ -213,6 +216,118 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
}
 }
 
+#else /* CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_XEN_DEBUG_FS
+static u32 kick_nohlt_stats;   /* Kick but not halt count  */
+static u32 halt_qhead_stats;   /* Queue head halting count */
+static u32 halt_qnode_stats;   /* Queue node halting count */
+static u32 halt_abort_stats;   /* Halting abort count  */
+static u32 wake_kick_stats;/* Wakeup by kicking count  */
+static u32 wake_spur_stats;/* Spurious wakeup count*/
+static u64 time_blocked;   /* Total blocking time  */
+
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+   if (type == PV_HALT_QHEAD)
+   add_smp(halt_qhead_stats, 1);
+   else if (type == PV_HALT_QNODE)
+   add_smp(halt_qnode_stats, 1);
+   else /* type == PV_HALT_ABORT */
+   add_smp(halt_abort_stats, 1);
+}
+
+static inline void xen_lock_stats(enum pv_lock_stats type)
+{
+   if (type == PV_WAKE_KICKED)
+   add_smp(wake_kick_stats, 1);
+   else if (type == PV_WAKE_SPURIOUS)
+   add_smp(wake_spur_stats, 1);
+   else /* type == PV_KICK_NOHALT */
+   add_smp(kick_nohlt_stats, 1);
+}
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u64 delta;
+
+   delta = sched_clock() - start;
+   add_smp(time_blocked, delta);
+}
+#else /* CONFIG_XEN_DEBUG_FS */
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+}
+
+static inline void xen_lock_stats(enum pv_lock_stats type)
+{
+}
+
+static inline u64 spin_time_start(void)
+{
+   return 0;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+}
+#endif /* CONFIG_XEN_DEBUG_FS */
+
+static void xen_kick_cpu(int cpu)
+{
+   xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+}
+
+/*
+ * Halt the current CPU  release it back to the host
+ */
+static void xen_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval)
+{
+   int irq = __this_cpu_read(lock_kicker_irq);
+   unsigned long flags;
+   u64 start;
+
+   /* If kicker interrupts not initialized yet, just spin */
+   if (irq == -1)
+   return;
+
+   /*
+* Make sure an interrupt handler can't upset things in a
+* partially setup state.
+*/
+   local_irq_save(flags);
+   start = spin_time_start();
+
+   xen_halt_stats(type);
+   /* clear pending */
+   xen_clear_irq_pending(irq);
+
+   /* Allow interrupts while blocked */
+   local_irq_restore(flags);
+   /*
+* Don't halt if the CPU state has been changed.
+*/
+   if (ACCESS_ONCE(*state) != sval) {
+   xen_halt_stats(PV_HALT_ABORT);
+   return;
+   }
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
+   /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+   xen_poll_irq(irq);
+   spin_time_accum_blocked(start);
+}
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
BUG();
@@ -258,7 +373,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -275,8 +389,15 @@ void __init xen_init_spinlocks(void)
return;
}
printk(KERN_DEBUG 

[PATCH v11 15/16] pvqspinlock, x86: Enable PV qspinlock PV for KVM

2014-05-30 Thread Waiman Long
This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Two KVM guests of 20 CPU cores (2 nodes) were created for performance
testing in one of the following three configurations:
 1) Only 1 VM is active
 2) Both VMs are active and they share the same 20 physical CPUs
   (200% overcommit)

The tests run included the disk workload of the AIM7 benchmark on both
ext4 and xfs RAM disks at 3000 users on a 3.15-rc7 based kernel. The
ebizzy -m test was was also run and its performance data were
recorded.  With two VMs running, the idle=poll kernel option was
added to simulate a busy guest. The entry unfair + PV qspinlock
below means that both the unfair lock and PV spinlock configuration
options were turned on.

AIM7 XFS Disk Test (no overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 25210087.24  101.02   5.24
  qspinlock 25714297.00   99.10   5.49
  PV qspinlock  25352117.10  100.32   5.45
  unfair qspinlock  25714297.00   99.25   5.40
  unfair + PV qspinlock 25495757.06   99.81   5.31

AIM7 XFS Disk Test (200% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 76890223.41  341.71   3.07
  qspinlock 78465622.94  346.22   2.90
  PV qspinlock  77386123.26  352.47   2.30
  unfair qspinlock  83565521.54  316.52   1.57
  unfair + PV qspinlock 79716522.58  323.95   3.58

AIM7 EXT4 Disk Test (no overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 19565229.20  106.58   5.35
  qspinlock 19955659.02  103.19   5.37
  PV qspinlock  19586519.19  106.57   5.30
  unfair qspinlock  20224728.90  103.58   5.37
  unfair + PV qspinlock 19911509.04  104.41   5.46

AIM7 EXT4 Disk Test (200% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 57655331.22  407.44   1.51
  qspinlock 60955029.53  407.14   1.69
  PV qspinlock  59210530.40  410.51   1.67
  unfair qspinlock  67289726.75  359.78   1.66
  unfair + PV qspinlock 67039126.85  357.09   0.63

EBIZZY-M Test (no overcommit)
  kernelRec/s   Real Time   Sys TimeUsr Time
  - -   -   
  PV ticketlock 1328  10.00  82.821.46
  qspinlock 1679  10.00  65.371.80
  PV qspinlock  1470  10.00  75.541.54
  unfair qspinlock  1518  10.00  70.801.71
  unfair + PV qspinlock 1585  10.00  69.021.76

EBIZZY-M Test (200% overcommit)
  kernelRec/s   Real Time   Sys TimeUsr Time
  - -   -   
  PV ticketlock  453  10.00  77.110.00
  qspinlock  459  10.00  77.500.00
  PV qspinlock   402  10.00  91.550.00
  unfair qspinlock   570  10.00  62.980.00
  unfair + PV qspinlock  586  10.00  59.680.00

Signed-off-by: Waiman Long waiman.l...@hp.com
Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/kernel/kvm.c |  135 +
 kernel/Kconfig.locks  |2 +-
 2 files changed, 136 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7ab8ab3..eef427b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -567,6 +567,7 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -794,6 +795,134 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
__ticket_t ticket)
}
}
 }
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_KVM_DEBUG_FS
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+static u32 kick_nohlt_stats;   /* Kick but not halt count  */
+static u32 halt_qhead_stats;   /* Queue head halting count */
+static u32 halt_qnode_stats;   /* Queue node halting count */
+static u32 halt_abort_stats;   /* Halting abort count  */
+static u32 wake_kick_stats;   

[PATCH v11 13/16] pvqspinlock: Enable coexistence with the unfair lock

2014-05-30 Thread Waiman Long
This patch enables the coexistence of both the PV qspinlock and
unfair lock.  When both are enabled, however, only the lock fastpath
will perform lock stealing whereas the slowpath will have that disabled
to get the best of both features.

We also need to transition a CPU spinning too long in the pending
bit code path back to the regular queuing code path so that it can
be properly halted by the PV qspinlock code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   47 ---
 1 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 93c663a..8deedcf 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -57,12 +57,24 @@
 #include mcs_spinlock.h
 
 /*
+ * Check the pending bit spinning threshold only if PV qspinlock is enabled
+ */
+#define PSPIN_THRESHOLD(1  10)
+#define MAX_NODES  4
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define pv_qspinlock_enabled() static_key_false(paravirt_spinlocks_enabled)
+#else
+#define pv_qspinlock_enabled() false
+#endif
+
+/*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
  * Exactly fits one cacheline.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -265,6 +277,9 @@ static noinline void queue_spin_lock_slowerpath(struct 
qspinlock *lock,
ACCESS_ONCE(prev-next) = node;
 
arch_mcs_spin_lock_contended(node-locked);
+   } else {
+   /* Mark it as the queue head */
+   ACCESS_ONCE(node-locked) = true;
}
 
/*
@@ -344,14 +359,17 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
struct mcs_spinlock *node;
u32 new, old, tail;
int idx;
+   int retry = INT_MAX;/* Retry count, queue if = 0 */
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
 #ifdef CONFIG_VIRT_UNFAIR_LOCKS
/*
 * A simple test and set unfair lock
+* Disable waiter lock stealing if PV spinlock is enabled
 */
-   if (static_key_false(virt_unfairlocks_enabled)) {
+   if (!pv_qspinlock_enabled() 
+   static_key_false(virt_unfairlocks_enabled)) {
cpu_relax();/* Relax after a failed lock attempt */
while (!queue_spin_trylock(lock))
cpu_relax();
@@ -360,6 +378,14 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 #endif /* CONFIG_VIRT_UNFAIR_LOCKS */
 
/*
+* When PV qspinlock is enabled, exit the pending bit code path and
+* go back to the regular queuing path if the lock isn't available
+* within a certain threshold.
+*/
+   if (pv_qspinlock_enabled())
+   retry = PSPIN_THRESHOLD;
+
+   /*
 * trylock || pending
 *
 * 0,0,0 - 0,0,1 ; trylock
@@ -370,7 +396,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 * If we observe that the queue is not empty or both
 * the pending and lock bits are set, queue
 */
-   if ((val  _Q_TAIL_MASK) ||
+   if ((val  _Q_TAIL_MASK) || (retry-- = 0) ||
(val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)))
goto queue;
 
@@ -413,8 +439,21 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 * sequentiality; this because not all clear_pending_set_locked()
 * implementations imply full barriers.
 */
-   while ((val = smp_load_acquire(lock-val.counter))  _Q_LOCKED_MASK)
+   while ((val = smp_load_acquire(lock-val.counter))  _Q_LOCKED_MASK) {
+   if (pv_qspinlock_enabled()  (retry-- = 0)) {
+   /*
+* Clear the pending bit and queue
+*/
+   for (;;) {
+   new = val  ~_Q_PENDING_MASK;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   goto queue;
+   val = old;
+   }
+   }
arch_mutex_cpu_relax();
+   }
 
/*
 * take ownership and clear the pending bit.
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 12/16] pvqspinlock, x86: Add PV data structure methods

2014-05-30 Thread Waiman Long
This patch modifies the para-virtualization (PV) infrastructure code
of the x86-64 architecture to support the PV queue spinlock. Three
new virtual methods are added to support PV qspinlock:

 1) kick_cpu - schedule in a virtual CPU
 2) halt_cpu - schedule out a virtual CPU
 3) lockstat - update statistical data for debugfs

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/paravirt.h   |   18 +-
 arch/x86/include/asm/paravirt_types.h |   17 +
 arch/x86/kernel/paravirt-spinlocks.c  |6 ++
 3 files changed, 40 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..d71e123 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,23 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 }
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick_cpu, cpu);
+}
+
+static __always_inline void
+__queue_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval)
+{
+   PVOP_VCALL3(pv_lock_ops.halt_cpu, type, state, sval);
+}
 
+static __always_inline void __queue_lockstat(enum pv_lock_stats type)
+{
+   PVOP_VCALL1(pv_lock_ops.lockstat, type);
+}
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -723,7 +739,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
 {
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
-
+#endif
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..549b3a0 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,26 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_lock_stats {
+   PV_HALT_QHEAD,  /* Queue head halting   */
+   PV_HALT_QNODE,  /* Other queue node halting */
+   PV_HALT_ABORT,  /* Halting aborted  */
+   PV_WAKE_KICKED, /* Wakeup by kicking*/
+   PV_WAKE_SPURIOUS,   /* Spurious wakeup  */
+   PV_KICK_NOHALT  /* Kick but CPU not halted  */
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   void (*kick_cpu)(int cpu);
+   void (*halt_cpu)(enum pv_lock_stats type, s8 *state, s8 sval);
+   void (*lockstat)(enum pv_lock_stats type);
+#else
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index d9041c9..17435b7 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -11,9 +11,15 @@
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+   .kick_cpu = paravirt_nop,
+   .halt_cpu = paravirt_nop,
+   .lockstat = paravirt_nop,
+#else
.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
.unlock_kick = paravirt_nop,
 #endif
+#endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 14/16] pvqspinlock: Add qspinlock para-virtualization support

2014-05-30 Thread Waiman Long
This patch adds base para-virtualization support to the queue
spinlock in the same way as was done in the PV ticket lock code. In
essence, the lock waiters will spin for a specified number of times
(QSPIN_THRESHOLD = 2^14) and then halted itself. The queue head waiter,
unlike the other waiter, will spins 2*QSPIN_THRESHOLD times before
halting itself.  Before being halted, the queue head waiter will set
a flag (_Q_LOCKED_SLOWPATH) in the lock byte to indicate that the
unlock slowpath has to be invoked.

In the unlock slowpath, the current lock holder will find the queue
head by following the previous node pointer links stored in the queue
node structure until it finds one that has the qhead flag turned
on. It then attempt to kick in the CPU of the queue head.

After the queue head acquired the lock, it will also check the status
of the next node and set _Q_LOCKED_SLOWPATH flag if it has been halted.

Enabling the PV code does have a performance impact on spinlock
acquisitions and releases. The following table shows the execution
time (in ms) of a spinlock micro-benchmark that does lock/unlock
operations 5M times for each task versus the number of contending
tasks on a Westmere-EX system.

  # ofTicket lockQueue lock
  tasks   PV off/PV on/%ChangePV off/PV on/%Change
  --     -
1135/  179/+33%  137/  168/+23%
2   1045/ 1103/ +6% 1161/ 1248/ +7%
3   1827/ 2683/+47% 2357/ 2600/+10%
4   2689/ 4191/+56% 2882/ 3115/ +8%
5   3736/ 5830/+56% 3493/ 3571/ +2%
6   4942/ 7609/+54% 4239/ 4198/ -1%
7   6304/ 9570/+52% 4931/ 4895/ -1%
8   7736/11323/+46% 5632/ 5588/ -1%

It can be seen that the ticket lock PV code has a fairly big decrease
in performance when there are 3 or more contending tasks. The queue
spinlock PV code, on the other hand, only has a relatively minor
drop in performance for with 1-4 contending tasks. With 5 or more
contending tasks, there is practically no difference in performance.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/pvqspinlock.h |  359 
 arch/x86/include/asm/qspinlock.h   |   33 
 kernel/locking/qspinlock.c |   72 +++-
 3 files changed, 458 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/pvqspinlock.h 
b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 000..af00eda
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,359 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ * Queue Spinlock Para-Virtualization (PV) Support
+ *
+ *  +--++--++--+
+ *   pv_qnode   |Queue |  prev  |  |  prev  |Queue |
+ *  | Head |---| Node |  -| Tail |
+ *  +--++--++--+
+ * |   |   |
+ * V   V   V
+ *  +--++--++--+
+ * mcs_spinlock |locked||locked||locked|
+ *  | = 1  |---| = 0  |-  | = 0  |
+ *  +--+  next  +--+  next  +--+
+ *
+ * The PV support code for queue spinlock is roughly the same as that
+ * of the ticket spinlock. Each CPU waiting for the lock will spin until it
+ * reaches a threshold. When that happens, it will put itself to halt so
+ * that the hypervisor can reuse the CPU cycles in some other guests as well
+ * as returning other hold-up CPUs faster.
+ *
+ * Auxillary fields in the pv_qnode structure are used to hold information
+ * relevant to the PV support so that it won't impact on the behavior and
+ * performance of the bare metal code. The structure contains a prev pointer
+ * so that a lock holder can find out the queue head from the queue tail
+ * following the prev pointers.
+ *
+ * A major difference between the two versions of PV spinlock is the fact
+ * that the spin threshold of the queue spinlock is half of that of the
+ * ticket spinlock. However, the queue head will spin twice as long as the
+ * other nodes before it puts itself to halt. The reason for that is to
+ * increase halting chance of heavily contended locks to favor lightly
+ * contended locks (queue depth of 1 or less).
+ *
+ * There are 2 places where races can happen:
+ *  1) Halting of the queue head CPU (in pv_head_spin_check) and the CPU
+ * kicking by the lock holder in the unlock path (in pv_kick_node).
+ *  2) Halting of the queue node CPU (in pv_queue_spin_check) and the
+ * the status check by the previous queue head (in pv_halt_check).
+ * See the comments on those functions to see how the races are being
+ * addressed.
+ */
+
+/*
+ * Spin threshold for queue spinlock
+ */
+#defineQSPIN_THRESHOLD (1U14)

Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Paolo Bonzini

Il 30/05/2014 14:42, Alexander Graf ha scritto:

From: Paul Mackerras pau...@samba.org

Commit b005255e12a3 (KVM: PPC: Book3S HV: Context-switch new POWER8
SPRs) added a definition of KVM_REG_PPC_WORT with the same register
number as the existing KVM_REG_PPC_VRSAVE (though in fact the
definitions are not identical because of the different register sizes.)

For clarity, this moves KVM_REG_PPC_WORT to the next unused number,
and also adds it to api.txt.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 Documentation/virtual/kvm/api.txt   | 1 +
 arch/powerpc/include/uapi/asm/kvm.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 9a95770..6b30290 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1873,6 +1873,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_PPR  | 64
   PPC   | KVM_REG_PPC_ARCH_COMPAT 32
   PPC   | KVM_REG_PPC_DABRX | 32
+  PPC   | KVM_REG_PPC_WORT  | 64
   PPC   | KVM_REG_PPC_TM_GPR0  | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31 | 64
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index a6665be..2bc4a94 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -545,7 +545,6 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TCSCR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb1)
 #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2)
 #define KVM_REG_PPC_ACOP   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3)
-#define KVM_REG_PPC_WORT   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb4)

 #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
 #define KVM_REG_PPC_LPCR   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
@@ -555,6 +554,7 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb7)

 #define KVM_REG_PPC_DABRX  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb8)
+#define KVM_REG_PPC_WORT   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb9)


This is an ABI break, this symbol was added in 3.14.  I think I should 
revert this.  Can you convince me otherwise?


Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 11/16] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled

2014-05-30 Thread Waiman Long
This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/spinlock.h  |4 ++--
 arch/x86/kernel/kvm.c|2 +-
 arch/x86/kernel/paravirt-spinlocks.c |4 ++--
 arch/x86/xen/spinlock.c  |2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 958d20f..428d0d1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -39,7 +39,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1  15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t 
*lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
if (TICKET_SLOWPATH_FLAG 
-   static_key_false(paravirt_ticketlocks_enabled)) {
+   static_key_false(paravirt_spinlocks_enabled)) {
arch_spinlock_t prev;
 
prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 0331cb3..7ab8ab3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -817,7 +817,7 @@ static __init int kvm_spinlock_init_jump(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
printk(KERN_INFO KVM setup paravirtual spinlock\n);
 
return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 69ed806..d9041c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 #endif
 
 #ifdef CONFIG_VIRT_UNFAIR_LOCKS
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 0ba5f3b..d1b6a32 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void)
if (!xen_domain())
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 08/16] qspinlock: Prepare for unfair lock support

2014-05-30 Thread Waiman Long
If unfair lock is supported, the lock acquisition loop at the end of
the queue_spin_lock_slowpath() function may need to detect the fact
the lock can be stolen. Code are added for the stolen lock detection.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   26 ++
 1 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 2c7abe7..ae1b19d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,7 +94,7 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
  *
- * This internal structure is also used by the set_locked function which
+ * This internal structure is also used by the try_set_locked function which
  * is not restricted to _Q_PENDING_BITS == 8.
  */
 struct __qspinlock {
@@ -206,19 +206,21 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
- * set_locked - Set the lock bit and own the lock
- * @lock: Pointer to queue spinlock structure
+ * try_set_locked - Try to set the lock bit and own the lock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 otherwise
  *
  * This routine should only be called when the caller is the only one
  * entitled to acquire the lock.
  */
-static __always_inline void set_locked(struct qspinlock *lock)
+static __always_inline int try_set_locked(struct qspinlock *lock)
 {
struct __qspinlock *l = (void *)lock;
 
barrier();
ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
barrier();
+   return 1;
 }
 
 /**
@@ -357,11 +359,12 @@ queue:
/*
 * we're at the head of the waitqueue, wait for the owner  pending to
 * go away.
-* Load-acquired is used here because the set_locked()
+* Load-acquired is used here because the try_set_locked()
 * function below may not be a full memory barrier.
 *
 * *,x,y - *,0,0
 */
+retry_queue_wait:
while ((val = smp_load_acquire(lock-val.counter))
_Q_LOCKED_PENDING_MASK)
arch_mutex_cpu_relax();
@@ -378,13 +381,20 @@ queue:
 */
for (;;) {
if (val != tail) {
-   set_locked(lock);
-   break;
+   /*
+* The try_set_locked function will only failed if the
+* lock was stolen.
+*/
+   if (try_set_locked(lock))
+   break;
+   else
+   goto  retry_queue_wait;
}
old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL);
if (old == val)
goto release;   /* No contention */
-
+   else if (old   _Q_LOCKED_MASK)
+   goto retry_queue_wait;
val = old;
}
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-05-30 Thread Waiman Long
Locking is always an issue in a virtualized environment because of 2
different types of problems:
 1) Lock holder preemption
 2) Lock waiter preemption

One solution to the lock waiter preemption problem is to allow unfair
lock in a virtualized environment. In this case, a new lock acquirer
can come and steal the lock if the next-in-line CPU to get the lock
is scheduled out.

A simple unfair queue spinlock can be implemented by allowing lock
stealing in the fast path. The slowpath will also be modified to run a
simple queue_spin_trylock() loop. A simple test and set lock like that
does have the problem that the The constant spinning on the lock word
put a lot of cacheline contention traffic on the affected cacheline,
thus slowing tasks that need to access the cacheline.

Unfair lock in a native environment is generally not a good idea as
there is a possibility of lock starvation for a heavily contended lock.

This patch adds a new configuration option for the x86 architecture
to enable the use of unfair queue spinlock (AVIRT_UNFAIR_LOCKS)
in a virtual guest. A jump label (virt_unfairlocks_enabled) is
used to switch between a fair and an unfair version of the spinlock
code. This jump label will only be enabled in a virtual guest where
the X86_FEATURE_HYPERVISOR feature bit is set.

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

With the unfair locking activated on bare metal 4-socket Westmere-EX
box, the execution times (in ms) of a spinlock micro-benchmark were
as follows:

  # ofTicket   Fair Unfair
  taskslock queue lockqueue lock
  --  ---   ----
1   135135   137
2   890   1082   613
3  1932   2248  1211
4  2829   2819  1720
5  3834   3522  2461
6  4963   4173  3715
7  6299   4875  3749
8  7691   5563  4194

Executing one task per node, the performance data were:

  # ofTicket   Fair Unfair
  nodeslock queue lockqueue lock
  --  ---   ----
1135135  137
2   4603   1034 1458
3  10940  12087 2562
4  21555  10507 4793

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/Kconfig |   11 +
 arch/x86/include/asm/qspinlock.h |   79 ++
 arch/x86/kernel/Makefile |1 +
 arch/x86/kernel/paravirt-spinlocks.c |   26 +++
 kernel/locking/qspinlock.c   |   20 +
 5 files changed, 137 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 95c9c4e..961f43a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -585,6 +585,17 @@ config PARAVIRT_SPINLOCKS
 
  If you are unsure how to answer this question, answer Y.
 
+config VIRT_UNFAIR_LOCKS
+   bool Enable unfair locks in a virtual guest
+   depends on SMP  QUEUE_SPINLOCK
+   depends on !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE
+   ---help---
+ This changes the kernel to use unfair locks in a virtual
+ guest. This will help performance in most cases. However,
+ there is a possibility of lock starvation on a heavily
+ contended lock especially in a large guest with many
+ virtual CPUs.
+
 source arch/x86/xen/Kconfig
 
 config KVM_GUEST
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index e4a4f5d..448de8b 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -5,6 +5,10 @@
 
 #if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
 
+#ifdef CONFIG_VIRT_UNFAIR_LOCKS
+extern struct static_key virt_unfairlocks_enabled;
+#endif
+
 #definequeue_spin_unlock queue_spin_unlock
 /**
  * queue_spin_unlock - release a queue spinlock
@@ -26,4 +30,79 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 
 #include asm-generic/qspinlock.h
 
+union arch_qspinlock {
+   atomic_t val;
+   u8   locked;
+};
+
+#ifdef CONFIG_VIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+   if (!qlock-locked  (cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock 

[PATCH v11 10/16] qspinlock: Split the MCS queuing code into a separate slowerpath

2014-05-30 Thread Waiman Long
With the pending addition of more codes to support PV spinlock, the
complexity of the slowpath function increases to the point that the
number of scratch-pad registers in the x86-64 architecture is not
enough and so those additional non-scratch-pad registers will need
to be used. This has the downside of requiring saving and restoring
of those registers in the prolog and epilog of the slowpath function
slowing down the nominally faster pending bit and trylock code path
at the beginning of the slowpath function.

This patch separates out the actual MCS queuing code into a slowerpath
function. This avoids the slow down of the pending bit and trylock
code path at the expense of a little bit of additional overhead to
the MCS queuing code path.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |  162 ---
 1 files changed, 90 insertions(+), 72 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3723c83..93c663a 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -232,6 +232,93 @@ static __always_inline int try_set_locked(struct qspinlock 
*lock)
 }
 
 /**
+ * queue_spin_lock_slowerpath - a slower patch for acquiring queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ * @node: Pointer to the queue node
+ * @tail: The tail code
+ *
+ * The reason for splitting a slowerpath from slowpath is to avoid the
+ * unnecessary overhead of non-scratch pad register pushing and popping
+ * due to increased complexity with unfair and PV spinlock from slowing
+ * down the nominally faster pending bit and trylock code path. So this
+ * function is not inlined.
+ */
+static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
+   struct mcs_spinlock *node, u32 tail)
+{
+   struct mcs_spinlock *prev, *next;
+   u32 val, old;
+
+   /*
+* we already touched the queueing cacheline; don't bother with pending
+* stuff.
+*
+* p,*,* - n,*,*
+*/
+   old = xchg_tail(lock, tail);
+
+   /*
+* if there was a previous node; link it and wait.
+*/
+   if (old  _Q_TAIL_MASK) {
+   prev = decode_tail(old);
+   ACCESS_ONCE(prev-next) = node;
+
+   arch_mcs_spin_lock_contended(node-locked);
+   }
+
+   /*
+* we're at the head of the waitqueue, wait for the owner  pending to
+* go away.
+* Load-acquired is used here because the try_set_locked()
+* function below may not be a full memory barrier.
+*
+* *,x,y - *,0,0
+*/
+retry_queue_wait:
+   while ((val = smp_load_acquire(lock-val.counter))
+   _Q_LOCKED_PENDING_MASK)
+   arch_mutex_cpu_relax();
+
+   /*
+* claim the lock:
+*
+* n,0,0 - 0,0,1 : lock, uncontended
+* *,0,0 - *,0,1 : lock, contended
+*
+* If the queue head is the only one in the queue (lock value == tail),
+* clear the tail code and grab the lock. Otherwise, we only need
+* to grab the lock.
+*/
+   for (;;) {
+   if (val != tail) {
+   /*
+* The try_set_locked function will only failed if the
+* lock was stolen.
+*/
+   if (try_set_locked(lock))
+   break;
+   else
+   goto  retry_queue_wait;
+   }
+   old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL);
+   if (old == val)
+   return; /* No contention */
+   else if (old   _Q_LOCKED_MASK)
+   goto retry_queue_wait;
+   val = old;
+   }
+
+   /*
+* contended path; wait for next
+*/
+   while (!(next = ACCESS_ONCE(node-next)))
+   arch_mutex_cpu_relax();
+
+   arch_mcs_spin_unlock_contended(next-locked);
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -254,7 +341,7 @@ static __always_inline int try_set_locked(struct qspinlock 
*lock)
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-   struct mcs_spinlock *prev, *next, *node;
+   struct mcs_spinlock *node;
u32 new, old, tail;
int idx;
 
@@ -355,78 +442,9 @@ queue:
 * attempt the trylock once more in the hope someone let go while we
 * weren't watching.
 */
-   if (queue_spin_trylock(lock))
-   goto release;
-
-   /*
-* we already touched the queueing cacheline; don't bother with pending
-* stuff.
-*
-* p,*,* - n,*,*
-*/
-   old = xchg_tail(lock, tail);
-
-   /*

[PATCH v11 06/16] qspinlock: prolong the stay in the pending bit path

2014-05-30 Thread Waiman Long
There is a problem in the current pending bit spinning code.  When the
lock is free, but the pending bit holder hasn't grabbed the lock 
cleared the pending bit yet, the spinning code will not be run.
As a result, the regular queuing code path might be used most of
the time even when there is only 2 tasks contending for the lock.
Assuming that the pending bit holder is going to get the lock and
clear the pending bit soon, it is actually better to wait than to be
queued up which has a higher overhead.

The following tables show the before-patch execution time (in ms)
of a micro-benchmark where 5M iterations of the lock/unlock cycles
were run on a 10-core Westere-EX x86-64 CPU with 2 different types of
loads - standalone (lock and protected data in different cachelines)
and embedded (lock and protected data in the same cacheline).

  [Standalone/Embedded - same node]
  # of tasksTicket lock Queue lock   %Change
  ----- --   ---
   1  135/ 111   135/ 101  0%/  -9%
   2  890/ 779  1885/1990   +112%/+156%
   3 1932/1859  2333/2341+21%/ +26%
   4 2829/2726  2900/2923 +3%/  +7%
   5 3834/3761  3655/3648 -5%/  -3%
   6 4963/4976  4336/4326-13%/ -13%
   7 6299/6269  5057/5064-20%/ -19%
   8 7691/7569  5786/5798-25%/ -23%

Of course, the results will varies depending on what kind of test
machine is used.

With 1 task per NUMA node, the execution times are:

[Standalone - different nodes]
  # of nodesTicket lock Queue lock  %Change
  ----- --  ---
   1   135135 0%
   2  4604   5087   +10%
   3 10940  12224   +12%
   4 21555  10555   -51%

It can be seen that the queue spinlock is slower than the ticket
spinlock when there are 2 or 3 contending tasks. In all the other case,
the queue spinlock is either equal or faster than the ticket spinlock.

With this patch, the performance data for 2 contending tasks are:

  [Standalone/Embedded]
  # of tasksTicket lock Queue lock  %Change
  ----- --  ---
   2  890/779984/871+11%/+12%

[Standalone - different nodes]
  # of nodesTicket lock Queue lock  %Change
  ----- --  ---
   2  4604 1364   -70%

It can be seen that the queue spinlock performance for 2 contending
tasks is now comparable to ticket spinlock on the same node, but much
faster when in different nodes. With 3 contending tasks, however,
the ticket spinlock is still quite a bit faster.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   18 --
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc7fd8c..7f10758 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -233,11 +233,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 */
for (;;) {
/*
-* If we observe any contention; queue.
+* If we observe that the queue is not empty or both
+* the pending and lock bits are set, queue
 */
-   if (val  ~_Q_LOCKED_MASK)
+   if ((val  _Q_TAIL_MASK) ||
+   (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)))
goto queue;
 
+   if (val == _Q_PENDING_VAL) {
+   /*
+* Pending bit is set, but not the lock bit.
+* Assuming that the pending bit holder is going to
+* set the lock bit and clear the pending bit soon,
+* it is better to wait than to exit at this point.
+*/
+   cpu_relax();
+   val = atomic_read(lock-val);
+   continue;
+   }
+
new = _Q_LOCKED_VAL;
if (val == new)
new |= _Q_PENDING_VAL;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 05/16] qspinlock: Optimize for smaller NR_CPUS

2014-05-30 Thread Waiman Long
From: Peter Zijlstra pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 
 kernel/locking/qspinlock.c|  103 +---
 2 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index ed5d89a..4914abe 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -38,6 +38,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -50,7 +58,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -61,6 +73,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 41594a1..fc7fd8c 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -22,6 +22,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -48,6 +49,9 @@
  * We can further change the first spinner to spin on a bit in the lock word
  * instead of its node; whereby avoiding the need to carry a node from lock to
  * unlock, and preserving API.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
  */
 
 #include mcs_spinlock.h
@@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   u32 new, old;
+
+

[PATCH v11 04/16] qspinlock: Extract out the exchange of tail code word

2014-05-30 Thread Waiman Long
This patch extracts the logic for the exchange of new and previous tail
code words into a new xchg_tail() function which can be optimized in a
later patch.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   58 
 2 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index bd25081..ed5d89a 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -61,6 +61,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 1e93c6a..41594a1 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -182,36 +207,25 @@ queue:
node-next = NULL;
 
/*
-* we already touched the queueing cacheline; don't bother with pending
-* stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; trylock
-* p,y,x - n,y,x ; prev = xchg(lock, node)
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* we already touched the queueing cacheline; don't bother with pending
+* stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail);
 
/*
 * if there was a previous node; link it and wait.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev = decode_tail(old);
ACCESS_ONCE(prev-next) = node;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 00/16] qspinlock: a 4-byte queue spinlock with PV support

2014-05-30 Thread Waiman Long
v10-v11:
  - Use a simple test-and-set unfair lock to simplify the code,
but performance may suffer a bit for large guest with many CPUs.
  - Take out Raghavendra KT's test results as the unfair lock changes
may render some of his results invalid.
  - Add PV support without increasing the size of the core queue node
structure.
  - Other minor changes to address some of the feedback comments.

v9-v10:
  - Make some minor changes to qspinlock.c to accommodate review feedback.
  - Change author to PeterZ for 2 of the patches.
  - Include Raghavendra KT's test results in patch 18.

v8-v9:
  - Integrate PeterZ's version of the queue spinlock patch with some
modification:
http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
  - Break the more complex patches into smaller ones to ease review effort.
  - Fix a racing condition in the PV qspinlock code.

v7-v8:
  - Remove one unneeded atomic operation from the slowpath, thus
improving performance.
  - Simplify some of the codes and add more comments.
  - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
unfair lock.
  - Reduce unfair lock slowpath lock stealing frequency depending
on its distance from the queue head.
  - Add performance data for IvyBridge-EX CPU.

v6-v7:
  - Remove an atomic operation from the 2-task contending code
  - Shorten the names of some macros
  - Make the queue waiter to attempt to steal lock when unfair lock is
enabled.
  - Remove lock holder kick from the PV code and fix a race condition
  - Run the unfair lock  PV code on overcommitted KVM guests to collect
performance data.

v5-v6:
 - Change the optimized 2-task contending code to make it fairer at the
   expense of a bit of performance.
 - Add a patch to support unfair queue spinlock for Xen.
 - Modify the PV qspinlock code to follow what was done in the PV
   ticketlock.
 - Add performance data for the unfair lock as well as the PV
   support code.

v4-v5:
 - Move the optimized 2-task contending code to the generic file to
   enable more architectures to use it without code duplication.
 - Address some of the style-related comments by PeterZ.
 - Allow the use of unfair queue spinlock in a real para-virtualized
   execution environment.
 - Add para-virtualization support to the qspinlock code by ensuring
   that the lock holder and queue head stay alive as much as possible.

v3-v4:
 - Remove debugging code and fix a configuration error
 - Simplify the qspinlock structure and streamline the code to make it
   perform a bit better
 - Add an x86 version of asm/qspinlock.h for holding x86 specific
   optimization.
 - Add an optimized x86 code path for 2 contending tasks to improve
   low contention performance.

v2-v3:
 - Simplify the code by using numerous mode only without an unfair option.
 - Use the latest smp_load_acquire()/smp_store_release() barriers.
 - Move the queue spinlock code to kernel/locking.
 - Make the use of queue spinlock the default for x86-64 without user
   configuration.
 - Additional performance tuning.

v1-v2:
 - Add some more comments to document what the code does.
 - Add a numerous CPU mode to support = 16K CPUs
 - Add a configuration option to allow lock stealing which can further
   improve performance in many cases.
 - Enable wakeup of queue head CPU at unlock time for non-numerous
   CPU mode.

This patch set has 3 different sections:
 1) Patches 1-7: Introduces a queue-based spinlock implementation that
can replace the default ticket spinlock without increasing the
size of the spinlock data structure. As a result, critical kernel
data structures that embed spinlock won't increase in size and
break data alignments.
 2) Patches 8-9: Enables the use of unfair queue spinlock in a
virtual guest. This can resolve some of the locking related
performance issues due to the fact that the next CPU to get the
lock may have been scheduled out for a period of time.
 3) Patches 10-16: Enable qspinlock para-virtualization support
by halting the waiting CPUs after spinning for a certain amount of
time. The unlock code will detect the a sleeping waiter and wake it
up. This is essentially the same logic as the PV ticketlock code.

The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention.  This patch has the potential of
improving the performance of all the workloads that have moderate to
heavy spinlock contention.

The queue spinlock is especially suitable for NUMA machines with at
least 2 sockets, though noticeable performance benefit probably won't
show up in machines with less than 4 sockets.

The purpose of this patch set is not to solve any particular spinlock
contention problems. Those need to be solved by refactoring the code
to make more efficient use of the lock or finer granularity ones. The
main purpose is to make the lock contention problems more 

[PATCH v11 02/16] qspinlock, x86: Enable x86-64 to use queue spinlock

2014-05-30 Thread Waiman Long
This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   29 +
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 39 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..95c9c4e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_WANTS_PROT_NUMA_PROT_NONE
+   select ARCH_USE_QUEUE_SPINLOCK
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 000..e4a4f5d
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,29 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   barrier();
+   ACCESS_ONCE(*(u8 *)lock) = 0;
+   barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE */
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..958d20f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -180,6 +184,7 @@ static __always_inline void 
arch_spin_lock_flags(arch_spinlock_t *lock,
 {
arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..7960268 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm/rwlock.h
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Alexander Graf


On 30.05.14 17:50, Paolo Bonzini wrote:

Il 30/05/2014 14:42, Alexander Graf ha scritto:

From: Paul Mackerras pau...@samba.org

Commit b005255e12a3 (KVM: PPC: Book3S HV: Context-switch new POWER8
SPRs) added a definition of KVM_REG_PPC_WORT with the same register
number as the existing KVM_REG_PPC_VRSAVE (though in fact the
definitions are not identical because of the different register sizes.)

For clarity, this moves KVM_REG_PPC_WORT to the next unused number,
and also adds it to api.txt.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexander Graf ag...@suse.de
---
 Documentation/virtual/kvm/api.txt   | 1 +
 arch/powerpc/include/uapi/asm/kvm.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt

index 9a95770..6b30290 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1873,6 +1873,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_PPR| 64
   PPC   | KVM_REG_PPC_ARCH_COMPAT 32
   PPC   | KVM_REG_PPC_DABRX | 32
+  PPC   | KVM_REG_PPC_WORT  | 64
   PPC   | KVM_REG_PPC_TM_GPR0| 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31| 64
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h

index a6665be..2bc4a94 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -545,7 +545,6 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TCSCR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb1)
 #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2)
 #define KVM_REG_PPC_ACOP(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3)
-#define KVM_REG_PPC_WORT(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb4)

 #define KVM_REG_PPC_VRSAVE(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
 #define KVM_REG_PPC_LPCR(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
@@ -555,6 +554,7 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 
0xb7)


 #define KVM_REG_PPC_DABRX(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb8)
+#define KVM_REG_PPC_WORT(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb9)


This is an ABI break, this symbol was added in 3.14.  I think I should 
revert this.  Can you convince me otherwise?


There's nothing bad happening with the change. Newer user space won't be 
able to read WORT on older kernels, but there were more things broken 
that just WORT for POWER8 support there ;).


And user space build with new headers running on an old kernel won't 
find the register, which is OK.


I couldn't find any combination where it's really a problem.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 03/16] qspinlock: Add pending bit

2014-05-30 Thread Waiman Long
From: Peter Zijlstra pet...@infradead.org

Because the qspinlock needs to touch a second cacheline; add a pending
bit and allow a single in-word spinner before we punt to the second
cacheline.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c|  109 ++--
 2 files changed, 97 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index f66f845..bd25081 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -39,8 +39,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -48,7 +49,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -57,5 +62,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b97a1ad..1e93c6a 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -83,24 +83,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock bit)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock bit)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
@@ -110,6 +114,65 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   /*
+* trylock || pending
+*
+* 0,0,0 - 0,0,1 ; trylock
+* 0,0,1 - 0,1,1 ; pending
+*/
+   for (;;) {
+   /*
+* If we observe any contention; queue.
+*/
+   if (val  ~_Q_LOCKED_MASK)
+   goto queue;
+
+   new = _Q_LOCKED_VAL;
+   if (val == new)
+   new |= _Q_PENDING_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+
+   /*
+* we won the 

[PATCH v11 01/16] qspinlock: A simple generic 4-byte queue spinlock

2014-05-30 Thread Waiman Long
This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  118 
 include/asm-generic/qspinlock_types.h |   61 ++
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  197 +
 6 files changed, 385 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..e8a7ae8
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,118 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32 val;
+
+ 

Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Paolo Bonzini

Il 30/05/2014 17:53, Alexander Graf ha scritto:

This is an ABI break, this symbol was added in 3.14.  I think I should
revert this.  Can you convince me otherwise?


There's nothing bad happening with the change. Newer user space won't be
able to read WORT on older kernels, but there were more things broken
that just WORT for POWER8 support there ;).

And user space build with new headers running on an old kernel won't
find the register, which is OK.


Would new userspace with old kernel be able to detect that POWER8 
support isn't quite complete?


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Alexander Graf


On 30.05.14 17:55, Paolo Bonzini wrote:

Il 30/05/2014 17:53, Alexander Graf ha scritto:

This is an ABI break, this symbol was added in 3.14.  I think I should
revert this.  Can you convince me otherwise?


There's nothing bad happening with the change. Newer user space won't be
able to read WORT on older kernels, but there were more things broken
that just WORT for POWER8 support there ;).

And user space build with new headers running on an old kernel won't
find the register, which is OK.


Would new userspace with old kernel be able to detect that POWER8 
support isn't quite complete?


It couldn't, no. It would try to run a guest - if it happens to work 
we're lucky ;). Even then the only thing that would remotely be affected 
by that one_reg rename is live migration (which just got a few more 
fixes in this pull request).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Paolo Bonzini

Il 30/05/2014 17:58, Alexander Graf ha scritto:


Would new userspace with old kernel be able to detect that POWER8
support isn't quite complete?


It couldn't, no. It would try to run a guest - if it happens to work
we're lucky ;).


That's why I'm considering a revert.


Even then the only thing that would remotely be affected
by that one_reg rename is live migration (which just got a few more
fixes in this pull request).


Doesn't info cpus also do get/set one_reg?  What happens if it returns 
EINVAL?  Also, reset should certainly try to write all registers, what 
happens if one is missed.


Beyond the particular case of WORT, I'd just like to point out that 
uapi/ changes need even more scrutiny from maintainers than usual.  I 
don't know exactly what checks Linus makes in my pull requests, but 
uapi/ is at the top of the list of things he might look at, right after 
the diffstat. :)


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters

2014-05-30 Thread Andi Kleen
On Fri, May 30, 2014 at 09:31:53AM +0200, Peter Zijlstra wrote:
 On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote:
  From: Andi Kleen a...@linux.intel.com
  
  Currently perf unconditionally disables PEBS for guest.
  
  Now that we have the infrastructure in place to handle
  it we can allow it for KVM owned guest events. For
  the perf needs to know that a event is owned by
  a guest. Add a new state bit in the perf_event for that.
  
 
 This doesn't make sense; why does it need to be owned?

Please read the complete patch kit

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Alexander Graf


On 30.05.14 18:03, Paolo Bonzini wrote:

Il 30/05/2014 17:58, Alexander Graf ha scritto:


Would new userspace with old kernel be able to detect that POWER8
support isn't quite complete?


It couldn't, no. It would try to run a guest - if it happens to work
we're lucky ;).


That's why I'm considering a revert.


Even then the only thing that would remotely be affected
by that one_reg rename is live migration (which just got a few more
fixes in this pull request).


Doesn't info cpus also do get/set one_reg?


Yeah, but WORT is not important enough to get listed.

What happens if it returns EINVAL?  Also, reset should certainly try 
to write all registers, what happens if one is missed.


If it returns EINVAL we just ignore the register.



Beyond the particular case of WORT, I'd just like to point out that 
uapi/ changes need even more scrutiny from maintainers than usual.  I 
don't know exactly what checks Linus makes in my pull requests, but 
uapi/ is at the top of the list of things he might look at, right 
after the diffstat. :)


Consider that ONE_REG as experimental flagged :). Really, I am as 
concerned as you are on ABI breakages, but in this case it's not worth 
it. I'm not even sure any guest uses WORT at all. Linux doesn't seem to.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Paolo Bonzini

Il 30/05/2014 18:08, Alexander Graf ha scritto:

What happens if it returns EINVAL? Also, reset should certainly
try  to write all registers, what happens if one is missed.


If it returns EINVAL we just ignore the register.


I wonder if it's the right thing to do.  You remember how you were 
bitten by less-than-sensible error handling in the x86 
kvm_arch_put_registers.



Beyond the particular case of WORT, I'd just like to point out that
uapi/ changes need even more scrutiny from maintainers than usual.  I
don't know exactly what checks Linus makes in my pull requests, but
uapi/ is at the top of the list of things he might look at, right
after the diffstat. :)


Consider that ONE_REG as experimental flagged :). Really, I am as
concerned as you are on ABI breakages, but in this case it's not worth
it. I'm not even sure any guest uses WORT at all. Linux doesn't seem to.


Fair enough... for this time only!...

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number

2014-05-30 Thread Alexander Graf


On 30.05.14 18:11, Paolo Bonzini wrote:

Il 30/05/2014 18:08, Alexander Graf ha scritto:

What happens if it returns EINVAL? Also, reset should certainly
try  to write all registers, what happens if one is missed.


If it returns EINVAL we just ignore the register.


I wonder if it's the right thing to do.  You remember how you were 
bitten by less-than-sensible error handling in the x86 
kvm_arch_put_registers.


Yeah, I think we'll have to do a flag day as of which we declare PPC as 
a first class supported citizen. At that point we'll have compat machine 
types and start to do more intense error checking.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/4] live migration support for initial write protect of VM

2014-05-30 Thread Mario Smarduch

  
 +static inline void kvm_set_s2pte_readonly(pte_t *pte)
 +{
 +pte_val(*pte) = ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
 
 This relies on the pte already having been set as RDONLY or RDWR, if you
 are creating a new pte and calling this function it could be easy to
 miss that distinction, I would prefer:
 
 pte_val(*pte) = L_PTE_S2_RDWR;
 pte_val(*pte) |= L_PTE_S2_RDONLY;
 

Confused on this comment, this appears to just add the read-only
permission. But will leave other permission bits intact, and
clears out the rest of the pte?

- Mario

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does anybody still care about kvm-ia64?

2014-05-30 Thread Jes Sorensen
On 05/30/14 15:45, Paolo Bonzini wrote:
 I was thinking of removing it in Linux 3.17.  I'm not even sure it
 compiles right now, hasn't seen any action in years, and all open-source
 userspace code to use it has been dead for years.
 
 If you disagree, please speak up loudly in the next month.

I'd say take it out back and show it mercy!

Jes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/12] kvm tools: Misc patches (mips support)

2014-05-30 Thread Pekka Enberg

On 05/28/2014 11:08 PM, Andreas Herrmann wrote:

Hi,

This is v3 of my patch set to run lkvm on MIPS.

It's rebased on v3.13-rc1-1436-g1fc83c5 of
git://github.com/penberg/linux-kvm.git



Applied, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Ensure negative return value on kvm_init() error handling path

2014-05-30 Thread Eduardo Habkost
We need to ensure ret  0 when going through the error path, or QEMU may
try to run the half-initialized VM and crash.

Signed-off-by: Eduardo Habkost ehabk...@redhat.com
---
 kvm-all.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kvm-all.c b/kvm-all.c
index 721a390..4e19eff 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1410,7 +1410,7 @@ int kvm_init(MachineClass *mc)
 
 ret = kvm_ioctl(s, KVM_GET_API_VERSION, 0);
 if (ret  KVM_API_VERSION) {
-if (ret  0) {
+if (ret = 0) {
 ret = -EINVAL;
 }
 fprintf(stderr, kvm version too old\n);
@@ -1461,6 +1461,7 @@ int kvm_init(MachineClass *mc)
 if (mc-kvm_type) {
 type = mc-kvm_type(kvm_type);
 } else if (kvm_type) {
+ret = -EINVAL;
 fprintf(stderr, Invalid argument kvm-type=%s\n, kvm_type);
 goto err;
 }
@@ -1561,6 +1562,7 @@ int kvm_init(MachineClass *mc)
 return 0;
 
 err:
+assert(ret  0);
 if (s-vmfd = 0) {
 close(s-vmfd);
 }
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Ensure negative return value on kvm_init() error handling path

2014-05-30 Thread Paolo Bonzini

Il 30/05/2014 22:26, Eduardo Habkost ha scritto:

We need to ensure ret  0 when going through the error path, or QEMU may
try to run the half-initialized VM and crash.

Signed-off-by: Eduardo Habkost ehabk...@redhat.com
---
 kvm-all.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kvm-all.c b/kvm-all.c
index 721a390..4e19eff 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1410,7 +1410,7 @@ int kvm_init(MachineClass *mc)

 ret = kvm_ioctl(s, KVM_GET_API_VERSION, 0);
 if (ret  KVM_API_VERSION) {
-if (ret  0) {
+if (ret = 0) {
 ret = -EINVAL;
 }
 fprintf(stderr, kvm version too old\n);
@@ -1461,6 +1461,7 @@ int kvm_init(MachineClass *mc)
 if (mc-kvm_type) {
 type = mc-kvm_type(kvm_type);
 } else if (kvm_type) {
+ret = -EINVAL;
 fprintf(stderr, Invalid argument kvm-type=%s\n, kvm_type);
 goto err;
 }
@@ -1561,6 +1562,7 @@ int kvm_init(MachineClass *mc)
 return 0;

 err:
+assert(ret  0);
 if (s-vmfd = 0) {
 close(s-vmfd);
 }



Applied, thanks.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] machine: Add kvm-type property

2014-05-30 Thread Eduardo Habkost
The kvm-type machine option was left out when MachineState was
introduced, preventing the kvm-type option from being used. Add the
missing property.

Signed-off-by: Eduardo Habkost ehabk...@redhat.com
Cc: Andreas Färber afaer...@suse.de
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: Alexander Graf ag...@suse.de
Cc: Marcel Apfelbaum marce...@redhat.com
---
Tested in a x86 machine only. Help would be welcome to test it on a PPC
machine using -machine spapr and KVM.

Before this patch:

$ qemu-system-x86_64 -machine pc,kvm-type=hv,accel=kvm
qemu-system-x86_64: Property '.kvm-type' not found

(This means the option won't work even for sPAPR machines.)

After applying this patch:

$ qemu-system-x86_64 -machine pc,kvm-type=hv,accel=kvm
Invalid argument kvm-type=hv

(This means the x86 KVM init code is seeing (and rejecting) the option,
and the sPAPR code can use it.)

Note that qemu-system-x86_64 will segfault with the above command-line
unless an additional fix (submitted today) is applied (kvm: Ensure
negative return value on kvm_init() error handling path).
---
 hw/core/machine.c   | 17 +
 include/hw/boards.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index cbba679..ed47b3a 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -235,6 +235,21 @@ static void machine_set_firmware(Object *obj, const char 
*value, Error **errp)
 ms-firmware = g_strdup(value);
 }
 
+static char *machine_get_kvm_type(Object *obj, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+
+return g_strdup(ms-kvm_type);
+}
+
+static void machine_set_kvm_type(Object *obj, const char *value, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+
+g_free(ms-kvm_type);
+ms-kvm_type = g_strdup(value);
+}
+
 static void machine_initfn(Object *obj)
 {
 object_property_add_str(obj, accel,
@@ -274,6 +289,8 @@ static void machine_initfn(Object *obj)
 object_property_add_bool(obj, usb, machine_get_usb, machine_set_usb, 
NULL);
 object_property_add_str(obj, firmware,
 machine_get_firmware, machine_set_firmware, NULL);
+object_property_add_str(obj, kvm-type,
+machine_get_kvm_type, machine_set_kvm_type, NULL);
 }
 
 static void machine_finalize(Object *obj)
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 2d2e2be..44956d6 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -111,6 +111,7 @@ struct MachineState {
 bool mem_merge;
 bool usb;
 char *firmware;
+char *kvm_type;
 
 ram_addr_t ram_size;
 const char *boot_order;
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] machine: Add kvm-type property

2014-05-30 Thread Paolo Bonzini

Il 30/05/2014 22:41, Eduardo Habkost ha scritto:

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 2d2e2be..44956d6 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -111,6 +111,7 @@ struct MachineState {
 bool mem_merge;
 bool usb;
 char *firmware;
+char *kvm_type;

 ram_addr_t ram_size;
 const char *boot_order;



Can you add it only to the pseries machine instead?  This is one of the 
first reasons why I wanted to have per-machine properties.  :)


Thanks!

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >