Re: Some more basic questions..
Thanks Zhang and Venkateshwara, some more follow up questions below:) 1. Does -realtime mlock=on allocate all the memory upfront and keep it for the VM, or does it just make sure the memory that is allocated within the guest is not swapped out under host memory pressure? “-realtime mlock=on” will mlockall(MCL_CURRENT | MCL_FUTURE) QEMU's ALL memory, because VM's memory is part of QEMU's, so this option will keep VM's memory unswapped. 2. I notice on a 4G guest on an 8G host, guest allocates only about 1G initially, and the rest later as I start applications. Is there a way for me to reserve ALL memory (4G in this case) upfront somehow without changes to guest and without allocating it? It will have to be something the host OS or some component within the host OS. Isnt there something to that effect? It seems odd that there isnt. On linux, user-process's memory is allocating on demand, the physical memory does not allocate until the virtual memory is touched. Because VM's memory is part of QEMU's, so ... I guess the VM you said above is linux guest. Windows guest will memset its memory during booting period. You can use -realtime mlock=on to reserve VM's ALL memory upfront. Thank you in advance.
Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters
On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com Currently perf unconditionally disables PEBS for guest. Now that we have the infrastructure in place to handle it we can allow it for KVM owned guest events. For the perf needs to know that a event is owned by a guest. Add a new state bit in the perf_event for that. This doesn't make sense; why does it need to be owned? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] perf: Handle guest PEBS events with a fake event
On Thu, May 29, 2014 at 06:12:06PM -0700, Andi Kleen wrote: Note: in very rare cases with exotic events this may lead to spurious PMIs in the guest. Qualify that statement so that if someone runs into it we at least know it is known/expected. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implement PEBS virtualization for Silvermont
On Thu, May 29, 2014 at 06:12:03PM -0700, Andi Kleen wrote: PEBS is very useful (e.g. enabling the more cycles:pp event or memory profiling) Unfortunately it didn't work in virtualization, which is becoming more and more common. This patch kit implements simple PEBS virtualization for KVM on Silvermont CPUs. Silvermont does not have the leak problems that prevented successfull PEBS virtualization earlier. Silvermont is such an underpowered thing, who in his right mind would run anything virt on it to further reduce performance? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: powerpc/pseries: Use new defines when calling h_set_mode
On 29.05.14 23:52, Benjamin Herrenschmidt wrote: On Thu, 2014-05-29 at 23:27 +0200, Alexander Graf wrote: On 29.05.14 09:45, Michael Neuling wrote: +/* Values for 2nd argument to H_SET_MODE */ +#define H_SET_MODE_RESOURCE_SET_CIABR1 +#define H_SET_MODE_RESOURCE_SET_DAWR2 +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3 +#define H_SET_MODE_RESOURCE_LE4 Much better, but I think you want to make use of these in non-kvm code too, no? At least the LE one is definitely already implemented as call :) Sure but that's a different patch below. Ben, how would you like to handle these 2 patches? If you give me an ack I can just put this patch into my kvm queue. Alternatively we could both carry a patch that adds the H_SET_MODE header bits only and whoever hits Linus' tree first wins ;). No biggie. Worst case it's a trivial conflict. Well, the way the patches are split right now it won't be a conflict, but a build failure on either side. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/23] MIPS: KVM: Fixes and guest timer rewrite
Il 29/05/2014 22:44, James Hogan ha scritto: Yes, I agree with your analysis and had considered something like this, although it doesn't particularly appeal to my sense of perfectionism :). I can see that. But I think the simplification of the code is worth it. It is hard to explain why the invalid times-goes-backwards case can happen if env-count_save_time is overwritten with data from another machine. I think the explanation is that (due to timers_state.cpu_ticks_enabled) the value of cpu_get_clock_at(now) - env-count_save_time does not depend on get_clock(), but the code doesn't have any comment for that. Rather than adding comments, we might as well force it to be always zero and just write get_clock() to COUNT_RESUME. Finally, having to serialize env-count_save_time makes harder to support migration from TCG to KVM and back. It would be race free though, and if you're stopping the VM at all you expect to lose some time anyway. Since you mentioned perfectionism, :) your code also loses some time; COUNT_RESUME is written based on when the CPU state becomes clean, not on when the CPU was restarted. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm: Implement PEBS virtualization
On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com PEBS (Precise Event Bases Sampling) profiling is very powerful, allowing improved sampling precision and much additional information, like address or TSX abort profiling. cycles:p and :pp uses PEBS. This patch enables PEBS profiling in KVM guests. That sounds really cool! PEBS writes profiling records to a virtual address in memory. Since the guest controls the virtual address space the PEBS record is directly delivered to the guest buffer. We set up the PEBS state that is works correctly.The CPU cannot handle any kinds of faults during these guest writes. To avoid any problems with guest pages being swapped by the host we pin the pages when the PEBS buffer is setup, by intercepting that MSR. It will avoid guest page to be swapped, but shadow paging code may still drop shadow PT pages that build a mapping from DS virtual address to the guest page. With EPT it is less likely to happen (but still possible IIRC depending on memory pressure and how much memory shadow paging code is allowed to use), without EPT it will happen for sure. Typically profilers only set up a single page, so pinning that is not a big problem. The pinning is limited to 17 pages currently (64K+1) In theory the guest can change its own page tables after the PEBS setup. The host has no way to track that with EPT. But if a guest would do that it could only crash itself. It's not expected that normal profilers do that. Spec says: The following restrictions should be applied to the DS save area. • The three DS save area sections should be allocated from a non-paged pool, and marked accessed and dirty. It is the responsibility of the operating system to keep the pages that contain the buffer present and to mark them accessed and dirty. The implication is that the operating system cannot do “lazy” page-table entry propagation for these pages. There is nothing, as far as I can see, that says what will happen if the condition is not met. I always interpreted it as undefined behaviour so anything can happen including CPU dies completely. You are saying above on one hand that CPU cannot handle any kinds of faults during write to DS area, but on the other hand a guest could only crash itself. Is this architecturally guarantied? The patch also adds the basic glue to enable the PEBS CPUIDs and other PEBS MSRs, and ask perf to enable PEBS as needed. Due to various limitations it currently only works on Silvermont based systems. This patch doesn't implement the extended MSRs some CPUs support. For example latency profiling on SLM will not work at this point. Timing: The emulation is somewhat more expensive than a real PMU. This may trigger the expensive PMI detection in the guest. Usually this can be disabled with echo 0 /proc/sys/kernel/perf_cpu_time_max_percent Migration: In theory it should should be possible (as long as we migrate to a host with the same PEBS event and the same PEBS format), but I'm not sure the basic KVM PMU code supports it correctly: no code to save/restore state, unless I'm missing something. Once the PMU code grows proper migration support it should be straight forward to handle the PEBS state too. Signed-off-by: Andi Kleen a...@linux.intel.com --- arch/x86/include/asm/kvm_host.h | 6 ++ arch/x86/include/uapi/asm/msr-index.h | 4 + arch/x86/kvm/cpuid.c | 10 +- arch/x86/kvm/pmu.c| 184 -- arch/x86/kvm/vmx.c| 6 ++ 5 files changed, 196 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7de069af..d87cb66 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -319,6 +319,8 @@ struct kvm_pmc { struct kvm_vcpu *vcpu; }; +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */ + struct kvm_pmu { unsigned nr_arch_gp_counters; unsigned nr_arch_fixed_counters; @@ -335,6 +337,10 @@ struct kvm_pmu { struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED]; struct irq_work irq_work; u64 reprogram_pmi; + u64 pebs_enable; + u64 ds_area; + struct page *pinned_pages[MAX_PINNED_PAGES]; + unsigned num_pinned_pages; }; enum { diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index fcf2b3a..409a582 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -72,6 +72,10 @@ #define MSR_IA32_PEBS_ENABLE 0x03f1 #define MSR_IA32_DS_AREA 0x0600 #define MSR_IA32_PERF_CAPABILITIES 0x0345 +#define PERF_CAP_PEBS_TRAP (1U 6) +#define PERF_CAP_ARCH_REG(1U 7) +#define PERF_CAP_PEBS_FORMAT (0xf 8) + #define
Re: [RFC] Implement Batched (group) ticket lock
On 05/30/2014 04:15 AM, Waiman Long wrote: On 05/28/2014 08:16 AM, Raghavendra K T wrote: - we need an intelligent way to nullify the effect of batching for baremetal (because extra cmpxchg is not required). To do this, you will need to have 2 slightly different algorithms depending on the paravirt_ticketlocks_enabled jump label. Thanks for the hint Waiman. [...] +spin: +for (;;) { +inc.head = ACCESS_ONCE(lock-tickets.head); +if (!(inc.head TICKET_LOCK_HEAD_INC)) { +new.head = inc.head | TICKET_LOCK_HEAD_INC; +if (cmpxchg(lock-tickets.head, inc.head, new.head) +== inc.head) +goto out; +} +cpu_relax(); +} + It had taken me some time to figure out the the LSB of inc.head is used as a bit lock for the contending tasks in the spin loop. I would suggest adding some comment here to make it easier to look at. Agree. 'll add a comment. [...] +#define TICKET_BATCH0x4 /* 4 waiters can contend simultaneously */ +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCHTICKET_LOCK_INC_SHIFT) + \ + TICKET_LOCK_TAIL_INC - 1) I don't think TAIL_INC has anything to do with setting the BATCH_MASK. It works here because TAIL_INC is 2. I think it is clearer to define it as either (~(TICKET_BATCHTICKET_LOCK_INC_SHIFT) + 1) or (~((TICKET_BATCHTICKET_LOCK_INC_SHIFT) - 1)). You are right. Thanks for pointing out. Your expression is simple and clearer. 'll use one of them. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: powerpc/pseries: Use new defines when calling h_set_mode
On Thu, 2014-05-29 at 17:45 +1000, Michael Neuling wrote: +/* Values for 2nd argument to H_SET_MODE */ +#define H_SET_MODE_RESOURCE_SET_CIABR1 +#define H_SET_MODE_RESOURCE_SET_DAWR2 +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3 +#define H_SET_MODE_RESOURCE_LE4 Much better, but I think you want to make use of these in non-kvm code too, no? At least the LE one is definitely already implemented as call :) powerpc/pseries: Use new defines when calling h_set_mode Now that we define these in the KVM code, use these defines when we call h_set_mode. No functional change. Signed-off-by: Michael Neuling mi...@neuling.org -- This depends on the KVM h_set_mode patches. diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 12c32c5..67859ed 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, unsigned long resource, static inline long enable_reloc_on_exceptions(void) { /* mflags = 3: Exceptions at 0xC0004000 */ - return plpar_set_mode(3, 3, 0, 0); + return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0); } Which header are these coming from, and why aren't we including it? And is it going to still build with CONFIG_KVM=n? cheers -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 2/6] KVM: s390: Enable DAT support for TPROT handler
From: Thomas Huth th...@linux.vnet.ibm.com The TPROT instruction can be used to check the accessability of storage for any kind of logical addresses. So far, our handler only supported real addresses. This patch now also enables support for addresses that have to be translated via DAT first. And while we're at it, change the code to use the common KVM function gfn_to_hva_prot() to check for the validity and writability of the memory page. Signed-off-by: Thomas Huth th...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kvm/gaccess.c | 4 ++-- arch/s390/kvm/gaccess.h | 2 ++ arch/s390/kvm/priv.c| 56 + 3 files changed, 37 insertions(+), 25 deletions(-) diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c index 5f73826..4653ac6 100644 --- a/arch/s390/kvm/gaccess.c +++ b/arch/s390/kvm/gaccess.c @@ -292,7 +292,7 @@ static void ipte_unlock_siif(struct kvm_vcpu *vcpu) wake_up(vcpu-kvm-arch.ipte_wq); } -static void ipte_lock(struct kvm_vcpu *vcpu) +void ipte_lock(struct kvm_vcpu *vcpu) { if (vcpu-arch.sie_block-eca 1) ipte_lock_siif(vcpu); @@ -300,7 +300,7 @@ static void ipte_lock(struct kvm_vcpu *vcpu) ipte_lock_simple(vcpu); } -static void ipte_unlock(struct kvm_vcpu *vcpu) +void ipte_unlock(struct kvm_vcpu *vcpu) { if (vcpu-arch.sie_block-eca 1) ipte_unlock_siif(vcpu); diff --git a/arch/s390/kvm/gaccess.h b/arch/s390/kvm/gaccess.h index 2d37a46..0149cf1 100644 --- a/arch/s390/kvm/gaccess.h +++ b/arch/s390/kvm/gaccess.h @@ -327,6 +327,8 @@ int read_guest_real(struct kvm_vcpu *vcpu, unsigned long gra, void *data, return access_guest_real(vcpu, gra, data, len, 0); } +void ipte_lock(struct kvm_vcpu *vcpu); +void ipte_unlock(struct kvm_vcpu *vcpu); int ipte_lock_held(struct kvm_vcpu *vcpu); int kvm_s390_check_low_addr_protection(struct kvm_vcpu *vcpu, unsigned long ga); diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c index 6296159..f89c1cd 100644 --- a/arch/s390/kvm/priv.c +++ b/arch/s390/kvm/priv.c @@ -930,8 +930,9 @@ int kvm_s390_handle_eb(struct kvm_vcpu *vcpu) static int handle_tprot(struct kvm_vcpu *vcpu) { u64 address1, address2; - struct vm_area_struct *vma; - unsigned long user_address; + unsigned long hva, gpa; + int ret = 0, cc = 0; + bool writable; vcpu-stat.instruction_tprot++; @@ -942,32 +943,41 @@ static int handle_tprot(struct kvm_vcpu *vcpu) /* we only handle the Linux memory detection case: * access key == 0 -* guest DAT == off * everything else goes to userspace. */ if (address2 0xf0) return -EOPNOTSUPP; if (vcpu-arch.sie_block-gpsw.mask PSW_MASK_DAT) - return -EOPNOTSUPP; - - down_read(current-mm-mmap_sem); - user_address = __gmap_translate(address1, vcpu-arch.gmap); - if (IS_ERR_VALUE(user_address)) - goto out_inject; - vma = find_vma(current-mm, user_address); - if (!vma) - goto out_inject; - vcpu-arch.sie_block-gpsw.mask = ~(3ul 44); - if (!(vma-vm_flags VM_WRITE) (vma-vm_flags VM_READ)) - vcpu-arch.sie_block-gpsw.mask |= (1ul 44); - if (!(vma-vm_flags VM_WRITE) !(vma-vm_flags VM_READ)) - vcpu-arch.sie_block-gpsw.mask |= (2ul 44); - - up_read(current-mm-mmap_sem); - return 0; + ipte_lock(vcpu); + ret = guest_translate_address(vcpu, address1, gpa, 1); + if (ret == PGM_PROTECTION) { + /* Write protected? Try again with read-only... */ + cc = 1; + ret = guest_translate_address(vcpu, address1, gpa, 0); + } + if (ret) { + if (ret == PGM_ADDRESSING || ret == PGM_TRANSLATION_SPEC) { + ret = kvm_s390_inject_program_int(vcpu, ret); + } else if (ret 0) { + /* Translation not available */ + kvm_s390_set_psw_cc(vcpu, 3); + ret = 0; + } + goto out_unlock; + } -out_inject: - up_read(current-mm-mmap_sem); - return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING); + hva = gfn_to_hva_prot(vcpu-kvm, gpa_to_gfn(gpa), writable); + if (kvm_is_error_hva(hva)) { + ret = kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING); + } else { + if (!writable) + cc = 1; /* Write not permitted == read-only */ + kvm_s390_set_psw_cc(vcpu, cc); + /* Note: CC2 only occurs for storage keys (not supported yet) */ + } +out_unlock: + if (vcpu-arch.sie_block-gpsw.mask PSW_MASK_DAT) + ipte_unlock(vcpu); + return ret; } int kvm_s390_handle_e5(struct kvm_vcpu *vcpu) -- 1.8.4.2 -- To
[GIT PULL 0/6] KVM: s390: Fixes and cleanups for 3.16
Paolo, The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb: KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 17:47:18 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git tags/kvm-s390-20140530 for you to fetch changes up to 5a5e65361f01b44caa51ba202e6720d458829fc5: KVM: s390: Intercept the tprot instruction (2014-05-30 09:39:40 +0200) 1. Several minor fixes and cleanups for KVM: 2. Fix flag check for gdb support 3. Remove unnecessary vcpu start 4. Remove code duplication for sigp interrupts 5. Better DAT handling for the TPROT instruction 6. Correct addressing exception for standby memory David Hildenbrand (2): KVM: s390: check the given debug flags, not the set ones KVM: s390: a VCPU is already started when delivering interrupts Jens Freimann (1): KVM: s390: clean up interrupt injection in sigp code Matthew Rosato (1): KVM: s390: Intercept the tprot instruction Thomas Huth (2): KVM: s390: Add a generic function for translating guest addresses KVM: s390: Enable DAT support for TPROT handler arch/s390/include/asm/kvm_host.h | 1 + arch/s390/kvm/gaccess.c | 57 ++-- arch/s390/kvm/gaccess.h | 5 arch/s390/kvm/interrupt.c| 1 - arch/s390/kvm/kvm-s390.c | 6 +++-- arch/s390/kvm/priv.c | 56 +++ arch/s390/kvm/sigp.c | 56 +-- 7 files changed, 116 insertions(+), 66 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 4/6] KVM: s390: check the given debug flags, not the set ones
From: David Hildenbrand d...@linux.vnet.ibm.com This patch fixes a minor bug when updating the guest debug settings. We should check the given debug flags, not the already set ones. Doesn't do any harm but too many (for now unused) flags could be set internally without error. Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/kvm/kvm-s390.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index e519860..06d1888 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -950,7 +950,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, vcpu-guest_debug = 0; kvm_s390_clear_bp_data(vcpu); - if (vcpu-guest_debug ~VALID_GUESTDBG_FLAGS) + if (dbg-control ~VALID_GUESTDBG_FLAGS) return -EINVAL; if (dbg-control KVM_GUESTDBG_ENABLE) { -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 1/6] KVM: s390: Add a generic function for translating guest addresses
From: Thomas Huth th...@linux.vnet.ibm.com This patch adds a function for translating logical guest addresses into physical guest addresses without touching the memory at the given location. Signed-off-by: Thomas Huth th...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kvm/gaccess.c | 53 + arch/s390/kvm/gaccess.h | 3 +++ 2 files changed, 56 insertions(+) diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c index db608c3..5f73826 100644 --- a/arch/s390/kvm/gaccess.c +++ b/arch/s390/kvm/gaccess.c @@ -645,6 +645,59 @@ int access_guest_real(struct kvm_vcpu *vcpu, unsigned long gra, } /** + * guest_translate_address - translate guest logical into guest absolute address + * + * Parameter semantics are the same as the ones from guest_translate. + * The memory contents at the guest address are not changed. + * + * Note: The IPTE lock is not taken during this function, so the caller + * has to take care of this. + */ +int guest_translate_address(struct kvm_vcpu *vcpu, unsigned long gva, + unsigned long *gpa, int write) +{ + struct kvm_s390_pgm_info *pgm = vcpu-arch.pgm; + psw_t *psw = vcpu-arch.sie_block-gpsw; + struct trans_exc_code_bits *tec; + union asce asce; + int rc; + + /* Access register mode is not supported yet. */ + if (psw_bits(*psw).t psw_bits(*psw).as == PSW_AS_ACCREG) + return -EOPNOTSUPP; + + gva = kvm_s390_logical_to_effective(vcpu, gva); + memset(pgm, 0, sizeof(*pgm)); + tec = (struct trans_exc_code_bits *)pgm-trans_exc_code; + tec-as = psw_bits(*psw).as; + tec-fsi = write ? FSI_STORE : FSI_FETCH; + tec-addr = gva PAGE_SHIFT; + if (is_low_address(gva) low_address_protection_enabled(vcpu)) { + if (write) { + rc = pgm-code = PGM_PROTECTION; + return rc; + } + } + + asce.val = get_vcpu_asce(vcpu); + if (psw_bits(*psw).t !asce.r) { /* Use DAT? */ + rc = guest_translate(vcpu, gva, gpa, write); + if (rc 0) { + if (rc == PGM_PROTECTION) + tec-b61 = 1; + pgm-code = rc; + } + } else { + rc = 0; + *gpa = kvm_s390_real_to_abs(vcpu, gva); + if (kvm_is_error_gpa(vcpu-kvm, *gpa)) + rc = pgm-code = PGM_ADDRESSING; + } + + return rc; +} + +/** * kvm_s390_check_low_addr_protection - check for low-address protection * @ga: Guest address * diff --git a/arch/s390/kvm/gaccess.h b/arch/s390/kvm/gaccess.h index a07ee08..2d37a46 100644 --- a/arch/s390/kvm/gaccess.h +++ b/arch/s390/kvm/gaccess.h @@ -155,6 +155,9 @@ int read_guest_lc(struct kvm_vcpu *vcpu, unsigned long gra, void *data, return kvm_read_guest(vcpu-kvm, gpa, data, len); } +int guest_translate_address(struct kvm_vcpu *vcpu, unsigned long gva, + unsigned long *gpa, int write); + int access_guest(struct kvm_vcpu *vcpu, unsigned long ga, void *data, unsigned long len, int write); -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 6/6] KVM: s390: Intercept the tprot instruction
From: Matthew Rosato mjros...@linux.vnet.ibm.com Based on original patch from Jeng-fang (Nick) Wang When standby memory is specified for a guest Linux, but no virtual memory has been allocated on the Qemu host backing that guest, the guest memory detection process encounters a memory access exception which is not thrown from the KVM handle_tprot() instruction-handler function. The access exception comes from sie64a returning EFAULT, which then passes an addressing exception to the guest. Unfortunately this does not the proper PSW fixup (nullifying vs. suppressing) so the guest will get a fault for the wrong address. Let's just intercept the tprot instruction all the time to do the right thing and not go the page fault handler path for standby memory. tprot is only used by Linux during startup so some exits should be ok. Without this patch, standby memory cannot be used with KVM. Signed-off-by: Nick Wang jfw...@us.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Tested-by: Matthew Rosato mjros...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/include/asm/kvm_host.h | 1 + arch/s390/kvm/kvm-s390.c | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index a27f500..4181d7b 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -110,6 +110,7 @@ struct kvm_s390_sie_block { #define ICTL_ISKE 0x4000 #define ICTL_SSKE 0x2000 #define ICTL_RRBE 0x1000 +#define ICTL_TPROT 0x0200 __u32 ictl; /* 0x0048 */ __u32 eca;/* 0x004c */ #define ICPT_INST 0x04 diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 06d1888..43e191b 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -637,7 +637,9 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu) if (sclp_has_siif()) vcpu-arch.sie_block-eca |= 1; vcpu-arch.sie_block-fac = (int) (long) vfacilities; - vcpu-arch.sie_block-ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE; + vcpu-arch.sie_block-ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE | + ICTL_TPROT; + if (kvm_s390_cmma_enabled(vcpu-kvm)) { rc = kvm_s390_vcpu_setup_cmma(vcpu); if (rc) -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 5/6] KVM: s390: a VCPU is already started when delivering interrupts
From: David Hildenbrand d...@linux.vnet.ibm.com This patch removes the start of a VCPU when delivering a RESTART interrupt. Interrupt delivery is called from kvm_arch_vcpu_ioctl_run. So the VCPU is already considered started - no need to call kvm_s390_vcpu_start. This function will early exit anyway. Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/kvm/interrupt.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index bf0d9bc..90c8de2 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -442,7 +442,6 @@ static void __do_deliver_interrupt(struct kvm_vcpu *vcpu, rc |= read_guest_lc(vcpu, offsetof(struct _lowcore, restart_psw), vcpu-arch.sie_block-gpsw, sizeof(psw_t)); - kvm_s390_vcpu_start(vcpu); break; case KVM_S390_PROGRAM_INT: VCPU_EVENT(vcpu, 4, interrupt: pgm check code:%x, ilc:%x, -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 3/6] KVM: s390: clean up interrupt injection in sigp code
From: Jens Freimann jf...@linux.vnet.ibm.com We have all the logic to inject interrupts available in kvm_s390_inject_vcpu(), so let's use it instead of injecting irqs manually to the list in sigp code. SIGP stop is special because we have to check the action_flags before injecting the interrupt. As the action_flags are not available in kvm_s390_inject_vcpu() we leave the code for the stop order code untouched for now. Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com Reviewed-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kvm/sigp.c | 56 +--- 1 file changed, 18 insertions(+), 38 deletions(-) diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index d0341d2..43079a4 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -54,33 +54,23 @@ static int __sigp_sense(struct kvm_vcpu *vcpu, u16 cpu_addr, static int __sigp_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr) { - struct kvm_s390_local_interrupt *li; - struct kvm_s390_interrupt_info *inti; + struct kvm_s390_interrupt s390int = { + .type = KVM_S390_INT_EMERGENCY, + .parm = vcpu-vcpu_id, + }; struct kvm_vcpu *dst_vcpu = NULL; + int rc = 0; if (cpu_addr KVM_MAX_VCPUS) dst_vcpu = kvm_get_vcpu(vcpu-kvm, cpu_addr); if (!dst_vcpu) return SIGP_CC_NOT_OPERATIONAL; - inti = kzalloc(sizeof(*inti), GFP_KERNEL); - if (!inti) - return -ENOMEM; - - inti-type = KVM_S390_INT_EMERGENCY; - inti-emerg.code = vcpu-vcpu_id; + rc = kvm_s390_inject_vcpu(dst_vcpu, s390int); + if (!rc) + VCPU_EVENT(vcpu, 4, sent sigp emerg to cpu %x, cpu_addr); - li = dst_vcpu-arch.local_int; - spin_lock_bh(li-lock); - list_add_tail(inti-list, li-list); - atomic_set(li-active, 1); - atomic_set_mask(CPUSTAT_EXT_INT, li-cpuflags); - if (waitqueue_active(li-wq)) - wake_up_interruptible(li-wq); - spin_unlock_bh(li-lock); - VCPU_EVENT(vcpu, 4, sent sigp emerg to cpu %x, cpu_addr); - - return SIGP_CC_ORDER_CODE_ACCEPTED; + return rc ? rc : SIGP_CC_ORDER_CODE_ACCEPTED; } static int __sigp_conditional_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr, @@ -116,33 +106,23 @@ static int __sigp_conditional_emergency(struct kvm_vcpu *vcpu, u16 cpu_addr, static int __sigp_external_call(struct kvm_vcpu *vcpu, u16 cpu_addr) { - struct kvm_s390_local_interrupt *li; - struct kvm_s390_interrupt_info *inti; + struct kvm_s390_interrupt s390int = { + .type = KVM_S390_INT_EXTERNAL_CALL, + .parm = vcpu-vcpu_id, + }; struct kvm_vcpu *dst_vcpu = NULL; + int rc; if (cpu_addr KVM_MAX_VCPUS) dst_vcpu = kvm_get_vcpu(vcpu-kvm, cpu_addr); if (!dst_vcpu) return SIGP_CC_NOT_OPERATIONAL; - inti = kzalloc(sizeof(*inti), GFP_KERNEL); - if (!inti) - return -ENOMEM; - - inti-type = KVM_S390_INT_EXTERNAL_CALL; - inti-extcall.code = vcpu-vcpu_id; - - li = dst_vcpu-arch.local_int; - spin_lock_bh(li-lock); - list_add_tail(inti-list, li-list); - atomic_set(li-active, 1); - atomic_set_mask(CPUSTAT_EXT_INT, li-cpuflags); - if (waitqueue_active(li-wq)) - wake_up_interruptible(li-wq); - spin_unlock_bh(li-lock); - VCPU_EVENT(vcpu, 4, sent sigp ext call to cpu %x, cpu_addr); + rc = kvm_s390_inject_vcpu(dst_vcpu, s390int); + if (!rc) + VCPU_EVENT(vcpu, 4, sent sigp ext call to cpu %x, cpu_addr); - return SIGP_CC_ORDER_CODE_ACCEPTED; + return rc ? rc : SIGP_CC_ORDER_CODE_ACCEPTED; } static int __inject_sigp_stop(struct kvm_s390_local_interrupt *li, int action) -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: powerpc/pseries: Use new defines when calling h_set_mode
On Fri, 2014-05-30 at 18:56 +1000, Michael Ellerman wrote: On Thu, 2014-05-29 at 17:45 +1000, Michael Neuling wrote: +/* Values for 2nd argument to H_SET_MODE */ +#define H_SET_MODE_RESOURCE_SET_CIABR1 +#define H_SET_MODE_RESOURCE_SET_DAWR2 +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3 +#define H_SET_MODE_RESOURCE_LE4 Much better, but I think you want to make use of these in non-kvm code too, no? At least the LE one is definitely already implemented as call :) powerpc/pseries: Use new defines when calling h_set_mode Now that we define these in the KVM code, use these defines when we call h_set_mode. No functional change. Signed-off-by: Michael Neuling mi...@neuling.org -- This depends on the KVM h_set_mode patches. diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 12c32c5..67859ed 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, unsigned long resource, static inline long enable_reloc_on_exceptions(void) { /* mflags = 3: Exceptions at 0xC0004000 */ - return plpar_set_mode(3, 3, 0, 0); + return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0); } Which header are these coming from, and why aren't we including it? And is it going to still build with CONFIG_KVM=n? From include/asm/hvcall.h in the h_set_mode patch set I sent before. And yes it compiles with CONFIG_KVM=n fine. Mikey -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: powerpc/pseries: Use new defines when calling h_set_mode
On 30.05.14 11:10, Michael Neuling wrote: On Fri, 2014-05-30 at 18:56 +1000, Michael Ellerman wrote: On Thu, 2014-05-29 at 17:45 +1000, Michael Neuling wrote: +/* Values for 2nd argument to H_SET_MODE */ +#define H_SET_MODE_RESOURCE_SET_CIABR1 +#define H_SET_MODE_RESOURCE_SET_DAWR2 +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3 +#define H_SET_MODE_RESOURCE_LE4 Much better, but I think you want to make use of these in non-kvm code too, no? At least the LE one is definitely already implemented as call :) powerpc/pseries: Use new defines when calling h_set_mode Now that we define these in the KVM code, use these defines when we call h_set_mode. No functional change. Signed-off-by: Michael Neuling mi...@neuling.org -- This depends on the KVM h_set_mode patches. diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 12c32c5..67859ed 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, unsigned long resource, static inline long enable_reloc_on_exceptions(void) { /* mflags = 3: Exceptions at 0xC0004000 */ - return plpar_set_mode(3, 3, 0, 0); + return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0); } Which header are these coming from, and why aren't we including it? And is it going to still build with CONFIG_KVM=n? From include/asm/hvcall.h in the h_set_mode patch set I sent before. And yes it compiles with CONFIG_KVM=n fine. Please split that patch into one that adds the definitions and one that changes the KVM code to use those definitions. Both Ben and me can then apply the definition patch and our respective tree patch. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: powerpc/pseries: Use new defines when calling h_set_mode
On 30.05.14 11:44, Michael Neuling wrote: Which header are these coming from, and why aren't we including it? And is it going to still build with CONFIG_KVM=n? From include/asm/hvcall.h in the h_set_mode patch set I sent before. And yes it compiles with CONFIG_KVM=n fine. Please split that patch into one that adds the definitions and one that changes the KVM code to use those definitions. Both Ben and me can then apply the definition patch and our respective tree patch. Why don't you just take the original h_set_mode patch and I'll repost this cleanup later to ben when yours is upstream. This cleanup patch is not critical to anything and it avoid more churn. That works too, but please keep in mind that my path to upstream is much longer than what you're used to ;). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/23] MIPS: Export local_flush_icache_range for KVM
On Thu, May 29, 2014 at 10:16:24AM +0100, James Hogan wrote: Export the local_flush_icache_range function pointer for GPL modules so that it can be used by KVM for syncing the icache after binary translation of trapping instructions. Acked-by: Ralf Baechle r...@linux-mips.org Ralf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 14/23] MIPS: KVM: Override guest kernel timer frequency directly
On Thu, May 29, 2014 at 10:16:36AM +0100, James Hogan wrote: The KVM_HOST_FREQ Kconfig symbol was used by KVM guest kernels to override the timer frequency calculation to a value based on the host frequency. Now that the KVM timer emulation is implemented independent of the host timer frequency and defaults to 100MHz, adjust the working of CONFIG_KVM_HOST_FREQ to match. The Kconfig symbol now specifies the guest timer frequency directly, and has been renamed accordingly to KVM_GUEST_TIMER_FREQ. It now defaults to 100MHz too and the help text is updated to make it clear that a zero value will allow the normal timer frequency calculation to take place (based on the emulated RTC). Acked-by: Ralf Baechle r...@linux-mips.org Ralf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/23] MIPS: KVM: Fixes and guest timer rewrite
Il 29/05/2014 11:16, James Hogan ha scritto: Here are a range of MIPS KVM TE fixes, preferably for v3.16 but I know it's probably a bit late now. Changes are pretty minimal though since v1 so please consider. They can also be found on my kvm_mips_queue branch (and the kvm_mips_timer_v2 tag) here: git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/kvm-mips.git They originally served to allow it to work better on Ingenic XBurst cores which have some peculiarities which break non-portable assumptions in the MIPS KVM implementation (patches 1-4, 13). Fixing guest CP0_Count emulation to work without a running host CP0_Count (patch 13) however required a rewrite of the timer emulation code to use the kernel monotonic time instead, which needed doing anyway since basing it directly off the host CP0_Count was broken. Various bugs were fixed in the process (patches 10-12) and improvements made thanks to valuable feedback from Paolo Bonzini for the last QEMU MIPS/KVM patchset (patches 5-7, 15-16). Finally there are some misc cleanups which I did along the way (patches 17-23). Only the first patch (fixes MIPS KVM with 4K pages) is marked for stable. For KVM to work on XBurst it needs the timer rework which is a fairly complex change, so there's little point marking any of the XBurst specific changes for stable. All feedback welcome! Patches 1-4: Fix KVM/MIPS with 4K pages, missing RDHWR SYNCI (XBurst), unmoving CP0_Random (XBurst). Patches 5-9: Add EPC, Count, Compare, UserLocal, HWREna guest CP0 registers to KVM register ioctl interface. Patches 10-12: Fix a few potential races relating to timers. Patches 13-14: Rewrite guest timer emulation to use ktime_get(). Patches 15-16: Add KVM virtual registers for controlling guest timer, including master timer disable, and timer frequency. Patches 17-23: Cleanups. Changes in v2 (tag:kvm_mips_timer_v2): Patchset: - Drop patch 4 MIPS: KVM: Fix CP0_EBASE KVM register id (David Daney). - Drop patch 14 MIPS: KVM: Add nanosecond count bias KVM register. The COUNT_CTL and COUNT_RESUME API is clean and sufficient. - Add missing access to UserLocal and HWREna guest CP0 registers (patches 15 and 16). - Add export of local_flush_icache_range (patch 2). Patch 12 MIPS: KVM: Migrate hrtimer to follow VCPU - Move kvm_mips_migrate_count() into kvm_tlb.c to fix a link error when KVM is built as a module, since kvm_tlb.c is built statically and cannot reference symbols in kvm_mips_emul.c. Patch 15 MIPS: KVM: Add master disable count interface - Make KVM_REG_MIPS_COUNT_RESUME writable too so that userland can control timer using master DC and without bias register. New values are rejected if they refer to a monotonic time in the future. - Expand on description of KVM_REG_MIPS_COUNT_RESUME about the effects of the register and that it can be written. v1 (tag:kvm_mips_timer_v1): see http://marc.info/?l=kvmm=139843936102657w=2 James Hogan (23): MIPS: KVM: Allocate at least 16KB for exception handlers MIPS: Export local_flush_icache_range for KVM MIPS: KVM: Use local_flush_icache_range to fix RI on XBurst MIPS: KVM: Use tlb_write_random MIPS: KVM: Add CP0_EPC KVM register access MIPS: KVM: Move KVM_{GET,SET}_ONE_REG definitions into kvm_host.h MIPS: KVM: Add CP0_Count/Compare KVM register access MIPS: KVM: Add CP0_UserLocal KVM register access MIPS: KVM: Add CP0_HWREna KVM register access MIPS: KVM: Deliver guest interrupts after local_irq_disable() MIPS: KVM: Fix timer race modifying guest CP0_Cause MIPS: KVM: Migrate hrtimer to follow VCPU MIPS: KVM: Rewrite count/compare timer emulation MIPS: KVM: Override guest kernel timer frequency directly MIPS: KVM: Add master disable count interface MIPS: KVM: Add count frequency KVM register MIPS: KVM: Make kvm_mips_comparecount_{func,wakeup} static MIPS: KVM: Whitespace fixes in kvm_mips_callbacks MIPS: KVM: Fix kvm_debug bit-rottage MIPS: KVM: Remove ifdef DEBUG around kvm_debug MIPS: KVM: Quieten kvm_info() logging MIPS: KVM: Remove redundant NULL checks before kfree() MIPS: KVM: Remove redundant semicolon arch/mips/Kconfig | 12 +- arch/mips/include/asm/kvm_host.h | 183 ++--- arch/mips/include/uapi/asm/kvm.h | 35 +++ arch/mips/kvm/kvm_locore.S| 32 --- arch/mips/kvm/kvm_mips.c | 140 +- arch/mips/kvm/kvm_mips_dyntrans.c | 15 +- arch/mips/kvm/kvm_mips_emul.c | 557 -- arch/mips/kvm/kvm_tlb.c | 77 +++--- arch/mips/kvm/kvm_trap_emul.c | 86 +- arch/mips/mm/cache.c | 1 + arch/mips/mti-malta/malta-time.c | 14 +- 11 files changed, 920 insertions(+), 232 deletions(-) Cc: Paolo Bonzini pbonz...@redhat.com Cc: Gleb Natapov g...@kernel.org Cc: kvm@vger.kernel.org Cc: Ralf Baechle r...@linux-mips.org Cc: linux-m...@linux-mips.org Cc: David Daney
[PULL 01/41] KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR
The L1 instruction cache control register contains bits that indicate that we're still handling a request. Mask those out when we set the SPR so that a read doesn't assume we're still doing something. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/e500_emulate.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c index 89b7f82..95d886f 100644 --- a/arch/powerpc/kvm/e500_emulate.c +++ b/arch/powerpc/kvm/e500_emulate.c @@ -222,6 +222,7 @@ int kvmppc_core_emulate_mtspr_e500(struct kvm_vcpu *vcpu, int sprn, ulong spr_va break; case SPRN_L1CSR1: vcpu_e500-l1csr1 = spr_val; + vcpu_e500-l1csr1 = ~(L1CSR1_ICFI | L1CSR1_ICLFR); break; case SPRN_HID0: vcpu_e500-hid0 = spr_val; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 02/41] KVM: PPC: E500: Add dcbtls emulation
The dcbtls instruction is able to lock data inside the L1 cache. We don't want to give the guest actual access to hardware cache locks, as that could influence other VMs on the same system. But we can tell the guest that its locking attempt failed. By implementing the instruction we at least don't give the guest a program exception which it definitely does not expect. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/reg_booke.h | 1 + arch/powerpc/kvm/e500_emulate.c | 14 ++ 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/include/asm/reg_booke.h b/arch/powerpc/include/asm/reg_booke.h index 163c3b0..464f108 100644 --- a/arch/powerpc/include/asm/reg_booke.h +++ b/arch/powerpc/include/asm/reg_booke.h @@ -583,6 +583,7 @@ /* Bit definitions for L1CSR0. */ #define L1CSR0_CPE 0x0001 /* Data Cache Parity Enable */ +#define L1CSR0_CUL 0x0400 /* Data Cache Unable to Lock */ #define L1CSR0_CLFC0x0100 /* Cache Lock Bits Flash Clear */ #define L1CSR0_DCFI0x0002 /* Data Cache Flash Invalidate */ #define L1CSR0_CFI 0x0002 /* Cache Flash Invalidate */ diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c index 95d886f..002d517 100644 --- a/arch/powerpc/kvm/e500_emulate.c +++ b/arch/powerpc/kvm/e500_emulate.c @@ -19,6 +19,7 @@ #include booke.h #include e500.h +#define XOP_DCBTLS 166 #define XOP_MSGSND 206 #define XOP_MSGCLR 238 #define XOP_TLBIVAX 786 @@ -103,6 +104,15 @@ static int kvmppc_e500_emul_ehpriv(struct kvm_run *run, struct kvm_vcpu *vcpu, return emulated; } +static int kvmppc_e500_emul_dcbtls(struct kvm_vcpu *vcpu) +{ + struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu); + + /* Always fail to lock the cache */ + vcpu_e500-l1csr0 |= L1CSR0_CUL; + return EMULATE_DONE; +} + int kvmppc_core_emulate_op_e500(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned int inst, int *advance) { @@ -116,6 +126,10 @@ int kvmppc_core_emulate_op_e500(struct kvm_run *run, struct kvm_vcpu *vcpu, case 31: switch (get_xop(inst)) { + case XOP_DCBTLS: + emulated = kvmppc_e500_emul_dcbtls(vcpu); + break; + #ifdef CONFIG_KVM_E500MC case XOP_MSGSND: emulated = kvmppc_e500_emul_msgsnd(vcpu, rb); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 11/41] KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian
When the guest does an RTAS hypercall it keeps all RTAS variables inside a big endian data structure. To make sure we don't have to bother about endianness inside the actual RTAS handlers, let's just convert the whole structure to host endian before we call our RTAS handlers and back to big endian when we return to the guest. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_rtas.c | 29 + 1 file changed, 29 insertions(+) diff --git a/arch/powerpc/kvm/book3s_rtas.c b/arch/powerpc/kvm/book3s_rtas.c index 7a05315..edb14ba 100644 --- a/arch/powerpc/kvm/book3s_rtas.c +++ b/arch/powerpc/kvm/book3s_rtas.c @@ -205,6 +205,32 @@ int kvm_vm_ioctl_rtas_define_token(struct kvm *kvm, void __user *argp) return rc; } +static void kvmppc_rtas_swap_endian_in(struct rtas_args *args) +{ +#ifdef __LITTLE_ENDIAN__ + int i; + + args-token = be32_to_cpu(args-token); + args-nargs = be32_to_cpu(args-nargs); + args-nret = be32_to_cpu(args-nret); + for (i = 0; i args-nargs; i++) + args-args[i] = be32_to_cpu(args-args[i]); +#endif +} + +static void kvmppc_rtas_swap_endian_out(struct rtas_args *args) +{ +#ifdef __LITTLE_ENDIAN__ + int i; + + for (i = 0; i args-nret; i++) + args-args[i] = cpu_to_be32(args-args[i]); + args-token = cpu_to_be32(args-token); + args-nargs = cpu_to_be32(args-nargs); + args-nret = cpu_to_be32(args-nret); +#endif +} + int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu) { struct rtas_token_definition *d; @@ -223,6 +249,8 @@ int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu) if (rc) goto fail; + kvmppc_rtas_swap_endian_in(args); + /* * args-rets is a pointer into args-args. Now that we've * copied args we need to fix it up to point into our copy, @@ -247,6 +275,7 @@ int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu) if (rc == 0) { args.rets = orig_rets; + kvmppc_rtas_swap_endian_out(args); rc = kvm_write_guest(vcpu-kvm, args_phys, args, sizeof(args)); if (rc) goto fail; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 09/41] KVM: PPC: Book3S PR: Default to big endian guest
The default MSR when user space does not define anything should be identical on little and big endian hosts, so remove MSR_LE from it. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_pr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 01a7156..d7b0ad2 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -1216,7 +1216,7 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_pr(struct kvm *kvm, kvmppc_set_pvr_pr(vcpu, vcpu-arch.pvr); vcpu-arch.slb_nr = 64; - vcpu-arch.shadow_msr = MSR_USER64; + vcpu-arch.shadow_msr = MSR_USER64 ~MSR_LE; err = kvmppc_mmu_init(vcpu); if (err 0) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 34/41] KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
From: Paul Mackerras pau...@samba.org The global_invalidates() function contains a check that is intended to tell whether we are currently executing in the context of a hypercall issued by the guest. The reason is that the optimization of using a local TLB invalidate instruction is only valid in that context. The check was testing local_paca-kvm_hstate.kvm_vcore, which gets set when entering the guest but no longer gets cleared when exiting the guest. To fix this, we use the kvm_vcpu field instead, which does get cleared when exiting the guest, by the kvmppc_release_hwthread() calls inside kvmppc_run_core(). The effect of having the check wrong was that when kvmppc_do_h_remove() got called from htab_write() on the destination machine during a migration, it cleared the current cpu's bit in kvm-arch.need_tlb_flush. This meant that when the guest started running in the destination VM, it may miss out on doing a complete TLB flush, and therefore may end up using stale TLB entries from a previous guest that used the same LPID value. This should make migration more reliable. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..ac840c6 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -42,13 +42,14 @@ static int global_invalidates(struct kvm *kvm, unsigned long flags) /* * If there is only one vcore, and it's currently running, +* as indicated by local_paca-kvm_hstate.kvm_vcpu being set, * we can use tlbiel as long as we mark all other physical * cores as potentially having stale TLB entries for this lpid. * If we're not using MMU notifiers, we never take pages away * from the guest, so we can use tlbiel if requested. * Otherwise, don't use tlbiel. */ - if (kvm-arch.online_vcores == 1 local_paca-kvm_hstate.kvm_vcore) + if (kvm-arch.online_vcores == 1 local_paca-kvm_hstate.kvm_vcpu) global = 0; else if (kvm-arch.using_mmu_notifiers) global = 1; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
From: Paul Mackerras pau...@samba.org Current, when testing whether a page is dirty (when constructing the bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit in the HPT entries mapping the page, and if it is 0, we consider the page to be clean. However, the Power ISA doesn't require processors to set the C bit to 1 immediately when writing to a page, and in fact allows them to delay the writeback of the C bit until they receive a TLB invalidation for the page. Thus it is possible that the page could be dirty and we miss it. Now, if there are vcpus running, this is not serious since the collection of the dirty log is racy already - some vcpu could dirty the page just after we check it. But if there are no vcpus running we should return definitive results, in case we are in the final phase of migrating the guest. Also, if the permission bits in the HPTE don't allow writing, then we know that no CPU can set C. If the HPTE was previously writable and the page was modified, any C bit writeback would have been flushed out by the tlbie that we did when changing the HPTE to read-only. Otherwise we need to do a TLB invalidation even if the C bit is 0, and then check the C bit. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 + 1 file changed, 37 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 96c9044..8056107 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -1060,6 +1060,11 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte) kvm_handle_hva(kvm, hva, kvm_unmap_rmapp); } +static int vcpus_running(struct kvm *kvm) +{ + return atomic_read(kvm-arch.vcpus_running) != 0; +} + /* * Returns the number of system pages that are dirty. * This can be more than 1 if we find a huge-page HPTE. @@ -1069,6 +1074,7 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) struct revmap_entry *rev = kvm-arch.revmap; unsigned long head, i, j; unsigned long n; + unsigned long v, r; unsigned long *hptep; int npages_dirty = 0; @@ -1088,7 +1094,22 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) hptep = (unsigned long *) (kvm-arch.hpt_virt + (i 4)); j = rev[i].forw; - if (!(hptep[1] HPTE_R_C)) + /* +* Checking the C (changed) bit here is racy since there +* is no guarantee about when the hardware writes it back. +* If the HPTE is not writable then it is stable since the +* page can't be written to, and we would have done a tlbie +* (which forces the hardware to complete any writeback) +* when making the HPTE read-only. +* If vcpus are running then this call is racy anyway +* since the page could get dirtied subsequently, so we +* expect there to be a further call which would pick up +* any delayed C bit writeback. +* Otherwise we need to do the tlbie even if C==0 in +* order to pick up any delayed writeback of C. +*/ + if (!(hptep[1] HPTE_R_C) + (!hpte_is_writable(hptep[1]) || vcpus_running(kvm))) continue; if (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) { @@ -1100,23 +1121,29 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) } /* Now check and modify the HPTE */ - if ((hptep[0] HPTE_V_VALID) (hptep[1] HPTE_R_C)) { - /* need to make it temporarily absent to clear C */ - hptep[0] |= HPTE_V_ABSENT; - kvmppc_invalidate_hpte(kvm, hptep, i); - hptep[1] = ~HPTE_R_C; - eieio(); - hptep[0] = (hptep[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; + if (!(hptep[0] HPTE_V_VALID)) + continue; + + /* need to make it temporarily absent so C is stable */ + hptep[0] |= HPTE_V_ABSENT; + kvmppc_invalidate_hpte(kvm, hptep, i); + v = hptep[0]; + r = hptep[1]; + if (r HPTE_R_C) { + hptep[1] = r ~HPTE_R_C; if (!(rev[i].guest_rpte HPTE_R_C)) { rev[i].guest_rpte |= HPTE_R_C; note_hpte_modification(kvm, rev[i]); } - n = hpte_page_size(hptep[0], hptep[1]); + n = hpte_page_size(v, r); n = (n +
[PULL 39/41] KVM: PPC: Book3S HV: Fix machine check delivery to guest
From: Paul Mackerras pau...@samba.org The code that delivered a machine check to the guest after handling it in real mode failed to load up r11 before calling kvmppc_msr_interrupt, which needs the old MSR value in r11 so it can see the transactional state there. This adds the missing load. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 60fe8ba..220aefb 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -2144,6 +2144,7 @@ machine_check_realmode: beq mc_cont /* If not, deliver a machine check. SRR0/1 are already set */ li r10, BOOK3S_INTERRUPT_MACHINE_CHECK + ld r11, VCPU_MSR(r9) bl kvmppc_msr_interrupt b fast_interrupt_c_return -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 35/41] KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
From: Paul Mackerras pau...@samba.org Currently, when a huge page is faulted in for a guest, we select the rmap chain to insert the HPTE into based on the guest physical address that the guest tried to access. Since there is an rmap chain for each system page, there are many rmap chains for the area covered by a huge page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page HPTE could end up in any one of them. For consistency, and to make the huge-page HPTEs easier to find, we now put huge-page HPTEs in the rmap chain corresponding to the base address of the huge page. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index f32896f..4e22ecb 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -585,6 +585,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, struct kvm *kvm = vcpu-kvm; unsigned long *hptep, hpte[3], r; unsigned long mmu_seq, psize, pte_size; + unsigned long gpa_base, gfn_base; unsigned long gpa, gfn, hva, pfn; struct kvm_memory_slot *memslot; unsigned long *rmap; @@ -623,7 +624,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, /* Translate the logical address and get the page */ psize = hpte_page_size(hpte[0], r); - gpa = (r HPTE_R_RPN ~(psize - 1)) | (ea (psize - 1)); + gpa_base = r HPTE_R_RPN ~(psize - 1); + gfn_base = gpa_base PAGE_SHIFT; + gpa = gpa_base | (ea (psize - 1)); gfn = gpa PAGE_SHIFT; memslot = gfn_to_memslot(kvm, gfn); @@ -635,6 +638,13 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (!kvm-arch.using_mmu_notifiers) return -EFAULT; /* should never get here */ + /* +* This should never happen, because of the slot_is_aligned() +* check in kvmppc_do_h_enter(). +*/ + if (gfn_base memslot-base_gfn) + return -EFAULT; + /* used to check for invalidations in progress */ mmu_seq = kvm-mmu_notifier_seq; smp_rmb(); @@ -727,7 +737,8 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, goto out_unlock; hpte[0] = (hpte[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; - rmap = memslot-arch.rmap[gfn - memslot-base_gfn]; + /* Always put the HPTE in the rmap chain for the page base address */ + rmap = memslot-arch.rmap[gfn_base - memslot-base_gfn]; lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 12/41] KVM: PPC: PR: Fill pvinfo hcall instructions in big endian
We expose a blob of hypercall instructions to user space that it gives to the guest via device tree again. That blob should contain a stream of instructions necessary to do a hypercall in big endian, as it just gets passed into the guest and old guests use them straight away. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/powerpc.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 3cf541a..a9bd0ff 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -1015,10 +1015,10 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo) u32 inst_nop = 0x6000; #ifdef CONFIG_KVM_BOOKE_HV u32 inst_sc1 = 0x4422; - pvinfo-hcall[0] = inst_sc1; - pvinfo-hcall[1] = inst_nop; - pvinfo-hcall[2] = inst_nop; - pvinfo-hcall[3] = inst_nop; + pvinfo-hcall[0] = cpu_to_be32(inst_sc1); + pvinfo-hcall[1] = cpu_to_be32(inst_nop); + pvinfo-hcall[2] = cpu_to_be32(inst_nop); + pvinfo-hcall[3] = cpu_to_be32(inst_nop); #else u32 inst_lis = 0x3c00; u32 inst_ori = 0x6000; @@ -1034,10 +1034,10 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo) *sc *nop */ - pvinfo-hcall[0] = inst_lis | ((KVM_SC_MAGIC_R0 16) inst_imm_mask); - pvinfo-hcall[1] = inst_ori | (KVM_SC_MAGIC_R0 inst_imm_mask); - pvinfo-hcall[2] = inst_sc; - pvinfo-hcall[3] = inst_nop; + pvinfo-hcall[0] = cpu_to_be32(inst_lis | ((KVM_SC_MAGIC_R0 16) inst_imm_mask)); + pvinfo-hcall[1] = cpu_to_be32(inst_ori | (KVM_SC_MAGIC_R0 inst_imm_mask)); + pvinfo-hcall[2] = cpu_to_be32(inst_sc); + pvinfo-hcall[3] = cpu_to_be32(inst_nop); #endif pvinfo-flags = KVM_PPC_PVINFO_FLAGS_EV_IDLE; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 40/41] KVM: PPC: Book3S PR: Use SLB entry 0
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0 will always be used by the Linux linear SLB entry, so the fact that slbia does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway. Just enable use of SLB entry 0 for our shadow SLB code. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_64_mmu_host.c | 11 --- arch/powerpc/kvm/book3s_64_slb.S | 3 ++- 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c index e2efb85..0ac9839 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_host.c +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c @@ -271,11 +271,8 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, ulong esid) int found_inval = -1; int r; - if (!svcpu-slb_max) - svcpu-slb_max = 1; - /* Are we overwriting? */ - for (i = 1; i svcpu-slb_max; i++) { + for (i = 0; i svcpu-slb_max; i++) { if (!(svcpu-slb[i].esid SLB_ESID_V)) found_inval = i; else if ((svcpu-slb[i].esid ESID_MASK) == esid) { @@ -285,7 +282,7 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, ulong esid) } /* Found a spare entry that was invalidated before */ - if (found_inval 0) { + if (found_inval = 0) { r = found_inval; goto out; } @@ -359,7 +356,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong ea, ulong seg_size) ulong seg_mask = -seg_size; int i; - for (i = 1; i svcpu-slb_max; i++) { + for (i = 0; i svcpu-slb_max; i++) { if ((svcpu-slb[i].esid SLB_ESID_V) (svcpu-slb[i].esid seg_mask) == ea) { /* Invalidate this entry */ @@ -373,7 +370,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong ea, ulong seg_size) void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu) { struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); - svcpu-slb_max = 1; + svcpu-slb_max = 0; svcpu-slb[0].esid = 0; svcpu_put(svcpu); } diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S index 596140e..84c52c6 100644 --- a/arch/powerpc/kvm/book3s_64_slb.S +++ b/arch/powerpc/kvm/book3s_64_slb.S @@ -138,7 +138,8 @@ slb_do_enter: /* Restore bolted entries from the shadow and fix it along the way */ - /* We don't store anything in entry 0, so we don't need to take care of it */ + li r0, r0 + slbmte r0, r0 slbia isync -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
From: Paul Mackerras pau...@samba.org Commit b005255e12a3 (KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs) added a definition of KVM_REG_PPC_WORT with the same register number as the existing KVM_REG_PPC_VRSAVE (though in fact the definitions are not identical because of the different register sizes.) For clarity, this moves KVM_REG_PPC_WORT to the next unused number, and also adds it to api.txt. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/uapi/asm/kvm.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 9a95770..6b30290 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1873,6 +1873,7 @@ registers, find a list below: PPC | KVM_REG_PPC_PPR | 64 PPC | KVM_REG_PPC_ARCH_COMPAT 32 PPC | KVM_REG_PPC_DABRX | 32 + PPC | KVM_REG_PPC_WORT | 64 PPC | KVM_REG_PPC_TM_GPR0 | 64 ... PPC | KVM_REG_PPC_TM_GPR31 | 64 diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index a6665be..2bc4a94 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -545,7 +545,6 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TCSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb1) #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2) #define KVM_REG_PPC_ACOP (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3) -#define KVM_REG_PPC_WORT (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb4) #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4) #define KVM_REG_PPC_LPCR (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5) @@ -555,6 +554,7 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb7) #define KVM_REG_PPC_DABRX (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb8) +#define KVM_REG_PPC_WORT (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb9) /* Transactional Memory checkpointed state: * This is all GPRs, all VSX regs and a subset of SPRs -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 38/41] KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
From: Paul Mackerras pau...@samba.org This adds workarounds for two hardware bugs in the POWER8 performance monitor unit (PMU), both related to interrupt generation. The effect of these bugs is that PMU interrupts can get lost, leading to tools such as perf reporting fewer counts and samples than they should. The first bug relates to the PMAO (perf. mon. alert occurred) bit in MMCR0; setting it should cause an interrupt, but doesn't. The other bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0. Setting PMAE when a counter is negative and counter negative conditions are enabled to cause alerts should cause an alert, but doesn't. The workaround for the first bug is to create conditions where a counter will overflow, whenever we are about to restore a MMCR0 value that has PMAO set (and PMAO_SYNC clear). The workaround for the second bug is to freeze all counters using MMCR2 before reading MMCR0. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/reg.h | 12 --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 59 +++-- 2 files changed, 64 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index e5d2e0b..4852bcf 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -670,18 +670,20 @@ #define MMCR0_PROBLEM_DISABLE MMCR0_FCP #define MMCR0_FCM1 0x1000UL /* freeze counters while MSR mark = 1 */ #define MMCR0_FCM0 0x0800UL /* freeze counters while MSR mark = 0 */ -#define MMCR0_PMXE 0x0400UL /* performance monitor exception enable */ -#define MMCR0_FCECE 0x0200UL /* freeze ctrs on enabled cond or event */ +#define MMCR0_PMXE ASM_CONST(0x0400) /* perf mon exception enable */ +#define MMCR0_FCECE ASM_CONST(0x0200) /* freeze ctrs on enabled cond or event */ #define MMCR0_TBEE 0x0040UL /* time base exception enable */ #define MMCR0_BHRBA 0x0020UL /* BHRB Access allowed in userspace */ #define MMCR0_EBE0x0010UL /* Event based branch enable */ #define MMCR0_PMCC 0x000cUL /* PMC control */ #define MMCR0_PMCC_U60x0008UL /* PMC1-6 are R/W by user (PR) */ #define MMCR0_PMC1CE 0x8000UL /* PMC1 count enable*/ -#define MMCR0_PMCjCE 0x4000UL /* PMCj count enable*/ +#define MMCR0_PMCjCE ASM_CONST(0x4000) /* PMCj count enable*/ #define MMCR0_TRIGGER0x2000UL /* TRIGGER enable */ -#define MMCR0_PMAO_SYNC 0x0800UL /* PMU interrupt is synchronous */ -#define MMCR0_PMAO 0x0080UL /* performance monitor alert has occurred, set to 0 after handling exception */ +#define MMCR0_PMAO_SYNC ASM_CONST(0x0800) /* PMU intr is synchronous */ +#define MMCR0_C56RUN ASM_CONST(0x0100) /* PMC5/6 count when RUN=0 */ +/* performance monitor alert has occurred, set to 0 after handling exception */ +#define MMCR0_PMAO ASM_CONST(0x0080) #define MMCR0_SHRFC 0x0040UL /* SHRre freeze conditions between threads */ #define MMCR0_FC56 0x0010UL /* freeze counters 5 and 6 */ #define MMCR0_FCTI 0x0008UL /* freeze counters in tags inactive mode */ diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index ffbb871..60fe8ba 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -86,6 +86,12 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S) lbz r4, LPPACA_PMCINUSE(r3) cmpwi r4, 0 beq 23f /* skip if not */ +BEGIN_FTR_SECTION + ld r3, HSTATE_MMCR(r13) + andi. r4, r3, MMCR0_PMAO_SYNC | MMCR0_PMAO + cmpwi r4, MMCR0_PMAO + beqlkvmppc_fix_pmao +END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG) lwz r3, HSTATE_PMC(r13) lwz r4, HSTATE_PMC + 4(r13) lwz r5, HSTATE_PMC + 8(r13) @@ -726,6 +732,12 @@ skip_tm: sldir3, r3, 31 /* MMCR0_FC (freeze counters) bit */ mtspr SPRN_MMCR0, r3 /* freeze all counters, disable ints */ isync +BEGIN_FTR_SECTION + ld r3, VCPU_MMCR(r4) + andi. r5, r3, MMCR0_PMAO_SYNC | MMCR0_PMAO + cmpwi r5, MMCR0_PMAO + beqlkvmppc_fix_pmao +END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG) lwz r3, VCPU_PMC(r4)/* always load up guest PMU registers */ lwz r5, VCPU_PMC + 4(r4)/* to prevent information leak */ lwz r6, VCPU_PMC + 8(r4) @@ -1324,6 +1336,30 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) 25: /* Save PMU registers if requested */ /* r8 and cr0.eq are live here */ +BEGIN_FTR_SECTION + /* +* POWER8 seems to have a hardware bug where setting +* MMCR0[PMAE] along with MMCR0[PMC1CE] and/or MMCR0[PMCjCE] +* when some counters are already negative doesn't seem +* to cause a performance monitor
[PULL 41/41] KVM: PPC: Book3S PR: Rework SLB switching code
On LPAR guest systems Linux enables the shadow SLB to indicate to the hypervisor a number of SLB entries that always have to be available. Today we go through this shadow SLB and disable all ESID's valid bits. However, pHyp doesn't like this approach very much and honors us with fancy machine checks. Fortunately the shadow SLB descriptor also has an entry that indicates the number of valid entries following. During the lifetime of a guest we can just swap that value to 0 and don't have to worry about the SLB restoration magic. While we're touching the code, let's also make it more readable (get rid of rldicl), allow it to deal with a dynamic number of bolted SLB entries and only do shadow SLB swizzling on LPAR systems. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kernel/paca.c | 3 ++ arch/powerpc/kvm/book3s_64_slb.S | 83 ++-- arch/powerpc/mm/slb.c| 2 +- 3 files changed, 42 insertions(+), 46 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index ad302f8..d6e195e 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -98,6 +98,9 @@ static inline void free_lppacas(void) { } /* * 3 persistent SLBs are registered here. The buffer will be zero * initially, hence will all be invaild until we actually write them. + * + * If you make the number of persistent SLB entries dynamic, please also + * update PR KVM to flush and restore them accordingly. */ static struct slb_shadow *slb_shadow; diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S index 84c52c6..3589c4e 100644 --- a/arch/powerpc/kvm/book3s_64_slb.S +++ b/arch/powerpc/kvm/book3s_64_slb.S @@ -17,29 +17,9 @@ * Authors: Alexander Graf ag...@suse.de */ -#define SHADOW_SLB_ESID(num) (SLBSHADOW_SAVEAREA + (num * 0x10)) -#define SHADOW_SLB_VSID(num) (SLBSHADOW_SAVEAREA + (num * 0x10) + 0x8) -#define UNBOLT_SLB_ENTRY(num) \ - li r11, SHADOW_SLB_ESID(num); \ - LDX_BE r9, r12, r11; \ - /* Invalid? Skip. */; \ - rldicl. r0, r9, 37, 63; \ - beq slb_entry_skip_ ## num; \ - xoris r9, r9, SLB_ESID_V@h; \ - STDX_BE r9, r12, r11; \ - slb_entry_skip_ ## num: - -#define REBOLT_SLB_ENTRY(num) \ - li r8, SHADOW_SLB_ESID(num); \ - li r7, SHADOW_SLB_VSID(num); \ - LDX_BE r10, r11, r8; \ - cmpdi r10, 0; \ - beq slb_exit_skip_ ## num; \ - orisr10, r10, SLB_ESID_V@h; \ - LDX_BE r9, r11, r7;\ - slbmte r9, r10;\ - STDX_BE r10, r11, r8; \ -slb_exit_skip_ ## num: +#define SHADOW_SLB_ENTRY_LEN 0x10 +#define OFFSET_ESID(x) (SHADOW_SLB_ENTRY_LEN * x) +#define OFFSET_VSID(x) ((SHADOW_SLB_ENTRY_LEN * x) + 8) /** ** @@ -63,20 +43,15 @@ slb_exit_skip_ ## num: * SVCPU[LR] = guest LR */ - /* Remove LPAR shadow entries */ +BEGIN_FW_FTR_SECTION -#if SLB_NUM_BOLTED == 3 + /* Declare SLB shadow as 0 entries big */ - ld r12, PACA_SLBSHADOWPTR(r13) + ld r11, PACA_SLBSHADOWPTR(r13) + li r8, 0 + stb r8, 3(r11) - /* Remove bolted entries */ - UNBOLT_SLB_ENTRY(0) - UNBOLT_SLB_ENTRY(1) - UNBOLT_SLB_ENTRY(2) - -#else -#error unknown number of bolted entries -#endif +END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR) /* Flush SLB */ @@ -99,7 +74,7 @@ slb_loop_enter: ld r10, 0(r11) - rldicl. r0, r10, 37, 63 + andis. r9, r10, SLB_ESID_V@h beq slb_loop_enter_skip ld r9, 8(r11) @@ -136,24 +111,42 @@ slb_do_enter: * */ - /* Restore bolted entries from the shadow and fix it along the way */ + /* Remove all SLB entries that are in use. */ li r0, r0 slbmte r0, r0 slbia - isync -#if SLB_NUM_BOLTED == 3 + /* Restore bolted entries from the shadow */ ld r11, PACA_SLBSHADOWPTR(r13) - REBOLT_SLB_ENTRY(0) - REBOLT_SLB_ENTRY(1) - REBOLT_SLB_ENTRY(2) - -#else -#error unknown number of bolted entries -#endif +BEGIN_FW_FTR_SECTION + + /* Declare SLB shadow as SLB_NUM_BOLTED entries big */ + + li r8, SLB_NUM_BOLTED + stb r8, 3(r11) + +END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR) + + /* Manually load all entries from shadow SLB */ + + li r8, SLBSHADOW_SAVEAREA + li r7, SLBSHADOW_SAVEAREA + 8 + + .rept SLB_NUM_BOLTED + LDX_BE r10, r11, r8 + cmpdi r10, 0 + beq
[PULL 36/41] KVM: PPC: Book3S HV: Fix dirty map for hugepages
From: Alexey Kardashevskiy a...@ozlabs.ru The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has one bit per system page (4K/64K). Currently, we only set one bit in the map for each HPT entry with the Change bit set, even if the HPT is for a large page (e.g., 16MB). Userspace then considers only the first system page dirty, though in fact the guest may have modified anywhere in the large page. To fix this, we make kvm_test_clear_dirty() return the actual number of pages that are dirty (and rename it to kvm_test_clear_dirty_npages() to emphasize that that's what it returns). In kvmppc_hv_get_dirty_log() we then set that many bits in the dirty map. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 33 - 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 4e22ecb..96c9044 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -1060,22 +1060,27 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte) kvm_handle_hva(kvm, hva, kvm_unmap_rmapp); } -static int kvm_test_clear_dirty(struct kvm *kvm, unsigned long *rmapp) +/* + * Returns the number of system pages that are dirty. + * This can be more than 1 if we find a huge-page HPTE. + */ +static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) { struct revmap_entry *rev = kvm-arch.revmap; unsigned long head, i, j; + unsigned long n; unsigned long *hptep; - int ret = 0; + int npages_dirty = 0; retry: lock_rmap(rmapp); if (*rmapp KVMPPC_RMAP_CHANGED) { *rmapp = ~KVMPPC_RMAP_CHANGED; - ret = 1; + npages_dirty = 1; } if (!(*rmapp KVMPPC_RMAP_PRESENT)) { unlock_rmap(rmapp); - return ret; + return npages_dirty; } i = head = *rmapp KVMPPC_RMAP_INDEX; @@ -1106,13 +,16 @@ static int kvm_test_clear_dirty(struct kvm *kvm, unsigned long *rmapp) rev[i].guest_rpte |= HPTE_R_C; note_hpte_modification(kvm, rev[i]); } - ret = 1; + n = hpte_page_size(hptep[0], hptep[1]); + n = (n + PAGE_SIZE - 1) PAGE_SHIFT; + if (n npages_dirty) + npages_dirty = n; } hptep[0] = ~HPTE_V_HVLOCK; } while ((i = j) != head); unlock_rmap(rmapp); - return ret; + return npages_dirty; } static void harvest_vpa_dirty(struct kvmppc_vpa *vpa, @@ -1136,15 +1144,22 @@ static void harvest_vpa_dirty(struct kvmppc_vpa *vpa, long kvmppc_hv_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned long *map) { - unsigned long i; + unsigned long i, j; unsigned long *rmapp; struct kvm_vcpu *vcpu; preempt_disable(); rmapp = memslot-arch.rmap; for (i = 0; i memslot-npages; ++i) { - if (kvm_test_clear_dirty(kvm, rmapp) map) - __set_bit_le(i, map); + int npages = kvm_test_clear_dirty_npages(kvm, rmapp); + /* +* Note that if npages 0 then i must be a multiple of npages, +* since we always put huge-page HPTEs in the rmap chain +* corresponding to their page base address. +*/ + if (npages map) + for (j = i; npages; ++j, --npages) + __set_bit_le(j, map); ++rmapp; } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 10/41] KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian
The HTAB on PPC is always in big endian. When we access it via hypercalls on behalf of the guest and we're running on a little endian host, we need to make sure we swap the bits accordingly. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_pr_papr.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c index 5efa97b..255e5b1 100644 --- a/arch/powerpc/kvm/book3s_pr_papr.c +++ b/arch/powerpc/kvm/book3s_pr_papr.c @@ -57,7 +57,7 @@ static int kvmppc_h_pr_enter(struct kvm_vcpu *vcpu) for (i = 0; ; ++i) { if (i == 8) goto done; - if ((*hpte HPTE_V_VALID) == 0) + if ((be64_to_cpu(*hpte) HPTE_V_VALID) == 0) break; hpte += 2; } @@ -67,8 +67,8 @@ static int kvmppc_h_pr_enter(struct kvm_vcpu *vcpu) goto done; } - hpte[0] = kvmppc_get_gpr(vcpu, 6); - hpte[1] = kvmppc_get_gpr(vcpu, 7); + hpte[0] = cpu_to_be64(kvmppc_get_gpr(vcpu, 6)); + hpte[1] = cpu_to_be64(kvmppc_get_gpr(vcpu, 7)); pteg_addr += i * HPTE_SIZE; copy_to_user((void __user *)pteg_addr, hpte, HPTE_SIZE); kvmppc_set_gpr(vcpu, 4, pte_index | i); @@ -93,6 +93,8 @@ static int kvmppc_h_pr_remove(struct kvm_vcpu *vcpu) pteg = get_pteg_addr(vcpu, pte_index); mutex_lock(vcpu-kvm-arch.hpt_mutex); copy_from_user(pte, (void __user *)pteg, sizeof(pte)); + pte[0] = be64_to_cpu(pte[0]); + pte[1] = be64_to_cpu(pte[1]); ret = H_NOT_FOUND; if ((pte[0] HPTE_V_VALID) == 0 || @@ -169,6 +171,8 @@ static int kvmppc_h_pr_bulk_remove(struct kvm_vcpu *vcpu) pteg = get_pteg_addr(vcpu, tsh H_BULK_REMOVE_PTEX); copy_from_user(pte, (void __user *)pteg, sizeof(pte)); + pte[0] = be64_to_cpu(pte[0]); + pte[1] = be64_to_cpu(pte[1]); /* tsl = AVPN */ flags = (tsh H_BULK_REMOVE_FLAGS) 26; @@ -207,6 +211,8 @@ static int kvmppc_h_pr_protect(struct kvm_vcpu *vcpu) pteg = get_pteg_addr(vcpu, pte_index); mutex_lock(vcpu-kvm-arch.hpt_mutex); copy_from_user(pte, (void __user *)pteg, sizeof(pte)); + pte[0] = be64_to_cpu(pte[0]); + pte[1] = be64_to_cpu(pte[1]); ret = H_NOT_FOUND; if ((pte[0] HPTE_V_VALID) == 0 || @@ -225,6 +231,8 @@ static int kvmppc_h_pr_protect(struct kvm_vcpu *vcpu) rb = compute_tlbie_rb(v, r, pte_index); vcpu-arch.mmu.tlbie(vcpu, rb, rb 1 ? true : false); + pte[0] = cpu_to_be64(pte[0]); + pte[1] = cpu_to_be64(pte[1]); copy_to_user((void __user *)pteg, pte, sizeof(pte)); ret = H_SUCCESS; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 31/41] KVM: PPC: Add CAP to indicate hcall fixes
We worked around some nasty KVM magic page hcall breakages: 1) NX bit not honored, so ignore NX when we detect it 2) LE guests swizzle hypercall instruction Without these fixes in place, there's no way it would make sense to expose kvm hypercalls to a guest. Chances are immensely high it would trip over and break. So add a new CAP that gives user space a hint that we have workarounds for the bugs above in place. It can use those as hint to disable PV hypercalls when the guest CPU is anything POWER7 or higher and the host does not have fixes in place. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/powerpc.c | 1 + include/uapi/linux/kvm.h | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 154f352..bab20f4 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -416,6 +416,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_SPAPR_TCE: case KVM_CAP_PPC_ALLOC_HTAB: case KVM_CAP_PPC_RTAS: + case KVM_CAP_PPC_FIXUP_HCALL: #ifdef CONFIG_KVM_XICS case KVM_CAP_IRQ_XICS: #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 2b83cf3..16c923d 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -748,6 +748,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_S390_IRQCHIP 99 #define KVM_CAP_IOEVENTFD_NO_LENGTH 100 #define KVM_CAP_VM_ATTRIBUTES 101 +#define KVM_CAP_PPC_FIXUP_HCALL 102 #ifdef KVM_CAP_IRQ_ROUTING -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 30/41] KVM: PPC: MPIC: Reset IRQ source private members
When we reset the in-kernel MPIC controller, we forget to reset some hidden state such as destmask and output. This state is usually set when the guest writes to the IDR register for a specific IRQ line. To make sure we stay in sync and don't forget hidden state, treat reset of the IDR register as a simple write of the IDR register. That automatically updates all the hidden state as well. Reported-by: Paul Janzen p...@pauljanzen.org Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/mpic.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c index efbd996..b68d0dc 100644 --- a/arch/powerpc/kvm/mpic.c +++ b/arch/powerpc/kvm/mpic.c @@ -126,6 +126,8 @@ static int openpic_cpu_write_internal(void *opaque, gpa_t addr, u32 val, int idx); static int openpic_cpu_read_internal(void *opaque, gpa_t addr, u32 *ptr, int idx); +static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ, + uint32_t val); enum irq_type { IRQ_TYPE_NORMAL = 0, @@ -528,7 +530,6 @@ static void openpic_reset(struct openpic *opp) /* Initialise IRQ sources */ for (i = 0; i opp-max_irq; i++) { opp-src[i].ivpr = opp-ivpr_reset; - opp-src[i].idr = opp-idr_reset; switch (opp-src[i].type) { case IRQ_TYPE_NORMAL: @@ -543,6 +544,8 @@ static void openpic_reset(struct openpic *opp) case IRQ_TYPE_FSLSPECIAL: break; } + + write_IRQreg_idr(opp, i, opp-idr_reset); } /* Initialise IRQ destinations */ for (i = 0; i MAX_CPU; i++) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 27/41] KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Use make_dsisr instead of open coding it. This also have the added benefit of handling alignment interrupt on additional instructions. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/disassemble.h | 34 + arch/powerpc/kernel/align.c| 34 + arch/powerpc/kvm/book3s_emulate.c | 39 +- 3 files changed, 36 insertions(+), 71 deletions(-) diff --git a/arch/powerpc/include/asm/disassemble.h b/arch/powerpc/include/asm/disassemble.h index 856f8de..6330a61 100644 --- a/arch/powerpc/include/asm/disassemble.h +++ b/arch/powerpc/include/asm/disassemble.h @@ -81,4 +81,38 @@ static inline unsigned int get_oc(u32 inst) { return (inst 11) 0x7fff; } + +#define IS_XFORM(inst) (get_op(inst) == 31) +#define IS_DSFORM(inst)(get_op(inst) = 56) + +/* + * Create a DSISR value from the instruction + */ +static inline unsigned make_dsisr(unsigned instr) +{ + unsigned dsisr; + + + /* bits 6:15 -- 22:31 */ + dsisr = (instr 0x03ff) 16; + + if (IS_XFORM(instr)) { + /* bits 29:30 -- 15:16 */ + dsisr |= (instr 0x0006) 14; + /* bit 25 --17 */ + dsisr |= (instr 0x0040) 8; + /* bits 21:24 -- 18:21 */ + dsisr |= (instr 0x0780) 3; + } else { + /* bit 5 --17 */ + dsisr |= (instr 0x0400) 12; + /* bits 1: 4 -- 18:21 */ + dsisr |= (instr 0x7800) 17; + /* bits 30:31 -- 12:13 */ + if (IS_DSFORM(instr)) + dsisr |= (instr 0x0003) 18; + } + + return dsisr; +} #endif /* __ASM_PPC_DISASSEMBLE_H__ */ diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c index 94908af..34f5552 100644 --- a/arch/powerpc/kernel/align.c +++ b/arch/powerpc/kernel/align.c @@ -25,14 +25,13 @@ #include asm/cputable.h #include asm/emulated_ops.h #include asm/switch_to.h +#include asm/disassemble.h struct aligninfo { unsigned char len; unsigned char flags; }; -#define IS_XFORM(inst) (((inst) 26) == 31) -#define IS_DSFORM(inst)(((inst) 26) = 56) #define INVALID{ 0, 0 } @@ -192,37 +191,6 @@ static struct aligninfo aligninfo[128] = { }; /* - * Create a DSISR value from the instruction - */ -static inline unsigned make_dsisr(unsigned instr) -{ - unsigned dsisr; - - - /* bits 6:15 -- 22:31 */ - dsisr = (instr 0x03ff) 16; - - if (IS_XFORM(instr)) { - /* bits 29:30 -- 15:16 */ - dsisr |= (instr 0x0006) 14; - /* bit 25 --17 */ - dsisr |= (instr 0x0040) 8; - /* bits 21:24 -- 18:21 */ - dsisr |= (instr 0x0780) 3; - } else { - /* bit 5 --17 */ - dsisr |= (instr 0x0400) 12; - /* bits 1: 4 -- 18:21 */ - dsisr |= (instr 0x7800) 17; - /* bits 30:31 -- 12:13 */ - if (IS_DSFORM(instr)) - dsisr |= (instr 0x0003) 18; - } - - return dsisr; -} - -/* * The dcbz (data cache block zero) instruction * gives an alignment fault if used on non-cacheable * memory. We handle the fault mainly for the diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index 61f38eb..c992447 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -634,44 +634,7 @@ unprivileged: u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst) { - u32 dsisr = 0; - - /* -* This is what the spec says about DSISR bits (not mentioned = 0): -* -* 12:13[DS]Set to bits 30:31 -* 15:16[X] Set to bits 29:30 -* 17 [X] Set to bit 25 -* [D/DS] Set to bit 5 -* 18:21[X] Set to bits 21:24 -* [D/DS] Set to bits 1:4 -* 22:26Set to bits 6:10 (RT/RS/FRT/FRS) -* 27:31Set to bits 11:15 (RA) -*/ - - switch (get_op(inst)) { - /* D-form */ - case OP_LFS: - case OP_LFD: - case OP_STFD: - case OP_STFS: - dsisr |= (inst 12) 0x4000; /* bit 17 */ - dsisr |= (inst 17) 0x3c00; /* bits 18:21 */ - break; - /* X-form */ - case 31: - dsisr |= (inst 14) 0x18000; /* bits 15:16 */ - dsisr |= (inst 8) 0x04000; /* bit 17 */ - dsisr |= (inst 3) 0x03c00; /* bits 18:21 */
[PULL 28/41] PPC: ePAPR: Fix hypercall on LE guest
We get an array of instructions from the hypervisor via device tree that we write into a buffer that gets executed whenever we want to make an ePAPR compliant hypercall. However, the hypervisor passes us these instructions in BE order which we have to manually convert to LE when we want to run them in LE mode. With this fixup in place, I can successfully run LE kernels with KVM PV enabled on PR KVM. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kernel/epapr_paravirt.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/epapr_paravirt.c b/arch/powerpc/kernel/epapr_paravirt.c index 7898be9..d9b7935 100644 --- a/arch/powerpc/kernel/epapr_paravirt.c +++ b/arch/powerpc/kernel/epapr_paravirt.c @@ -47,9 +47,10 @@ static int __init early_init_dt_scan_epapr(unsigned long node, return -1; for (i = 0; i (len / 4); i++) { - patch_instruction(epapr_hypercall_start + i, insts[i]); + u32 inst = be32_to_cpu(insts[i]); + patch_instruction(epapr_hypercall_start + i, inst); #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64) - patch_instruction(epapr_ev_idle_start + i, insts[i]); + patch_instruction(epapr_ev_idle_start + i, inst); #endif } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 32/41] KVM: PPC: Book3S: Add ONE_REG register names that were missed
From: Paul Mackerras pau...@samba.org Commit 3b7834743f9 (KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg) added definitions for several KVM_REG_PPC_* symbols but missed adding some to api.txt. This adds them. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 0581f6c..9a95770 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1794,6 +1794,11 @@ registers, find a list below: PPC | KVM_REG_PPC_MMCR0 | 64 PPC | KVM_REG_PPC_MMCR1 | 64 PPC | KVM_REG_PPC_MMCRA | 64 + PPC | KVM_REG_PPC_MMCR2 | 64 + PPC | KVM_REG_PPC_MMCRS | 64 + PPC | KVM_REG_PPC_SIAR | 64 + PPC | KVM_REG_PPC_SDAR | 64 + PPC | KVM_REG_PPC_SIER | 64 PPC | KVM_REG_PPC_PMC1 | 32 PPC | KVM_REG_PPC_PMC2 | 32 PPC | KVM_REG_PPC_PMC3 | 32 -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 07/41] KVM: PPC: Book3S_64 PR: Access HTAB in big endian
The HTAB is always big endian. We access the guest's HTAB using copy_from/to_user, but don't yet take care of the fact that we might be running on an LE host. Wrap all accesses to the guest HTAB with big endian accessors. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_64_mmu.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c index 171e5ca..b93c245 100644 --- a/arch/powerpc/kvm/book3s_64_mmu.c +++ b/arch/powerpc/kvm/book3s_64_mmu.c @@ -275,12 +275,15 @@ do_second: key = 4; for (i=0; i16; i+=2) { + u64 pte0 = be64_to_cpu(pteg[i]); + u64 pte1 = be64_to_cpu(pteg[i + 1]); + /* Check all relevant fields of 1st dword */ - if ((pteg[i] v_mask) == v_val) { + if ((pte0 v_mask) == v_val) { /* If large page bit is set, check pgsize encoding */ if (slbe-large (vcpu-arch.hflags BOOK3S_HFLAG_MULTI_PGSIZE)) { - pgsize = decode_pagesize(slbe, pteg[i+1]); + pgsize = decode_pagesize(slbe, pte1); if (pgsize 0) continue; } @@ -297,8 +300,8 @@ do_second: goto do_second; } - v = pteg[i]; - r = pteg[i+1]; + v = be64_to_cpu(pteg[i]); + r = be64_to_cpu(pteg[i+1]); pp = (r HPTE_R_PP) | key; if (r HPTE_R_PP0) pp |= 8; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 15/41] KVM: PPC: Book3S: Move little endian conflict to HV KVM
With the previous patches applied, we can now successfully use PR KVM on little endian hosts which means we can now allow users to select it. However, HV KVM still needs some work, so let's keep the kconfig conflict on that one. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 141b202..d6a53b9 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -6,7 +6,6 @@ source virt/kvm/Kconfig menuconfig VIRTUALIZATION bool Virtualization - depends on !CPU_LITTLE_ENDIAN ---help--- Say Y here to get to see options for using your Linux host to run other operating systems inside virtual machines (guests). @@ -76,6 +75,7 @@ config KVM_BOOK3S_64 config KVM_BOOK3S_64_HV tristate KVM support for POWER7 and PPC970 using hypervisor mode in host depends on KVM_BOOK3S_64 + depends on !CPU_LITTLE_ENDIAN select KVM_BOOK3S_HV_POSSIBLE select MMU_NOTIFIER select CMA -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 19/41] KVM: PPC: Book3S PR: Expose TAR facility to guest
POWER8 implements a new register called TAR. This register has to be enabled in FSCR and then from KVM's point of view is mere storage. This patch enables the guest to use TAR. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_host.h | 2 ++ arch/powerpc/kernel/asm-offsets.c | 2 ++ arch/powerpc/kvm/book3s.c | 6 ++ arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_pr.c| 18 ++ 5 files changed, 28 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 232ec5f..29fbb55 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -449,7 +449,9 @@ struct kvm_vcpu_arch { ulong pc; ulong ctr; ulong lr; +#ifdef CONFIG_PPC_BOOK3S ulong tar; +#endif ulong xer; u32 cr; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index e2b86b5..93e1465 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -446,7 +446,9 @@ int main(void) DEFINE(VCPU_XER, offsetof(struct kvm_vcpu, arch.xer)); DEFINE(VCPU_CTR, offsetof(struct kvm_vcpu, arch.ctr)); DEFINE(VCPU_LR, offsetof(struct kvm_vcpu, arch.lr)); +#ifdef CONFIG_PPC_BOOK3S DEFINE(VCPU_TAR, offsetof(struct kvm_vcpu, arch.tar)); +#endif DEFINE(VCPU_CR, offsetof(struct kvm_vcpu, arch.cr)); DEFINE(VCPU_PC, offsetof(struct kvm_vcpu, arch.pc)); #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 79cfa2d..4046a1a 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -634,6 +634,9 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) case KVM_REG_PPC_FSCR: val = get_reg_val(reg-id, vcpu-arch.fscr); break; + case KVM_REG_PPC_TAR: + val = get_reg_val(reg-id, vcpu-arch.tar); + break; default: r = -EINVAL; break; @@ -726,6 +729,9 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) case KVM_REG_PPC_FSCR: vcpu-arch.fscr = set_reg_val(reg-id, val); break; + case KVM_REG_PPC_TAR: + vcpu-arch.tar = set_reg_val(reg-id, val); + break; default: r = -EINVAL; break; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 0092e12..ee1d8ee 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -891,9 +891,6 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, u64 id, case KVM_REG_PPC_BESCR: *val = get_reg_val(id, vcpu-arch.bescr); break; - case KVM_REG_PPC_TAR: - *val = get_reg_val(id, vcpu-arch.tar); - break; case KVM_REG_PPC_DPDES: *val = get_reg_val(id, vcpu-arch.vcore-dpdes); break; @@ -1100,9 +1097,6 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id, case KVM_REG_PPC_BESCR: vcpu-arch.bescr = set_reg_val(id, *val); break; - case KVM_REG_PPC_TAR: - vcpu-arch.tar = set_reg_val(id, *val); - break; case KVM_REG_PPC_DPDES: vcpu-arch.vcore-dpdes = set_reg_val(id, *val); break; diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index ddc626e..7d27a95 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -90,6 +90,7 @@ static void kvmppc_core_vcpu_put_pr(struct kvm_vcpu *vcpu) #endif kvmppc_giveup_ext(vcpu, MSR_FP | MSR_VEC | MSR_VSX); + kvmppc_giveup_fac(vcpu, FSCR_TAR_LG); vcpu-cpu = -1; } @@ -625,6 +626,14 @@ static void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac) /* Facility not available to the guest, ignore giveup request*/ return; } + + switch (fac) { + case FSCR_TAR_LG: + vcpu-arch.tar = mfspr(SPRN_TAR); + mtspr(SPRN_TAR, current-thread.tar); + vcpu-arch.shadow_fscr = ~FSCR_TAR; + break; + } #endif } @@ -794,6 +803,12 @@ static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong fac) } switch (fac) { + case FSCR_TAR_LG: + /* TAR switching isn't lazy in Linux yet */ + current-thread.tar = mfspr(SPRN_TAR); + mtspr(SPRN_TAR, vcpu-arch.tar); + vcpu-arch.shadow_fscr |= FSCR_TAR; + break; default: kvmppc_emulate_fac(vcpu, fac);
[PULL 13/41] KVM: PPC: Make shared struct aka magic page guest endian
The shared (magic) page is a data structure that contains often used supervisor privileged SPRs accessible via memory to the user to reduce the number of exits we have to take to read/write them. When we actually share this structure with the guest we have to maintain it in guest endianness, because some of the patch tricks only work with native endian load/store operations. Since we only share the structure with either host or guest in little endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv. For booke, the shared struct stays big endian. For book3s_64 hv we maintain the struct in host native endian, since it never gets shared with the guest. For book3s_64 pr we introduce a variable that tells us which endianness the shared struct is in and route every access to it through helper inline functions that evaluate this variable. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_book3s.h| 3 +- arch/powerpc/include/asm/kvm_booke.h | 5 -- arch/powerpc/include/asm/kvm_host.h | 3 + arch/powerpc/include/asm/kvm_ppc.h | 80 +- arch/powerpc/kernel/asm-offsets.c| 4 ++ arch/powerpc/kvm/book3s.c| 72 arch/powerpc/kvm/book3s_32_mmu.c | 21 +++ arch/powerpc/kvm/book3s_32_mmu_host.c| 4 +- arch/powerpc/kvm/book3s_64_mmu.c | 19 --- arch/powerpc/kvm/book3s_64_mmu_host.c| 4 +- arch/powerpc/kvm/book3s_emulate.c| 28 - arch/powerpc/kvm/book3s_exports.c| 1 + arch/powerpc/kvm/book3s_hv.c | 11 arch/powerpc/kvm/book3s_interrupts.S | 23 +++- arch/powerpc/kvm/book3s_paired_singles.c | 16 +++--- arch/powerpc/kvm/book3s_pr.c | 97 +++- arch/powerpc/kvm/book3s_pr_papr.c| 2 +- arch/powerpc/kvm/emulate.c | 24 arch/powerpc/kvm/powerpc.c | 33 ++- arch/powerpc/kvm/trace_pr.h | 2 +- 20 files changed, 309 insertions(+), 143 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index bb1e38a..f52f656 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -268,9 +268,10 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu) return vcpu-arch.pc; } +static inline u64 kvmppc_get_msr(struct kvm_vcpu *vcpu); static inline bool kvmppc_need_byteswap(struct kvm_vcpu *vcpu) { - return (vcpu-arch.shared-msr MSR_LE) != (MSR_KERNEL MSR_LE); + return (kvmppc_get_msr(vcpu) MSR_LE) != (MSR_KERNEL MSR_LE); } static inline u32 kvmppc_get_last_inst_internal(struct kvm_vcpu *vcpu, ulong pc) diff --git a/arch/powerpc/include/asm/kvm_booke.h b/arch/powerpc/include/asm/kvm_booke.h index 80d46b5..c7aed61 100644 --- a/arch/powerpc/include/asm/kvm_booke.h +++ b/arch/powerpc/include/asm/kvm_booke.h @@ -108,9 +108,4 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu) { return vcpu-arch.fault_dear; } - -static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu) -{ - return vcpu-arch.shared-msr; -} #endif /* __ASM_KVM_BOOKE_H__ */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index d342f8e..15f19d3 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -623,6 +623,9 @@ struct kvm_vcpu_arch { wait_queue_head_t cpu_run; struct kvm_vcpu_arch_shared *shared; +#if defined(CONFIG_PPC_BOOK3S_64) defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE) + bool shared_big_endian; +#endif unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 4096f16..4a7cc45 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -449,6 +449,84 @@ static inline void kvmppc_mmu_flush_icache(pfn_t pfn) } /* + * Shared struct helpers. The shared struct can be little or big endian, + * depending on the guest endianness. So expose helpers to all of them. + */ +static inline bool kvmppc_shared_big_endian(struct kvm_vcpu *vcpu) +{ +#if defined(CONFIG_PPC_BOOK3S_64) defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE) + /* Only Book3S_64 PR supports bi-endian for now */ + return vcpu-arch.shared_big_endian; +#elif defined(CONFIG_PPC_BOOK3S_64) defined(__LITTLE_ENDIAN__) + /* Book3s_64 HV on little endian is always little endian */ + return false; +#else + return true; +#endif +} + +#define SHARED_WRAPPER_GET(reg, size) \ +static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu) \ +{ \ + if (kvmppc_shared_big_endian(vcpu))
[PULL 14/41] KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions
When the host CPU we're running on doesn't support dcbz32 itself, but the guest wants to have dcbz only clear 32 bytes of data, we loop through every executable mapped page to search for dcbz instructions and patch them with a special privileged instruction that we emulate as dcbz32. The only guests that want to see dcbz act as 32byte are book3s_32 guests, so we don't have to worry about little endian instruction ordering. So let's just always search for big endian dcbz instructions, also when we're on a little endian host. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- arch/powerpc/kvm/book3s_pr.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c index 628d90e..93503bb 100644 --- a/arch/powerpc/kvm/book3s_32_mmu.c +++ b/arch/powerpc/kvm/book3s_32_mmu.c @@ -131,7 +131,7 @@ static hva_t kvmppc_mmu_book3s_32_get_pteg(struct kvm_vcpu *vcpu, pteg = (vcpu_book3s-sdr1 0x) | hash; dprintk(MMU: pc=0x%lx eaddr=0x%lx sdr1=0x%llx pteg=0x%x vsid=0x%x\n, - kvmppc_get_pc(vcpu_book3s-vcpu), eaddr, vcpu_book3s-sdr1, pteg, + kvmppc_get_pc(vcpu), eaddr, vcpu_book3s-sdr1, pteg, sr_vsid(sre)); r = gfn_to_hva(vcpu-kvm, pteg PAGE_SHIFT); diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index d424ca0..6e55934 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -428,8 +428,8 @@ static void kvmppc_patch_dcbz(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte) /* patch dcbz into reserved instruction, so we trap */ for (i=hpage_offset; i hpage_offset + (HW_PAGE_SIZE / 4); i++) - if ((page[i] 0xff0007ff) == INS_DCBZ) - page[i] = 0xfff7; + if ((be32_to_cpu(page[i]) 0xff0007ff) == INS_DCBZ) + page[i] = cpu_to_be32(0xfff7); kunmap_atomic(page); put_page(hpage); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 21/41] KVM: PPC: Book3S PR: Expose TM registers
POWER8 introduces transactional memory which brings along a number of new registers and MSR bits. Implementing all of those is a pretty big headache, so for now let's at least emulate enough to make Linux's context switching code happy. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_emulate.c | 22 ++ arch/powerpc/kvm/book3s_pr.c | 20 +++- 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index e1165ba..9bdff15 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -451,6 +451,17 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong spr_val) case SPRN_EBBRR: vcpu-arch.ebbrr = spr_val; break; +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM + case SPRN_TFHAR: + vcpu-arch.tfhar = spr_val; + break; + case SPRN_TEXASR: + vcpu-arch.texasr = spr_val; + break; + case SPRN_TFIAR: + vcpu-arch.tfiar = spr_val; + break; +#endif #endif case SPRN_ICTC: case SPRN_THRM1: @@ -572,6 +583,17 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val case SPRN_EBBRR: *spr_val = vcpu-arch.ebbrr; break; +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM + case SPRN_TFHAR: + *spr_val = vcpu-arch.tfhar; + break; + case SPRN_TEXASR: + *spr_val = vcpu-arch.texasr; + break; + case SPRN_TFIAR: + *spr_val = vcpu-arch.tfiar; + break; +#endif #endif case SPRN_THRM1: case SPRN_THRM2: diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 7d27a95..23367a7 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -794,9 +794,27 @@ static void kvmppc_emulate_fac(struct kvm_vcpu *vcpu, ulong fac) /* Enable facilities (TAR, EBB, DSCR) for the guest */ static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong fac) { + bool guest_fac_enabled; BUG_ON(!cpu_has_feature(CPU_FTR_ARCH_207S)); - if (!(vcpu-arch.fscr (1ULL fac))) { + /* +* Not every facility is enabled by FSCR bits, check whether the +* guest has this facility enabled at all. +*/ + switch (fac) { + case FSCR_TAR_LG: + case FSCR_EBB_LG: + guest_fac_enabled = (vcpu-arch.fscr (1ULL fac)); + break; + case FSCR_TM_LG: + guest_fac_enabled = kvmppc_get_msr(vcpu) MSR_TM; + break; + default: + guest_fac_enabled = false; + break; + } + + if (!guest_fac_enabled) { /* Facility not enabled by the guest */ kvmppc_trigger_fac_interrupt(vcpu, fac); return RESUME_GUEST; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 16/41] KVM: PPC: Book3S PR: Ignore PMU SPRs
When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs that we don't emulate. Just ignore accesses to them. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_emulate.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index 45d0a80..52448ef 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -455,6 +455,13 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong spr_val) case SPRN_WPAR_GEKKO: case SPRN_MSSSR0: case SPRN_DABR: +#ifdef CONFIG_PPC_BOOK3S_64 + case SPRN_MMCRS: + case SPRN_MMCRA: + case SPRN_MMCR0: + case SPRN_MMCR1: + case SPRN_MMCR2: +#endif break; unprivileged: default: @@ -553,6 +560,13 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val case SPRN_WPAR_GEKKO: case SPRN_MSSSR0: case SPRN_DABR: +#ifdef CONFIG_PPC_BOOK3S_64 + case SPRN_MMCRS: + case SPRN_MMCRA: + case SPRN_MMCR0: + case SPRN_MMCR1: + case SPRN_MMCR2: +#endif *spr_val = 0; break; default: -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 25/41] PPC: KVM: Make NX bit available with magic page
Because old kernels enable the magic page and then choke on NXed trampoline code we have to disable NX by default in KVM when we use the magic page. However, since commit b18db0b8 we have successfully fixed that and can now leave NX enabled, so tell the hypervisor about this. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kernel/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c index 6a01752..5e6f24f 100644 --- a/arch/powerpc/kernel/kvm.c +++ b/arch/powerpc/kernel/kvm.c @@ -417,7 +417,7 @@ static void kvm_map_magic_page(void *data) ulong out[8]; in[0] = KVM_MAGIC_PAGE; - in[1] = KVM_MAGIC_PAGE; + in[1] = KVM_MAGIC_PAGE | MAGIC_PAGE_FLAG_NOT_MAPPED_NX; epapr_hypercall(in, out, KVM_HCALL_TOKEN(KVM_HC_PPC_MAP_MAGIC_PAGE)); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 23/41] KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com On recent IBM Power CPUs, while the hashed page table is looked up using the page size from the segmentation hardware (i.e. the SLB), it is possible to have the HPT entry indicate a larger page size. Thus for example it is possible to put a 16MB page in a 64kB segment, but since the hash lookup is done using a 64kB page size, it may be necessary to put multiple entries in the HPT for a single 16MB page. This capability is called mixed page-size segment (MPSS). With MPSS, there are two relevant page sizes: the base page size, which is the size used in searching the HPT, and the actual page size, which is the size indicated in the HPT entry. [ Note that the actual page size is always = base page size ]. We use ibm,segment-page-sizes device tree node to advertise the MPSS support to PAPR guest. The penc encoding indicates whether we support a specific combination of base page size and actual page size in the same segment. We also use the penc value in the LP encoding of HPTE entry. This patch exposes MPSS support to KVM guest by advertising the feature via ibm,segment-page-sizes. It also adds the necessary changes to decode the base page size and the actual page size correctly from the HPTE entry. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_book3s_64.h | 146 ++- arch/powerpc/kvm/book3s_hv.c | 7 ++ 2 files changed, 130 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 51388be..fddb72b 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -77,34 +77,122 @@ static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits) return old == 0; } +static inline int __hpte_actual_psize(unsigned int lp, int psize) +{ + int i, shift; + unsigned int mask; + + /* start from 1 ignoring MMU_PAGE_4K */ + for (i = 1; i MMU_PAGE_COUNT; i++) { + + /* invalid penc */ + if (mmu_psize_defs[psize].penc[i] == -1) + continue; + /* +* encoding bits per actual page size +*PTE LP actual page size +* rrrz =8KB +* rrzz =16KB +* rzzz =32KB +* =64KB +* ... +*/ + shift = mmu_psize_defs[i].shift - LP_SHIFT; + if (shift LP_BITS) + shift = LP_BITS; + mask = (1 shift) - 1; + if ((lp mask) == mmu_psize_defs[psize].penc[i]) + return i; + } + return -1; +} + static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, unsigned long pte_index) { - unsigned long rb, va_low; + int b_psize, a_psize; + unsigned int penc; + unsigned long rb = 0, va_low, sllp; + unsigned int lp = (r LP_SHIFT) ((1 LP_BITS) - 1); + + if (!(v HPTE_V_LARGE)) { + /* both base and actual psize is 4k */ + b_psize = MMU_PAGE_4K; + a_psize = MMU_PAGE_4K; + } else { + for (b_psize = 0; b_psize MMU_PAGE_COUNT; b_psize++) { + + /* valid entries have a shift value */ + if (!mmu_psize_defs[b_psize].shift) + continue; + a_psize = __hpte_actual_psize(lp, b_psize); + if (a_psize != -1) + break; + } + } + /* +* Ignore the top 14 bits of va +* v have top two bits covering segment size, hence move +* by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits. +* AVA field in v also have the lower 23 bits ignored. +* For base page size 4K we need 14 .. 65 bits (so need to +* collect extra 11 bits) +* For others we need 14..14+i +*/ + /* This covers 14..54 bits of va*/ rb = (v ~0x7fUL) 16; /* AVA field */ + /* +* AVA in v had cleared lower 23 bits. We need to derive +* that from pteg index +*/ va_low = pte_index 3; if (v HPTE_V_SECONDARY) va_low = ~va_low; - /* xor vsid from AVA */ + /* +* get the vpn bits from va_low using reverse of hashing. +* In v we have va with 23 bits dropped and then left shifted +* HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need +* right shift it with (SID_SHIFT - (23 - 7)) +*/ if (!(v HPTE_V_1TB_SEG)) - va_low ^= v 12; +
[PULL 26/41] KVM: PPC: BOOK3S: Always use the saved DAR value
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Although it's optional, IBM POWER cpus always had DAR value set on alignment interrupt. So don't try to compute these values. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_emulate.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index 9bdff15..61f38eb 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -676,6 +676,12 @@ u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst) ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst) { +#ifdef CONFIG_PPC_BOOK3S_64 + /* +* Linux's fix_alignment() assumes that DAR is valid, so can we +*/ + return vcpu-arch.fault_dar; +#else ulong dar = 0; ulong ra = get_ra(inst); ulong rb = get_rb(inst); @@ -700,4 +706,5 @@ ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst) } return dar; +#endif } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 18/41] KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR
POWER8 introduced a new interrupt type called Facility unavailable interrupt which contains its status message in a new register called FSCR. Handle these exits and try to emulate instructions for unhandled facilities. Follow-on patches enable KVM to expose specific facilities into the guest. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_asm.h| 18 arch/powerpc/include/asm/kvm_book3s_asm.h | 2 + arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/kernel/asm-offsets.c | 3 ++ arch/powerpc/kvm/book3s.c | 10 + arch/powerpc/kvm/book3s_emulate.c | 6 +++ arch/powerpc/kvm/book3s_hv.c | 6 --- arch/powerpc/kvm/book3s_pr.c | 68 +++ arch/powerpc/kvm/book3s_segment.S | 25 9 files changed, 125 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_asm.h b/arch/powerpc/include/asm/kvm_asm.h index 19eb74a..9601741 100644 --- a/arch/powerpc/include/asm/kvm_asm.h +++ b/arch/powerpc/include/asm/kvm_asm.h @@ -102,6 +102,7 @@ #define BOOK3S_INTERRUPT_PERFMON 0xf00 #define BOOK3S_INTERRUPT_ALTIVEC 0xf20 #define BOOK3S_INTERRUPT_VSX 0xf40 +#define BOOK3S_INTERRUPT_FAC_UNAVAIL 0xf60 #define BOOK3S_INTERRUPT_H_FAC_UNAVAIL 0xf80 #define BOOK3S_IRQPRIO_SYSTEM_RESET0 @@ -114,14 +115,15 @@ #define BOOK3S_IRQPRIO_FP_UNAVAIL 7 #define BOOK3S_IRQPRIO_ALTIVEC 8 #define BOOK3S_IRQPRIO_VSX 9 -#define BOOK3S_IRQPRIO_SYSCALL 10 -#define BOOK3S_IRQPRIO_MACHINE_CHECK 11 -#define BOOK3S_IRQPRIO_DEBUG 12 -#define BOOK3S_IRQPRIO_EXTERNAL13 -#define BOOK3S_IRQPRIO_DECREMENTER 14 -#define BOOK3S_IRQPRIO_PERFORMANCE_MONITOR 15 -#define BOOK3S_IRQPRIO_EXTERNAL_LEVEL 16 -#define BOOK3S_IRQPRIO_MAX 17 +#define BOOK3S_IRQPRIO_FAC_UNAVAIL 10 +#define BOOK3S_IRQPRIO_SYSCALL 11 +#define BOOK3S_IRQPRIO_MACHINE_CHECK 12 +#define BOOK3S_IRQPRIO_DEBUG 13 +#define BOOK3S_IRQPRIO_EXTERNAL14 +#define BOOK3S_IRQPRIO_DECREMENTER 15 +#define BOOK3S_IRQPRIO_PERFORMANCE_MONITOR 16 +#define BOOK3S_IRQPRIO_EXTERNAL_LEVEL 17 +#define BOOK3S_IRQPRIO_MAX 18 #define BOOK3S_HFLAG_DCBZ320x1 #define BOOK3S_HFLAG_SLB 0x2 diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index 821725c..5bdfb5d 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -104,6 +104,7 @@ struct kvmppc_host_state { #ifdef CONFIG_PPC_BOOK3S_64 u64 cfar; u64 ppr; + u64 host_fscr; #endif }; @@ -133,6 +134,7 @@ struct kvmppc_book3s_shadow_vcpu { u64 esid; u64 vsid; } slb[64]; /* guest SLB */ + u64 shadow_fscr; #endif }; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 15f19d3..232ec5f 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -475,6 +475,7 @@ struct kvm_vcpu_arch { ulong ppr; ulong pspb; ulong fscr; + ulong shadow_fscr; ulong ebbhr; ulong ebbrr; ulong bescr; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index bbf3b9a..e2b86b5 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -537,6 +537,7 @@ int main(void) DEFINE(VCPU_CFAR, offsetof(struct kvm_vcpu, arch.cfar)); DEFINE(VCPU_PPR, offsetof(struct kvm_vcpu, arch.ppr)); DEFINE(VCPU_FSCR, offsetof(struct kvm_vcpu, arch.fscr)); + DEFINE(VCPU_SHADOW_FSCR, offsetof(struct kvm_vcpu, arch.shadow_fscr)); DEFINE(VCPU_PSPB, offsetof(struct kvm_vcpu, arch.pspb)); DEFINE(VCPU_EBBHR, offsetof(struct kvm_vcpu, arch.ebbhr)); DEFINE(VCPU_EBBRR, offsetof(struct kvm_vcpu, arch.ebbrr)); @@ -618,6 +619,7 @@ int main(void) #ifdef CONFIG_PPC64 SVCPU_FIELD(SVCPU_SLB, slb); SVCPU_FIELD(SVCPU_SLB_MAX, slb_max); + SVCPU_FIELD(SVCPU_SHADOW_FSCR, shadow_fscr); #endif HSTATE_FIELD(HSTATE_HOST_R1, host_r1); @@ -653,6 +655,7 @@ int main(void) #ifdef CONFIG_PPC_BOOK3S_64 HSTATE_FIELD(HSTATE_CFAR, cfar); HSTATE_FIELD(HSTATE_PPR, ppr); + HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr); #endif /* CONFIG_PPC_BOOK3S_64 */ #else /* CONFIG_PPC_BOOK3S */ diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 81abc5c..79cfa2d 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -145,6 +145,7 @@ static int kvmppc_book3s_vec2irqprio(unsigned int vec)
[PULL 29/41] KVM: PPC: Graciously fail broken LE hypercalls
There are LE Linux guests out there that don't handle hypercalls correctly. Instead of interpreting the instruction stream from device tree as big endian they assume it's a little endian instruction stream and fail. When we see an illegal instruction from such a byte reversed instruction stream, bail out graciously and just declare every hcall as error. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_emulate.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index c992447..3f29526 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -94,8 +94,25 @@ int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct kvm_vcpu *vcpu, int rs = get_rs(inst); int ra = get_ra(inst); int rb = get_rb(inst); + u32 inst_sc = 0x4402; switch (get_op(inst)) { + case 0: + emulated = EMULATE_FAIL; + if ((kvmppc_get_msr(vcpu) MSR_LE) + (inst == swab32(inst_sc))) { + /* +* This is the byte reversed syscall instruction of our +* hypercall handler. Early versions of LE Linux didn't +* swap the instructions correctly and ended up in +* illegal instructions. +* Just always fail hypercalls on these broken systems. +*/ + kvmppc_set_gpr(vcpu, 3, EV_UNIMPLEMENTED); + kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4); + emulated = EMULATE_DONE; + } + break; case 19: switch (get_xop(inst)) { case OP_19_XOP_RFID: -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 20/41] KVM: PPC: Book3S PR: Expose EBB registers
POWER8 introduces a new facility called the Event Based Branch facility. It contains of a few registers that indicate where a guest should branch to when a defined event occurs and it's in PR mode. We don't want to really enable EBB as it will create a big mess with !PR guest mode while hardware is in PR and we don't really emulate the PMU anyway. So instead, let's just leave it at emulation of all its registers. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s.c | 18 ++ arch/powerpc/kvm/book3s_emulate.c | 22 ++ arch/powerpc/kvm/book3s_hv.c | 18 -- 3 files changed, 40 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 4046a1a..52c654d 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -637,6 +637,15 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) case KVM_REG_PPC_TAR: val = get_reg_val(reg-id, vcpu-arch.tar); break; + case KVM_REG_PPC_EBBHR: + val = get_reg_val(reg-id, vcpu-arch.ebbhr); + break; + case KVM_REG_PPC_EBBRR: + val = get_reg_val(reg-id, vcpu-arch.ebbrr); + break; + case KVM_REG_PPC_BESCR: + val = get_reg_val(reg-id, vcpu-arch.bescr); + break; default: r = -EINVAL; break; @@ -732,6 +741,15 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) case KVM_REG_PPC_TAR: vcpu-arch.tar = set_reg_val(reg-id, val); break; + case KVM_REG_PPC_EBBHR: + vcpu-arch.ebbhr = set_reg_val(reg-id, val); + break; + case KVM_REG_PPC_EBBRR: + vcpu-arch.ebbrr = set_reg_val(reg-id, val); + break; + case KVM_REG_PPC_BESCR: + vcpu-arch.bescr = set_reg_val(reg-id, val); + break; default: r = -EINVAL; break; diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index e8133e5..e1165ba 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -441,6 +441,17 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong spr_val) case SPRN_FSCR: vcpu-arch.fscr = spr_val; break; +#ifdef CONFIG_PPC_BOOK3S_64 + case SPRN_BESCR: + vcpu-arch.bescr = spr_val; + break; + case SPRN_EBBHR: + vcpu-arch.ebbhr = spr_val; + break; + case SPRN_EBBRR: + vcpu-arch.ebbrr = spr_val; + break; +#endif case SPRN_ICTC: case SPRN_THRM1: case SPRN_THRM2: @@ -551,6 +562,17 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val case SPRN_FSCR: *spr_val = vcpu-arch.fscr; break; +#ifdef CONFIG_PPC_BOOK3S_64 + case SPRN_BESCR: + *spr_val = vcpu-arch.bescr; + break; + case SPRN_EBBHR: + *spr_val = vcpu-arch.ebbhr; + break; + case SPRN_EBBRR: + *spr_val = vcpu-arch.ebbrr; + break; +#endif case SPRN_THRM1: case SPRN_THRM2: case SPRN_THRM3: diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index ee1d8ee..3a94561 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -882,15 +882,6 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, u64 id, case KVM_REG_PPC_PSPB: *val = get_reg_val(id, vcpu-arch.pspb); break; - case KVM_REG_PPC_EBBHR: - *val = get_reg_val(id, vcpu-arch.ebbhr); - break; - case KVM_REG_PPC_EBBRR: - *val = get_reg_val(id, vcpu-arch.ebbrr); - break; - case KVM_REG_PPC_BESCR: - *val = get_reg_val(id, vcpu-arch.bescr); - break; case KVM_REG_PPC_DPDES: *val = get_reg_val(id, vcpu-arch.vcore-dpdes); break; @@ -1088,15 +1079,6 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id, case KVM_REG_PPC_PSPB: vcpu-arch.pspb = set_reg_val(id, *val); break; - case KVM_REG_PPC_EBBHR: - vcpu-arch.ebbhr = set_reg_val(id, *val); - break; - case KVM_REG_PPC_EBBRR: - vcpu-arch.ebbrr = set_reg_val(id, *val); - break; - case KVM_REG_PPC_BESCR: -
[PULL 24/41] KVM: PPC: Disable NX for old magic page using guests
Old guests try to use the magic page, but map their trampoline code inside of an NX region. Since we can't fix those old kernels, try to detect whether the guest is sane or not. If not, just disable NX functionality in KVM so that old guests at least work at all. For newer guests, add a bit that we can set to keep NX functionality available. Signed-off-by: Alexander Graf ag...@suse.de --- Documentation/virtual/kvm/ppc-pv.txt | 14 ++ arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/include/uapi/asm/kvm_para.h | 6 ++ arch/powerpc/kvm/book3s_64_mmu.c | 3 +++ arch/powerpc/kvm/powerpc.c | 14 -- 5 files changed, 36 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/ppc-pv.txt b/Documentation/virtual/kvm/ppc-pv.txt index 4643cde..3195606 100644 --- a/Documentation/virtual/kvm/ppc-pv.txt +++ b/Documentation/virtual/kvm/ppc-pv.txt @@ -94,10 +94,24 @@ a bitmap of available features inside the magic page. The following enhancements to the magic page are currently available: KVM_MAGIC_FEAT_SRMaps SR registers r/w in the magic page + KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs For enhanced features in the magic page, please check for the existence of the feature before using them! +Magic page flags + + +In addition to features that indicate whether a host is capable of a particular +feature we also have a channel for a guest to tell the guest whether it's capable +of something. This is what we call flags. + +Flags are passed to the host in the low 12 bits of the Effective Address. + +The following flags are currently available for a guest to expose: + + MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correclty wrt magic page + MSR bits diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 29fbb55..bb66d8b 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -631,6 +631,7 @@ struct kvm_vcpu_arch { #endif unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ + bool disable_kernel_nx; int irq_type; /* one of KVM_IRQ_* */ int irq_cpu_id; diff --git a/arch/powerpc/include/uapi/asm/kvm_para.h b/arch/powerpc/include/uapi/asm/kvm_para.h index e3af328..91e42f0 100644 --- a/arch/powerpc/include/uapi/asm/kvm_para.h +++ b/arch/powerpc/include/uapi/asm/kvm_para.h @@ -82,10 +82,16 @@ struct kvm_vcpu_arch_shared { #define KVM_FEATURE_MAGIC_PAGE 1 +/* Magic page flags from host to guest */ + #define KVM_MAGIC_FEAT_SR (1 0) /* MASn, ESR, PIR, and high SPRGs */ #define KVM_MAGIC_FEAT_MAS0_TO_SPRG7 (1 1) +/* Magic page flags from guest to host */ + +#define MAGIC_PAGE_FLAG_NOT_MAPPED_NX (1 0) + #endif /* _UAPI__POWERPC_KVM_PARA_H__ */ diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c index 278729f..774a253 100644 --- a/arch/powerpc/kvm/book3s_64_mmu.c +++ b/arch/powerpc/kvm/book3s_64_mmu.c @@ -313,6 +313,9 @@ do_second: gpte-raddr = (r HPTE_R_RPN ~eaddr_mask) | (eaddr eaddr_mask); gpte-page_size = pgsize; gpte-may_execute = ((r HPTE_R_N) ? false : true); + if (unlikely(vcpu-arch.disable_kernel_nx) + !(kvmppc_get_msr(vcpu) MSR_PR)) + gpte-may_execute = true; gpte-may_read = false; gpte-may_write = false; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index b4e15bf..154f352 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -177,8 +177,18 @@ int kvmppc_kvm_pv(struct kvm_vcpu *vcpu) vcpu-arch.shared_big_endian = shared_big_endian; #endif - vcpu-arch.magic_page_pa = param1; - vcpu-arch.magic_page_ea = param2; + if (!(param2 MAGIC_PAGE_FLAG_NOT_MAPPED_NX)) { + /* +* Older versions of the Linux magic page code had +* a bug where they would map their trampoline code +* NX. If that's the case, remove !PR NX capability. +*/ + vcpu-arch.disable_kernel_nx = true; + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); + } + + vcpu-arch.magic_page_pa = param1 ~0xfffULL; + vcpu-arch.magic_page_ea = param2 ~0xfffULL; r2 = KVM_MAGIC_FEAT_SR | KVM_MAGIC_FEAT_MAS0_TO_SPRG7; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 22/41] KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Today when KVM tries to reserve memory for the hash page table it allocates from the normal page allocator first. If that fails it falls back to CMA's reserved region. One of the side effects of this is that we could end up exhausting the page allocator and get linux into OOM conditions while we still have plenty of space available in CMA. This patch addresses this issue by first trying hash page table allocation from CMA's reserved region before falling back to the normal page allocator. So if we run out of memory, we really are out of memory. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 23 ++- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index fb25ebc..f32896f 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -52,7 +52,7 @@ static void kvmppc_rmap_reset(struct kvm *kvm); long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp) { - unsigned long hpt; + unsigned long hpt = 0; struct revmap_entry *rev; struct page *page = NULL; long order = KVM_DEFAULT_HPT_ORDER; @@ -64,22 +64,11 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp) } kvm-arch.hpt_cma_alloc = 0; - /* -* try first to allocate it from the kernel page allocator. -* We keep the CMA reserved for failed allocation. -*/ - hpt = __get_free_pages(GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT | - __GFP_NOWARN, order - PAGE_SHIFT); - - /* Next try to allocate from the preallocated pool */ - if (!hpt) { - VM_BUG_ON(order KVM_CMA_CHUNK_ORDER); - page = kvm_alloc_hpt(1 (order - PAGE_SHIFT)); - if (page) { - hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page)); - kvm-arch.hpt_cma_alloc = 1; - } else - --order; + VM_BUG_ON(order KVM_CMA_CHUNK_ORDER); + page = kvm_alloc_hpt(1 (order - PAGE_SHIFT)); + if (page) { + hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page)); + kvm-arch.hpt_cma_alloc = 1; } /* Lastly try successively smaller sizes from the page allocator */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 17/41] KVM: PPC: Book3S PR: Emulate TIR register
In parallel to the Processor ID Register (PIR) threaded POWER8 also adds a Thread ID Register (TIR). Since PR KVM doesn't emulate more than one thread per core, we can just always expose 0 here. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_emulate.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index 52448ef..0a1de29 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -566,6 +566,7 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val case SPRN_MMCR0: case SPRN_MMCR1: case SPRN_MMCR2: + case SPRN_TIR: #endif *spr_val = 0; break; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 04/41] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com With debug option sleep inside atomic section checking enabled we get the below WARN_ON during a PR KVM boot. This is because upstream now have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the warning by adding preempt_disable/enable around floating point and altivec enable. WARNING: at arch/powerpc/kernel/process.c:156 Modules linked in: kvm_pr kvm CPU: 1 PID: 3990 Comm: qemu-system-ppc Tainted: GW 3.15.0-rc1+ #4 task: c000eb85b3a0 ti: c000ec59c000 task.ti: c000ec59c000 NIP: c0015c84 LR: d3334644 CTR: c0015c00 REGS: c000ec59f140 TRAP: 0700 Tainted: GW (3.15.0-rc1+) MSR: 80029032 SF,EE,ME,IR,DR,RI CR: 4224 XER: 2000 CFAR: c0015c24 SOFTE: 1 GPR00: d3334644 c000ec59f3c0 c0e2fa40 c000e2f8 GPR04: 0800 2000 0001 8000 GPR08: 0001 0001 2000 c0015c00 GPR12: d333da18 cfb80900 GPR16: 3fffce4e0fa1 GPR20: 0010 0001 0002 100b9a38 GPR24: 0002 0013 GPR28: c000eb85b3a0 2000 c000e2f8 NIP [c0015c84] .enable_kernel_fp+0x84/0x90 LR [d3334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr] Call Trace: [c000ec59f3c0] [0010] 0x10 (unreliable) [c000ec59f430] [d3334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr] [c000ec59f4c0] [d324b380] .kvmppc_set_msr+0x30/0x50 [kvm] [c000ec59f530] [d3337cac] .kvmppc_core_emulate_op_pr+0x16c/0x5e0 [kvm_pr] [c000ec59f5f0] [d324a944] .kvmppc_emulate_instruction+0x284/0xa80 [kvm] [c000ec59f6c0] [d3336888] .kvmppc_handle_exit_pr+0x488/0xb70 [kvm_pr] [c000ec59f790] [d3338d34] kvm_start_lightweight+0xcc/0xdc [kvm_pr] [c000ec59f960] [d3336288] .kvmppc_vcpu_run_pr+0xc8/0x190 [kvm_pr] [c000ec59f9f0] [d324c880] .kvmppc_vcpu_run+0x30/0x50 [kvm] [c000ec59fa60] [d3249e74] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0 [kvm] [c000ec59faf0] [d3244948] .kvm_vcpu_ioctl+0x478/0x760 [kvm] [c000ec59fcb0] [c0224e34] .do_vfs_ioctl+0x4d4/0x790 [c000ec59fd90] [c0225148] .SyS_ioctl+0x58/0xb0 [c000ec59fe30] [c000a1e4] syscall_exit+0x0/0x98 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_pr.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 8c05cb5..01a7156 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -683,16 +683,20 @@ static int kvmppc_handle_ext(struct kvm_vcpu *vcpu, unsigned int exit_nr, #endif if (msr MSR_FP) { + preempt_disable(); enable_kernel_fp(); load_fp_state(vcpu-arch.fp); t-fp_save_area = vcpu-arch.fp; + preempt_enable(); } if (msr MSR_VEC) { #ifdef CONFIG_ALTIVEC + preempt_disable(); enable_kernel_altivec(); load_vr_state(vcpu-arch.vr); t-vr_save_area = vcpu-arch.vr; + preempt_enable(); #endif } @@ -716,13 +720,17 @@ static void kvmppc_handle_lost_ext(struct kvm_vcpu *vcpu) return; if (lost_ext MSR_FP) { + preempt_disable(); enable_kernel_fp(); load_fp_state(vcpu-arch.fp); + preempt_enable(); } #ifdef CONFIG_ALTIVEC if (lost_ext MSR_VEC) { + preempt_disable(); enable_kernel_altivec(); load_vr_state(vcpu-arch.vr); + preempt_enable(); } #endif current-thread.regs-msr |= lost_ext; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 00/41] ppc patch queue 2014-05-30
Hi Paolo / Marcelo, This is my current patch queue for ppc. Please pull. Alex The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb: KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 17:47:18 +0200) are available in the git repository at: git://github.com/agraf/linux-2.6.git tags/signed-kvm-ppc-next for you to fetch changes up to d8d164a9850d486cc48081c18831680254688d0f: KVM: PPC: Book3S PR: Rework SLB switching code (2014-05-30 14:26:30 +0200) Patch queue for ppc - 2014-05-30 In this round we have a few nice gems. PR KVM gains initial POWER8 support as well as LE host awareness, ihe e500 targets can now properly run u-boot, LE guests now work with PR KVM including KVM hypercalls and HV KVM guests can now use huge pages. On top of this there are some bug fixes. Alexander Graf (27): KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR KVM: PPC: E500: Add dcbtls emulation KVM: PPC: Book3S: PR: Fix C/R bit setting KVM: PPC: Book3S_32: PR: Access HTAB in big endian KVM: PPC: Book3S_64 PR: Access HTAB in big endian KVM: PPC: Book3S_64 PR: Access shadow slb in big endian KVM: PPC: Book3S PR: Default to big endian guest KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian KVM: PPC: PR: Fill pvinfo hcall instructions in big endian KVM: PPC: Make shared struct aka magic page guest endian KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions KVM: PPC: Book3S: Move little endian conflict to HV KVM KVM: PPC: Book3S PR: Ignore PMU SPRs KVM: PPC: Book3S PR: Emulate TIR register KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR KVM: PPC: Book3S PR: Expose TAR facility to guest KVM: PPC: Book3S PR: Expose EBB registers KVM: PPC: Book3S PR: Expose TM registers KVM: PPC: Disable NX for old magic page using guests PPC: KVM: Make NX bit available with magic page PPC: ePAPR: Fix hypercall on LE guest KVM: PPC: Graciously fail broken LE hypercalls KVM: PPC: MPIC: Reset IRQ source private members KVM: PPC: Add CAP to indicate hcall fixes KVM: PPC: Book3S PR: Use SLB entry 0 KVM: PPC: Book3S PR: Rework SLB switching code Alexey Kardashevskiy (1): KVM: PPC: Book3S HV: Fix dirty map for hugepages Aneesh Kumar K.V (6): KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest KVM: PPC: BOOK3S: Always use the saved DAR value KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler Paul Mackerras (7): KVM: PPC: Book3S: Add ONE_REG register names that were missed KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address KVM: PPC: Book3S HV: Make sure we don't miss dirty pages KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs KVM: PPC: Book3S HV: Fix machine check delivery to guest Documentation/virtual/kvm/api.txt | 6 + Documentation/virtual/kvm/ppc-pv.txt | 14 ++ arch/powerpc/include/asm/disassemble.h| 34 + arch/powerpc/include/asm/kvm_asm.h| 18 ++- arch/powerpc/include/asm/kvm_book3s.h | 3 +- arch/powerpc/include/asm/kvm_book3s_64.h | 146 +++--- arch/powerpc/include/asm/kvm_book3s_asm.h | 2 + arch/powerpc/include/asm/kvm_booke.h | 5 - arch/powerpc/include/asm/kvm_host.h | 9 +- arch/powerpc/include/asm/kvm_ppc.h| 80 +- arch/powerpc/include/asm/reg.h| 12 +- arch/powerpc/include/asm/reg_booke.h | 1 + arch/powerpc/include/uapi/asm/kvm.h | 2 +- arch/powerpc/include/uapi/asm/kvm_para.h | 6 + arch/powerpc/kernel/align.c | 34 + arch/powerpc/kernel/asm-offsets.c | 11 +- arch/powerpc/kernel/epapr_paravirt.c | 5 +- arch/powerpc/kernel/kvm.c | 2 +- arch/powerpc/kernel/paca.c| 3 + arch/powerpc/kvm/Kconfig | 2 +- arch/powerpc/kvm/book3s.c | 106 - arch/powerpc/kvm/book3s_32_mmu.c | 41 ++--- arch/powerpc/kvm/book3s_32_mmu_host.c | 4 +- arch/powerpc/kvm/book3s_64_mmu.c | 39 +++-- arch/powerpc/kvm/book3s_64_mmu_host.c | 15 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 116 ++- arch/powerpc/kvm/book3s_64_slb.S | 87 +-- arch/powerpc/kvm/book3s_emulate.c | 156
[PULL 05/41] KVM: PPC: Book3S: PR: Fix C/R bit setting
Commit 9308ab8e2d made C/R HTAB updates go byte-wise into the target HTAB. However, it didn't update the guest's copy of the HTAB, but instead the host local copy of it. Write to the guest's HTAB instead. Signed-off-by: Alexander Graf ag...@suse.de CC: Paul Mackerras pau...@samba.org Acked-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- arch/powerpc/kvm/book3s_64_mmu.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c index 76a64ce..60fc3f4 100644 --- a/arch/powerpc/kvm/book3s_32_mmu.c +++ b/arch/powerpc/kvm/book3s_32_mmu.c @@ -270,7 +270,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu *vcpu, gva_t eaddr, page */ if (found) { u32 pte_r = pteg[i+1]; - char __user *addr = (char __user *) pteg[i+1]; + char __user *addr = (char __user *) (ptegp + (i+1) * sizeof(u32)); /* * Use single-byte writes to update the HPTE, to diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c index 8231b83..171e5ca 100644 --- a/arch/powerpc/kvm/book3s_64_mmu.c +++ b/arch/powerpc/kvm/book3s_64_mmu.c @@ -342,14 +342,14 @@ do_second: * non-PAPR platforms such as mac99, and this is * what real hardware does. */ - char __user *addr = (char __user *) pteg[i+1]; +char __user *addr = (char __user *) (ptegp + (i + 1) * sizeof(u64)); r |= HPTE_R_R; put_user(r 8, addr + 6); } if (iswrite gpte-may_write !(r HPTE_R_C)) { /* Set the dirty flag */ /* Use a single byte write */ - char __user *addr = (char __user *) pteg[i+1]; +char __user *addr = (char __user *) (ptegp + (i + 1) * sizeof(u64)); r |= HPTE_R_C; put_user(r, addr + 7); } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 03/41] KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest
From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com This patch make sure we inherit the LE bit correctly in different case so that we can run Little Endian distro in PR mode Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kernel/asm-offsets.c | 2 +- arch/powerpc/kvm/book3s_64_mmu.c| 2 +- arch/powerpc/kvm/book3s_pr.c| 23 ++- 4 files changed, 25 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 1eaea2d..d342f8e 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -562,6 +562,7 @@ struct kvm_vcpu_arch { #ifdef CONFIG_PPC_BOOK3S ulong fault_dar; u32 fault_dsisr; + unsigned long intr_msr; #endif #ifdef CONFIG_BOOKE @@ -654,7 +655,6 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; - unsigned long intr_msr; #endif }; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index dba8140..6a4b77d 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -493,7 +493,6 @@ int main(void) DEFINE(VCPU_DAR, offsetof(struct kvm_vcpu, arch.shregs.dar)); DEFINE(VCPU_VPA, offsetof(struct kvm_vcpu, arch.vpa.pinned_addr)); DEFINE(VCPU_VPA_DIRTY, offsetof(struct kvm_vcpu, arch.vpa.dirty)); - DEFINE(VCPU_INTR_MSR, offsetof(struct kvm_vcpu, arch.intr_msr)); #endif #ifdef CONFIG_PPC_BOOK3S DEFINE(VCPU_VCPUID, offsetof(struct kvm_vcpu, vcpu_id)); @@ -528,6 +527,7 @@ int main(void) DEFINE(VCPU_SLB_NR, offsetof(struct kvm_vcpu, arch.slb_nr)); DEFINE(VCPU_FAULT_DSISR, offsetof(struct kvm_vcpu, arch.fault_dsisr)); DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, arch.fault_dar)); + DEFINE(VCPU_INTR_MSR, offsetof(struct kvm_vcpu, arch.intr_msr)); DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst)); DEFINE(VCPU_TRAP, offsetof(struct kvm_vcpu, arch.trap)); DEFINE(VCPU_CFAR, offsetof(struct kvm_vcpu, arch.cfar)); diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c index 83da1f8..8231b83 100644 --- a/arch/powerpc/kvm/book3s_64_mmu.c +++ b/arch/powerpc/kvm/book3s_64_mmu.c @@ -38,7 +38,7 @@ static void kvmppc_mmu_book3s_64_reset_msr(struct kvm_vcpu *vcpu) { - kvmppc_set_msr(vcpu, MSR_SF); + kvmppc_set_msr(vcpu, vcpu-arch.intr_msr); } static struct kvmppc_slb *kvmppc_mmu_book3s_64_find_slbe( diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index c5c052a..8c05cb5 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -249,7 +249,7 @@ static void kvmppc_recalc_shadow_msr(struct kvm_vcpu *vcpu) ulong smsr = vcpu-arch.shared-msr; /* Guest MSR values */ - smsr = MSR_FE0 | MSR_FE1 | MSR_SF | MSR_SE | MSR_BE; + smsr = MSR_FE0 | MSR_FE1 | MSR_SF | MSR_SE | MSR_BE | MSR_LE; /* Process MSR values */ smsr |= MSR_ME | MSR_RI | MSR_IR | MSR_DR | MSR_PR | MSR_EE; /* External providers the guest reserved */ @@ -1110,6 +1110,15 @@ static int kvmppc_get_one_reg_pr(struct kvm_vcpu *vcpu, u64 id, case KVM_REG_PPC_HIOR: *val = get_reg_val(id, to_book3s(vcpu)-hior); break; + case KVM_REG_PPC_LPCR: + /* +* We are only interested in the LPCR_ILE bit +*/ + if (vcpu-arch.intr_msr MSR_LE) + *val = get_reg_val(id, LPCR_ILE); + else + *val = get_reg_val(id, 0); + break; default: r = -EINVAL; break; @@ -1118,6 +1127,14 @@ static int kvmppc_get_one_reg_pr(struct kvm_vcpu *vcpu, u64 id, return r; } +static void kvmppc_set_lpcr_pr(struct kvm_vcpu *vcpu, u64 new_lpcr) +{ + if (new_lpcr LPCR_ILE) + vcpu-arch.intr_msr |= MSR_LE; + else + vcpu-arch.intr_msr = ~MSR_LE; +} + static int kvmppc_set_one_reg_pr(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *val) { @@ -1128,6 +1145,9 @@ static int kvmppc_set_one_reg_pr(struct kvm_vcpu *vcpu, u64 id, to_book3s(vcpu)-hior = set_reg_val(id, *val); to_book3s(vcpu)-hior_explicit = true; break; + case KVM_REG_PPC_LPCR: + kvmppc_set_lpcr_pr(vcpu, set_reg_val(id, *val)); + break; default: r = -EINVAL; break; @@ -1180,6 +1200,7 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_pr(struct kvm *kvm, vcpu-arch.pvr = 0x3C0301; if (mmu_has_feature(MMU_FTR_1T_SEGMENT)) vcpu-arch.pvr = mfspr(SPRN_PVR);
[PULL 06/41] KVM: PPC: Book3S_32: PR: Access HTAB in big endian
The HTAB is always big endian. We access the guest's HTAB using copy_from/to_user, but don't yet take care of the fact that we might be running on an LE host. Wrap all accesses to the guest HTAB with big endian accessors. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_32_mmu.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c index 60fc3f4..0e42b16 100644 --- a/arch/powerpc/kvm/book3s_32_mmu.c +++ b/arch/powerpc/kvm/book3s_32_mmu.c @@ -208,6 +208,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu *vcpu, gva_t eaddr, u32 sre; hva_t ptegp; u32 pteg[16]; + u32 pte0, pte1; u32 ptem = 0; int i; int found = 0; @@ -233,11 +234,13 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu *vcpu, gva_t eaddr, } for (i=0; i16; i+=2) { - if (ptem == pteg[i]) { + pte0 = be32_to_cpu(pteg[i]); + pte1 = be32_to_cpu(pteg[i + 1]); + if (ptem == pte0) { u8 pp; - pte-raddr = (pteg[i+1] ~(0xFFFULL)) | (eaddr 0xFFF); - pp = pteg[i+1] 3; + pte-raddr = (pte1 ~(0xFFFULL)) | (eaddr 0xFFF); + pp = pte1 3; if ((sr_kp(sre) (vcpu-arch.shared-msr MSR_PR)) || (sr_ks(sre) !(vcpu-arch.shared-msr MSR_PR))) @@ -260,7 +263,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu *vcpu, gva_t eaddr, } dprintk_pte(MMU: Found PTE - %x %x - %x\n, - pteg[i], pteg[i+1], pp); + pte0, pte1, pp); found = 1; break; } @@ -269,7 +272,7 @@ static int kvmppc_mmu_book3s_32_xlate_pte(struct kvm_vcpu *vcpu, gva_t eaddr, /* Update PTE C and A bits, so the guest's swapper knows we used the page */ if (found) { - u32 pte_r = pteg[i+1]; + u32 pte_r = pte1; char __user *addr = (char __user *) (ptegp + (i+1) * sizeof(u32)); /* @@ -296,7 +299,8 @@ no_page_found: to_book3s(vcpu)-sdr1, ptegp); for (i=0; i16; i+=2) { dprintk_pte( %02d: 0x%x - 0x%x (0x%x)\n, - i, pteg[i], pteg[i+1], ptem); + i, be32_to_cpu(pteg[i]), + be32_to_cpu(pteg[i+1]), ptem); } } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 08/41] KVM: PPC: Book3S_64 PR: Access shadow slb in big endian
The shadow SLB in the PACA is shared with the hypervisor, so it has to be big endian. We access the shadow SLB during world switch, so let's make sure we access it in big endian even when we're on a little endian host. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_64_slb.S | 33 - 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S index 4f12e8f..596140e 100644 --- a/arch/powerpc/kvm/book3s_64_slb.S +++ b/arch/powerpc/kvm/book3s_64_slb.S @@ -17,29 +17,28 @@ * Authors: Alexander Graf ag...@suse.de */ -#ifdef __LITTLE_ENDIAN__ -#error Need to fix SLB shadow accesses in little endian mode -#endif - #define SHADOW_SLB_ESID(num) (SLBSHADOW_SAVEAREA + (num * 0x10)) #define SHADOW_SLB_VSID(num) (SLBSHADOW_SAVEAREA + (num * 0x10) + 0x8) #define UNBOLT_SLB_ENTRY(num) \ - ld r9, SHADOW_SLB_ESID(num)(r12); \ - /* Invalid? Skip. */; \ - rldicl. r0, r9, 37, 63; \ - beq slb_entry_skip_ ## num; \ - xoris r9, r9, SLB_ESID_V@h; \ - std r9, SHADOW_SLB_ESID(num)(r12); \ + li r11, SHADOW_SLB_ESID(num); \ + LDX_BE r9, r12, r11; \ + /* Invalid? Skip. */; \ + rldicl. r0, r9, 37, 63; \ + beq slb_entry_skip_ ## num; \ + xoris r9, r9, SLB_ESID_V@h; \ + STDX_BE r9, r12, r11; \ slb_entry_skip_ ## num: #define REBOLT_SLB_ENTRY(num) \ - ld r10, SHADOW_SLB_ESID(num)(r11); \ - cmpdi r10, 0; \ - beq slb_exit_skip_ ## num; \ - orisr10, r10, SLB_ESID_V@h; \ - ld r9, SHADOW_SLB_VSID(num)(r11); \ - slbmte r9, r10; \ - std r10, SHADOW_SLB_ESID(num)(r11); \ + li r8, SHADOW_SLB_ESID(num); \ + li r7, SHADOW_SLB_VSID(num); \ + LDX_BE r10, r11, r8; \ + cmpdi r10, 0; \ + beq slb_exit_skip_ ## num; \ + orisr10, r10, SLB_ESID_V@h; \ + LDX_BE r9, r11, r7;\ + slbmte r9, r10;\ + STDX_BE r10, r11, r8; \ slb_exit_skip_ ## num: /** -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 00/41] ppc patch queue 2014-05-30
Il 30/05/2014 14:42, Alexander Graf ha scritto: Hi Paolo / Marcelo, This is my current patch queue for ppc. Please pull. Alex The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb: KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 17:47:18 +0200) are available in the git repository at: git://github.com/agraf/linux-2.6.git tags/signed-kvm-ppc-next for you to fetch changes up to d8d164a9850d486cc48081c18831680254688d0f: KVM: PPC: Book3S PR: Rework SLB switching code (2014-05-30 14:26:30 +0200) Patch queue for ppc - 2014-05-30 In this round we have a few nice gems. PR KVM gains initial POWER8 support as well as LE host awareness, ihe e500 targets can now properly run u-boot, LE guests now work with PR KVM including KVM hypercalls and HV KVM guests can now use huge pages. On top of this there are some bug fixes. Thanks for sending the patches well before the merge window! There is a conflict in capability numbers. KVM_CAP_PPC_FIXUP_HCALL is 102 on the branch, but will be 103 when I merge. This will be a very large release for KVM, with over 200 patches scattered over all architectures except ia64 (~25 MIPS, ~20 ARM, ~40 PPC, ~35 x86, ~80 s390). Paolo Alexander Graf (27): KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR KVM: PPC: E500: Add dcbtls emulation KVM: PPC: Book3S: PR: Fix C/R bit setting KVM: PPC: Book3S_32: PR: Access HTAB in big endian KVM: PPC: Book3S_64 PR: Access HTAB in big endian KVM: PPC: Book3S_64 PR: Access shadow slb in big endian KVM: PPC: Book3S PR: Default to big endian guest KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian KVM: PPC: PR: Fill pvinfo hcall instructions in big endian KVM: PPC: Make shared struct aka magic page guest endian KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions KVM: PPC: Book3S: Move little endian conflict to HV KVM KVM: PPC: Book3S PR: Ignore PMU SPRs KVM: PPC: Book3S PR: Emulate TIR register KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR KVM: PPC: Book3S PR: Expose TAR facility to guest KVM: PPC: Book3S PR: Expose EBB registers KVM: PPC: Book3S PR: Expose TM registers KVM: PPC: Disable NX for old magic page using guests PPC: KVM: Make NX bit available with magic page PPC: ePAPR: Fix hypercall on LE guest KVM: PPC: Graciously fail broken LE hypercalls KVM: PPC: MPIC: Reset IRQ source private members KVM: PPC: Add CAP to indicate hcall fixes KVM: PPC: Book3S PR: Use SLB entry 0 KVM: PPC: Book3S PR: Rework SLB switching code Alexey Kardashevskiy (1): KVM: PPC: Book3S HV: Fix dirty map for hugepages Aneesh Kumar K.V (6): KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest KVM: PPC: BOOK3S: Always use the saved DAR value KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler Paul Mackerras (7): KVM: PPC: Book3S: Add ONE_REG register names that were missed KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address KVM: PPC: Book3S HV: Make sure we don't miss dirty pages KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs KVM: PPC: Book3S HV: Fix machine check delivery to guest Documentation/virtual/kvm/api.txt | 6 + Documentation/virtual/kvm/ppc-pv.txt | 14 ++ arch/powerpc/include/asm/disassemble.h| 34 + arch/powerpc/include/asm/kvm_asm.h| 18 ++- arch/powerpc/include/asm/kvm_book3s.h | 3 +- arch/powerpc/include/asm/kvm_book3s_64.h | 146 +++--- arch/powerpc/include/asm/kvm_book3s_asm.h | 2 + arch/powerpc/include/asm/kvm_booke.h | 5 - arch/powerpc/include/asm/kvm_host.h | 9 +- arch/powerpc/include/asm/kvm_ppc.h| 80 +- arch/powerpc/include/asm/reg.h| 12 +- arch/powerpc/include/asm/reg_booke.h | 1 + arch/powerpc/include/uapi/asm/kvm.h | 2 +- arch/powerpc/include/uapi/asm/kvm_para.h | 6 + arch/powerpc/kernel/align.c | 34 + arch/powerpc/kernel/asm-offsets.c | 11 +- arch/powerpc/kernel/epapr_paravirt.c | 5 +- arch/powerpc/kernel/kvm.c | 2 +- arch/powerpc/kernel/paca.c| 3 + arch/powerpc/kvm/Kconfig | 2 +- arch/powerpc/kvm/book3s.c | 106
Re: [PULL 00/41] ppc patch queue 2014-05-30
Am 30.05.2014 um 14:58 schrieb Paolo Bonzini pbonz...@redhat.com: Il 30/05/2014 14:42, Alexander Graf ha scritto: Hi Paolo / Marcelo, This is my current patch queue for ppc. Please pull. Alex The following changes since commit 1f854112553a1d65363ab27d4ee3dfb4b27075fb: KVM: vmx: DR7 masking on task switch emulation is wrong (2014-05-22 17:47:18 +0200) are available in the git repository at: git://github.com/agraf/linux-2.6.git tags/signed-kvm-ppc-next for you to fetch changes up to d8d164a9850d486cc48081c18831680254688d0f: KVM: PPC: Book3S PR: Rework SLB switching code (2014-05-30 14:26:30 +0200) Patch queue for ppc - 2014-05-30 In this round we have a few nice gems. PR KVM gains initial POWER8 support as well as LE host awareness, ihe e500 targets can now properly run u-boot, LE guests now work with PR KVM including KVM hypercalls and HV KVM guests can now use huge pages. On top of this there are some bug fixes. Thanks for sending the patches well before the merge window! Heh, I figured I'd be nice for a change. And my qemu queue is beyond 100 patches already and waiting on this one ;). There is a conflict in capability numbers. KVM_CAP_PPC_FIXUP_HCALL is 102 on the branch, but will be 103 when I merge. That's ok, I've waited for the consumer of this cap until now, so the wrong number will be unused. Thanks a lot for the heads-up though :). And thanks for merging! This will be a very large release for KVM, with over 200 patches scattered over all architectures except ia64 (~25 MIPS, ~20 ARM, ~40 PPC, ~35 x86, ~80 s390). Woot, nice! Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
does anybody still care about kvm-ia64?
I was thinking of removing it in Linux 3.17. I'm not even sure it compiles right now, hasn't seen any action in years, and all open-source userspace code to use it has been dead for years. If you disagree, please speak up loudly in the next month. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 07/16] qspinlock: Use a simple write to grab the lock, if applicable
Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the queue code word (tail) in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasksBefore patchAfter patch %Change ----- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasksBefore patchAfter patch %Change ----- -- --- 312242/12237 12087/12093 -1%/-1% 410688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 62 --- 1 files changed, 46 insertions(+), 16 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 7f10758..2c7abe7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -197,6 +206,22 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * This routine should only be called when the caller is the only one + * entitled to acquire the lock. + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + barrier(); + ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL; + barrier(); +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock
[PATCH v11 16/16] pvqspinlock, x86: Enable PV qspinlock for XEN
This patch adds the necessary XEN specific code to allow XEN to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/xen/spinlock.c | 147 +-- kernel/Kconfig.locks|2 +- 2 files changed, 143 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d1b6a32..2a259bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -17,6 +17,12 @@ #include xen-ops.h #include debugfs.h +static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(char *, irq_name); +static bool xen_pvspin = true; + +#ifndef CONFIG_QUEUE_SPINLOCK + enum xen_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -100,12 +106,9 @@ struct xen_lock_waiting { __ticket_t want; }; -static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; -static DEFINE_PER_CPU(char *, irq_name); static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); static cpumask_t waiting_cpus; -static bool xen_pvspin = true; __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want) { int irq = __this_cpu_read(lock_kicker_irq); @@ -213,6 +216,118 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next) } } +#else /* CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_XEN_DEBUG_FS +static u32 kick_nohlt_stats; /* Kick but not halt count */ +static u32 halt_qhead_stats; /* Queue head halting count */ +static u32 halt_qnode_stats; /* Queue node halting count */ +static u32 halt_abort_stats; /* Halting abort count */ +static u32 wake_kick_stats;/* Wakeup by kicking count */ +static u32 wake_spur_stats;/* Spurious wakeup count*/ +static u64 time_blocked; /* Total blocking time */ + +static inline void xen_halt_stats(enum pv_lock_stats type) +{ + if (type == PV_HALT_QHEAD) + add_smp(halt_qhead_stats, 1); + else if (type == PV_HALT_QNODE) + add_smp(halt_qnode_stats, 1); + else /* type == PV_HALT_ABORT */ + add_smp(halt_abort_stats, 1); +} + +static inline void xen_lock_stats(enum pv_lock_stats type) +{ + if (type == PV_WAKE_KICKED) + add_smp(wake_kick_stats, 1); + else if (type == PV_WAKE_SPURIOUS) + add_smp(wake_spur_stats, 1); + else /* type == PV_KICK_NOHALT */ + add_smp(kick_nohlt_stats, 1); +} + +static inline u64 spin_time_start(void) +{ + return sched_clock(); +} + +static inline void spin_time_accum_blocked(u64 start) +{ + u64 delta; + + delta = sched_clock() - start; + add_smp(time_blocked, delta); +} +#else /* CONFIG_XEN_DEBUG_FS */ +static inline void xen_halt_stats(enum pv_lock_stats type) +{ +} + +static inline void xen_lock_stats(enum pv_lock_stats type) +{ +} + +static inline u64 spin_time_start(void) +{ + return 0; +} + +static inline void spin_time_accum_blocked(u64 start) +{ +} +#endif /* CONFIG_XEN_DEBUG_FS */ + +static void xen_kick_cpu(int cpu) +{ + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); +} + +/* + * Halt the current CPU release it back to the host + */ +static void xen_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval) +{ + int irq = __this_cpu_read(lock_kicker_irq); + unsigned long flags; + u64 start; + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return; + + /* +* Make sure an interrupt handler can't upset things in a +* partially setup state. +*/ + local_irq_save(flags); + start = spin_time_start(); + + xen_halt_stats(type); + /* clear pending */ + xen_clear_irq_pending(irq); + + /* Allow interrupts while blocked */ + local_irq_restore(flags); + /* +* Don't halt if the CPU state has been changed. +*/ + if (ACCESS_ONCE(*state) != sval) { + xen_halt_stats(PV_HALT_ABORT); + return; + } + /* +* If an interrupt happens here, it will leave the wakeup irq +* pending, which will cause xen_poll_irq() to return +* immediately. +*/ + + /* Block until irq becomes pending (or perhaps a spurious wakeup) */ + xen_poll_irq(irq); + spin_time_accum_blocked(start); +} +#endif /* CONFIG_QUEUE_SPINLOCK */ + static irqreturn_t dummy_handler(int irq, void *dev_id) { BUG(); @@ -258,7 +373,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -275,8 +389,15 @@ void __init xen_init_spinlocks(void) return; } printk(KERN_DEBUG
[PATCH v11 15/16] pvqspinlock, x86: Enable PV qspinlock PV for KVM
This patch adds the necessary KVM specific code to allow KVM to support the CPU halting and kicking operations needed by the queue spinlock PV code. Two KVM guests of 20 CPU cores (2 nodes) were created for performance testing in one of the following three configurations: 1) Only 1 VM is active 2) Both VMs are active and they share the same 20 physical CPUs (200% overcommit) The tests run included the disk workload of the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a 3.15-rc7 based kernel. The ebizzy -m test was was also run and its performance data were recorded. With two VMs running, the idle=poll kernel option was added to simulate a busy guest. The entry unfair + PV qspinlock below means that both the unfair lock and PV spinlock configuration options were turned on. AIM7 XFS Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 25210087.24 101.02 5.24 qspinlock 25714297.00 99.10 5.49 PV qspinlock 25352117.10 100.32 5.45 unfair qspinlock 25714297.00 99.25 5.40 unfair + PV qspinlock 25495757.06 99.81 5.31 AIM7 XFS Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 76890223.41 341.71 3.07 qspinlock 78465622.94 346.22 2.90 PV qspinlock 77386123.26 352.47 2.30 unfair qspinlock 83565521.54 316.52 1.57 unfair + PV qspinlock 79716522.58 323.95 3.58 AIM7 EXT4 Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 19565229.20 106.58 5.35 qspinlock 19955659.02 103.19 5.37 PV qspinlock 19586519.19 106.57 5.30 unfair qspinlock 20224728.90 103.58 5.37 unfair + PV qspinlock 19911509.04 104.41 5.46 AIM7 EXT4 Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 57655331.22 407.44 1.51 qspinlock 60955029.53 407.14 1.69 PV qspinlock 59210530.40 410.51 1.67 unfair qspinlock 67289726.75 359.78 1.66 unfair + PV qspinlock 67039126.85 357.09 0.63 EBIZZY-M Test (no overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 1328 10.00 82.821.46 qspinlock 1679 10.00 65.371.80 PV qspinlock 1470 10.00 75.541.54 unfair qspinlock 1518 10.00 70.801.71 unfair + PV qspinlock 1585 10.00 69.021.76 EBIZZY-M Test (200% overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 453 10.00 77.110.00 qspinlock 459 10.00 77.500.00 PV qspinlock 402 10.00 91.550.00 unfair qspinlock 570 10.00 62.980.00 unfair + PV qspinlock 586 10.00 59.680.00 Signed-off-by: Waiman Long waiman.l...@hp.com Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- arch/x86/kernel/kvm.c | 135 + kernel/Kconfig.locks |2 +- 2 files changed, 136 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 7ab8ab3..eef427b 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -567,6 +567,7 @@ static void kvm_kick_cpu(int cpu) kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); } +#ifndef CONFIG_QUEUE_SPINLOCK enum kvm_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -794,6 +795,134 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) } } } +#else /* !CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_KVM_DEBUG_FS +static struct dentry *d_spin_debug; +static struct dentry *d_kvm_debug; +static u32 kick_nohlt_stats; /* Kick but not halt count */ +static u32 halt_qhead_stats; /* Queue head halting count */ +static u32 halt_qnode_stats; /* Queue node halting count */ +static u32 halt_abort_stats; /* Halting abort count */ +static u32 wake_kick_stats;
[PATCH v11 13/16] pvqspinlock: Enable coexistence with the unfair lock
This patch enables the coexistence of both the PV qspinlock and unfair lock. When both are enabled, however, only the lock fastpath will perform lock stealing whereas the slowpath will have that disabled to get the best of both features. We also need to transition a CPU spinning too long in the pending bit code path back to the regular queuing code path so that it can be properly halted by the PV qspinlock code. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 47 --- 1 files changed, 43 insertions(+), 4 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 93c663a..8deedcf 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -57,12 +57,24 @@ #include mcs_spinlock.h /* + * Check the pending bit spinning threshold only if PV qspinlock is enabled + */ +#define PSPIN_THRESHOLD(1 10) +#define MAX_NODES 4 + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define pv_qspinlock_enabled() static_key_false(paravirt_spinlocks_enabled) +#else +#define pv_qspinlock_enabled() false +#endif + +/* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. * * Exactly fits one cacheline. */ -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); +static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]); /* * We must be able to distinguish between no-tail and the tail at 0:0, @@ -265,6 +277,9 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock, ACCESS_ONCE(prev-next) = node; arch_mcs_spin_lock_contended(node-locked); + } else { + /* Mark it as the queue head */ + ACCESS_ONCE(node-locked) = true; } /* @@ -344,14 +359,17 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) struct mcs_spinlock *node; u32 new, old, tail; int idx; + int retry = INT_MAX;/* Retry count, queue if = 0 */ BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); #ifdef CONFIG_VIRT_UNFAIR_LOCKS /* * A simple test and set unfair lock +* Disable waiter lock stealing if PV spinlock is enabled */ - if (static_key_false(virt_unfairlocks_enabled)) { + if (!pv_qspinlock_enabled() + static_key_false(virt_unfairlocks_enabled)) { cpu_relax();/* Relax after a failed lock attempt */ while (!queue_spin_trylock(lock)) cpu_relax(); @@ -360,6 +378,14 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) #endif /* CONFIG_VIRT_UNFAIR_LOCKS */ /* +* When PV qspinlock is enabled, exit the pending bit code path and +* go back to the regular queuing path if the lock isn't available +* within a certain threshold. +*/ + if (pv_qspinlock_enabled()) + retry = PSPIN_THRESHOLD; + + /* * trylock || pending * * 0,0,0 - 0,0,1 ; trylock @@ -370,7 +396,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * If we observe that the queue is not empty or both * the pending and lock bits are set, queue */ - if ((val _Q_TAIL_MASK) || + if ((val _Q_TAIL_MASK) || (retry-- = 0) || (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL))) goto queue; @@ -413,8 +439,21 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * sequentiality; this because not all clear_pending_set_locked() * implementations imply full barriers. */ - while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_MASK) + while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_MASK) { + if (pv_qspinlock_enabled() (retry-- = 0)) { + /* +* Clear the pending bit and queue +*/ + for (;;) { + new = val ~_Q_PENDING_MASK; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + goto queue; + val = old; + } + } arch_mutex_cpu_relax(); + } /* * take ownership and clear the pending bit. -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 12/16] pvqspinlock, x86: Add PV data structure methods
This patch modifies the para-virtualization (PV) infrastructure code of the x86-64 architecture to support the PV queue spinlock. Three new virtual methods are added to support PV qspinlock: 1) kick_cpu - schedule in a virtual CPU 2) halt_cpu - schedule out a virtual CPU 3) lockstat - update statistical data for debugfs Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/paravirt.h | 18 +- arch/x86/include/asm/paravirt_types.h | 17 + arch/x86/kernel/paravirt-spinlocks.c |6 ++ 3 files changed, 40 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cd6e161..d71e123 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -711,7 +711,23 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, } #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK +static __always_inline void __queue_kick_cpu(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick_cpu, cpu); +} + +static __always_inline void +__queue_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval) +{ + PVOP_VCALL3(pv_lock_ops.halt_cpu, type, state, sval); +} +static __always_inline void __queue_lockstat(enum pv_lock_stats type) +{ + PVOP_VCALL1(pv_lock_ops.lockstat, type); +} +#else static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -723,7 +739,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, { PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } - +#endif #endif #ifdef CONFIG_X86_32 diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..549b3a0 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,26 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +#ifdef CONFIG_QUEUE_SPINLOCK +enum pv_lock_stats { + PV_HALT_QHEAD, /* Queue head halting */ + PV_HALT_QNODE, /* Other queue node halting */ + PV_HALT_ABORT, /* Halting aborted */ + PV_WAKE_KICKED, /* Wakeup by kicking*/ + PV_WAKE_SPURIOUS, /* Spurious wakeup */ + PV_KICK_NOHALT /* Kick but CPU not halted */ +}; +#endif + struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + void (*kick_cpu)(int cpu); + void (*halt_cpu)(enum pv_lock_stats type, s8 *state, s8 sval); + void (*lockstat)(enum pv_lock_stats type); +#else struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket); +#endif }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index d9041c9..17435b7 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -11,9 +11,15 @@ #ifdef CONFIG_PARAVIRT_SPINLOCKS struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP +#ifdef CONFIG_QUEUE_SPINLOCK + .kick_cpu = paravirt_nop, + .halt_cpu = paravirt_nop, + .lockstat = paravirt_nop, +#else .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop), .unlock_kick = paravirt_nop, #endif +#endif }; EXPORT_SYMBOL(pv_lock_ops); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 14/16] pvqspinlock: Add qspinlock para-virtualization support
This patch adds base para-virtualization support to the queue spinlock in the same way as was done in the PV ticket lock code. In essence, the lock waiters will spin for a specified number of times (QSPIN_THRESHOLD = 2^14) and then halted itself. The queue head waiter, unlike the other waiter, will spins 2*QSPIN_THRESHOLD times before halting itself. Before being halted, the queue head waiter will set a flag (_Q_LOCKED_SLOWPATH) in the lock byte to indicate that the unlock slowpath has to be invoked. In the unlock slowpath, the current lock holder will find the queue head by following the previous node pointer links stored in the queue node structure until it finds one that has the qhead flag turned on. It then attempt to kick in the CPU of the queue head. After the queue head acquired the lock, it will also check the status of the next node and set _Q_LOCKED_SLOWPATH flag if it has been halted. Enabling the PV code does have a performance impact on spinlock acquisitions and releases. The following table shows the execution time (in ms) of a spinlock micro-benchmark that does lock/unlock operations 5M times for each task versus the number of contending tasks on a Westmere-EX system. # ofTicket lockQueue lock tasks PV off/PV on/%ChangePV off/PV on/%Change -- - 1135/ 179/+33% 137/ 168/+23% 2 1045/ 1103/ +6% 1161/ 1248/ +7% 3 1827/ 2683/+47% 2357/ 2600/+10% 4 2689/ 4191/+56% 2882/ 3115/ +8% 5 3736/ 5830/+56% 3493/ 3571/ +2% 6 4942/ 7609/+54% 4239/ 4198/ -1% 7 6304/ 9570/+52% 4931/ 4895/ -1% 8 7736/11323/+46% 5632/ 5588/ -1% It can be seen that the ticket lock PV code has a fairly big decrease in performance when there are 3 or more contending tasks. The queue spinlock PV code, on the other hand, only has a relatively minor drop in performance for with 1-4 contending tasks. With 5 or more contending tasks, there is practically no difference in performance. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/pvqspinlock.h | 359 arch/x86/include/asm/qspinlock.h | 33 kernel/locking/qspinlock.c | 72 +++- 3 files changed, 458 insertions(+), 6 deletions(-) create mode 100644 arch/x86/include/asm/pvqspinlock.h diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h new file mode 100644 index 000..af00eda --- /dev/null +++ b/arch/x86/include/asm/pvqspinlock.h @@ -0,0 +1,359 @@ +#ifndef _ASM_X86_PVQSPINLOCK_H +#define _ASM_X86_PVQSPINLOCK_H + +/* + * Queue Spinlock Para-Virtualization (PV) Support + * + * +--++--++--+ + * pv_qnode |Queue | prev | | prev |Queue | + * | Head |---| Node | -| Tail | + * +--++--++--+ + * | | | + * V V V + * +--++--++--+ + * mcs_spinlock |locked||locked||locked| + * | = 1 |---| = 0 |- | = 0 | + * +--+ next +--+ next +--+ + * + * The PV support code for queue spinlock is roughly the same as that + * of the ticket spinlock. Each CPU waiting for the lock will spin until it + * reaches a threshold. When that happens, it will put itself to halt so + * that the hypervisor can reuse the CPU cycles in some other guests as well + * as returning other hold-up CPUs faster. + * + * Auxillary fields in the pv_qnode structure are used to hold information + * relevant to the PV support so that it won't impact on the behavior and + * performance of the bare metal code. The structure contains a prev pointer + * so that a lock holder can find out the queue head from the queue tail + * following the prev pointers. + * + * A major difference between the two versions of PV spinlock is the fact + * that the spin threshold of the queue spinlock is half of that of the + * ticket spinlock. However, the queue head will spin twice as long as the + * other nodes before it puts itself to halt. The reason for that is to + * increase halting chance of heavily contended locks to favor lightly + * contended locks (queue depth of 1 or less). + * + * There are 2 places where races can happen: + * 1) Halting of the queue head CPU (in pv_head_spin_check) and the CPU + * kicking by the lock holder in the unlock path (in pv_kick_node). + * 2) Halting of the queue node CPU (in pv_queue_spin_check) and the + * the status check by the previous queue head (in pv_halt_check). + * See the comments on those functions to see how the races are being + * addressed. + */ + +/* + * Spin threshold for queue spinlock + */ +#defineQSPIN_THRESHOLD (1U14)
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
Il 30/05/2014 14:42, Alexander Graf ha scritto: From: Paul Mackerras pau...@samba.org Commit b005255e12a3 (KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs) added a definition of KVM_REG_PPC_WORT with the same register number as the existing KVM_REG_PPC_VRSAVE (though in fact the definitions are not identical because of the different register sizes.) For clarity, this moves KVM_REG_PPC_WORT to the next unused number, and also adds it to api.txt. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/uapi/asm/kvm.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 9a95770..6b30290 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1873,6 +1873,7 @@ registers, find a list below: PPC | KVM_REG_PPC_PPR | 64 PPC | KVM_REG_PPC_ARCH_COMPAT 32 PPC | KVM_REG_PPC_DABRX | 32 + PPC | KVM_REG_PPC_WORT | 64 PPC | KVM_REG_PPC_TM_GPR0 | 64 ... PPC | KVM_REG_PPC_TM_GPR31 | 64 diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index a6665be..2bc4a94 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -545,7 +545,6 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TCSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb1) #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2) #define KVM_REG_PPC_ACOP (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3) -#define KVM_REG_PPC_WORT (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb4) #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4) #define KVM_REG_PPC_LPCR (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5) @@ -555,6 +554,7 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb7) #define KVM_REG_PPC_DABRX (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb8) +#define KVM_REG_PPC_WORT (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb9) This is an ABI break, this symbol was added in 3.14. I think I should revert this. Can you convince me otherwise? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 11/16] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
This patch renames the paravirt_ticketlocks_enabled static key to a more generic paravirt_spinlocks_enabled name. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/spinlock.h |4 ++-- arch/x86/kernel/kvm.c|2 +- arch/x86/kernel/paravirt-spinlocks.c |4 ++-- arch/x86/xen/spinlock.c |2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 958d20f..428d0d1 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -39,7 +39,7 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 15) -extern struct static_key paravirt_ticketlocks_enabled; +extern struct static_key paravirt_spinlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); #ifdef CONFIG_QUEUE_SPINLOCK @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (TICKET_SLOWPATH_FLAG - static_key_false(paravirt_ticketlocks_enabled)) { + static_key_false(paravirt_spinlocks_enabled)) { arch_spinlock_t prev; prev = *lock; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 0331cb3..7ab8ab3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -817,7 +817,7 @@ static __init int kvm_spinlock_init_jump(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); printk(KERN_INFO KVM setup paravirtual spinlock\n); return 0; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 69ed806..d9041c9 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = { }; EXPORT_SYMBOL(pv_lock_ops); -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE; +EXPORT_SYMBOL(paravirt_spinlocks_enabled); #endif #ifdef CONFIG_VIRT_UNFAIR_LOCKS diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 0ba5f3b..d1b6a32 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void) if (!xen_domain()) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); return 0; } early_initcall(xen_init_spinlocks_jump); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 08/16] qspinlock: Prepare for unfair lock support
If unfair lock is supported, the lock acquisition loop at the end of the queue_spin_lock_slowpath() function may need to detect the fact the lock can be stolen. Code are added for the stolen lock detection. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 26 ++ 1 files changed, 18 insertions(+), 8 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 2c7abe7..ae1b19d 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,7 +94,7 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * can allow better optimization of the lock acquisition for the pending * bit holder. * - * This internal structure is also used by the set_locked function which + * This internal structure is also used by the try_set_locked function which * is not restricted to _Q_PENDING_BITS == 8. */ struct __qspinlock { @@ -206,19 +206,21 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** - * set_locked - Set the lock bit and own the lock - * @lock: Pointer to queue spinlock structure + * try_set_locked - Try to set the lock bit and own the lock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 otherwise * * This routine should only be called when the caller is the only one * entitled to acquire the lock. */ -static __always_inline void set_locked(struct qspinlock *lock) +static __always_inline int try_set_locked(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; barrier(); ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL; barrier(); + return 1; } /** @@ -357,11 +359,12 @@ queue: /* * we're at the head of the waitqueue, wait for the owner pending to * go away. -* Load-acquired is used here because the set_locked() +* Load-acquired is used here because the try_set_locked() * function below may not be a full memory barrier. * * *,x,y - *,0,0 */ +retry_queue_wait: while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_PENDING_MASK) arch_mutex_cpu_relax(); @@ -378,13 +381,20 @@ queue: */ for (;;) { if (val != tail) { - set_locked(lock); - break; + /* +* The try_set_locked function will only failed if the +* lock was stolen. +*/ + if (try_set_locked(lock)) + break; + else + goto retry_queue_wait; } old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL); if (old == val) goto release; /* No contention */ - + else if (old _Q_LOCKED_MASK) + goto retry_queue_wait; val = old; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest
Locking is always an issue in a virtualized environment because of 2 different types of problems: 1) Lock holder preemption 2) Lock waiter preemption One solution to the lock waiter preemption problem is to allow unfair lock in a virtualized environment. In this case, a new lock acquirer can come and steal the lock if the next-in-line CPU to get the lock is scheduled out. A simple unfair queue spinlock can be implemented by allowing lock stealing in the fast path. The slowpath will also be modified to run a simple queue_spin_trylock() loop. A simple test and set lock like that does have the problem that the The constant spinning on the lock word put a lot of cacheline contention traffic on the affected cacheline, thus slowing tasks that need to access the cacheline. Unfair lock in a native environment is generally not a good idea as there is a possibility of lock starvation for a heavily contended lock. This patch adds a new configuration option for the x86 architecture to enable the use of unfair queue spinlock (AVIRT_UNFAIR_LOCKS) in a virtual guest. A jump label (virt_unfairlocks_enabled) is used to switch between a fair and an unfair version of the spinlock code. This jump label will only be enabled in a virtual guest where the X86_FEATURE_HYPERVISOR feature bit is set. Enabling this configuration feature causes a slight decrease the performance of an uncontended lock-unlock operation by about 1-2% mainly due to the use of a static key. However, uncontended lock-unlock operation are really just a tiny percentage of a real workload. So there should no noticeable change in application performance. With the unfair locking activated on bare metal 4-socket Westmere-EX box, the execution times (in ms) of a spinlock micro-benchmark were as follows: # ofTicket Fair Unfair taskslock queue lockqueue lock -- --- ---- 1 135135 137 2 890 1082 613 3 1932 2248 1211 4 2829 2819 1720 5 3834 3522 2461 6 4963 4173 3715 7 6299 4875 3749 8 7691 5563 4194 Executing one task per node, the performance data were: # ofTicket Fair Unfair nodeslock queue lockqueue lock -- --- ---- 1135135 137 2 4603 1034 1458 3 10940 12087 2562 4 21555 10507 4793 Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/Kconfig | 11 + arch/x86/include/asm/qspinlock.h | 79 ++ arch/x86/kernel/Makefile |1 + arch/x86/kernel/paravirt-spinlocks.c | 26 +++ kernel/locking/qspinlock.c | 20 + 5 files changed, 137 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 95c9c4e..961f43a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -585,6 +585,17 @@ config PARAVIRT_SPINLOCKS If you are unsure how to answer this question, answer Y. +config VIRT_UNFAIR_LOCKS + bool Enable unfair locks in a virtual guest + depends on SMP QUEUE_SPINLOCK + depends on !CONFIG_X86_OOSTORE !CONFIG_X86_PPRO_FENCE + ---help--- + This changes the kernel to use unfair locks in a virtual + guest. This will help performance in most cases. However, + there is a possibility of lock starvation on a heavily + contended lock especially in a large guest with many + virtual CPUs. + source arch/x86/xen/Kconfig config KVM_GUEST diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index e4a4f5d..448de8b 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -5,6 +5,10 @@ #if !defined(CONFIG_X86_OOSTORE) !defined(CONFIG_X86_PPRO_FENCE) +#ifdef CONFIG_VIRT_UNFAIR_LOCKS +extern struct static_key virt_unfairlocks_enabled; +#endif + #definequeue_spin_unlock queue_spin_unlock /** * queue_spin_unlock - release a queue spinlock @@ -26,4 +30,79 @@ static inline void queue_spin_unlock(struct qspinlock *lock) #include asm-generic/qspinlock.h +union arch_qspinlock { + atomic_t val; + u8 locked; +}; + +#ifdef CONFIG_VIRT_UNFAIR_LOCKS +/** + * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + if (!qlock-locked (cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +/** + * queue_spin_lock_unfair - acquire a queue spinlock
[PATCH v11 10/16] qspinlock: Split the MCS queuing code into a separate slowerpath
With the pending addition of more codes to support PV spinlock, the complexity of the slowpath function increases to the point that the number of scratch-pad registers in the x86-64 architecture is not enough and so those additional non-scratch-pad registers will need to be used. This has the downside of requiring saving and restoring of those registers in the prolog and epilog of the slowpath function slowing down the nominally faster pending bit and trylock code path at the beginning of the slowpath function. This patch separates out the actual MCS queuing code into a slowerpath function. This avoids the slow down of the pending bit and trylock code path at the expense of a little bit of additional overhead to the MCS queuing code path. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 162 --- 1 files changed, 90 insertions(+), 72 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 3723c83..93c663a 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -232,6 +232,93 @@ static __always_inline int try_set_locked(struct qspinlock *lock) } /** + * queue_spin_lock_slowerpath - a slower patch for acquiring queue spinlock + * @lock: Pointer to queue spinlock structure + * @node: Pointer to the queue node + * @tail: The tail code + * + * The reason for splitting a slowerpath from slowpath is to avoid the + * unnecessary overhead of non-scratch pad register pushing and popping + * due to increased complexity with unfair and PV spinlock from slowing + * down the nominally faster pending bit and trylock code path. So this + * function is not inlined. + */ +static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock, + struct mcs_spinlock *node, u32 tail) +{ + struct mcs_spinlock *prev, *next; + u32 val, old; + + /* +* we already touched the queueing cacheline; don't bother with pending +* stuff. +* +* p,*,* - n,*,* +*/ + old = xchg_tail(lock, tail); + + /* +* if there was a previous node; link it and wait. +*/ + if (old _Q_TAIL_MASK) { + prev = decode_tail(old); + ACCESS_ONCE(prev-next) = node; + + arch_mcs_spin_lock_contended(node-locked); + } + + /* +* we're at the head of the waitqueue, wait for the owner pending to +* go away. +* Load-acquired is used here because the try_set_locked() +* function below may not be a full memory barrier. +* +* *,x,y - *,0,0 +*/ +retry_queue_wait: + while ((val = smp_load_acquire(lock-val.counter)) + _Q_LOCKED_PENDING_MASK) + arch_mutex_cpu_relax(); + + /* +* claim the lock: +* +* n,0,0 - 0,0,1 : lock, uncontended +* *,0,0 - *,0,1 : lock, contended +* +* If the queue head is the only one in the queue (lock value == tail), +* clear the tail code and grab the lock. Otherwise, we only need +* to grab the lock. +*/ + for (;;) { + if (val != tail) { + /* +* The try_set_locked function will only failed if the +* lock was stolen. +*/ + if (try_set_locked(lock)) + break; + else + goto retry_queue_wait; + } + old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL); + if (old == val) + return; /* No contention */ + else if (old _Q_LOCKED_MASK) + goto retry_queue_wait; + val = old; + } + + /* +* contended path; wait for next +*/ + while (!(next = ACCESS_ONCE(node-next))) + arch_mutex_cpu_relax(); + + arch_mcs_spin_unlock_contended(next-locked); +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -254,7 +341,7 @@ static __always_inline int try_set_locked(struct qspinlock *lock) */ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) { - struct mcs_spinlock *prev, *next, *node; + struct mcs_spinlock *node; u32 new, old, tail; int idx; @@ -355,78 +442,9 @@ queue: * attempt the trylock once more in the hope someone let go while we * weren't watching. */ - if (queue_spin_trylock(lock)) - goto release; - - /* -* we already touched the queueing cacheline; don't bother with pending -* stuff. -* -* p,*,* - n,*,* -*/ - old = xchg_tail(lock, tail); - - /*
[PATCH v11 06/16] qspinlock: prolong the stay in the pending bit path
There is a problem in the current pending bit spinning code. When the lock is free, but the pending bit holder hasn't grabbed the lock cleared the pending bit yet, the spinning code will not be run. As a result, the regular queuing code path might be used most of the time even when there is only 2 tasks contending for the lock. Assuming that the pending bit holder is going to get the lock and clear the pending bit soon, it is actually better to wait than to be queued up which has a higher overhead. The following tables show the before-patch execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX x86-64 CPU with 2 different types of loads - standalone (lock and protected data in different cachelines) and embedded (lock and protected data in the same cacheline). [Standalone/Embedded - same node] # of tasksTicket lock Queue lock %Change ----- -- --- 1 135/ 111 135/ 101 0%/ -9% 2 890/ 779 1885/1990 +112%/+156% 3 1932/1859 2333/2341+21%/ +26% 4 2829/2726 2900/2923 +3%/ +7% 5 3834/3761 3655/3648 -5%/ -3% 6 4963/4976 4336/4326-13%/ -13% 7 6299/6269 5057/5064-20%/ -19% 8 7691/7569 5786/5798-25%/ -23% Of course, the results will varies depending on what kind of test machine is used. With 1 task per NUMA node, the execution times are: [Standalone - different nodes] # of nodesTicket lock Queue lock %Change ----- -- --- 1 135135 0% 2 4604 5087 +10% 3 10940 12224 +12% 4 21555 10555 -51% It can be seen that the queue spinlock is slower than the ticket spinlock when there are 2 or 3 contending tasks. In all the other case, the queue spinlock is either equal or faster than the ticket spinlock. With this patch, the performance data for 2 contending tasks are: [Standalone/Embedded] # of tasksTicket lock Queue lock %Change ----- -- --- 2 890/779984/871+11%/+12% [Standalone - different nodes] # of nodesTicket lock Queue lock %Change ----- -- --- 2 4604 1364 -70% It can be seen that the queue spinlock performance for 2 contending tasks is now comparable to ticket spinlock on the same node, but much faster when in different nodes. With 3 contending tasks, however, the ticket spinlock is still quite a bit faster. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 18 -- 1 files changed, 16 insertions(+), 2 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index fc7fd8c..7f10758 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -233,11 +233,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) */ for (;;) { /* -* If we observe any contention; queue. +* If we observe that the queue is not empty or both +* the pending and lock bits are set, queue */ - if (val ~_Q_LOCKED_MASK) + if ((val _Q_TAIL_MASK) || + (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL))) goto queue; + if (val == _Q_PENDING_VAL) { + /* +* Pending bit is set, but not the lock bit. +* Assuming that the pending bit holder is going to +* set the lock bit and clear the pending bit soon, +* it is better to wait than to exit at this point. +*/ + cpu_relax(); + val = atomic_read(lock-val); + continue; + } + new = _Q_LOCKED_VAL; if (val == new) new |= _Q_PENDING_VAL; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 05/16] qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 13 kernel/locking/qspinlock.c| 103 +--- 2 files changed, 106 insertions(+), 10 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index ed5d89a..4914abe 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -38,6 +38,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -50,7 +58,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -61,6 +73,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 41594a1..fc7fd8c 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -22,6 +22,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -48,6 +49,9 @@ * We can further change the first spinner to spin on a bit in the lock word * instead of its node; whereby avoiding the need to carry a node from lock to * unlock, and preserving API. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. */ #include mcs_spinlock.h @@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL; +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + u32 new, old; + +
[PATCH v11 04/16] qspinlock: Extract out the exchange of tail code word
This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 58 2 files changed, 38 insertions(+), 22 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index bd25081..ed5d89a 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -61,6 +61,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 1e93c6a..41594a1 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(lock-val); + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -182,36 +207,25 @@ queue: node-next = NULL; /* -* we already touched the queueing cacheline; don't bother with pending -* stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 - 0,0,1 ; trylock -* p,y,x - n,y,x ; prev = xchg(lock, node) +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; /* -* we won the trylock; forget about queueing. +* we already touched the queueing cacheline; don't bother with pending +* stuff. +* +* p,*,* - n,*,* */ - if (new == _Q_LOCKED_VAL) - goto release; + old = xchg_tail(lock, tail); /* * if there was a previous node; link it and wait. */ - if (old ~_Q_LOCKED_PENDING_MASK) { + if (old _Q_TAIL_MASK) { prev = decode_tail(old); ACCESS_ONCE(prev-next) = node; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 00/16] qspinlock: a 4-byte queue spinlock with PV support
v10-v11: - Use a simple test-and-set unfair lock to simplify the code, but performance may suffer a bit for large guest with many CPUs. - Take out Raghavendra KT's test results as the unfair lock changes may render some of his results invalid. - Add PV support without increasing the size of the core queue node structure. - Other minor changes to address some of the feedback comments. v9-v10: - Make some minor changes to qspinlock.c to accommodate review feedback. - Change author to PeterZ for 2 of the patches. - Include Raghavendra KT's test results in patch 18. v8-v9: - Integrate PeterZ's version of the queue spinlock patch with some modification: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org - Break the more complex patches into smaller ones to ease review effort. - Fix a racing condition in the PV qspinlock code. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. v6-v7: - Remove an atomic operation from the 2-task contending code - Shorten the names of some macros - Make the queue waiter to attempt to steal lock when unfair lock is enabled. - Remove lock holder kick from the PV code and fix a race condition - Run the unfair lock PV code on overcommitted KVM guests to collect performance data. v5-v6: - Change the optimized 2-task contending code to make it fairer at the expense of a bit of performance. - Add a patch to support unfair queue spinlock for Xen. - Modify the PV qspinlock code to follow what was done in the PV ticketlock. - Add performance data for the unfair lock as well as the PV support code. v4-v5: - Move the optimized 2-task contending code to the generic file to enable more architectures to use it without code duplication. - Address some of the style-related comments by PeterZ. - Allow the use of unfair queue spinlock in a real para-virtualized execution environment. - Add para-virtualization support to the qspinlock code by ensuring that the lock holder and queue head stay alive as much as possible. v3-v4: - Remove debugging code and fix a configuration error - Simplify the qspinlock structure and streamline the code to make it perform a bit better - Add an x86 version of asm/qspinlock.h for holding x86 specific optimization. - Add an optimized x86 code path for 2 contending tasks to improve low contention performance. v2-v3: - Simplify the code by using numerous mode only without an unfair option. - Use the latest smp_load_acquire()/smp_store_release() barriers. - Move the queue spinlock code to kernel/locking. - Make the use of queue spinlock the default for x86-64 without user configuration. - Additional performance tuning. v1-v2: - Add some more comments to document what the code does. - Add a numerous CPU mode to support = 16K CPUs - Add a configuration option to allow lock stealing which can further improve performance in many cases. - Enable wakeup of queue head CPU at unlock time for non-numerous CPU mode. This patch set has 3 different sections: 1) Patches 1-7: Introduces a queue-based spinlock implementation that can replace the default ticket spinlock without increasing the size of the spinlock data structure. As a result, critical kernel data structures that embed spinlock won't increase in size and break data alignments. 2) Patches 8-9: Enables the use of unfair queue spinlock in a virtual guest. This can resolve some of the locking related performance issues due to the fact that the next CPU to get the lock may have been scheduled out for a period of time. 3) Patches 10-16: Enable qspinlock para-virtualization support by halting the waiting CPUs after spinning for a certain amount of time. The unlock code will detect the a sleeping waiter and wake it up. This is essentially the same logic as the PV ticketlock code. The queue spinlock has slightly better performance than the ticket spinlock in uncontended case. Its performance can be much better with moderate to heavy contention. This patch has the potential of improving the performance of all the workloads that have moderate to heavy spinlock contention. The queue spinlock is especially suitable for NUMA machines with at least 2 sockets, though noticeable performance benefit probably won't show up in machines with less than 4 sockets. The purpose of this patch set is not to solve any particular spinlock contention problems. Those need to be solved by refactoring the code to make more efficient use of the lock or finer granularity ones. The main purpose is to make the lock contention problems more
[PATCH v11 02/16] qspinlock, x86: Enable x86-64 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 29 + arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 4 files changed, 39 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 25d2c6f..95c9c4e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -29,6 +29,7 @@ config X86 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_INT128 if X86_64 select ARCH_WANTS_PROT_NUMA_PROT_NONE + select ARCH_USE_QUEUE_SPINLOCK select HAVE_IDE select HAVE_OPROFILE select HAVE_PCSPKR_PLATFORM diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 000..e4a4f5d --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,29 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#if !defined(CONFIG_X86_OOSTORE) !defined(CONFIG_X86_PPRO_FENCE) + +#definequeue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * No special memory barrier other than a compiler one is needed for the + * x86 architecture. A compiler barrier is added at the end to make sure + * that the clearing the lock bit is done ASAP without artificial delay + * due to compiler optimization. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + barrier(); + ACCESS_ONCE(*(u8 *)lock) = 0; + barrier(); +} + +#endif /* !CONFIG_X86_OOSTORE !CONFIG_X86_PPRO_FENCE */ + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 0f62f54..958d20f 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, { arch_spin_lock(lock); } +#endif /* CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 4f1bea1..7960268 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm/rwlock.h -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
On 30.05.14 17:50, Paolo Bonzini wrote: Il 30/05/2014 14:42, Alexander Graf ha scritto: From: Paul Mackerras pau...@samba.org Commit b005255e12a3 (KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs) added a definition of KVM_REG_PPC_WORT with the same register number as the existing KVM_REG_PPC_VRSAVE (though in fact the definitions are not identical because of the different register sizes.) For clarity, this moves KVM_REG_PPC_WORT to the next unused number, and also adds it to api.txt. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexander Graf ag...@suse.de --- Documentation/virtual/kvm/api.txt | 1 + arch/powerpc/include/uapi/asm/kvm.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 9a95770..6b30290 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1873,6 +1873,7 @@ registers, find a list below: PPC | KVM_REG_PPC_PPR| 64 PPC | KVM_REG_PPC_ARCH_COMPAT 32 PPC | KVM_REG_PPC_DABRX | 32 + PPC | KVM_REG_PPC_WORT | 64 PPC | KVM_REG_PPC_TM_GPR0| 64 ... PPC | KVM_REG_PPC_TM_GPR31| 64 diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index a6665be..2bc4a94 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -545,7 +545,6 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TCSCR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb1) #define KVM_REG_PPC_PID(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb2) #define KVM_REG_PPC_ACOP(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb3) -#define KVM_REG_PPC_WORT(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb4) #define KVM_REG_PPC_VRSAVE(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4) #define KVM_REG_PPC_LPCR(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5) @@ -555,6 +554,7 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_ARCH_COMPAT(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb7) #define KVM_REG_PPC_DABRX(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb8) +#define KVM_REG_PPC_WORT(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb9) This is an ABI break, this symbol was added in 3.14. I think I should revert this. Can you convince me otherwise? There's nothing bad happening with the change. Newer user space won't be able to read WORT on older kernels, but there were more things broken that just WORT for POWER8 support there ;). And user space build with new headers running on an old kernel won't find the register, which is OK. I couldn't find any combination where it's really a problem. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 03/16] qspinlock: Add pending bit
From: Peter Zijlstra pet...@infradead.org Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 109 ++-- 2 files changed, 97 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index f66f845..bd25081 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -39,8 +39,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -48,7 +49,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -57,5 +62,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index b97a1ad..1e93c6a 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -83,24 +83,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock bit) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock bit) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + * uncontended :(n,x,y) +-- (n,0,0) --' | : + * queue : | ^--' | : + * : v | : + * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : + * queue : ^--' : */ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) { @@ -110,6 +114,65 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + /* +* trylock || pending +* +* 0,0,0 - 0,0,1 ; trylock +* 0,0,1 - 0,1,1 ; pending +*/ + for (;;) { + /* +* If we observe any contention; queue. +*/ + if (val ~_Q_LOCKED_MASK) + goto queue; + + new = _Q_LOCKED_VAL; + if (val == new) + new |= _Q_PENDING_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + + /* +* we won the
[PATCH v11 01/16] qspinlock: A simple generic 4-byte queue spinlock
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock.h | 118 include/asm-generic/qspinlock_types.h | 61 ++ kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 197 + 6 files changed, 385 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 000..e8a7ae8 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,118 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(lock-val) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32 val; + +
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
Il 30/05/2014 17:53, Alexander Graf ha scritto: This is an ABI break, this symbol was added in 3.14. I think I should revert this. Can you convince me otherwise? There's nothing bad happening with the change. Newer user space won't be able to read WORT on older kernels, but there were more things broken that just WORT for POWER8 support there ;). And user space build with new headers running on an old kernel won't find the register, which is OK. Would new userspace with old kernel be able to detect that POWER8 support isn't quite complete? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
On 30.05.14 17:55, Paolo Bonzini wrote: Il 30/05/2014 17:53, Alexander Graf ha scritto: This is an ABI break, this symbol was added in 3.14. I think I should revert this. Can you convince me otherwise? There's nothing bad happening with the change. Newer user space won't be able to read WORT on older kernels, but there were more things broken that just WORT for POWER8 support there ;). And user space build with new headers running on an old kernel won't find the register, which is OK. Would new userspace with old kernel be able to detect that POWER8 support isn't quite complete? It couldn't, no. It would try to run a guest - if it happens to work we're lucky ;). Even then the only thing that would remotely be affected by that one_reg rename is live migration (which just got a few more fixes in this pull request). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
Il 30/05/2014 17:58, Alexander Graf ha scritto: Would new userspace with old kernel be able to detect that POWER8 support isn't quite complete? It couldn't, no. It would try to run a guest - if it happens to work we're lucky ;). That's why I'm considering a revert. Even then the only thing that would remotely be affected by that one_reg rename is live migration (which just got a few more fixes in this pull request). Doesn't info cpus also do get/set one_reg? What happens if it returns EINVAL? Also, reset should certainly try to write all registers, what happens if one is missed. Beyond the particular case of WORT, I'd just like to point out that uapi/ changes need even more scrutiny from maintainers than usual. I don't know exactly what checks Linus makes in my pull requests, but uapi/ is at the top of the list of things he might look at, right after the diffstat. :) Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters
On Fri, May 30, 2014 at 09:31:53AM +0200, Peter Zijlstra wrote: On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com Currently perf unconditionally disables PEBS for guest. Now that we have the infrastructure in place to handle it we can allow it for KVM owned guest events. For the perf needs to know that a event is owned by a guest. Add a new state bit in the perf_event for that. This doesn't make sense; why does it need to be owned? Please read the complete patch kit -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
On 30.05.14 18:03, Paolo Bonzini wrote: Il 30/05/2014 17:58, Alexander Graf ha scritto: Would new userspace with old kernel be able to detect that POWER8 support isn't quite complete? It couldn't, no. It would try to run a guest - if it happens to work we're lucky ;). That's why I'm considering a revert. Even then the only thing that would remotely be affected by that one_reg rename is live migration (which just got a few more fixes in this pull request). Doesn't info cpus also do get/set one_reg? Yeah, but WORT is not important enough to get listed. What happens if it returns EINVAL? Also, reset should certainly try to write all registers, what happens if one is missed. If it returns EINVAL we just ignore the register. Beyond the particular case of WORT, I'd just like to point out that uapi/ changes need even more scrutiny from maintainers than usual. I don't know exactly what checks Linus makes in my pull requests, but uapi/ is at the top of the list of things he might look at, right after the diffstat. :) Consider that ONE_REG as experimental flagged :). Really, I am as concerned as you are on ABI breakages, but in this case it's not worth it. I'm not even sure any guest uses WORT at all. Linux doesn't seem to. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
Il 30/05/2014 18:08, Alexander Graf ha scritto: What happens if it returns EINVAL? Also, reset should certainly try to write all registers, what happens if one is missed. If it returns EINVAL we just ignore the register. I wonder if it's the right thing to do. You remember how you were bitten by less-than-sensible error handling in the x86 kvm_arch_put_registers. Beyond the particular case of WORT, I'd just like to point out that uapi/ changes need even more scrutiny from maintainers than usual. I don't know exactly what checks Linus makes in my pull requests, but uapi/ is at the top of the list of things he might look at, right after the diffstat. :) Consider that ONE_REG as experimental flagged :). Really, I am as concerned as you are on ABI breakages, but in this case it's not worth it. I'm not even sure any guest uses WORT at all. Linux doesn't seem to. Fair enough... for this time only!... Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
On 30.05.14 18:11, Paolo Bonzini wrote: Il 30/05/2014 18:08, Alexander Graf ha scritto: What happens if it returns EINVAL? Also, reset should certainly try to write all registers, what happens if one is missed. If it returns EINVAL we just ignore the register. I wonder if it's the right thing to do. You remember how you were bitten by less-than-sensible error handling in the x86 kvm_arch_put_registers. Yeah, I think we'll have to do a flag day as of which we declare PPC as a first class supported citizen. At that point we'll have compat machine types and start to do more intense error checking. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/4] live migration support for initial write protect of VM
+static inline void kvm_set_s2pte_readonly(pte_t *pte) +{ +pte_val(*pte) = ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR); This relies on the pte already having been set as RDONLY or RDWR, if you are creating a new pte and calling this function it could be easy to miss that distinction, I would prefer: pte_val(*pte) = L_PTE_S2_RDWR; pte_val(*pte) |= L_PTE_S2_RDONLY; Confused on this comment, this appears to just add the read-only permission. But will leave other permission bits intact, and clears out the rest of the pte? - Mario -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: does anybody still care about kvm-ia64?
On 05/30/14 15:45, Paolo Bonzini wrote: I was thinking of removing it in Linux 3.17. I'm not even sure it compiles right now, hasn't seen any action in years, and all open-source userspace code to use it has been dead for years. If you disagree, please speak up loudly in the next month. I'd say take it out back and show it mercy! Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/12] kvm tools: Misc patches (mips support)
On 05/28/2014 11:08 PM, Andreas Herrmann wrote: Hi, This is v3 of my patch set to run lkvm on MIPS. It's rebased on v3.13-rc1-1436-g1fc83c5 of git://github.com/penberg/linux-kvm.git Applied, thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Ensure negative return value on kvm_init() error handling path
We need to ensure ret 0 when going through the error path, or QEMU may try to run the half-initialized VM and crash. Signed-off-by: Eduardo Habkost ehabk...@redhat.com --- kvm-all.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kvm-all.c b/kvm-all.c index 721a390..4e19eff 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1410,7 +1410,7 @@ int kvm_init(MachineClass *mc) ret = kvm_ioctl(s, KVM_GET_API_VERSION, 0); if (ret KVM_API_VERSION) { -if (ret 0) { +if (ret = 0) { ret = -EINVAL; } fprintf(stderr, kvm version too old\n); @@ -1461,6 +1461,7 @@ int kvm_init(MachineClass *mc) if (mc-kvm_type) { type = mc-kvm_type(kvm_type); } else if (kvm_type) { +ret = -EINVAL; fprintf(stderr, Invalid argument kvm-type=%s\n, kvm_type); goto err; } @@ -1561,6 +1562,7 @@ int kvm_init(MachineClass *mc) return 0; err: +assert(ret 0); if (s-vmfd = 0) { close(s-vmfd); } -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Ensure negative return value on kvm_init() error handling path
Il 30/05/2014 22:26, Eduardo Habkost ha scritto: We need to ensure ret 0 when going through the error path, or QEMU may try to run the half-initialized VM and crash. Signed-off-by: Eduardo Habkost ehabk...@redhat.com --- kvm-all.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kvm-all.c b/kvm-all.c index 721a390..4e19eff 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1410,7 +1410,7 @@ int kvm_init(MachineClass *mc) ret = kvm_ioctl(s, KVM_GET_API_VERSION, 0); if (ret KVM_API_VERSION) { -if (ret 0) { +if (ret = 0) { ret = -EINVAL; } fprintf(stderr, kvm version too old\n); @@ -1461,6 +1461,7 @@ int kvm_init(MachineClass *mc) if (mc-kvm_type) { type = mc-kvm_type(kvm_type); } else if (kvm_type) { +ret = -EINVAL; fprintf(stderr, Invalid argument kvm-type=%s\n, kvm_type); goto err; } @@ -1561,6 +1562,7 @@ int kvm_init(MachineClass *mc) return 0; err: +assert(ret 0); if (s-vmfd = 0) { close(s-vmfd); } Applied, thanks. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] machine: Add kvm-type property
The kvm-type machine option was left out when MachineState was introduced, preventing the kvm-type option from being used. Add the missing property. Signed-off-by: Eduardo Habkost ehabk...@redhat.com Cc: Andreas Färber afaer...@suse.de Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Cc: Alexander Graf ag...@suse.de Cc: Marcel Apfelbaum marce...@redhat.com --- Tested in a x86 machine only. Help would be welcome to test it on a PPC machine using -machine spapr and KVM. Before this patch: $ qemu-system-x86_64 -machine pc,kvm-type=hv,accel=kvm qemu-system-x86_64: Property '.kvm-type' not found (This means the option won't work even for sPAPR machines.) After applying this patch: $ qemu-system-x86_64 -machine pc,kvm-type=hv,accel=kvm Invalid argument kvm-type=hv (This means the x86 KVM init code is seeing (and rejecting) the option, and the sPAPR code can use it.) Note that qemu-system-x86_64 will segfault with the above command-line unless an additional fix (submitted today) is applied (kvm: Ensure negative return value on kvm_init() error handling path). --- hw/core/machine.c | 17 + include/hw/boards.h | 1 + 2 files changed, 18 insertions(+) diff --git a/hw/core/machine.c b/hw/core/machine.c index cbba679..ed47b3a 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -235,6 +235,21 @@ static void machine_set_firmware(Object *obj, const char *value, Error **errp) ms-firmware = g_strdup(value); } +static char *machine_get_kvm_type(Object *obj, Error **errp) +{ +MachineState *ms = MACHINE(obj); + +return g_strdup(ms-kvm_type); +} + +static void machine_set_kvm_type(Object *obj, const char *value, Error **errp) +{ +MachineState *ms = MACHINE(obj); + +g_free(ms-kvm_type); +ms-kvm_type = g_strdup(value); +} + static void machine_initfn(Object *obj) { object_property_add_str(obj, accel, @@ -274,6 +289,8 @@ static void machine_initfn(Object *obj) object_property_add_bool(obj, usb, machine_get_usb, machine_set_usb, NULL); object_property_add_str(obj, firmware, machine_get_firmware, machine_set_firmware, NULL); +object_property_add_str(obj, kvm-type, +machine_get_kvm_type, machine_set_kvm_type, NULL); } static void machine_finalize(Object *obj) diff --git a/include/hw/boards.h b/include/hw/boards.h index 2d2e2be..44956d6 100644 --- a/include/hw/boards.h +++ b/include/hw/boards.h @@ -111,6 +111,7 @@ struct MachineState { bool mem_merge; bool usb; char *firmware; +char *kvm_type; ram_addr_t ram_size; const char *boot_order; -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] machine: Add kvm-type property
Il 30/05/2014 22:41, Eduardo Habkost ha scritto: diff --git a/include/hw/boards.h b/include/hw/boards.h index 2d2e2be..44956d6 100644 --- a/include/hw/boards.h +++ b/include/hw/boards.h @@ -111,6 +111,7 @@ struct MachineState { bool mem_merge; bool usb; char *firmware; +char *kvm_type; ram_addr_t ram_size; const char *boot_order; Can you add it only to the pseries machine instead? This is one of the first reasons why I wanted to have per-machine properties. :) Thanks! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html