Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging
On Thu, 30 Oct 2014 12:19:00 -0700 Mario Smarduch m.smard...@samsung.com wrote: On 10/30/2014 05:14 AM, Cornelia Huck wrote: On Wed, 22 Oct 2014 15:34:07 -0700 Mario Smarduch m.smard...@samsung.com wrote: This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and makefile. No other architectures are affected, each uses it's own version. This changed from previous patch revision where non-generic architectures were modified. In subsequent patch armv7 does samething. All other architectures continue use architecture defined version. Hm. The x86 specific version of dirty page logging is generic enough to be used by other architectures, noteably ARMv7. So let's move the x86 code under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other architectures continue to use their own implementations. ? I'll update descriptions for both patches, with the more concise descriptions. I don't think it's so generic. Especially, the relationship between mmu_lock and TLB flushing has been changed a few times for optimizing x86 mmu code, and it may change in the future again. Since how mmu_lock is used is depends on each arch, it's not so simple to make the function generic IMO. Thanks, Takuya Thanks. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/x86/include/asm/kvm_host.h |3 -- arch/x86/kvm/Kconfig|1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c | 86 -- include/linux/kvm_host.h|4 ++ virt/kvm/Kconfig|3 ++ virt/kvm/dirtylog.c | 112 +++ 7 files changed, 121 insertions(+), 89 deletions(-) create mode 100644 virt/kvm/dirtylog.c diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c new file mode 100644 index 000..67a --- /dev/null +++ b/virt/kvm/dirtylog.c @@ -0,0 +1,112 @@ +/* + * kvm generic dirty logging support, used by architectures that share + * comman dirty page logging implementation. s/comman/common/ The approach looks sane to me, especially as it does not change other architectures needlessly. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Takuya Yoshikawa takuya.yoshik...@gmail.com -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] KVM: PPC: Book3S HV: Fix computation of tlbie operand
The B (segment size) field in the RB operand for the tlbie instruction is two bits, which we get from the top two bits of the first doubleword of the HPT entry to be invalidated. These bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM bit numbering). The compute_tlbie_rb() function gets these bits as v (62 - 8), which is not correct as it will bring in the top 10 bits, not just the top two. These extra bits could corrupt the AP, AVAL and L fields in the RB value. To fix this we shift right 62 bits and then shift left 8 bits, so we only get the two bits of the B field. The first doubleword of the HPT entry is under the control of the guest kernel. In fact, Linux guests will always put zeroes in bits 54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing this. Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 0aa8179..a37f1a4 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -148,7 +148,7 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, /* This covers 14..54 bits of va*/ rb = (v ~0x7fUL) 16; /* AVA field */ - rb |= v (62 - 8);/* B field */ + rb |= (v HPTE_V_SSIZE_SHIFT) 8; /* B field */ /* * AVA in v had cleared lower 23 bits. We need to derive * that from pteg index -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] Some fixes for HV KVM on PPC
Here are fixes for five bugs which were found in the testing of our PowerKVM product. The bugs range from guest performance issues to guest crashes and memory corruption. Please apply. Paul. --- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_hv.c | 22 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 44 arch/powerpc/kvm/book3s_hv_rm_xics.c | 36 -- arch/powerpc/kvm/book3s_xics.c | 30 +- arch/powerpc/kvm/book3s_xics.h | 1 + 6 files changed, 92 insertions(+), 43 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
From: Suresh E. Warrier warr...@linux.vnet.ibm.com This fixes some inaccuracies in the state machine for the virtualized ICP when implementing the H_IPI hcall (Set_MFFR and related states): 1. The old code wipes out any pending interrupts when the new MFRR is more favored than the CPPR but less favored than a pending interrupt (by always modifying xisr and the pending_pri). This can cause us to lose a pending external interrupt. The correct code here is to only modify the pending_pri and xisr in the ICP if the MFRR is equal to or more favored than the current pending pri (since in this case, it is guaranteed that that there cannot be a pending external interrupt). The code changes are required in both kvmppc_rm_h_ipi and kvmppc_h_ipi. 2. Again, in both kvmppc_rm_h_ipi and kvmppc_h_ipi, there is a check for whether MFRR is being made less favored AND further if new MFFR is also less favored than the current CPPR, we check for any resends pending in the ICP. These checks look like they are designed to cover the case where if the MFRR is being made less favored, we opportunistically trigger a resend of any interrupts that had been previously rejected. Although, this is not a state described by PAPR, this is an action we actually need to do especially if the CPPR is already at 0xFF. Because in this case, the resend bit will stay on until another ICP state change which may be a long time coming and the interrupt stays pending until then. The current code which checks for MFRR CPPR is broken when CPPR is 0xFF since it will not get triggered in that case. Ideally, we would want to do a resend only if prio(pending_interrupt) mfrr prio(pending_interrupt) cppr where pending interrupt is the one that was rejected. But we don't have the priority of the pending interrupt state saved, so we simply trigger a resend whenever the MFRR is made less favored. 3. In kvmppc_rm_h_ipi, where we save state to pass resends to the virtual mode, we also need to save the ICP whose need_resend we reset since this does not need to be my ICP (vcpu-arch.icp) as is incorrectly assumed by the current code. A new field rm_resend_icp is added to the kvmppc_icp structure for this purpose. Signed-off-by: Suresh Warrier warr...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s_hv_rm_xics.c | 36 arch/powerpc/kvm/book3s_xics.c | 30 +++--- arch/powerpc/kvm/book3s_xics.h | 1 + 3 files changed, 52 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c b/arch/powerpc/kvm/book3s_hv_rm_xics.c index 3ee38e6..7b066f6 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_xics.c +++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c @@ -183,8 +183,10 @@ static void icp_rm_down_cppr(struct kvmppc_xics *xics, struct kvmppc_icp *icp, * state update in HW (ie bus transactions) so we can handle them * separately here as well. */ - if (resend) + if (resend) { icp-rm_action |= XICS_RM_CHECK_RESEND; + icp-rm_resend_icp = icp; + } } @@ -254,10 +256,25 @@ int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, * nothing needs to be done as there can be no XISR to * reject. * +* ICP state: Check_IPI +* * If the CPPR is less favored, then we might be replacing -* an interrupt, and thus need to possibly reject it as in +* an interrupt, and thus need to possibly reject it. * -* ICP state: Check_IPI +* ICP State: IPI +* +* Besides rejecting any pending interrupts, we also +* update XISR and pending_pri to mark IPI as pending. +* +* PAPR does not describe this state, but if the MFRR is being +* made less favored than its earlier value, there might be +* a previously-rejected interrupt needing to be resent. +* Ideally, we would want to resend only if +* prio(pending_interrupt) mfrr +* prio(pending_interrupt) cppr +* where pending interrupt is the one that was rejected. But +* we don't have that state, so we simply trigger a resend +* whenever the MFRR is made less favored. */ do { old_state = new_state = ACCESS_ONCE(icp-state); @@ -270,13 +287,14 @@ int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, resend = false; if (mfrr new_state.cppr) { /* Reject a pending interrupt if not an IPI */ - if (mfrr = new_state.pending_pri) + if (mfrr = new_state.pending_pri) { reject = new_state.xisr; - new_state.pending_pri = mfrr; -
[PATCH 3/5] KVM: PPC: Book3S HV: Fix KSM memory corruption
Testing with KSM active in the host showed occasional corruption of guest memory. Typically a page that should have contained zeroes would contain values that look like the contents of a user process stack (values such as 0x_3fff__xxx). Code inspection in kvmppc_h_protect revealed that there was a race condition with the possibility of granting write access to a page which is read-only in the host page tables. The code attempts to keep the host mapping read-only if the host userspace PTE is read-only, but if that PTE had been temporarily made invalid for any reason, the read-only check would not trigger and the host HPTE could end up read-write. Examination of the guest HPT in the failure situation revealed that there were indeed shared pages which should have been read-only that were mapped read-write. To close this race, we don't let a page go from being read-only to being read-write, as far as the real HPTE mapping the page is concerned (the guest view can go to read-write, but the actual mapping stays read-only). When the guest tries to write to the page, we take an HDSI and let kvmppc_book3s_hv_page_fault take care of providing a writable HPTE for the page. This eliminates the occasional corruption of shared pages that was previously seen with KSM active. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 44 ++--- 1 file changed, 17 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 084ad54..411720f 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -667,40 +667,30 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags, rev-guest_rpte = r; note_hpte_modification(kvm, rev); } - r = (be64_to_cpu(hpte[1]) ~mask) | bits; /* Update HPTE */ if (v HPTE_V_VALID) { - rb = compute_tlbie_rb(v, r, pte_index); - hpte[0] = cpu_to_be64(v ~HPTE_V_VALID); - do_tlbies(kvm, rb, 1, global_invalidates(kvm, flags), true); /* -* If the host has this page as readonly but the guest -* wants to make it read/write, reduce the permissions. -* Checking the host permissions involves finding the -* memslot and then the Linux PTE for the page. +* If the page is valid, don't let it transition from +* readonly to writable. If it should be writable, we'll +* take a trap and let the page fault code sort it out. */ - if (hpte_is_writable(r) kvm-arch.using_mmu_notifiers) { - unsigned long psize, gfn, hva; - struct kvm_memory_slot *memslot; - pgd_t *pgdir = vcpu-arch.pgdir; - pte_t pte; - - psize = hpte_page_size(v, r); - gfn = ((r HPTE_R_RPN) ~(psize - 1)) PAGE_SHIFT; - memslot = __gfn_to_memslot(kvm_memslots_raw(kvm), gfn); - if (memslot) { - hva = __gfn_to_hva_memslot(memslot, gfn); - pte = lookup_linux_pte_and_update(pgdir, hva, - 1, psize); - if (pte_present(pte) !pte_write(pte)) - r = hpte_make_readonly(r); - } + pte = be64_to_cpu(hpte[1]); + r = (pte ~mask) | bits; + if (hpte_is_writable(r) kvm-arch.using_mmu_notifiers + !hpte_is_writable(pte)) + r = hpte_make_readonly(r); + /* If the PTE is changing, invalidate it first */ + if (r != pte) { + rb = compute_tlbie_rb(v, r, pte_index); + hpte[0] = cpu_to_be64((v ~HPTE_V_VALID) | + HPTE_V_ABSENT); + do_tlbies(kvm, rb, 1, global_invalidates(kvm, flags), + true); + hpte[1] = cpu_to_be64(r); } } - hpte[1] = cpu_to_be64(r); - eieio(); - hpte[0] = cpu_to_be64(v ~HPTE_V_HVLOCK); + unlock_hpte(hpte, v ~HPTE_V_HVLOCK); asm volatile(ptesync : : : memory); return H_SUCCESS; } -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
From: Mahesh Salgaonkar mah...@linux.vnet.ibm.com When we get an HMI (hypervisor maintenance interrupt) while in a guest, we see that guest enters into paused state. The reason is, in kvmppc_handle_exit_hv it falls through default path and returns to host instead of resuming guest. This causes guest to enter into paused state. HMI is a hypervisor only interrupt and it is safe to resume the guest since the host has handled it already. This patch adds a switch case to resume the guest. Without this patch we see guest entering into paused state with following console messages: [ 3003.329351] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329356] Error detail: Timer facility experienced an error [ 3003.329359] HMER: 0840 [ 3003.329360] TFMR: 4a12000980a84000 [ 3003.329366] vcpu c007c35094c0 (40): [ 3003.329368] pc = c00c2ba0 msr = 80009032 trap = e60 [ 3003.329370] r 0 = c021ddc0 r16 = 0046 [ 3003.329372] r 1 = c0007a02bbd0 r17 = 327d5d98 [ 3003.329375] r 2 = c10980b8 r18 = 1fc9a0b0 [ 3003.329377] r 3 = c142d6b8 r19 = c142d6b8 [ 3003.329379] r 4 = 0002 r20 = [ 3003.329381] r 5 = c524a110 r21 = [ 3003.329383] r 6 = 0001 r22 = [ 3003.329386] r 7 = r23 = c524a110 [ 3003.329388] r 8 = r24 = 0001 [ 3003.329391] r 9 = 0001 r25 = c0007c31da38 [ 3003.329393] r10 = c14280b8 r26 = 0002 [ 3003.329395] r11 = 746f6f6c2f68656c r27 = c524a110 [ 3003.329397] r12 = 28004484 r28 = c0007c31da38 [ 3003.329399] r13 = cfe01400 r29 = 0002 [ 3003.329401] r14 = 0046 r30 = c3011e00 [ 3003.329403] r15 = ffba r31 = 0002 [ 3003.329404] ctr = c041a670 lr = c0272520 [ 3003.329405] srr0 = c007e8d8 srr1 = 90001002 [ 3003.329406] sprg0 = sprg1 = cfe01400 [ 3003.329407] sprg2 = cfe01400 sprg3 = 0005 [ 3003.329408] cr = 48004482 xer = 2000 dsisr = 4200 [ 3003.329409] dar = 010015020048 [ 3003.329410] fault dar = 010015020048 dsisr = 4200 [ 3003.329411] SLB (8 entries): [ 3003.329412] ESID = c800 VSID = 40016e7779000510 [ 3003.329413] ESID = d801 VSID = 400142add1000510 [ 3003.329414] ESID = f804 VSID = 4000eb1a81000510 [ 3003.329415] ESID = 1f00080b VSID = 40004fda0a000d90 [ 3003.329416] ESID = 3f00080c VSID = 400039f536000d90 [ 3003.329417] ESID = 180d VSID = 0001251b35150d90 [ 3003.329417] ESID = 0100080e VSID = 4001e4609d90 [ 3003.329418] ESID = d8000819 VSID = 40013d349c000400 [ 3003.329419] lpcr = c04881847001 sdr1 = 001b1906 last_inst = [ 3003.329421] trap=0xe60 | pc=0xc00c2ba0 | msr=0x80009032 [ 3003.329524] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329526] Error detail: Timer facility experienced an error [ 3003.329527] HMER: 0840 [ 3003.329527] TFMR: 4a12000980a94000 [ 3006.359786] Severe Hypervisor Maintenance interrupt [Recovered] [ 3006.359792] Error detail: Timer facility experienced an error [ 3006.359795] HMER: 0840 [ 3006.359797] TFMR: 4a12000980a84000 IdName State 2 guest2 running 3 guest3 paused 4 guest4 running Signed-off-by: Mahesh Salgaonkar mah...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s_hv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index e63587d..cd7e030 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -769,6 +769,8 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu, vcpu-stat.ext_intr_exits++; r = RESUME_GUEST; break; + /* HMI is hypervisor interrupt and host has handled it. Resume guest.*/ + case BOOK3S_INTERRUPT_HMI: case BOOK3S_INTERRUPT_PERFMON: r = RESUME_GUEST; break; -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: PPC: Book3S HV: Fix computation of tlbie operand
Paul Mackerras pau...@samba.org writes: The B (segment size) field in the RB operand for the tlbie instruction is two bits, which we get from the top two bits of the first doubleword of the HPT entry to be invalidated. These bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM bit numbering). The compute_tlbie_rb() function gets these bits as v (62 - 8), which is not correct as it will bring in the top 10 bits, not just the top two. These extra bits could corrupt the AP, AVAL and L fields in the RB value. To fix this we shift right 62 bits and then shift left 8 bits, so we only get the two bits of the B field. Good catch. The first doubleword of the HPT entry is under the control of the guest kernel. In fact, Linux guests will always put zeroes in bits 54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing this. Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 0aa8179..a37f1a4 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -148,7 +148,7 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, /* This covers 14..54 bits of va*/ rb = (v ~0x7fUL) 16; /* AVA field */ - rb |= v (62 - 8);/* B field */ + rb |= (v HPTE_V_SSIZE_SHIFT) 8; /* B field */ /* * AVA in v had cleared lower 23 bits. We need to derive * that from pteg index -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: PPC: Book3S HV: Fix computation of tlbie operand
Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com writes: Paul Mackerras pau...@samba.org writes: The B (segment size) field in the RB operand for the tlbie instruction is two bits, which we get from the top two bits of the first doubleword of the HPT entry to be invalidated. These bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM bit numbering). The compute_tlbie_rb() function gets these bits as v (62 - 8), which is not correct as it will bring in the top 10 bits, not just the top two. These extra bits could corrupt the AP, AVAL and L fields in the RB value. To fix this we shift right 62 bits and then shift left 8 bits, so we only get the two bits of the B field. Good catch. The first doubleword of the HPT entry is under the control of the guest kernel. In fact, Linux guests will always put zeroes in bits 54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing this. Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 0aa8179..a37f1a4 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -148,7 +148,7 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, /* This covers 14..54 bits of va*/ rb = (v ~0x7fUL) 16; /* AVA field */ -rb |= v (62 - 8);/* B field */ +rb |= (v HPTE_V_SSIZE_SHIFT) 8; /* B field */ or should we do. I guess the below is more closer to what we have in rest of the code ? rb |= ((v (HPTE_V_SSIZE_SHIFT - 8)) ~0xffUL); /* * AVA in v had cleared lower 23 bits. We need to derive * that from pteg index -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html