On Fri, Nov 02, 2018 at 10:26:37AM +0000, Suzuki K Poulose wrote:
> Hi Christoffer,
> 
> On 02/11/18 07:53, Christoffer Dall wrote:
> >There are two things we need to take care of when we create block
> >mappings in the stage 2 page tables:
> >
> >   (1) The alignment within a PMD between the host address range and the
> >   guest IPA range must be the same, since otherwise we end up mapping
> >   pages with the wrong offset.
> >
> >   (2) The head and tail of a memory slot may not cover a full block
> >   size, and we have to take care to not map those with block
> >   descriptors, since we could expose memory to the guest that the host
> >   did not intend to expose.
> >
> >So far, we have been taking care of (1), but not (2), and our commentary
> >describing (1) was somewhat confusing.
> >
> >This commit attempts to factor out the checks of both into a common
> >function, and if we don't pass the check, we won't attempt any PMD
> >mappings for neither hugetlbfs nor THP.
> >
> >Note that we used to only check the alignment for THP, not for
> >hugetlbfs, but as far as I can tell the check needs to be applied to
> >both scenarios.
> >
> >Cc: Ralph Palutke <[email protected]>
> >Cc: Lukas Braun <[email protected]>
> >Reported-by: Lukas Braun <[email protected]>
> >Signed-off-by: Christoffer Dall <[email protected]>
> >---
> >  virt/kvm/arm/mmu.c | 79 +++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 64 insertions(+), 15 deletions(-)
> >
> >diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >index 5eca48bdb1a6..4e7572656b5c 100644
> >--- a/virt/kvm/arm/mmu.c
> >+++ b/virt/kvm/arm/mmu.c
> >@@ -1470,6 +1470,63 @@ static void kvm_send_hwpoison_signal(unsigned long 
> >address,
> >     send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
> >  }
> >+static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot 
> >*memslot,
> >+                                           unsigned long hva)
> >+{
> >+    gpa_t gpa_start, gpa_end;
> >+    hva_t uaddr_start, uaddr_end;
> >+    size_t size;
> >+
> >+    size = memslot->npages * PAGE_SIZE;
> >+
> >+    gpa_start = memslot->base_gfn << PAGE_SHIFT;
> >+    gpa_end = gpa_start + size;
> >+
> >+    uaddr_start = memslot->userspace_addr;
> >+    uaddr_end = uaddr_start + size;
> >+
> >+    /*
> >+     * Pages belonging to memslots that don't have the same alignment
> >+     * within a PMD for userspace and IPA cannot be mapped with stage-2
> >+     * PMD entries, because we'll end up mapping the wrong pages.
> >+     *
> >+     * Consider a layout like the following:
> >+     *
> >+     *    memslot->userspace_addr:
> >+     *    +-----+--------------------+--------------------+---+
> >+     *    |abcde|fgh  Stage-1 PMD    |    Stage-1 PMD   tv|xyz|
> >+     *    +-----+--------------------+--------------------+---+
> >+     *
> >+     *    memslot->base_gfn << PAGE_SIZE:
> >+     *      +---+--------------------+--------------------+-----+
> >+     *      |abc|def  Stage-2 PMD    |    Stage-2 PMD     |tvxyz|
> >+     *      +---+--------------------+--------------------+-----+
> >+     *
> >+     * If we create those stage-2 PMDs, we'll end up with this incorrect
> >+     * mapping:
> >+     *   d -> f
> >+     *   e -> g
> >+     *   f -> h
> >+     */
> 
> Nice ! The comment makes it so easy to understand the problem.
> 
> >+    if ((gpa_start & ~S2_PMD_MASK) != (uaddr_start & ~S2_PMD_MASK))
> >+            return false;
> >+
> >+    /*
> >+     * Next, let's make sure we're not trying to map anything not covered
> >+     * by the memslot. This means we have to prohibit PMD size mappings
> >+     * for the beginning and end of a non-PMD aligned and non-PMD sized
> >+     * memory slot (illustrated by the head and tail parts of the
> >+     * userspace view above containing pages 'abcde' and 'xyz',
> >+     * respectively).
> >+     *
> >+     * Note that it doesn't matter if we do the check using the
> >+     * userspace_addr or the base_gfn, as both are equally aligned (per
> >+     * the check above) and equally sized.
> >+     */
> >+    return (hva & S2_PMD_MASK) >= uaddr_start &&
> >+           (hva & S2_PMD_MASK) + S2_PMD_SIZE <= uaddr_end;
> 
> Shouldn't this be :
> 
>               (hva & S2_PMD_MASK) + S2_PMD_SIZE < uaddr_end;
> 
>               or
> 
>               (hva & S2_PMD_MASK) + S2_PMD_SIZE <= (uaddr_end - 1);

Let's consider an example.  Imagine a 4MB region (1024 4K pages) start
at 0, and we want to ensure that the address is within one of the 2MB
pages.

That means, size will be:

  size = 1024 * 4096 = 4194304
  uaddr_start = 0
  uaddr_end = 0 + 4194304 = 4194304

and the comparison check for anything in the first 2MB region becomes

      return 0 >= 0 && 2097152 <= 4194304;

which is what we want.

The comparison check for anything in the second 2MB region becomes

      return 2097152 >= 0 && 4194304 <= 4194304;

which is also what we want.

If we had done a stricly less than or subtracted one from uaddr_end this
check would fail, which we don't want.

Am I missing something?

> 
> >+}
> >+
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                       struct kvm_memory_slot *memslot, unsigned long hva,
> >                       unsigned long fault_status)
> >@@ -1495,6 +1552,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >phys_addr_t fault_ipa,
> >             return -EFAULT;
> >     }
> >+    if (!fault_supports_stage2_pmd_mappings(memslot, hva))
> >+            force_pte = true;
> >+
> >+    if (logging_active)
> >+            force_pte = true;
> 
> super minor nit: If we combine the checks, may be we could skip checking
> the  alignments, when logging is active.
> 
>       if (logging_active ||
>           !fault_supports_stage2_pmd_mappings(memslot, hva))
>               force_pte = true;
> 

I prefer the way the code is written now, because logging active is a
logically very separate concept from checking the boundaries, and this
is realy the slow path anyway, so not sure there's any measurable
benefit.

More generally, I'd like user_mem_abort() to become more like:

if (check1())
   conditionr1 = value;
if (check2())
   condition2 = value;
if (check3())
   condition3 = value;

manipulate_page_tables(condition1, condition2, condition3, ...);

And I'll have a got at that following Punit's PUD huge pages.

> >+
> >     /* Let's check if we will get back a huge page backed by hugetlbfs */
> >     down_read(&current->mm->mmap_sem);
> >     vma = find_vma_intersection(current->mm, hva, hva + 1);
> >@@ -1504,22 +1567,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >phys_addr_t fault_ipa,
> >             return -EFAULT;
> >     }
> >-    if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> >+    if (vma_kernel_pagesize(vma) == PMD_SIZE && !force_pte) {
> >             hugetlb = true;
> >             gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> >-    } else {
> >-            /*
> >-             * Pages belonging to memslots that don't have the same
> >-             * alignment for userspace and IPA cannot be mapped using
> >-             * block descriptors even if the pages belong to a THP for
> >-             * the process, because the stage-2 block descriptor will
> >-             * cover more than a single THP and we loose atomicity for
> >-             * unmapping, updates, and splits of the THP or other pages
> >-             * in the stage-2 block range.
> >-             */
> >-            if ((memslot->userspace_addr & ~PMD_MASK) !=
> >-                ((memslot->base_gfn << PAGE_SHIFT) & ~PMD_MASK))
> >-                    force_pte = true;
> >     }
> >     up_read(&current->mm->mmap_sem);
> >@@ -1558,7 +1608,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >phys_addr_t fault_ipa,
> >              * should not be mapped with huge pages (it introduces churn
> >              * and performance degradation), so force a pte mapping.
> >              */
> >-            force_pte = true;
> >             flags |= KVM_S2_FLAG_LOGGING_ACTIVE;
> 
> Otherwise looks good to me.
> 

Thanks!

    Christoffer
_______________________________________________
kvmarm mailing list
[email protected]
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Reply via email to