Re: [kvm-devel] large page support for kvm

Marcelo Tosatti Mon, 11 Feb 2008 07:46:19 -0800

On Thu, Jan 31, 2008 at 07:44:52AM +0200, Avi Kivity wrote:
> Joerg Roedel wrote:
> > On Tue, Jan 29, 2008 at 07:20:12PM +0200, Avi Kivity wrote:
> >
> >   
> >> Here's a rough sketch of my proposal:
> >>
> >> - For every memory slot, allocate an array containing one int for every 
> >> potential large page included within that memory slot.  Each entry in 
> >> the array contains the number of write-protected 4KB pages within the 
> >> large page frame corresponding to that entry.
> >>
> >> For example, if we have a memory slot for gpas 1MB-1GB, we'd have an 
> >> array of size 511, corresponding to the 511 2MB pages from 2MB upwards.  
> >> If we shadow a pagetable at address 4MB+8KB, we'd increment the entry 
> >> corresponding to the large page at 4MB.  When we unshadow that page, 
> >> decrement the entry.
> >>     
> >
> > You need to take care the the 2MB gpa is aligned 2 MB host physical to
> > be able to map it correctly with a large pte. So maybe we need two
> > memslots for 1MB-1GB. One for 1MB-2MB using normal 4kb pages and one
> > from 2MB-1GB which can be allocated using HugeTLBfs.
> >
> >   
> 
> Another option is to allocate all memory starting from address zero 
> using hugetlbfs, and pass 0-640K as one memslot and 1MB+ as another. In 
> case the kernel needs to support both methods (e.g. it must handle a 
> memslot that starts in the middle of a large page).
> 
> >> - If we attempt to shadow a large page (either a guest pse pte, or a 
> >> real-mode pseudo pte), we check if the host page is a large page.  If 
> >> so, we also check the write-protect count array.  If the result is zero, 
> >> we create a shadow pse pte.
> >>
> >> - Whenever we write-protect a page, also zap any large-page mappings for 
> >> that page.  This means rmap will need some extension to handle pde rmaps 
> >> in addition to pte rmaps.
> >>     
> >
> > This sounds straight forward to me. All you need is a short value for
> > every potential large page and initialize it with -1 if the host page is
> > a large page and with 0 otherwise. Every time this value reaches -1 we
> > can map the page with a large pte (and the guest maps with large pte).
> >
> >   
> 
> You don't know whether the host page is a large page in advance. It 
> needs to be checked during pagefault time.
> 
> >> - qemu is extended to have a command-line option to use large pages to 
> >> back guest memory.
> >>
> >> Large pages should improve performance significantly, both with 
> >> traditional shadow and npt/ept.
> >>     
> >
> > Yes, I think that too. But with shadow paging it really depends on the
> > guest if the performance increasement is long-term. In a Linux guest,
> > for example, the direct mapped memory will become fragmented over
> > time (together with the location of the page tables). So the
> > number of potential large page mappings will likely decrease over
> > time.
> >
> >   
> 
> Yes, that's why it is important to be able to fail fast when checking 
> whether we can use a large spte.


Ok, how does the following look. Still need to plug in large page
creation in the nonpaging case, but this should be enough for comments.

One drawback is that the hugepage follow_page() path uses a single
mm->page_table_lock spinlock, whereas 4k pages are split one per PTE
page. This is noticeable, SMP guests are slower due to it, but its a
separate problem.

Gives a 2% improvement in kernel compilations and large memory copies.

Attached is the qemu part, but this is obviously just a hack in case 
someone is interested in testing...

Index: linux-2.6-x86-kvm/arch/x86/kvm/mmu.c
===================================================================
--- linux-2.6-x86-kvm.orig/arch/x86/kvm/mmu.c
+++ linux-2.6-x86-kvm/arch/x86/kvm/mmu.c
@@ -27,6 +27,7 @@
 #include <linux/highmem.h>
 #include <linux/module.h>
 #include <linux/swap.h>
+#include <linux/hugetlb.h>
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -205,6 +206,11 @@ static int is_shadow_present_pte(u64 pte
                && pte != shadow_notrap_nonpresent_pte;
 }
 
+static int is_large_pte(u64 pte)
+{
+       return pte & PT_PAGE_SIZE_MASK;
+}
+
 static int is_writeble_pte(unsigned long pte)
 {
        return pte & PT_WRITABLE_MASK;
@@ -349,6 +355,107 @@ static void mmu_free_rmap_desc(struct kv
        kfree(rd);
 }
 
+#define HPAGE_ALIGN_OFFSET(x) ((((x)+HPAGE_SIZE-1)&HPAGE_MASK) - (x))
+/*
+ * Return the offset inside the memslot largepage integer map for a given
+ * gfn, handling slots that are not large page aligned.
+ */
+int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
+{
+       unsigned long long idx;
+       unsigned long memslot_align;
+
+       memslot_align = HPAGE_ALIGN_OFFSET(slot->base_gfn << PAGE_SHIFT);
+       idx = ((gfn - slot->base_gfn) << PAGE_SHIFT) + memslot_align;
+       idx /= HPAGE_SIZE;
+       return &slot->largepage[idx];
+}
+
+static void account_shadowed(struct kvm *kvm, gfn_t gfn)
+{
+       int *largepage_idx;
+
+       largepage_idx = slot_largepage_idx(gfn, gfn_to_memslot(kvm, gfn));
+       *largepage_idx += 1;
+       WARN_ON(*largepage_idx > (HPAGE_SIZE/PAGE_SIZE));
+}
+
+static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
+{
+       int *largepage_idx;
+
+       largepage_idx = slot_largepage_idx(gfn, gfn_to_memslot(kvm, gfn));
+       *largepage_idx -= 1;
+       WARN_ON(*largepage_idx < 0);
+}
+
+static int has_wrprotected_page(struct kvm *kvm, gfn_t gfn)
+{
+       struct kvm_memory_slot *slot;
+
+       slot = gfn_to_memslot(kvm, gfn);
+       if (slot) {
+               int *largepage_idx;
+               int end_gfn;
+
+               largepage_idx = slot_largepage_idx(gfn, slot);
+               /* check if the largepage crosses a memslot */
+               end_gfn = slot->base_gfn + slot->npages;
+               if (gfn + (HPAGE_SIZE/PAGE_SIZE) >= end_gfn)
+                       return 1;
+               else
+                       return *largepage_idx;
+       }
+
+       return 1;
+}
+
+extern unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
+static int host_largepage_backed(struct kvm *kvm, gfn_t gfn)
+{
+       struct vm_area_struct *vma;
+       unsigned long addr;
+
+       addr = gfn_to_hva(kvm, gfn);
+       if (kvm_is_error_hva(addr))
+               return 0;
+
+       vma = find_vma(current->mm, addr);
+       if (vma && is_vm_hugetlb_page(vma))
+               return 1;
+
+       return 0;
+}
+
+static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn)
+{
+       if (has_wrprotected_page(vcpu->kvm, large_gfn))
+               return 0;
+
+       if (!host_largepage_backed(vcpu->kvm, large_gfn))
+               return 0;
+
+       if ((large_gfn << PAGE_SHIFT) & (HPAGE_SIZE-1))
+               return 0;
+
+       /* guest has 4M pages */
+       if (!is_pae(vcpu))
+               return 0;
+
+       return 1;
+}
+
+static int is_physical_memory(struct kvm *kvm, gfn_t gfn)
+{
+       unsigned long addr;
+
+       addr = gfn_to_hva(kvm, gfn);
+       if (kvm_is_error_hva(addr))
+               return 0;
+
+       return 1;
+}
+
 /*
  * Take gfn and return the reverse mapping to it.
  * Note: gfn must be unaliased before this function get called
@@ -362,6 +469,20 @@ static unsigned long *gfn_to_rmap(struct
        return &slot->rmap[gfn - slot->base_gfn];
 }
 
+static unsigned long *gfn_to_rmap_pde(struct kvm *kvm, gfn_t gfn)
+{
+       struct kvm_memory_slot *slot;
+       unsigned long memslot_align;
+       unsigned long long idx;
+
+       slot = gfn_to_memslot(kvm, gfn);
+       memslot_align = HPAGE_ALIGN_OFFSET(slot->base_gfn << PAGE_SHIFT);
+
+       idx = ((gfn - slot->base_gfn) << PAGE_SHIFT) + memslot_align;
+       idx /= HPAGE_SIZE;
+       return &slot->rmap_pde[idx];
+}
+
 /*
  * Reverse mapping data structures:
  *
@@ -371,7 +492,7 @@ static unsigned long *gfn_to_rmap(struct
  * If rmapp bit zero is one, (then rmap & ~1) points to a struct kvm_rmap_desc
  * containing more mappings.
  */
-static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
+static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, int hpage)
 {
        struct kvm_mmu_page *sp;
        struct kvm_rmap_desc *desc;
@@ -383,7 +504,10 @@ static void rmap_add(struct kvm_vcpu *vc
        gfn = unalias_gfn(vcpu->kvm, gfn);
        sp = page_header(__pa(spte));
        sp->gfns[spte - sp->spt] = gfn;
-       rmapp = gfn_to_rmap(vcpu->kvm, gfn);
+       if (!hpage)
+               rmapp = gfn_to_rmap(vcpu->kvm, gfn);
+       else
+               rmapp = gfn_to_rmap_pde(vcpu->kvm, gfn);
        if (!*rmapp) {
                rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
                *rmapp = (unsigned long)spte;
@@ -449,7 +573,10 @@ static void rmap_remove(struct kvm *kvm,
                kvm_release_page_dirty(page);
        else
                kvm_release_page_clean(page);
-       rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
+       if (is_large_pte(*spte))
+               rmapp = gfn_to_rmap_pde(kvm, sp->gfns[spte - sp->spt]);
+       else
+               rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
        if (!*rmapp) {
                printk(KERN_ERR "rmap_remove: %p %llx 0->BUG\n", spte, *spte);
                BUG();
@@ -528,8 +655,27 @@ static void rmap_write_protect(struct kv
                }
                spte = rmap_next(kvm, rmapp, spte);
        }
+       /* check for huge page mappings */
+       rmapp = gfn_to_rmap_pde(kvm, gfn);
+       spte = rmap_next(kvm, rmapp, NULL);
+       while (spte) {
+               BUG_ON(!spte);
+               BUG_ON(!(*spte & PT_PRESENT_MASK));
+               BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != 
(PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
+               pgprintk("rmap_write_protect(large): spte %p %llx %d\n", spte, 
*spte, gfn);
+               if (is_writeble_pte(*spte)) {
+                       rmap_remove(kvm, spte);
+                       --kvm->stat.lpages;
+                       set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+                       write_protected = 1;
+               }
+               spte = rmap_next(kvm, rmapp, spte);
+       }
+
        if (write_protected)
                kvm_flush_remote_tlbs(kvm);
+
+       account_shadowed(kvm, gfn);
 }
 
 #ifdef MMU_DEBUG
@@ -749,11 +895,19 @@ static void kvm_mmu_page_unlink_children
        for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
                ent = pt[i];
 
+               if (is_shadow_present_pte(pt[i]) && is_large_pte(pt[i])) {
+                       if (is_writeble_pte(pt[i]))
+                               --kvm->stat.lpages;
+                       rmap_remove(kvm, &pt[i]);
+               }
+
                pt[i] = shadow_trap_nonpresent_pte;
                if (!is_shadow_present_pte(ent))
                        continue;
-               ent &= PT64_BASE_ADDR_MASK;
-               mmu_page_remove_parent_pte(page_header(ent), &pt[i]);
+               if (!is_large_pte(ent)) {
+                       ent &= PT64_BASE_ADDR_MASK;
+                       mmu_page_remove_parent_pte(page_header(ent), &pt[i]);
+               }
        }
        kvm_flush_remote_tlbs(kvm);
 }
@@ -793,6 +947,8 @@ static void kvm_mmu_zap_page(struct kvm 
        }
        kvm_mmu_page_unlink_children(kvm, sp);
        if (!sp->root_count) {
+               if (!sp->role.metaphysical)
+                       unaccount_shadowed(kvm, sp->gfn);
                hlist_del(&sp->hash_link);
                kvm_mmu_free_page(kvm, sp);
        } else
@@ -886,12 +1042,21 @@ struct page *gva_to_page(struct kvm_vcpu
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte,
                         unsigned pt_access, unsigned pte_access,
                         int user_fault, int write_fault, int dirty,
-                        int *ptwrite, gfn_t gfn, struct page *page)
+                        int *ptwrite, int largepage, gfn_t gfn,
+                        struct page *page)
 {
        u64 spte;
        int was_rmapped = is_rmap_pte(*shadow_pte);
        int was_writeble = is_writeble_pte(*shadow_pte);
 
+       /*
+        * If its a largepage mapping, there could be a previous
+        * pointer to a PTE page hanging there, which will falsely
+        * set was_rmapped.
+        */
+       if (largepage)
+               was_rmapped = is_large_pte(*shadow_pte);
+
        pgprintk("%s: spte %llx access %x write_fault %d"
                 " user_fault %d gfn %lx\n",
                 __FUNCTION__, *shadow_pte, pt_access,
@@ -911,6 +1076,8 @@ static void mmu_set_spte(struct kvm_vcpu
        spte |= PT_PRESENT_MASK;
        if (pte_access & ACC_USER_MASK)
                spte |= PT_USER_MASK;
+       if (largepage)
+               spte |= PT_PAGE_SIZE_MASK;
 
        if (is_error_page(page)) {
                set_shadow_pte(shadow_pte,
@@ -932,7 +1099,8 @@ static void mmu_set_spte(struct kvm_vcpu
                }
 
                shadow = kvm_mmu_lookup_page(vcpu->kvm, gfn);
-               if (shadow) {
+               if (shadow ||
+                  (largepage && has_wrprotected_page(vcpu->kvm, gfn))) {
                        pgprintk("%s: found shadow page for %lx, marking ro\n",
                                 __FUNCTION__, gfn);
                        pte_access &= ~ACC_WRITE_MASK;
@@ -951,10 +1119,17 @@ unshadowed:
                mark_page_dirty(vcpu->kvm, gfn);
 
        pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte);
+       pgprintk("instantiating %s PTE (%s) at %d (%llx)\n",
+                (spte&PT_PAGE_SIZE_MASK)? "2MB" : "4kB",
+                (spte&PT_WRITABLE_MASK)?"RW":"R", gfn, spte);
        set_shadow_pte(shadow_pte, spte);
+       if (!was_rmapped && (spte & (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
+               == (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
+               ++vcpu->kvm->stat.lpages;
+
        page_header_update_slot(vcpu->kvm, shadow_pte, gfn);
        if (!was_rmapped) {
-               rmap_add(vcpu, shadow_pte, gfn);
+               rmap_add(vcpu, shadow_pte, gfn, largepage);
                if (!is_rmap_pte(*shadow_pte))
                        kvm_release_page_clean(page);
        } else {
@@ -987,7 +1162,7 @@ static int __nonpaging_map(struct kvm_vc
 
                if (level == 1) {
                        mmu_set_spte(vcpu, &table[index], ACC_ALL, ACC_ALL,
-                                    0, write, 1, &pt_write, gfn, page);
+                                    0, write, 1, &pt_write, 0, gfn, page);
                        return pt_write || is_io_pte(table[index]);
                }
 
@@ -1300,7 +1475,8 @@ static void mmu_pte_write_zap_pte(struct
 
        pte = *spte;
        if (is_shadow_present_pte(pte)) {
-               if (sp->role.level == PT_PAGE_TABLE_LEVEL)
+               if (sp->role.level == PT_PAGE_TABLE_LEVEL ||
+                   is_large_pte(pte))
                        rmap_remove(vcpu->kvm, spte);
                else {
                        child = page_header(pte & PT64_BASE_ADDR_MASK);
@@ -1308,14 +1484,18 @@ static void mmu_pte_write_zap_pte(struct
                }
        }
        set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+       if (is_large_pte(pte) && is_writeble_pte(pte))
+               --vcpu->kvm->stat.lpages;
 }
 
 static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
                                  struct kvm_mmu_page *sp,
                                  u64 *spte,
-                                 const void *new)
+                                 const void *new,
+                                 u64 old)
 {
-       if (sp->role.level != PT_PAGE_TABLE_LEVEL) {
+       if ((sp->role.level != PT_PAGE_TABLE_LEVEL)
+           && !vcpu->arch.update_pte.largepage) {
                ++vcpu->kvm->stat.mmu_pde_zapped;
                return;
        }
@@ -1390,6 +1570,10 @@ static void mmu_guess_page_from_pte_writ
        gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
        vcpu->arch.update_pte.gfn = gfn;
        vcpu->arch.update_pte.page = gfn_to_page(vcpu->kvm, gfn);
+       if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn))
+               vcpu->arch.update_pte.largepage = 1;
+       else
+               vcpu->arch.update_pte.largepage = 0;
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1487,7 +1671,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *
                        entry = *spte;
                        mmu_pte_write_zap_pte(vcpu, sp, spte);
                        if (new)
-                               mmu_pte_write_new_pte(vcpu, sp, spte, new);
+                               mmu_pte_write_new_pte(vcpu, sp, spte, new, 
entry);
                        mmu_pte_write_flush_tlb(vcpu, entry, *spte);
                        ++spte;
                }
Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
===================================================================
--- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
+++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
@@ -71,6 +71,7 @@ struct guest_walker {
        unsigned pte_access;
        gfn_t gfn;
        u32 error_code;
+       int largepage_backed;
 };
 
 static gfn_t gpte_to_gfn(pt_element_t gpte)
@@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
  */
 static int FNAME(walk_addr)(struct guest_walker *walker,
                            struct kvm_vcpu *vcpu, gva_t addr,
-                           int write_fault, int user_fault, int fetch_fault)
+                           int write_fault, int user_fault, int fetch_fault,
+                           int faulting)
 {
        pt_element_t pte;
        gfn_t table_gfn;
@@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
        pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
 walk:
        walker->level = vcpu->arch.mmu.root_level;
+       walker->largepage_backed = 0;
        pte = vcpu->arch.cr3;
 #if PTTYPE == 64
        if (!is_long_mode(vcpu)) {
@@ -192,10 +195,19 @@ walk:
                if (walker->level == PT_DIRECTORY_LEVEL
                    && (pte & PT_PAGE_SIZE_MASK)
                    && (PTTYPE == 64 || is_pse(vcpu))) {
-                       walker->gfn = gpte_to_gfn_pde(pte);
+                       gfn_t gfn = gpte_to_gfn_pde(pte);
+                       walker->gfn = gfn;
+
                        walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
                        if (PTTYPE == 32 && is_cpuid_PSE36())
                                walker->gfn += pse36_gfn_delta(pte);
+
+                       if (faulting
+                           && is_largepage_backed(vcpu, gfn)
+                           && is_physical_memory(vcpu->kvm, walker->gfn)) {
+                               walker->largepage_backed = 1;
+                               walker->gfn = gfn;
+                       }
                        break;
                }
 
@@ -245,6 +257,7 @@ static void FNAME(update_pte)(struct kvm
        pt_element_t gpte;
        unsigned pte_access;
        struct page *npage;
+       int largepage = vcpu->arch.update_pte.largepage;
 
        gpte = *(const pt_element_t *)pte;
        if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) {
@@ -261,7 +274,8 @@ static void FNAME(update_pte)(struct kvm
                return;
        get_page(npage);
        mmu_set_spte(vcpu, spte, page->role.access, pte_access, 0, 0,
-                    gpte & PT_DIRTY_MASK, NULL, gpte_to_gfn(gpte), npage);
+                    gpte & PT_DIRTY_MASK, NULL, largepage, gpte_to_gfn(gpte),
+                    npage);
 }
 
 /*
@@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
                shadow_ent = ((u64 *)__va(shadow_addr)) + index;
                if (level == PT_PAGE_TABLE_LEVEL)
                        break;
+               if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
+                       break;
+
                if (is_shadow_present_pte(*shadow_ent)) {
                        shadow_addr = *shadow_ent & PT64_BASE_ADDR_MASK;
                        continue;
@@ -337,7 +354,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
        mmu_set_spte(vcpu, shadow_ent, access, walker->pte_access & access,
                     user_fault, write_fault,
                     walker->ptes[walker->level-1] & PT_DIRTY_MASK,
-                    ptwrite, walker->gfn, page);
+                    ptwrite, walker->largepage_backed, walker->gfn, page);
 
        return shadow_ent;
 }
@@ -380,7 +397,7 @@ static int FNAME(page_fault)(struct kvm_
         * Look up the shadow pte for the faulting address.
         */
        r = FNAME(walk_addr)(&walker, vcpu, addr, write_fault, user_fault,
-                            fetch_fault);
+                            fetch_fault, 1);
 
        /*
         * The page is not mapped by the guest.  Let the guest handle it.
@@ -395,6 +412,13 @@ static int FNAME(page_fault)(struct kvm_
 
        page = gfn_to_page(vcpu->kvm, walker.gfn);
 
+       /* shortcut non-RAM accesses to avoid walking over a 2MB PMD entry */
+       if (is_error_page(page)) {
+               kvm_release_page_clean(page);
+               up_read(&current->mm->mmap_sem);
+               return 1;
+       }
+
        spin_lock(&vcpu->kvm->mmu_lock);
        kvm_mmu_free_some_pages(vcpu);
        shadow_pte = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault,
@@ -428,7 +452,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kv
        gpa_t gpa = UNMAPPED_GVA;
        int r;
 
-       r = FNAME(walk_addr)(&walker, vcpu, vaddr, 0, 0, 0);
+       r = FNAME(walk_addr)(&walker, vcpu, vaddr, 0, 0, 0, 0);
 
        if (r) {
                gpa = gfn_to_gpa(walker.gfn);
Index: linux-2.6-x86-kvm/include/asm-x86/kvm_host.h
===================================================================
--- linux-2.6-x86-kvm.orig/include/asm-x86/kvm_host.h
+++ linux-2.6-x86-kvm/include/asm-x86/kvm_host.h
@@ -228,6 +228,7 @@ struct kvm_vcpu_arch {
        struct {
                gfn_t gfn;          /* presumed gfn during guest pte update */
                struct page *page;  /* page corresponding to that gfn */
+               int largepage;
        } update_pte;
 
        struct i387_fxsave_struct host_fx_image;
@@ -298,6 +299,7 @@ struct kvm_vm_stat {
        u32 mmu_recycled;
        u32 mmu_cache_miss;
        u32 remote_tlb_flush;
+       u32 lpages;
 };
 
 struct kvm_vcpu_stat {
Index: linux-2.6-x86-kvm/include/linux/kvm_host.h
===================================================================
--- linux-2.6-x86-kvm.orig/include/linux/kvm_host.h
+++ linux-2.6-x86-kvm/include/linux/kvm_host.h
@@ -99,7 +99,9 @@ struct kvm_memory_slot {
        unsigned long npages;
        unsigned long flags;
        unsigned long *rmap;
+       unsigned long *rmap_pde;
        unsigned long *dirty_bitmap;
+       int *largepage;
        unsigned long userspace_addr;
        int user_alloc;
 };
Index: linux-2.6-x86-kvm/virt/kvm/kvm_main.c
===================================================================
--- linux-2.6-x86-kvm.orig/virt/kvm/kvm_main.c
+++ linux-2.6-x86-kvm/virt/kvm/kvm_main.c
@@ -188,9 +188,17 @@ static void kvm_free_physmem_slot(struct
        if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
                vfree(free->dirty_bitmap);
 
+       if (!dont || free->rmap_pde != dont->rmap_pde)
+               vfree(free->rmap_pde);
+
+       if (!dont || free->largepage != dont->largepage)
+               kfree(free->largepage);
+
        free->npages = 0;
        free->dirty_bitmap = NULL;
        free->rmap = NULL;
+       free->rmap_pde = NULL;
+       free->largepage = NULL;
 }
 
 void kvm_free_physmem(struct kvm *kvm)
@@ -300,6 +308,28 @@ int __kvm_set_memory_region(struct kvm *
                new.user_alloc = user_alloc;
                new.userspace_addr = mem->userspace_addr;
        }
+       if (npages && !new.rmap_pde) {
+               int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
+               if (npages % (HPAGE_SIZE/PAGE_SIZE))
+                       largepages++;
+               new.rmap_pde = vmalloc(largepages * sizeof(struct page *));
+
+               if (!new.rmap_pde)
+                       goto out_free;
+
+               memset(new.rmap_pde, 0, largepages * sizeof(struct page *));
+       }
+       if (npages && !new.largepage) {
+               int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
+               if (npages % (HPAGE_SIZE/PAGE_SIZE))
+                       largepages++;
+               new.largepage = kmalloc(largepages * sizeof(int), GFP_KERNEL);
+
+               if (!new.largepage)
+                       goto out;
+
+               memset(new.largepage, 0, largepages * sizeof(int));
+       }
 
        /* Allocate page dirty bitmap if needed */
        if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
@@ -443,7 +473,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, 
 }
 EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
 
-static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
+unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 {
        struct kvm_memory_slot *slot;
 
@@ -453,6 +483,7 @@ static unsigned long gfn_to_hva(struct k
                return bad_hva();
        return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE);
 }
+EXPORT_SYMBOL(gfn_to_hva);
 
 /*
  * Requires current->mm->mmap_sem to be held
Index: linux-2.6-x86-kvm/arch/x86/kvm/x86.c
===================================================================
--- linux-2.6-x86-kvm.orig/arch/x86/kvm/x86.c
+++ linux-2.6-x86-kvm/arch/x86/kvm/x86.c
@@ -75,6 +75,7 @@ struct kvm_stats_debugfs_item debugfs_en
        { "mmu_recycled", VM_STAT(mmu_recycled) },
        { "mmu_cache_miss", VM_STAT(mmu_cache_miss) },
        { "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
+       { "lpages", VM_STAT(lpages) },
        { NULL }
 };

Index: kvm-userspace/qemu/vl.c
===================================================================
--- kvm-userspace.orig/qemu/vl.c
+++ kvm-userspace/qemu/vl.c
@@ -8501,6 +8501,31 @@ void qemu_get_launch_info(int *argc, cha
     *opt_incoming = incoming;
 }
 
+#define HPAGE_SIZE 2*1024*1024
+
+void *alloc_huge_area(unsigned long memory)
+{
+       void *area;
+       int fd;
+       char path[] = "/mnt/kvm.XXXXXX";
+
+       mkstemp(path);
+       fd = open(path, O_RDWR);
+       if (fd < 0) {
+               perror("open");
+               exit(0);
+       }
+       memory = (memory+HPAGE_SIZE-1) & ~(HPAGE_SIZE-1);
+
+       area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
+       if (area == MAP_FAILED) {
+               perror("mmap");
+               exit(0);
+       }
+
+       return area;
+}
+
 int main(int argc, char **argv)
 {
 #ifdef CONFIG_GDBSTUB
@@ -9330,9 +9355,9 @@ int main(int argc, char **argv)
 
             ret = kvm_qemu_check_extension(KVM_CAP_USER_MEMORY);
             if (ret) {
-               printf("allocating %d MB\n", phys_ram_size/1024/1024);
-                phys_ram_base = qemu_vmalloc(phys_ram_size);
-               if (!phys_ram_base) {
+                //phys_ram_base = qemu_vmalloc(phys_ram_size);
+               phys_ram_base = alloc_huge_area(phys_ram_size);
+               if (!phys_ram_base) {
                        fprintf(stderr, "Could not allocate physical memory\n");
                        exit(1);
                }

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

_______________________________________________
kvm-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] large page support for kvm

Reply via email to