Re: [kvm-devel] large page support for kvm

Marcelo Tosatti Tue, 12 Feb 2008 16:12:38 -0800

On Tue, Feb 12, 2008 at 01:55:54PM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >Ok, how does the following look. Still need to plug in large page
> >creation in the nonpaging case, but this should be enough for comments.
> >  
> 
> Most of the comments are cosmetic, but a couple have some meat.
> 
> > 
> >+#define HPAGE_ALIGN_OFFSET(x) ((((x)+HPAGE_SIZE-1)&HPAGE_MASK) - (x))
> >  
> 
> A function please.


Done.

> Can probably be reshuffled to avoid the long long.  The compiler may 
> want to call helpers on i386...

Reworked to operate on frame numbers instead of addresses, that gets rid
of long long (and its also neater).

> This bit can go into a function
> 
> >     if (!*rmapp) {
> >             rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
> >             *rmapp = (unsigned long)spte;
> >@@ -449,7 +573,10 @@ static void rmap_remove(struct kvm *kvm,
> >             kvm_release_page_dirty(page);
> >     else
> >             kvm_release_page_clean(page);
> >-    rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
> >+    if (is_large_pte(*spte))
> >+            rmapp = gfn_to_rmap_pde(kvm, sp->gfns[spte - sp->spt]);
> >+    else
> >+            rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt]);
> >  
> 
> As it is reused here.

Ok, gfn_to_rmap_pde() has been merged into gfn_to_rmap() and there is 
a "lpage" parameter to know which array (and array offset) to use.

> >+    /*
> >+     * If its a largepage mapping, there could be a previous
> >+     * pointer to a PTE page hanging there, which will falsely
> >+     * set was_rmapped.
> >+     */
> >+    if (largepage)
> >+            was_rmapped = is_large_pte(*shadow_pte);
> >+
> >  
> 
> But that pte page will have its parent_pte chain pointing to shadow_pte, 
> no?  Either this can't happen, or we need to unlink that pte page first.

This can happen, say if you have a large page frame with shadowed pages
in it, therefore 4k mapped. When all shadowed pages are removed, it will
be mapped with a 2M page, overwriting the pointer to the (metaphysical)
PTE page.

So you say "need to unlink that pte page first" you mean simply remove
the parent pointer which now points to the 2M PMD entry, right? This
avoids a zap_mmu_page() on that metaphysical page from nukeing the
(unrelated) 2M PMD entry.

Nukeing the metaphysical page (zap_page) seems overkill and
unnecessary... it will eventually be recycled, or reused if the large
mapping is nuked.

It doesnt appear to be necessary to flush the TLB whenever a 2MB PMD
overwrote a PMD-to-PTE-page pointer.

In the worst case other CPU's will use the cached 4k translations for a
while. If there are permission changes on this translations the OS is
responsible for flushing the TLB anyway.

> >@@ -951,10 +1119,17 @@ unshadowed:
> >             mark_page_dirty(vcpu->kvm, gfn);
> > 
> >     pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte);
> >+    pgprintk("instantiating %s PTE (%s) at %d (%llx)\n",
> >+             (spte&PT_PAGE_SIZE_MASK)? "2MB" : "4kB",
> >+             (spte&PT_WRITABLE_MASK)?"RW":"R", gfn, spte);
> >     set_shadow_pte(shadow_pte, spte);
> >+    if (!was_rmapped && (spte & (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
> >+            == (PT_PAGE_SIZE_MASK|PT_WRITABLE_MASK))
> >+            ++vcpu->kvm->stat.lpages;
> >+
> >  
> 
> Why do you only account for writable large pages?

No particular reason. Will fix.

> >Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
> >===================================================================
> >--- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
> >+++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
> >@@ -71,6 +71,7 @@ struct guest_walker {
> >     unsigned pte_access;
> >     gfn_t gfn;
> >     u32 error_code;
> >+    int largepage_backed;
> > };
> > 
> > static gfn_t gpte_to_gfn(pt_element_t gpte)
> >@@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
> >  */
> > static int FNAME(walk_addr)(struct guest_walker *walker,
> >                         struct kvm_vcpu *vcpu, gva_t addr,
> >-                        int write_fault, int user_fault, int fetch_fault)
> >+                        int write_fault, int user_fault, int fetch_fault,
> >+                        int faulting)
> > {
> >     pt_element_t pte;
> >     gfn_t table_gfn;
> >@@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
> >     pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
> > walk:
> >     walker->level = vcpu->arch.mmu.root_level;
> >+    walker->largepage_backed = 0;
> >     pte = vcpu->arch.cr3;
> > #if PTTYPE == 64
> >     if (!is_long_mode(vcpu)) {
> >@@ -192,10 +195,19 @@ walk:
> >             if (walker->level == PT_DIRECTORY_LEVEL
> >                 && (pte & PT_PAGE_SIZE_MASK)
> >                 && (PTTYPE == 64 || is_pse(vcpu))) {
> >-                    walker->gfn = gpte_to_gfn_pde(pte);
> >+                    gfn_t gfn = gpte_to_gfn_pde(pte);
> >+                    walker->gfn = gfn;
> >+
> >                     walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
> >                     if (PTTYPE == 32 && is_cpuid_PSE36())
> >                             walker->gfn += pse36_gfn_delta(pte);
> >+
> >+                    if (faulting
> >+                        && is_largepage_backed(vcpu, gfn)
> >+                        && is_physical_memory(vcpu->kvm, walker->gfn)) {
> >+                            walker->largepage_backed = 1;
> >+                            walker->gfn = gfn;
> >+                    }
> >  
> 
> I don't like this bit.  So far the walker has been independent of the 
> host state, and only depended on guest data.  We can set 
> largepage_backed unconditionally on a large guest pte, leave gfn 
> containing the low-order bits, and leave the host checks for the actual 
> mapping logic.
> 
> > 
> > /*
> >@@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
> >             shadow_ent = ((u64 *)__va(shadow_addr)) + index;
> >             if (level == PT_PAGE_TABLE_LEVEL)
> >                     break;
> >+            if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
> >+                    break;
> >+
> >  
> 
> (here)

gfn_to_page() needs to grab the struct page corresponding to the large
page, not the offset struct page for the faulting 4k address within
the large frame. Since gfn_to_page can sleep, there is no way to do
that in the mapping logic which happens under mmu_lock protection.
We don't want to grab the large page frame "struct page" unless the
is_largepage_backed() checks are successful.

The checks could be done in page_fault() if walker->level == 2, before
gfn_to_page()... But I don't see much difference of that and doing 
it inside walk_addr(). What do you say?

> 
> >@@ -395,6 +412,13 @@ static int FNAME(page_fault)(struct kvm_
> > 
> >     page = gfn_to_page(vcpu->kvm, walker.gfn);
> > 
> >+    /* shortcut non-RAM accesses to avoid walking over a 2MB PMD entry */
> >+    if (is_error_page(page)) {
> >+            kvm_release_page_clean(page);
> >+            up_read(&current->mm->mmap_sem);
> >+            return 1;
> >+    }
> >+
> >  
> 
> This bit is already in kvm.git?

Yeah, great.

> > 
> > struct kvm_vcpu_stat {
> >Index: linux-2.6-x86-kvm/include/linux/kvm_host.h
> >===================================================================
> >--- linux-2.6-x86-kvm.orig/include/linux/kvm_host.h
> >+++ linux-2.6-x86-kvm/include/linux/kvm_host.h
> >@@ -99,7 +99,9 @@ struct kvm_memory_slot {
> >     unsigned long npages;
> >     unsigned long flags;
> >     unsigned long *rmap;
> >+    unsigned long *rmap_pde;
> >     unsigned long *dirty_bitmap;
> >+    int *largepage;
> >  
> 
> We can combine rmap_pde and largepage into an array of struct { int 
> write_count, unsigned long rmap_pde; } and save some cacheline accesses.

Done. Fixed the other comments too...

Thanks

> >     
> > 
> > void kvm_free_physmem(struct kvm *kvm)
> >@@ -300,6 +308,28 @@ int __kvm_set_memory_region(struct kvm *
> >             new.user_alloc = user_alloc;
> >             new.userspace_addr = mem->userspace_addr;
> >     }
> >+    if (npages && !new.rmap_pde) {
> >+            int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
> >+            if (npages % (HPAGE_SIZE/PAGE_SIZE))
> >+                    largepages++;
> >+            new.rmap_pde = vmalloc(largepages * sizeof(struct page *));
> >+
> >+            if (!new.rmap_pde)
> >+                    goto out_free;
> >+
> >+            memset(new.rmap_pde, 0, largepages * sizeof(struct page *));
> >+    }
> >+    if (npages && !new.largepage) {
> >+            int largepages = npages / (HPAGE_SIZE/PAGE_SIZE);
> >+            if (npages % (HPAGE_SIZE/PAGE_SIZE))
> >+                    largepages++;
> >+            new.largepage = kmalloc(largepages * sizeof(int), 
> >GFP_KERNEL);
> >+
> >+    
> 
> vmalloc() here too, for very large guests.
> 
> Please test the changes with user/test/x86/access.flat, with both normal 
> and large pages backing.
> 
> -- 
> error compiling committee.c: too many arguments to function

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] large page support for kvm

Reply via email to