Marcelo Tosatti wrote:
>>> +   /*
>>> +    * If its a largepage mapping, there could be a previous
>>> +    * pointer to a PTE page hanging there, which will falsely
>>> +    * set was_rmapped.
>>> +    */
>>> +   if (largepage)
>>> +           was_rmapped = is_large_pte(*shadow_pte);
>>> +
>>>  
>>>       
>> But that pte page will have its parent_pte chain pointing to shadow_pte, 
>> no?  Either this can't happen, or we need to unlink that pte page first.
>>     
>
> This can happen, say if you have a large page frame with shadowed pages
> in it, therefore 4k mapped. When all shadowed pages are removed, it will
> be mapped with a 2M page, overwriting the pointer to the (metaphysical)
> PTE page.
>
> So you say "need to unlink that pte page first" you mean simply remove
> the parent pointer which now points to the 2M PMD entry, right? This
> avoids a zap_mmu_page() on that metaphysical page from nukeing the
> (unrelated) 2M PMD entry.
>   

Yes. We need to keep that tangle of pointers consistent.

> Nukeing the metaphysical page (zap_page) seems overkill and
> unnecessary... it will eventually be recycled, or reused if the large
> mapping is nuked.
>   

Yes.

> It doesnt appear to be necessary to flush the TLB whenever a 2MB PMD
> overwrote a PMD-to-PTE-page pointer.
>   

Since they're pointing to the same underlying memory, no, though we may 
need to transfer the dirty bits from the ptes to the struct page.

> In the worst case other CPU's will use the cached 4k translations for a
> while. If there are permission changes on this translations the OS is
> responsible for flushing the TLB anyway.
>   
>   
>>> Index: linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
>>> ===================================================================
>>> --- linux-2.6-x86-kvm.orig/arch/x86/kvm/paging_tmpl.h
>>> +++ linux-2.6-x86-kvm/arch/x86/kvm/paging_tmpl.h
>>> @@ -71,6 +71,7 @@ struct guest_walker {
>>>     unsigned pte_access;
>>>     gfn_t gfn;
>>>     u32 error_code;
>>> +   int largepage_backed;
>>> };
>>>
>>> static gfn_t gpte_to_gfn(pt_element_t gpte)
>>> @@ -120,7 +121,8 @@ static unsigned FNAME(gpte_access)(struc
>>>  */
>>> static int FNAME(walk_addr)(struct guest_walker *walker,
>>>                         struct kvm_vcpu *vcpu, gva_t addr,
>>> -                       int write_fault, int user_fault, int fetch_fault)
>>> +                       int write_fault, int user_fault, int fetch_fault,
>>> +                       int faulting)
>>> {
>>>     pt_element_t pte;
>>>     gfn_t table_gfn;
>>> @@ -130,6 +132,7 @@ static int FNAME(walk_addr)(struct guest
>>>     pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
>>> walk:
>>>     walker->level = vcpu->arch.mmu.root_level;
>>> +   walker->largepage_backed = 0;
>>>     pte = vcpu->arch.cr3;
>>> #if PTTYPE == 64
>>>     if (!is_long_mode(vcpu)) {
>>> @@ -192,10 +195,19 @@ walk:
>>>             if (walker->level == PT_DIRECTORY_LEVEL
>>>                 && (pte & PT_PAGE_SIZE_MASK)
>>>                 && (PTTYPE == 64 || is_pse(vcpu))) {
>>> -                   walker->gfn = gpte_to_gfn_pde(pte);
>>> +                   gfn_t gfn = gpte_to_gfn_pde(pte);
>>> +                   walker->gfn = gfn;
>>> +
>>>                     walker->gfn += PT_INDEX(addr, PT_PAGE_TABLE_LEVEL);
>>>                     if (PTTYPE == 32 && is_cpuid_PSE36())
>>>                             walker->gfn += pse36_gfn_delta(pte);
>>> +
>>> +                   if (faulting
>>> +                       && is_largepage_backed(vcpu, gfn)
>>> +                       && is_physical_memory(vcpu->kvm, walker->gfn)) {
>>> +                           walker->largepage_backed = 1;
>>> +                           walker->gfn = gfn;
>>> +                   }
>>>  
>>>       
>> I don't like this bit.  So far the walker has been independent of the 
>> host state, and only depended on guest data.  We can set 
>> largepage_backed unconditionally on a large guest pte, leave gfn 
>> containing the low-order bits, and leave the host checks for the actual 
>> mapping logic.
>>
>>     
>>> /*
>>> @@ -299,6 +313,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>>>             shadow_ent = ((u64 *)__va(shadow_addr)) + index;
>>>             if (level == PT_PAGE_TABLE_LEVEL)
>>>                     break;
>>> +           if (level == PT_DIRECTORY_LEVEL && walker->largepage_backed)
>>> +                   break;
>>> +
>>>  
>>>       
>> (here)
>>     
>
> gfn_to_page() needs to grab the struct page corresponding to the large
> page, not the offset struct page for the faulting 4k address within
> the large frame. Since gfn_to_page can sleep, there is no way to do
> that in the mapping logic which happens under mmu_lock protection.
> We don't want to grab the large page frame "struct page" unless the
> is_largepage_backed() checks are successful.
>
> The checks could be done in page_fault() if walker->level == 2, before
> gfn_to_page()... But I don't see much difference of that and doing 
> it inside walk_addr(). What do you say?
>
>   

I'd like to keep walk_addr() independent of the rest of the mmu (i.e. 
walk_addr is 100% guest oriented). Also, the issue you point out is 
shared by direct_map which doesn't call walk_addr().

An unrelated issue (pointed out by Jun Nakajima) is that this kills 
dirty log tracking (needed for migration). It could be solved simply by 
not using large page backing if dirty log tracking is enabled for that slot.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to