On Fri, Mar 21, 2008 at 04:56:50PM +0100, Andrea Arcangeli wrote:
> On Fri, Mar 21, 2008 at 10:37:00AM -0300, Marcelo Tosatti wrote:
> > This is not the final put_page().
> > 
> > Remote TLB's are flushed here, after rmap_remove:
> > 
> > +       if (nuked)
> > +               kvm_flush_remote_tlbs(kvm);
> > 
> > This ioctl is called before zap_page_range() is executed through
> > sys_madvise(MADV_DONTNEED) to remove the page in question.
> > 
> > We know that the guest will not attempt to fault in the gfn because
> > the virtio balloon driver is synchronous (it will only attempt to
> > release that page back to the guest OS once rmap_nuke+zap_page_range has
> > finished).
> > 
> > Can you be more verbose?
> 
> Sure.
> 
> 1) even if you run madvise(MADV_DONTNEED) after KVM_ZAP_GFN, the anon
>    page can be released by the VM at any time without any kvm-aware
>    lock (there will be a swap reference to it, no any more page_count
>    references leading to memory corruption in the host in presence of
>    memory pressure). This is purely theoretical of course, not sure if
>    timings or probabilities allows for reproducing this in real life.

If there are any active shadow mappings to a page there is a guarantee
that there is a valid linux pte mapping pointing at it. So page_count ==
1 + nr_sptes.

So the theoretical race you're talking about is:


    CPU0                                                        CPU1

    spte = rmap_next(kvm, rmapp, NULL);
    while (spte) {
            BUG_ON(!spte);
            BUG_ON(!(*spte & PT_PRESENT_MASK));
            rmap_printk("rmap_nuke: spte %p %llx\n", spte, *spte);
            rmap_remove(kvm, spte);
            set_shadow_pte(spte, shadow_trap_nonpresent_pte);
            nuked = 1;
            spte = rmap_next(kvm, rmapp, spte);
    }
                                                        <---------- 
try_to_unmap_one()
                                                                    page is now 
free
                                                                    page 
allocated for other
                                                                    purposes
   if (nuked)
       kvm_flush_remote_tlbs(kvm);

And some other VCPU with the TLB cached writes to the now freed (and
possibly allocated to another purpose) page.

This case is safe because the path that frees a pte and subsequently
a page will take care of flushing the TLB of any remote CPU's that
possibly have it cached (before freeing the page, of course).
ptep_clear_flush->flush_tlb_page.

Am I missing something?

> 2) not sure what you mean with synchronous, do you mean single
>    threaded? I can't see how it can be single threaded (does
>    ballooning stops all other vcpus?). 

No, I mean synchronous as in that no other vcpu will attempt to fault
that _particular gfn_ in between KVM_ZAP_GFN and madvise.

>  Why are you taking the mmu_lock around rmap_nuke if no other vcpu
>  can take any page fault and call into get_user_pages in between
>  KVM_ZAP_GFN and madvise?

Other vcpu's can take page faults and call into get_user_pages, but not
for the gfn KVM_ZAP_GFN is operating on, because it has been allocated
by the balloon driver.

So we need mmu_lock to protect against concurrent shadow page and rmap
operations.

> As far as I
>    can tell the only possible safe ordering is madvise; KVM_ZAP_GFN,
>    which is emulating the mmu notifier behavior incidentally.
> 
> Note that the rmap_remove smp race (also note here smp race means
> smp-host race, it will trigger even if guest is UP) might be a generic
> issue with the rmap_remove logic. I didn't analyze all the possible
> rmap_remove callers yet (this was in my todo list), I just made sure
> that my code would be smp safe.

As detailed above, we have a guarantee that there is a live linux pte
by the time rmap_remove() nukes a shadow pte.

> > By the way, I don't see invalidate_begin/invalidate_end hooks in the KVM 
> > part of MMU notifiers V9 patch? (meaning that zap_page_range will not zap
> > the spte's for the pages in question).
> 
> range_begin isn't needed. range_begin is needed only by secondary mmu
> drivers that aren't reference counting the pages. The _end callback is
> below. It could be improved to skip the whole range in a single browse
> of the memslots instead of browsing it for each page in the range. The
> mmu notifiers aren't merged and this code may still require changes in
> terms of API if EMM is merged instead of #v9 (hope not), so I tried to
> keep it simple.

Oh, I missed that. Nice.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to