On Fri, Mar 21, 2008 at 06:23:41PM -0300, Marcelo Tosatti wrote: > If there are any active shadow mappings to a page there is a guarantee > that there is a valid linux pte mapping pointing at it. So page_count == ^^ was
> 1 + nr_sptes. Yes. > So the theoretical race you're talking about is: > > > CPU0 CPU1 > > spte = rmap_next(kvm, rmapp, NULL); > while (spte) { > BUG_ON(!spte); > BUG_ON(!(*spte & PT_PRESENT_MASK)); > rmap_printk("rmap_nuke: spte %p %llx\n", spte, *spte); > rmap_remove(kvm, spte); > set_shadow_pte(spte, shadow_trap_nonpresent_pte); > nuked = 1; > spte = rmap_next(kvm, rmapp, spte); > } > <---------- > try_to_unmap_one() > page is now > free > page > allocated for other > purposes > if (nuked) > kvm_flush_remote_tlbs(kvm); I'd more accurately describe the race as this: CPU0 CPU1 spte = rmap_next(kvm, rmapp, NULL); while (spte) { BUG_ON(!spte); BUG_ON(!(*spte & PT_PRESENT_MASK)); rmap_printk("rmap_nuke: spte %p %llx\n", spte, *spte); page_count = 2 <---------- try_to_unmap_one() page_count = 1 rmap_remove(kvm, spte); <- page is freed, race window opened set_shadow_pte(spte, shadow_trap_nonpresent_pte); nuked = 1; spte = rmap_next(kvm, rmapp, spte); } if (nuked) kvm_flush_remote_tlbs(kvm); <- race window closed > And some other VCPU with the TLB cached writes to the now freed (and > possibly allocated to another purpose) page. The other VCPU could also read, not just write. If it reads the guest will crash, if it writes the host will crash (host fs corruption etc...) too. > This case is safe because the path that frees a pte and subsequently > a page will take care of flushing the TLB of any remote CPU's that > possibly have it cached (before freeing the page, of course). > ptep_clear_flush->flush_tlb_page. > > Am I missing something? flush_tlb_page only takes care of accesses to the page through qemu task when outside vmx/svm mode. If the other physical cpu, is running some code of some guest vcpu in vmx/svm mode after the race window is open and before it is closed the race will trigger. > > 2) not sure what you mean with synchronous, do you mean single > > threaded? I can't see how it can be single threaded (does > > ballooning stops all other vcpus?). > > No, I mean synchronous as in that no other vcpu will attempt to fault > that _particular gfn_ in between KVM_ZAP_GFN and madvise. The other vcpu can't fault any gfn, the other vcpu will fault only _after_ the kvm_flush_remote_tlbs. Otherwise there would be no problem. The fix is exactly to make sure the other vcpu faults that particular gfn before the page is in the freelist, and to do that the tlb flush must happen _before_ rmap_remove calls put_page. And madvise should happen before KVM_ZAP_GFN to prevent a gfn_to_page in a parallel kvm page fault triggered by some other vcpu to re-instantiate the spte after KVM_ZAP_GFN run. > > Why are you taking the mmu_lock around rmap_nuke if no other vcpu > > can take any page fault and call into get_user_pages in between > > KVM_ZAP_GFN and madvise? > > Other vcpu's can take page faults and call into get_user_pages, but not > for the gfn KVM_ZAP_GFN is operating on, because it has been allocated > by the balloon driver. With regard to the ordering of ioctl vs madvise, we better not trust the guest. If the guest tries to access ram ranges that has been inflated in the balloon qemu should behave like if the guest is accessing ram outside the virtualized e820 map instead of leaking ram to the buggy guest. There is no difference in the ordering, other than doing madvise first is more robust in detecting buggy guest. I can't see any benefit in doing the ioctl first. The memory corrupting race window will happen regardless because the host VM rmap code is the one releasing the page_count. > So we need mmu_lock to protect against concurrent shadow page and rmap > operations. Ok. > As detailed above, we have a guarantee that there is a live linux pte > by the time rmap_remove() nukes a shadow pte. I can't see how. I mean right now with current kvm.git, not with some patch floating around. Which is the whole point of adding the KVM_ZAP_GFN because the mainline folks didn't merge the mmu notifiers in time for .25. My mmu notifier code already taken care of doing all of this race free but I sure agree with this approach for the short term, the less we have to wait the better. Thanks! Andrea ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel