Andrea Arcangeli wrote: > I thought some more about this. > > BTW, for completeness: normally (with exception of vm_destroy) the > put_page run by rmap_remove won't be the last one, but still the page > can go in the freelist a moment after put_page runs (leading to the > same problem). The VM is prevented to free the page while it's pinned, > but the VM can do the final free on the page before rmap_remove > returns. And w/o mmu notifiers there's no serialization that makes the > core VM stop on the mmu_lock to wait the tlb flush to run, before the > VM finally executes the last free of the page. mmu notifiers fixes > this race for regular swapping as the core VM will block on the > mmu_lock waiting the tlb flush (for this to work the tlb flush must > always happen inside the mmu_lock unless the order is exactly "spte = > nonpresent; tlbflush; put_page"). A VM_LOCKED on the vmas backing the > anonymous memory will fix this for regolar swapping too (I did > something like this in a patch at the end as a band-aid). > > But thinking more the moment we pretend to allow anybody to randomly > __munmap__ any part of the guest physical address space like for > ballooning while the guest runs (think unprivileged user owning > /dev/kvm and running munmap at will), not even VM_LOCKED (ignored by > munmap) and not even the mmu notifiers, can prevent the page to be > queued in the kernel freelists immediately after rmap_remove returns, > this is because rmap_remove may run in a different host-cpu in between > unmap_vmas and invalidate_range_end. > > Running the ioctl before munmap won't help to prevent the race as the > guest can still re-instantiate the sptes with page faults between the > ioctl and munmap. > > However we've invalidate_range_begin. If we invalidate all sptes in > invalidate_range_begin and we hold off the page faults in between > _begin/_end, then we can fix this with the mmu notifiers. >
This can be done by taking mmu_lock in _begin and releasing it in _end, unless there's a lock dependency issue. > So I think I can allow munmap safely (to unprivileged user too) by > using _range_begin somehow. For this to work any relevant tlb flush > must happen inside the _same_ mmu_lock critical section where > spte=nonpresent and rmap_remove run too (thanks to the mmu_lock the > ordering of those operations won't matter anymore, and no queue will > be needed). > > I don't understand your conclusion: you prove that mlock() is not good enough, then post a patch to do it? I'll take another shot at fixing rmap_remove(), I don't like to cripple swapping for 2.6.25 (though it will only be really dependable in .26). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace _______________________________________________ kvm-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/kvm-devel
