I thought some more about this. BTW, for completeness: normally (with exception of vm_destroy) the put_page run by rmap_remove won't be the last one, but still the page can go in the freelist a moment after put_page runs (leading to the same problem). The VM is prevented to free the page while it's pinned, but the VM can do the final free on the page before rmap_remove returns. And w/o mmu notifiers there's no serialization that makes the core VM stop on the mmu_lock to wait the tlb flush to run, before the VM finally executes the last free of the page. mmu notifiers fixes this race for regular swapping as the core VM will block on the mmu_lock waiting the tlb flush (for this to work the tlb flush must always happen inside the mmu_lock unless the order is exactly "spte = nonpresent; tlbflush; put_page"). A VM_LOCKED on the vmas backing the anonymous memory will fix this for regolar swapping too (I did something like this in a patch at the end as a band-aid).
But thinking more the moment we pretend to allow anybody to randomly __munmap__ any part of the guest physical address space like for ballooning while the guest runs (think unprivileged user owning /dev/kvm and running munmap at will), not even VM_LOCKED (ignored by munmap) and not even the mmu notifiers, can prevent the page to be queued in the kernel freelists immediately after rmap_remove returns, this is because rmap_remove may run in a different host-cpu in between unmap_vmas and invalidate_range_end. Running the ioctl before munmap won't help to prevent the race as the guest can still re-instantiate the sptes with page faults between the ioctl and munmap. However we've invalidate_range_begin. If we invalidate all sptes in invalidate_range_begin and we hold off the page faults in between _begin/_end, then we can fix this with the mmu notifiers. So I think I can allow munmap safely (to unprivileged user too) by using _range_begin somehow. For this to work any relevant tlb flush must happen inside the _same_ mmu_lock critical section where spte=nonpresent and rmap_remove run too (thanks to the mmu_lock the ordering of those operations won't matter anymore, and no queue will be needed). ------- Temporary band-aid to prevent the VM to do the final free on the page immediately after put_page run inside rmap_remove, and before the tlb flush (mmu_lock shall be released always after the tlb flush for future appraoches to be safe). This still can't allow safe (safe as guest not able to memory corrupt the host) munmap in the middle of the guest physical memory. Only a combination of both _range_begin and _range_end mmu notifier will allow that, unless rmap_remove callers are all changed to do safe "spte=nonpresent; tlbflush; put_page" ordering, then it'll always be safe, but it'll be slower as there will be more tlb flushes than needed. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index caa9f94..5343216 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3887,8 +3887,11 @@ int kvm_arch_set_memory_region(struct kvm *kvm, memslot->userspace_addr = do_mmap(NULL, 0, npages * PAGE_SIZE, PROT_READ | PROT_WRITE, - MAP_SHARED | MAP_ANONYMOUS, - 0); + MAP_SHARED | MAP_ANONYMOUS +#ifndef CONFIG_MMU_NOTIFIER + |MAP_LOCKED +#endif + , 0); up_write(¤t->mm->mmap_sem); if (IS_ERR((void *)memslot->userspace_addr)) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 30bf832..ed65e29 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -250,6 +250,27 @@ static int kvm_vm_release(struct inode *inode, struct file *filp) return 0; } +#ifndef CONFIG_MMU_NOTIFIER +static int memslot_mlock(unsigned long start, unsigned long len) +{ + struct vm_area_struct * vma; + unsigned long end = start+len; + + vma = find_vma(current->mm, start); + if (!vma || vma->vm_start > start) + return -ENOMEM; + + /* go simple and don't split vmas */ + for (;vma && vma->vm_start < end; vma = vma->vm_next) { + vma->vm_flags |= VM_LOCKED; + start = vma->vm_end; + } + if (start < end) + return -ENOMEM; + return 0; +} +#endif + /* * Allocate some memory and give it an address in the guest physical address * space. @@ -271,8 +292,12 @@ int __kvm_set_memory_region(struct kvm *kvm, r = -EINVAL; /* General sanity checks */ + if (mem->userspace_addr & (PAGE_SIZE - 1)) + goto out; if (mem->memory_size & (PAGE_SIZE - 1)) goto out; + if (mem->userspace_addr + mem->memory_size < mem->userspace_addr) + goto out; if (mem->guest_phys_addr & (PAGE_SIZE - 1)) goto out; if (mem->slot >= KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS) @@ -318,6 +343,12 @@ int __kvm_set_memory_region(struct kvm *kvm, /* Allocate if a slot is being created */ if (npages && !new.rmap) { +#ifndef CONFIG_MMU_NOTIFIER + if (user_alloc && + memslot_mlock(mem->userspace_addr, mem->memory_size)) + goto out_free; +#endif + new.rmap = vmalloc(npages * sizeof(struct page *)); if (!new.rmap) ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel