I thought some more about this.

BTW, for completeness: normally (with exception of vm_destroy) the
put_page run by rmap_remove won't be the last one, but still the page
can go in the freelist a moment after put_page runs (leading to the
same problem). The VM is prevented to free the page while it's pinned,
but the VM can do the final free on the page before rmap_remove
returns. And w/o mmu notifiers there's no serialization that makes the
core VM stop on the mmu_lock to wait the tlb flush to run, before the
VM finally executes the last free of the page. mmu notifiers fixes
this race for regular swapping as the core VM will block on the
mmu_lock waiting the tlb flush (for this to work the tlb flush must
always happen inside the mmu_lock unless the order is exactly "spte =
nonpresent; tlbflush; put_page"). A VM_LOCKED on the vmas backing the
anonymous memory will fix this for regolar swapping too (I did
something like this in a patch at the end as a band-aid).

But thinking more the moment we pretend to allow anybody to randomly
__munmap__ any part of the guest physical address space like for
ballooning while the guest runs (think unprivileged user owning
/dev/kvm and running munmap at will), not even VM_LOCKED (ignored by
munmap) and not even the mmu notifiers, can prevent the page to be
queued in the kernel freelists immediately after rmap_remove returns,
this is because rmap_remove may run in a different host-cpu in between
unmap_vmas and invalidate_range_end.

Running the ioctl before munmap won't help to prevent the race as the
guest can still re-instantiate the sptes with page faults between the
ioctl and munmap.

However we've invalidate_range_begin. If we invalidate all sptes in
invalidate_range_begin and we hold off the page faults in between
_begin/_end, then we can fix this with the mmu notifiers.

So I think I can allow munmap safely (to unprivileged user too) by
using _range_begin somehow. For this to work any relevant tlb flush
must happen inside the _same_ mmu_lock critical section where
spte=nonpresent and rmap_remove run too (thanks to the mmu_lock the
ordering of those operations won't matter anymore, and no queue will
be needed).

-------

Temporary band-aid to prevent the VM to do the final free on the page
immediately after put_page run inside rmap_remove, and before the tlb
flush (mmu_lock shall be released always after the tlb flush for
future appraoches to be safe). This still can't allow safe (safe as
guest not able to memory corrupt the host) munmap in the middle of the
guest physical memory. Only a combination of both _range_begin and
_range_end mmu notifier will allow that, unless rmap_remove callers
are all changed to do safe "spte=nonpresent; tlbflush; put_page"
ordering, then it'll always be safe, but it'll be slower as there will
be more tlb flushes than needed.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index caa9f94..5343216 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3887,8 +3887,11 @@ int kvm_arch_set_memory_region(struct kvm *kvm,
                        memslot->userspace_addr = do_mmap(NULL, 0,
                                                     npages * PAGE_SIZE,
                                                     PROT_READ | PROT_WRITE,
-                                                    MAP_SHARED | MAP_ANONYMOUS,
-                                                    0);
+                                                    MAP_SHARED | MAP_ANONYMOUS
+#ifndef CONFIG_MMU_NOTIFIER
+                                                    |MAP_LOCKED
+#endif
+                                                    , 0);
                        up_write(&current->mm->mmap_sem);
 
                        if (IS_ERR((void *)memslot->userspace_addr))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 30bf832..ed65e29 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -250,6 +250,27 @@ static int kvm_vm_release(struct inode *inode, struct file 
*filp)
        return 0;
 }
 
+#ifndef CONFIG_MMU_NOTIFIER
+static int memslot_mlock(unsigned long start, unsigned long len)
+{
+       struct vm_area_struct * vma;
+       unsigned long end = start+len;
+
+       vma = find_vma(current->mm, start);
+       if (!vma || vma->vm_start > start)
+               return -ENOMEM;
+
+       /* go simple and don't split vmas */
+       for (;vma && vma->vm_start < end; vma = vma->vm_next) {
+               vma->vm_flags |= VM_LOCKED;
+               start = vma->vm_end;
+       }
+       if (start < end)
+               return -ENOMEM;
+       return 0;
+}
+#endif
+
 /*
  * Allocate some memory and give it an address in the guest physical address
  * space.
@@ -271,8 +292,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
        r = -EINVAL;
        /* General sanity checks */
+       if (mem->userspace_addr & (PAGE_SIZE - 1))
+               goto out;
        if (mem->memory_size & (PAGE_SIZE - 1))
                goto out;
+       if (mem->userspace_addr + mem->memory_size < mem->userspace_addr)
+               goto out;
        if (mem->guest_phys_addr & (PAGE_SIZE - 1))
                goto out;
        if (mem->slot >= KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
@@ -318,6 +343,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
        /* Allocate if a slot is being created */
        if (npages && !new.rmap) {
+#ifndef CONFIG_MMU_NOTIFIER
+               if (user_alloc &&
+                   memslot_mlock(mem->userspace_addr, mem->memory_size))
+                       goto out_free;
+#endif
+
                new.rmap = vmalloc(npages * sizeof(struct page *));
 
                if (!new.rmap)

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to