On 1/14/20 5:00 AM, Jason Gunthorpe wrote:
On Mon, Jan 13, 2020 at 02:47:02PM -0800, Ralph Campbell wrote:void nouveau_svmm_fini(struct nouveau_svmm **psvmm) { struct nouveau_svmm *svmm = *psvmm; + struct mmu_interval_notifier *mni; + if (svmm) { mutex_lock(&svmm->mutex); + while (true) { + mni = mmu_interval_notifier_find(svmm->mm, + &nouveau_svm_mni_ops, 0UL, ~0UL); + if (!mni) + break; + mmu_interval_notifier_put(mni);Oh, now I really don't like the name 'put'. It looks like mni is refcounted here, and it isn't. put should be called 'remove_deferred'
OK.
And then you also need a way to barrier this scheme on driver unload.
Good point. I can add something like void mmu_interval_notifier_synchronize(struct mm_struct *mm) that waits for deferred operations to complete similar to mmu_interval_read_begin().
+ } svmm->vmm = NULL; mutex_unlock(&svmm->mutex); - mmu_notifier_put(&svmm->notifier);While here it was actually a refcount.+static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range) +{ + struct svmm_interval *smi = + container_of(mni, struct svmm_interval, notifier); + struct nouveau_svmm *svmm = smi->svmm; + unsigned long start = mmu_interval_notifier_start(mni); + unsigned long last = mmu_interval_notifier_last(mni);This whole algorithm only works if it is protected by the read side of the interval tree lock. Deserves at least a comment if not an assertion too.
This is called from the invalidate() callback and while holding the driver page table lock so the struct mmu_interval_notifier and the interval tree can't change. I will add comments for v7.
static int nouveau_range_fault(struct nouveau_svmm *svmm, struct nouveau_drm *drm, void *data, u32 size, - u64 *pfns, struct svm_notifier *notifier) + u64 *pfns, u64 start, u64 end) { unsigned long timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); /* Have HMM fault pages within the fault window to the GPU. */ struct hmm_range range = { - .notifier = ¬ifier->notifier, - .start = notifier->notifier.interval_tree.start, - .end = notifier->notifier.interval_tree.last + 1, + .start = start, + .end = end, .pfns = pfns, .flags = nouveau_svm_pfn_flags, .values = nouveau_svm_pfn_values, + .default_flags = 0, + .pfn_flags_mask = ~0UL, .pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT, }; - struct mm_struct *mm = notifier->notifier.mm; + struct mm_struct *mm = svmm->mm; long ret;while (true) {if (time_after(jiffies, timeout)) return -EBUSY;- range.notifier_seq = mmu_interval_read_begin(range.notifier);- range.default_flags = 0; - range.pfn_flags_mask = -1UL; down_read(&mm->mmap_sem);mmap sem doesn't have to be held for the interval search, and again we have lifetime issues with the membership here.
I agree mmap_sem isn't needed for the interval search, it is needed if the search doesn't find a registered interval and one needs to be created to cover the underlying VMA. If an arbitrary size interval was created instead, then mmap_sem wouldn't be needed. I don't understand the lifetime/membership issue. The driver is the only thing that allocates, inserts, or removes struct mmu_interval_notifier and thus completely controls the lifetime.
+ ret = nouveau_svmm_interval_find(svmm, &range); + if (ret) { + up_read(&mm->mmap_sem); + return ret; + } + range.notifier_seq = mmu_interval_read_begin(range.notifier); ret = hmm_range_fault(&range, 0); up_read(&mm->mmap_sem); if (ret <= 0) {I'm still not sure this is a better approach than what ODP does. It looks very expensive on the fault path.. Jason
ODP doesn't have this problem because users have to call ib_reg_mr() before any I/O can happen to the process address space. That is when mmu_interval_notifier_insert() / mmu_interval_notifier_remove() can be called and the driver doesn't have to worry about the interval changing sizes or being removed while I/O is happening. For GPU like devices, I'm trying to allow hardware access to any user level address without pre-registering it. That means inserting mmu interval notifiers for the ranges the GPU page faults on and updating the intervals as munmap() calls remove parts of the address space. I don't want to register an interval per page so the logical range is the underlying VMA. It isn't that expensive, there is an extra driver lock/unlock as part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC) for new intervals. Also, the deferred interval updates for munmap(). Compared to the cost of updating PTEs in the device and GPU fault handling, this is minimal overhead. _______________________________________________ Nouveau mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/nouveau
