On 2018-08-15 03:02 AM, Christian König wrote: > Hi Felix, > > yeah, you pretty much nailed it. > > The problem is that the array itself is RCU protected. This means that > you can only copy the whole structure when you want to update it. > > The exception is reservation_object_add_shared() which only works > because we replace an either signaled fence or a fence within the same > context but a later sequence number. > > This also explains why this is only a band aid and the whole approach > of removing fences doesn't work in general. At any time somebody could > have taken an RCU reference to the old array, created a copy of it and > is now still using this one. > > The only 100% correct solution would be to mark the existing fence as > signaled and replace it everywhere else.
Depends what you're trying to achieve. I think the problem you see is, that some reader may still be using the old reservation_object_list copy with the fence still in it. But, the remaining lifetime of the reservation_object_list copy is limited. If we wanted to be sure no more copies with the old fence exist, all we'd need to do is call synchronize_rcu. Maybe we need to do that before releasing the fence references, or release the fence reference in an RCU callback to be safe. Regards, Felix > > Going to fix the copy&paste error I made with the sequence number and > send it out again. > > Regards, > Christian. > > Am 14.08.2018 um 22:17 schrieb Felix Kuehling: >> [+Harish] >> >> I think this looks good for the most part. See one comment inline below. >> >> But bear with me while I'm trying to understand what was wrong with the >> old code. Please correct me if I get this wrong or point out anything >> I'm missing. >> >> The reservation_object_list looks to be protected by a combination of >> three mechanism: >> >> * Holding the reservation object >> * RCU >> * seqcount >> >> Updating the fence list requires holding the reservation object. But >> there are some readers that can be called without holding that lock >> (reservation_object_copy_fences, reservation_object_get_fences_rcu, >> reservation_object_wait_timeout_rcu, >> reservation_object_test_signaled_rcu). They rely on RCU to work on a >> copy and seqcount to make sure they had the most up-to-date information. >> So any function updating the fence lists needs to do RCU and seqcount >> correctly to prevent breaking those readers. >> >> As I understand it, RCU with seqcount retry just means that readers will >> spin retrying while there are writers, and writers are never blocked by >> readers. Writers are blocked by each other, because they need to hold >> the reservation. >> >> The code you added looks a lot like >> reservation_object_add_shared_replace, which removes fences that have >> signalled, and atomically replaces obj->fences with a new >> reservation_fence_list. That atomicity is important because each pointer >> in the obj->fences->shared array is separately protected by RCU, but not >> the array as a whole or the shared_count. >> >> See one comment inline. >> >> Regards, >> Felix >> >> On 2018-08-14 03:39 AM, Christian König wrote: >>> Fix quite a number of bugs here. Unfortunately only compile tested. >>> >>> Signed-off-by: Christian König <christian.koe...@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 103 >>> ++++++++++------------- >>> 1 file changed, 46 insertions(+), 57 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> index fa38a960ce00..dece31516dc4 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> @@ -206,11 +206,9 @@ static int >>> amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo, >>> struct amdgpu_amdkfd_fence ***ef_list, >>> unsigned int *ef_count) >>> { >>> - struct reservation_object_list *fobj; >>> - struct reservation_object *resv; >>> - unsigned int i = 0, j = 0, k = 0, shared_count; >>> - unsigned int count = 0; >>> - struct amdgpu_amdkfd_fence **fence_list; >>> + struct reservation_object *resv = bo->tbo.resv; >>> + struct reservation_object_list *old, *new; >>> + unsigned int i, j, k; >>> if (!ef && !ef_list) >>> return -EINVAL; >>> @@ -220,76 +218,67 @@ static int >>> amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo, >>> *ef_count = 0; >>> } >>> - resv = bo->tbo.resv; >>> - fobj = reservation_object_get_list(resv); >>> - >>> - if (!fobj) >>> + old = reservation_object_get_list(resv); >>> + if (!old) >>> return 0; >>> - preempt_disable(); >>> - write_seqcount_begin(&resv->seq); >>> + new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]), >>> + GFP_KERNEL); >>> + if (!new) >>> + return -ENOMEM; >>> - /* Go through all the shared fences in the resevation object. If >>> - * ef is specified and it exists in the list, remove it and >>> reduce the >>> - * count. If ef is not specified, then get the count of >>> eviction fences >>> - * present. >>> + /* Go through all the shared fences in the resevation object >>> and sort >>> + * the interesting ones to the end of the list. >>> */ >>> - shared_count = fobj->shared_count; >>> - for (i = 0; i < shared_count; ++i) { >>> + for (i = 0, j = old->shared_count, k = 0; i < >>> old->shared_count; ++i) { >>> struct dma_fence *f; >>> - f = rcu_dereference_protected(fobj->shared[i], >>> + f = rcu_dereference_protected(old->shared[i], >>> reservation_object_held(resv)); >>> - if (ef) { >>> - if (f->context == ef->base.context) { >>> - dma_fence_put(f); >>> - fobj->shared_count--; >>> - } else { >>> - RCU_INIT_POINTER(fobj->shared[j++], f); >>> - } >>> - } else if (to_amdgpu_amdkfd_fence(f)) >>> - count++; >>> + if ((ef && f->context == ef->base.context) || >>> + (!ef && to_amdgpu_amdkfd_fence(f))) >>> + RCU_INIT_POINTER(new->shared[--j], f); >>> + else >>> + RCU_INIT_POINTER(new->shared[k++], f); >>> } >>> - write_seqcount_end(&resv->seq); >>> - preempt_enable(); >>> + new->shared_max = old->shared_max; >>> + new->shared_count = k; >>> - if (ef || !count) >>> - return 0; >>> + if (!ef) { >>> + unsigned int count = old->shared_count - j; >>> - /* Alloc memory for count number of eviction fence pointers. >>> Fill the >>> - * ef_list array and ef_count >>> - */ >>> - fence_list = kcalloc(count, sizeof(struct amdgpu_amdkfd_fence *), >>> - GFP_KERNEL); >>> - if (!fence_list) >>> - return -ENOMEM; >>> + /* Alloc memory for count number of eviction fence >>> pointers. Fill the >>> + * ef_list array and ef_count >>> + */ >>> + *ef_list = kcalloc(count, sizeof(**ef_list), GFP_KERNEL); >>> + *ef_count = count; >>> + if (!*ef_list) { >>> + kfree(new); >>> + return -ENOMEM; >>> + } >>> + } >>> + >>> + /* Install the new fence list, seqcount provides the barriers */ >>> + preempt_disable(); >>> + write_seqcount_begin(&resv->seq); >>> + RCU_INIT_POINTER(resv->fence, new); >>> preempt_disable(); >>> write_seqcount_begin(&resv->seq); >> You're disabling preemption and calling write_seqcount_begin twice. I >> think this is meant to be >> >> write_seqcount_end(&resv->seq); >> preempt_enable(); >> >> >>> - j = 0; >>> - for (i = 0; i < shared_count; ++i) { >>> + /* Drop the references to the removed fences or move them to >>> ef_list */ >>> + for (i = j, k = 0; i < old->shared_count; ++i) { >>> struct dma_fence *f; >>> - struct amdgpu_amdkfd_fence *efence; >>> - f = rcu_dereference_protected(fobj->shared[i], >>> - reservation_object_held(resv)); >>> - >>> - efence = to_amdgpu_amdkfd_fence(f); >>> - if (efence) { >>> - fence_list[k++] = efence; >>> - fobj->shared_count--; >>> - } else { >>> - RCU_INIT_POINTER(fobj->shared[j++], f); >>> - } >>> + f = rcu_dereference_protected(new->shared[i], >>> + reservation_object_held(resv)); >>> + if (!ef) >>> + (*ef_list)[k++] = to_amdgpu_amdkfd_fence(f); >>> + else >>> + dma_fence_put(f); >>> } >>> - >>> - write_seqcount_end(&resv->seq); >>> - preempt_enable(); >>> - >>> - *ef_list = fence_list; >>> - *ef_count = k; >>> + kfree_rcu(old, rcu); >>> return 0; >>> } > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx