Hi Chris, Am 14.11.2016 um 09:31 schrieb Chris Wilson: > The primary operation on the shared fence arrays is insertion and > retrieval. Retrieval is reasonably fast, as we just copy the array, but > insertion into the shared fence array is slow as we must iterate over all > current fences to discard an older fence. (Note also that since the > array is not autopruning, we cannot hook in the fence signal callback due > to locking constraints, thus it can grow very, very large.) This, coupled > with the atomics to acquire and release the shared fence, causes a severe > performance degradation and scaling bottleneck, as felt by i915.ko in > commit d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common > struct reservation_object"). To speed up insertion, we want a direct > replacement of the per-context slot. Enter the radix tree. > > The kernel already has a couple of options for integer radix trees, but > both are not suitable for us for a couple of reasons. First we need a > per-tree preallocation in order to ensure that adding a fence does not > fail. Both radixtree and idr only offer preallocation under a preempt > disabled critical section, unsuitable for our existing users. idr has > an internal __idr_pre_get that could be exported - but idr does not yet > support inserting entries at an arbitrary location. Secondly, neither > tree supports full u64 integers as used by the fence context identifiers > (though radixtree would support that on 64bit platforms). And finally, > if we ignore the API shortcomings, radixtree still appears too high in > the profiles! > > So what are our requirements for the shared fence array? > > - guaranteed insertion > - fast insertion > - RCU-safe, fast traversal > - 64bit identifiers > > To guarantee insertion, we need to preallocate enough storage to satisfy > a later insertion. The reservation object API requires every > add_shared_fence to preceded by a call to reserve_shared_fence. For an > uncompressed radix tree we would need to preallocate enough layers to cover > the full 64bit height, but with a compressed radix tree we only require a > maximum of 2 spare layers. > > The compressed tree also offers a useful property wrt to the pattern of > fences used. The fence contexts are allocated as a pure cyclic atomic64, > i.e. it is sparse and ever incrementing. However, timelines tend to > cluster (i.e. they will allocate a cluster of fence contexts for > themselves and primarily use these). So we may see that we have a > mixture of low fence contents and a group of very high contexts (e.g. > a group at id<16 and a group at id>65536). The compressed tree avoids > having to allocate the intermediate layers (reducing the pointer dancing > on lookup and traversal) - still not as dense as the current compact > array, but under typical conditions as good as. (However, the current > array requires contiguous allocations - a scarce resource for the ever > expanding array!) The density could be improved by switching from a > simple dangerously wrapping atomic64 to an ida for fence context > allocation. > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
Yeah, I was thinking about a similar change for a while now cause we clearly see problems with that on amdgpu as well. That's also the reason amdgpu internally uses a separate container for fences. This implementation is based on a fixed sized hash table and performs well enough for our cases. Might be a good idea to check that one as well, cause it doesn't introduce a new container object just for this use case. In general on such kind of changes I prefer that the implementation of the container code is added separately and then the existing code changed to use it. That makes reviewing the changes much easier. Regards, Christian.