[PATCH] dma-buf: Replace reservation shared fence array with a compressed radix tree

Christian König Mon, 14 Nov 2016 10:59:57 +0100

Hi Chris,

Am 14.11.2016 um 09:31 schrieb Chris Wilson:
> The primary operation on the shared fence arrays is insertion and
> retrieval.  Retrieval is reasonably fast, as we just copy the array, but
> insertion into the shared fence array is slow as we must iterate over all
> current fences to discard an older fence.  (Note also that since the
> array is not autopruning, we cannot hook in the fence signal callback due
> to locking constraints, thus it can grow very, very large.)  This, coupled
> with the atomics to acquire and release the shared fence, causes a severe
> performance degradation and scaling bottleneck, as felt by i915.ko in
> commit d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common
> struct reservation_object").  To speed up insertion, we want a direct
> replacement of the per-context slot. Enter the radix tree.
>
> The kernel already has a couple of options for integer radix trees, but
> both are not suitable for us for a couple of reasons.  First we need a
> per-tree preallocation in order to ensure that adding a fence does not
> fail.  Both radixtree and idr only offer preallocation under a preempt
> disabled critical section, unsuitable for our existing users.  idr has
> an internal __idr_pre_get that could be exported - but idr does not yet
> support inserting entries at an arbitrary location.  Secondly, neither
> tree supports full u64 integers as used by the fence context identifiers
> (though radixtree would support that on 64bit platforms).  And finally,
> if we ignore the API shortcomings, radixtree still appears too high in
> the profiles!
>
> So what are our requirements for the shared fence array?
>
>       - guaranteed insertion
>       - fast insertion
>       - RCU-safe, fast traversal
>       - 64bit identifiers
>
> To guarantee insertion, we need to preallocate enough storage to satisfy
> a later insertion.  The reservation object API requires every
> add_shared_fence to preceded by a call to reserve_shared_fence.  For an
> uncompressed radix tree we would need to preallocate enough layers to cover
> the full 64bit height, but with a compressed radix tree we only require a
> maximum of 2 spare layers.
>
> The compressed tree also offers a useful property wrt to the pattern of
> fences used.  The fence contexts are allocated as a pure cyclic atomic64,
> i.e. it is sparse and ever incrementing.  However, timelines tend to
> cluster (i.e. they will allocate a cluster of fence contexts for
> themselves and primarily use these).  So we may see that we have a
> mixture of low fence contents and a group of very high contexts (e.g.
> a group at id<16 and a group at id>65536).  The compressed tree avoids
> having to allocate the intermediate layers (reducing the pointer dancing
> on lookup and traversal) - still not as dense as the current compact
> array, but under typical conditions as good as.  (However, the current
> array requires contiguous allocations - a scarce resource for the ever
> expanding array!)  The density could be improved by switching from a
> simple dangerously wrapping atomic64 to an ida for fence context
> allocation.
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>


Yeah, I was thinking about a similar change for a while now cause we 
clearly see problems with that on amdgpu as well.

That's also the reason amdgpu internally uses a separate container for 
fences. This implementation is based on a fixed sized hash table and 
performs well enough for our cases. Might be a good idea to check that 
one as well, cause it doesn't introduce a new container object just for 
this use case.

In general on such kind of changes I prefer that the implementation of 
the container code is added separately and then the existing code 
changed to use it. That makes reviewing the changes much easier.

Regards,
Christian.

[PATCH] dma-buf: Replace reservation shared fence array with a compressed radix tree

Reply via email to