Hi Chris,

Am 14.11.2016 um 09:31 schrieb Chris Wilson:
> The primary operation on the shared fence arrays is insertion and
> retrieval.  Retrieval is reasonably fast, as we just copy the array, but
> insertion into the shared fence array is slow as we must iterate over all
> current fences to discard an older fence.  (Note also that since the
> array is not autopruning, we cannot hook in the fence signal callback due
> to locking constraints, thus it can grow very, very large.)  This, coupled
> with the atomics to acquire and release the shared fence, causes a severe
> performance degradation and scaling bottleneck, as felt by i915.ko in
> commit d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common
> struct reservation_object").  To speed up insertion, we want a direct
> replacement of the per-context slot. Enter the radix tree.
>
> The kernel already has a couple of options for integer radix trees, but
> both are not suitable for us for a couple of reasons.  First we need a
> per-tree preallocation in order to ensure that adding a fence does not
> fail.  Both radixtree and idr only offer preallocation under a preempt
> disabled critical section, unsuitable for our existing users.  idr has
> an internal __idr_pre_get that could be exported - but idr does not yet
> support inserting entries at an arbitrary location.  Secondly, neither
> tree supports full u64 integers as used by the fence context identifiers
> (though radixtree would support that on 64bit platforms).  And finally,
> if we ignore the API shortcomings, radixtree still appears too high in
> the profiles!
>
> So what are our requirements for the shared fence array?
>
>       - guaranteed insertion
>       - fast insertion
>       - RCU-safe, fast traversal
>       - 64bit identifiers
>
> To guarantee insertion, we need to preallocate enough storage to satisfy
> a later insertion.  The reservation object API requires every
> add_shared_fence to preceded by a call to reserve_shared_fence.  For an
> uncompressed radix tree we would need to preallocate enough layers to cover
> the full 64bit height, but with a compressed radix tree we only require a
> maximum of 2 spare layers.
>
> The compressed tree also offers a useful property wrt to the pattern of
> fences used.  The fence contexts are allocated as a pure cyclic atomic64,
> i.e. it is sparse and ever incrementing.  However, timelines tend to
> cluster (i.e. they will allocate a cluster of fence contexts for
> themselves and primarily use these).  So we may see that we have a
> mixture of low fence contents and a group of very high contexts (e.g.
> a group at id<16 and a group at id>65536).  The compressed tree avoids
> having to allocate the intermediate layers (reducing the pointer dancing
> on lookup and traversal) - still not as dense as the current compact
> array, but under typical conditions as good as.  (However, the current
> array requires contiguous allocations - a scarce resource for the ever
> expanding array!)  The density could be improved by switching from a
> simple dangerously wrapping atomic64 to an ida for fence context
> allocation.
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>

Yeah, I was thinking about a similar change for a while now cause we 
clearly see problems with that on amdgpu as well.

That's also the reason amdgpu internally uses a separate container for 
fences. This implementation is based on a fixed sized hash table and 
performs well enough for our cases. Might be a good idea to check that 
one as well, cause it doesn't introduce a new container object just for 
this use case.

In general on such kind of changes I prefer that the implementation of 
the container code is added separately and then the existing code 
changed to use it. That makes reviewing the changes much easier.

Regards,
Christian.

Reply via email to