Re: [RFC 0/4] Some (drm_sched_|dma_)fence lifetime issues

Tvrtko Ursulin Wed, 07 May 2025 07:46:35 -0700


On 07/05/2025 13:54, Christian König wrote:

On 5/7/25 14:28, Tvrtko Ursulin wrote:


On 28/04/2025 14:15, Christian König wrote:

On 4/24/25 09:07, Tvrtko Ursulin wrote:


On 23/04/2025 14:12, Christian König wrote:

On 4/18/25 18:42, Tvrtko Ursulin wrote:

Hi all,

Recently I mentioned to Danilo about some fence lifetime issues so here is a
rough series, more than anything intended to start the discussion.

Most of the problem statement can be found in the first patch but to briefly
summarise - because sched fence can outlive the scheduler, we can trivially
engineer an use after free with xe and possibly other drivers. All that is
needed is to convert a syncobj into a sync file behind drivers back, and I don't
see what the driver can do about it.



Yeah that topic again :) The problem here is that this is not a bug, it is a 
feature!

IIRC it was Alex who pointed that issue out on the very first fence patch set, 
and we already discussed what to do back then.

The problem with grabbing module references for fences is that you get 
trivially into circle references and so basically always preventing the module 
from unloading.


Where "always" is only "while there are active objects from that module", no?



The problem is that dma_fences stay around after they are signaled. And 
basically all drivers keep some dma_fence around for their resource management. 
E.g. amdgpu for the VMIDs.

This means that some dma_fence is referenced by the module and the module 
referenced by some dma_fence. E.g. you are never able to unload the module.


Are you thinking truly never or for as long someone has a reference?


Truly never. It's simply a circle dependency you can never break up.

In other words the module references the fence and the fence references the 
module.


Past fences being signaled? How?

For example while userspace has a reference to dma_fence via sync_file fence 
owning module would not unloadable. One would have to terminate the process, 
which granted wouldn't be easy to see which process prevents the unload, before 
driver could be unloaded.

For the foreign fences kept around in kernel space, that would be solvable by 
some periodic house keeping at worst.

Also, about the use cases for module unload. Since you and Brost especially 
seem to be expressing a hard no to module references, what are the use cases 
you are concerned about?

The decision was made to postpone this and live with the potential use after 
free on module unload until somebody has time to fix it. Well that was +10 
years ago :)

I discussed this with Sima again last year and we came to the conclusion that 
the easiest way forward would be to decouple the dma_fence implementation from 
the driver or component issuing the fence.

I then came up with the following steps to allow this:
1. Decouple the lock used for protecting the dma_fence callback list from the 
caller.
2. Stop calling enable_signaling with the lock held.
3. Nuke all those kmem_cache implementations and force drivers to always 
allocate fences using kvmalloc().
4. Nuke the release callback (or maybe move it directly after signaling) and set 
fence->ops to NULL after signaling the fence.

I already send patches out for #1 and #2, but don't have enough time to 
actually finish the work.

If you want take a look at nuking all those kmem_cache implementations for 
allocating the fence memory. I think that can be completed completely separate 
to everything else.


So enabling dma fence "revoke" so to say.

Just to check we are on the same page, it is not just about the module 
references, but also use after frees which can happen even if module is still 
loaded but any memory reachable via dma fence entry points has been freed.



Yeah, that came much later when people started to use the scheduler 
dynamically. Basically the sched pointer in the drm_sched_fence implementation 
becomes invalid as soon as the fence signals.


In that case, as Matt has already asked, if you could dig up your unfinished 
work it would be interesting to see.



This is what I already send out: 
https://gitlab.freedesktop.org/ckoenig/linux-drm/-/commits/dma-fence-rework-enable-signaling

A bunch of the cleanup patches in that branch have already been applied, only 
the last one is missing IIRC.

And here is a WIP patch to decouple the lock I wrote halve a year ago or so: 
https://gitlab.freedesktop.org/ckoenig/linux-drm/-/commits/dma-fence-rework-locking


Thanks!

My concern here is that to me it appears the whole premise is to leave fences 
dangling in memory and somehow make them safe to be accessed by importers.


As soon as you unload the last module using it the fences will automatically be 
released. So I don't see the problem.

For starters this can create permanent memory leaks. Or at least for the same 
window of duration as would the exporters be not unloadable with the reference 
counting alternative. So we would not a strong argument for why poorly bound 
memory leaks are better than poorly bound unloadable modules.


When the module unloads it drops the reference to the fences ultimately freeing 
them.

The only issue is that modules can both reference their own as well a foreign 
fences. So what can happen is that you have module A which references fences 
A1, A2 and B1 and module B which references B1, B2 and A2.

Now you can't unload either module first because they cross reference their 
fences and unloading one would leave the other module with fences which can't 
be released without crashing.

So what we need to have is that the dma_fence framework guarantees that you don't 
need the fence->ops nor the fence->lock pointer any more after the fence 
signaled.

With this option it would mean guarding all entry points with theembedded lock or you had in mind something different? Going simply bythe signaled bit looks impossible to be safe.

It is also a question how to "revoke" fences safely (race free). It sounds hard 
to me. It does not seem you got to this last problem in the above branches so I don't 
know if you had some elegant ideas for that.

Maybe first to ask if anyone is aware of a precedent where something in the 
kernel already uses this design pattern?


Of hand I don't know of any, but the problem sounds rather common to me.

Uf I don't know. Feels very atypical to me but I would be very glad tobe told otherwise.


Regards,

Tvrtko

Re: [RFC 0/4] Some (drm_sched_|dma_)fence lifetime issues

Reply via email to