On Wed, Apr 23, 2025 at 03:12:27PM +0200, Christian König wrote: > On 4/18/25 18:42, Tvrtko Ursulin wrote: > > Hi all, > > > > Recently I mentioned to Danilo about some fence lifetime issues so here is a > > rough series, more than anything intended to start the discussion. > > > > Most of the problem statement can be found in the first patch but to briefly > > summarise - because sched fence can outlive the scheduler, we can trivially > > engineer an use after free with xe and possibly other drivers. All that is > > needed is to convert a syncobj into a sync file behind drivers back, and I > > don't > > see what the driver can do about it. > > > Yeah that topic again :) The problem here is that this is not a bug, it is a > feature! >
Really more unforeseen design flaw IMO - it happens. > IIRC it was Alex who pointed that issue out on the very first fence patch > set, and we already discussed what to do back then. > > The problem with grabbing module references for fences is that you get > trivially into circle references and so basically always preventing the > module from unloading. > Yea, the Xe patch holding module ref is a no-go. > The decision was made to postpone this and live with the potential use after > free on module unload until somebody has time to fix it. Well that was +10 > years ago :) > > I discussed this with Sima again last year and we came to the conclusion that > the easiest way forward would be to decouple the dma_fence implementation > from the driver or component issuing the fence. > > I then came up with the following steps to allow this: > 1. Decouple the lock used for protecting the dma_fence callback list from the > caller. > 2. Stop calling enable_signaling with the lock held. > 3. Nuke all those kmem_cache implementations and force drivers to always > allocate fences using kvmalloc(). Let's document this too. Xe is an offender, I'll post a fix tomorrow. > 4. Nuke the release callback (or maybe move it directly after signaling) and > set fence->ops to NULL after signaling the fence. > > I already send patches out for #1 and #2, but don't have enough time to > actually finish the work. > Link? This has been lingering for a while perhaps the community can pick this up. Matt > If you want take a look at nuking all those kmem_cache implementations for > allocating the fence memory. I think that can be completed completely > separate to everything else. > > Regards, > Christian. > > > > > > IGT that exploits the problem: > > https://patchwork.freedesktop.org/patch/642709/?series=146211&rev=2 > > > > Different flavour of the problem space is if we had a close(drm_fd) in that > > test > > before the sleep. In that case we can even unload xe.ko and gpu-sched.ko for > > even more fun. Last two patches in the series close that gap. > > > > But first two patches are just shrinking the race window. They are not > > proper > > fixes. This is what I want to discuss since I understand reference counting > > all > > the involved objects has been rejected in the past. And since the problem > > probably expands to all dma fences it certainly isn't easy. > > > > To be clear once more - lets not focus on how this does not fix it fully - > > I am > > primarily trying to start the conversation. > > > > Cc: Christian König <christian.koe...@amd.com> > > Cc: Danilo Krummrich <d...@kernel.org> > > Cc: Lucas De Marchi <lucas.demar...@intel.com> > > Cc: Matthew Brost <matthew.br...@intel.com> > > Cc: Philipp Stanner <pha...@kernel.org> > > Cc: Rodrigo Vivi <rodrigo.v...@intel.com> > > > > Tvrtko Ursulin (4): > > sync_file: Weakly paper over one use-after-free resulting race > > dma-fence: Slightly safer dma_fence_set_deadline > > drm/sched: Keep module reference while there are active fences > > drm/xe: Keep module reference while there are active fences > > > > drivers/dma-buf/dma-fence.c | 2 +- > > drivers/dma-buf/sync_file.c | 29 ++++++++++++++++++++----- > > drivers/gpu/drm/scheduler/sched_fence.c | 12 ++++++++-- > > drivers/gpu/drm/xe/xe_hw_fence.c | 13 ++++++++++- > > 4 files changed, 47 insertions(+), 9 deletions(-) > > >