On 3/9/26 09:16, Boris Brezillon wrote: > On Fri, 06 Mar 2026 20:02:58 +0100 > Philipp Stanner <[email protected]> wrote: > >> On Fri, 2026-03-06 at 16:43 +0100, Boris Brezillon wrote: >>> On Fri, 6 Mar 2026 16:25:25 +0100 >>> Boris Brezillon <[email protected]> wrote: >>> >>>> On Fri, 6 Mar 2026 15:37:31 +0100 >>>> Christian König <[email protected]> wrote: >>>> >>>>> On 3/6/26 14:36, Philipp Stanner wrote: >>>>>>>>>> In other words not signaling a fence can leave the system in a >>>>>>>>>> deadlock state, but signaling it incorrectly usually results in >>>>>>>>>> random data corruption. >>>>>>>> >>>>>>>> Well, not signaling it results in a potential deadlock of the >>>>>>>> whole kernel, whereas wrongly signaling it is "only" a functional >>>>>>>> bug. >>>>>>> >>>>>>> No, that results in random memory corruption. Which is easily a >>>>>>> magnitude worse than just a kernel deadlock. >>>>>>> >>>>>>> When have seen such bugs numerous times with suspend and resume on >>>>>>> laptops in different subsystems, e.g. not only GPU. And I'm >>>>>>> absolutely clearly rejecting any attempt to signal DMA fences when >>>>>>> an object runs out of scope because of that experience. >>>>>> >>>>>> But you're aware that both in C and Rust you could experience UAF >>>>>> bugs if fences drop unsignaled and the driver unloads? >>>>>> >>>>>> Though I tentatively agree that memory corruptions on a large scale >>>>>> are probably the worst error you can have on a computer. >>>>> >>>>> Yeah, of course I'm aware of the UAF issue we have. >>>>> >>>>> But those are relatively harmless compared to the random memory >>>>> corruption issues. >>>>> >>>>> Linux has the unfortunate habit of re-using memory directly after >>>>> freeing it because that means caches are usually hotter. >>>>> >>>>> So rough DMA operations have the tendency to end up anywhere and >>>>> tools like KASAN can't find anything wrong. >>>>> >>>>> The only protection you sometimes have is IOMMU, but that is usually >>>>> not able to catch everything either. >>>>> >>>>>>> >>>>>>>>> It all stands and falls with the question whether a fence can >>>>>>>>> drop by accident in Rust, or if it will only ever drop when the >>>>>>>>> hw-ring is closed. >>>>>>>>> >>>>>>>>> What do you believe is the right thing to do when a driver >>>>>>>>> unloads? >>>>>>>> >>>>>>>> The fence has to be signaled -- ideally after shutting down all >>>>>>>> queues, but it has to be signaled. >>>>>>> >>>>>>> Yeah well this shutting down all queues (and making sure that no >>>>>>> write operation is pending in caches etc...) is what people >>>>>>> usually don't get right. >>>>>>> >>>>>>> What you can to is things like setting your timeout to zero and >>>>>>> immediately causing terminating the HW operation and resetting the >>>>>>> device. >>>>>>> >>>>>>> This will then use the same code path as the mandatory timeout >>>>>>> functionality for DMA operations and that usually works reliable. >>>>>> >>>>>> Why is all that even an issue? The driver controls the hardware and >>>>>> must "switch it off" before it unloads. Then the hardware will not >>>>>> access any memory anymore for sure. >>>>> >>>>> Well exactly that is usually really complicated. Let me try to >>>>> explain that on a HW example. >>>>> >>>>> Between a shader and the actual system memory you usually have >>>>> different IP blocks or stages where a memory access needs to go >>>>> through: >>>>> >>>>> Shader -> device VM -> device cache -> PCI bus -> CPU cache -> memory >>>>> >>>>> Now when you want to terminate some shader or make some memory >>>>> inaccessible because it is freed drivers update their page tables and >>>>> issue the equivalent of TLB invalidation on a CPU. >>>>> >>>>> The problem is now that this will only invalidate the translation in >>>>> the device VM. It doesn't affect the device cache nor any ongoing >>>>> memory transaction on the bus which waits to snoop the CPU cache. >>>>> >>>>> To make sure that you don't corrupt system memory you actually need >>>>> to wait for a cache flush event to be signaled and *not* just update >>>>> the VM page tables and tell the HW to invalidate it's TLB. >>>>> >>>>> So what is needed is usually a fence operation. In other words a >>>>> memory value written over the PCIe bus into system memory. Background >>>>> is that memory writes are ordered and this one comes after all >>>>> previous PCIe memory writes of the device and so is in the correct >>>>> order. >>>>> >>>>> Only when the CPU sees this memory write you can be sure that your >>>>> operation is completed. >>>>> >>>>> This memory write is then often implemented by using an MSI interrupt >>>>> which in turn signals the DMA fence. >>>>> >>>>> >>>>> So the right thing to do is to wait for the DMA fence to signal >>>>> through its normal signaling path which includes both HW and SW >>>>> functionality and *not* just tell the HW to stop some ring and then >>>>> just assume that this is also sufficient to signal the DMA fence >>>>> associated with the HW operation. >>>> >>>> Ultimately this >>>> "stop-HW-and-make-sure-all-outcomes-are-visible-even-for-partially-executed-jobs" >>>> is something you'll have to do, no matter what. But it leading to >>>> having to wait for each pending fence, I'm not too sure. What about the >>>> case where jobs/ops further away in the HWRing were not even considered >>>> for execution by the HW, because the STOP operation prevented them from >>>> being dequeued. I'd expect that the only event we'll get for those is >>>> "HW queue is properly stopped now". So at this point it's a matter of >>>> signalling everything that's left, no? I mean, that's basically what >>>> Panthor does: >>>> >>>> 1. it stops >>>> 2. wait for all executing ops to land (with all the cache maintenance, >>>> etc, you described) >>>> 3. signal(ECANCELED) what's left (things not picked by the HW by >>>> the time the STOP was effective). >>>> >>>> It's currently done manually, but does it have to? >>> >>> All this being said, I'm also a pragmatic guy, so if you tell us "no >>> way!" even after these arguments, I'd rather give up on this >>> auto-signaling feature and have rust drivers be forced to manually >>> signal stuff than have the whole Fence abstraction blocked. We can add >>> a warn_on!(!is_signaled()) in the DriverFence::drop() path for the time >>> being, so we can at least catch cases where the driver didn't signal >>> the fence before dropping the signal-able object. >> >> >> In past discussions with my team members we were also not very >> determined whether we want autosignal or not. >> >> The only thing I'm concerned about is the RCU vs. UAF feature. We >> already invested a lot of time and pain into that feature, so we most >> probably want it. > > Right, there's this UAF situation too. I guess if auto-signal is not > an option, we could add an _ORPHAN_BIT (or _DEAD_BIT) flag, and have > it tested along the _SIGNALED_BIT one in paths where we check if > dma_fence::ops are usable.
You mean to protect driver unload? Yeah that's a pretty good point. But I think I have an idea how to tackle that sanely even on the C side. Give me a day to go over the rest of my mails and hack something together. I have another DMA-fence cleanup I wanted to run by Tvrtko and you anyway. Regards, Christian.
