On Fri, 06 Mar 2026 20:02:58 +0100 Philipp Stanner <[email protected]> wrote:
> On Fri, 2026-03-06 at 16:43 +0100, Boris Brezillon wrote: > > On Fri, 6 Mar 2026 16:25:25 +0100 > > Boris Brezillon <[email protected]> wrote: > > > > > On Fri, 6 Mar 2026 15:37:31 +0100 > > > Christian König <[email protected]> wrote: > > > > > > > On 3/6/26 14:36, Philipp Stanner wrote: > > > > > > > > > In other words not signaling a fence can leave the system in a > > > > > > > > > deadlock state, but signaling it incorrectly usually results > > > > > > > > > in > > > > > > > > > random data corruption. > > > > > > > > > > > > > > Well, not signaling it results in a potential deadlock of the > > > > > > > whole kernel, whereas wrongly signaling it is "only" a functional > > > > > > > bug. > > > > > > > > > > > > No, that results in random memory corruption. Which is easily a > > > > > > magnitude worse than just a kernel deadlock. > > > > > > > > > > > > When have seen such bugs numerous times with suspend and resume on > > > > > > laptops in different subsystems, e.g. not only GPU. And I'm > > > > > > absolutely clearly rejecting any attempt to signal DMA fences when > > > > > > an object runs out of scope because of that experience. > > > > > > > > > > But you're aware that both in C and Rust you could experience UAF > > > > > bugs if fences drop unsignaled and the driver unloads? > > > > > > > > > > Though I tentatively agree that memory corruptions on a large scale > > > > > are probably the worst error you can have on a computer. > > > > > > > > Yeah, of course I'm aware of the UAF issue we have. > > > > > > > > But those are relatively harmless compared to the random memory > > > > corruption issues. > > > > > > > > Linux has the unfortunate habit of re-using memory directly after > > > > freeing it because that means caches are usually hotter. > > > > > > > > So rough DMA operations have the tendency to end up anywhere and > > > > tools like KASAN can't find anything wrong. > > > > > > > > The only protection you sometimes have is IOMMU, but that is usually > > > > not able to catch everything either. > > > > > > > > > > > > > > > > > > It all stands and falls with the question whether a fence can > > > > > > > > drop by accident in Rust, or if it will only ever drop when the > > > > > > > > hw-ring is closed. > > > > > > > > > > > > > > > > What do you believe is the right thing to do when a driver > > > > > > > > unloads? > > > > > > > > > > > > > > The fence has to be signaled -- ideally after shutting down all > > > > > > > queues, but it has to be signaled. > > > > > > > > > > > > Yeah well this shutting down all queues (and making sure that no > > > > > > write operation is pending in caches etc...) is what people > > > > > > usually don't get right. > > > > > > > > > > > > What you can to is things like setting your timeout to zero and > > > > > > immediately causing terminating the HW operation and resetting the > > > > > > device. > > > > > > > > > > > > This will then use the same code path as the mandatory timeout > > > > > > functionality for DMA operations and that usually works reliable. > > > > > > > > > > > > > > > > Why is all that even an issue? The driver controls the hardware and > > > > > must "switch it off" before it unloads. Then the hardware will not > > > > > access any memory anymore for sure. > > > > > > > > Well exactly that is usually really complicated. Let me try to > > > > explain that on a HW example. > > > > > > > > Between a shader and the actual system memory you usually have > > > > different IP blocks or stages where a memory access needs to go > > > > through: > > > > > > > > Shader -> device VM -> device cache -> PCI bus -> CPU cache -> memory > > > > > > > > Now when you want to terminate some shader or make some memory > > > > inaccessible because it is freed drivers update their page tables and > > > > issue the equivalent of TLB invalidation on a CPU. > > > > > > > > The problem is now that this will only invalidate the translation in > > > > the device VM. It doesn't affect the device cache nor any ongoing > > > > memory transaction on the bus which waits to snoop the CPU cache. > > > > > > > > To make sure that you don't corrupt system memory you actually need > > > > to wait for a cache flush event to be signaled and *not* just update > > > > the VM page tables and tell the HW to invalidate it's TLB. > > > > > > > > So what is needed is usually a fence operation. In other words a > > > > memory value written over the PCIe bus into system memory. Background > > > > is that memory writes are ordered and this one comes after all > > > > previous PCIe memory writes of the device and so is in the correct > > > > order. > > > > > > > > Only when the CPU sees this memory write you can be sure that your > > > > operation is completed. > > > > > > > > This memory write is then often implemented by using an MSI interrupt > > > > which in turn signals the DMA fence. > > > > > > > > > > > > So the right thing to do is to wait for the DMA fence to signal > > > > through its normal signaling path which includes both HW and SW > > > > functionality and *not* just tell the HW to stop some ring and then > > > > just assume that this is also sufficient to signal the DMA fence > > > > associated with the HW operation. > > > > > > Ultimately this > > > "stop-HW-and-make-sure-all-outcomes-are-visible-even-for-partially-executed-jobs" > > > is something you'll have to do, no matter what. But it leading to > > > having to wait for each pending fence, I'm not too sure. What about the > > > case where jobs/ops further away in the HWRing were not even considered > > > for execution by the HW, because the STOP operation prevented them from > > > being dequeued. I'd expect that the only event we'll get for those is > > > "HW queue is properly stopped now". So at this point it's a matter of > > > signalling everything that's left, no? I mean, that's basically what > > > Panthor does: > > > > > > 1. it stops > > > 2. wait for all executing ops to land (with all the cache maintenance, > > > etc, you described) > > > 3. signal(ECANCELED) what's left (things not picked by the HW by > > > the time the STOP was effective). > > > > > > It's currently done manually, but does it have to? > > > > All this being said, I'm also a pragmatic guy, so if you tell us "no > > way!" even after these arguments, I'd rather give up on this > > auto-signaling feature and have rust drivers be forced to manually > > signal stuff than have the whole Fence abstraction blocked. We can add > > a warn_on!(!is_signaled()) in the DriverFence::drop() path for the time > > being, so we can at least catch cases where the driver didn't signal > > the fence before dropping the signal-able object. > > > In past discussions with my team members we were also not very > determined whether we want autosignal or not. > > The only thing I'm concerned about is the RCU vs. UAF feature. We > already invested a lot of time and pain into that feature, so we most > probably want it. Right, there's this UAF situation too. I guess if auto-signal is not an option, we could add an _ORPHAN_BIT (or _DEAD_BIT) flag, and have it tested along the _SIGNALED_BIT one in paths where we check if dma_fence::ops are usable.
