All, After about six hours of debugging, I found an issue in a fairly aggressive test case involving the DRM scheduler function drm_sched_entity_push_job. The problem is that spsc_queue_push does not correctly return first on a job push, causing the queue to fail to run even though it is ready.
I know this sounds a bit insane, but I assure you it’s happening and is quite reproducible. I'm working off a pull of drm-tip from a few days ago + some local change to Xe's memory management, with a Kconfig that has no debug options enabled. I’m not sure if there’s a bug somewhere in the kernel related to barriers or atomics in the recent drm-tip. That seems unlikely—but just as unlikely is that this bug has existed for a while without being triggered until now. I've verified the hang in several ways: using printks, adding a debugfs entry to manually kick the DRM scheduler queue when it's stuck (which gets it unstuck), and replacing the SPSC queue with one guarded by a spinlock (which completely fixes the issue). That last point raises a big question: why are we using a convoluted lockless algorithm here instead of a simple spinlock? This isn't a critical path—and even if it were, how much performance benefit are we actually getting from the lockless design? Probably very little. Any objections to me rewriting this around a spinlock-based design? My head hurts from chasing this bug, and I feel like this is the best way forward rather than wasting more time here. Matt