On Mon, Jun 16, 2025 at 09:24:38AM +0200, Christian König wrote: > On 6/13/25 21:11, Matthew Brost wrote: > > On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote: > >> On 6/13/25 19:01, Matthew Brost wrote: > >>> All, > >>> > >>> After about six hours of debugging, I found an issue in a fairly > >>> aggressive test case involving the DRM scheduler function > >>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not > >>> correctly return first on a job push, causing the queue to fail to run > >>> even though it is ready. > >>> > >>> I know this sounds a bit insane, but I assure you it’s happening and is > >>> quite reproducible. I'm working off a pull of drm-tip from a few days > >>> ago + some local change to Xe's memory management, with a Kconfig that > >>> has no debug options enabled. I’m not sure if there’s a bug somewhere in > >>> the kernel related to barriers or atomics in the recent drm-tip. That > >>> seems unlikely—but just as unlikely is that this bug has existed for a > >>> while without being triggered until now. > >>> > >>> I've verified the hang in several ways: using printks, adding a debugfs > >>> entry to manually kick the DRM scheduler queue when it's stuck (which > >>> gets it unstuck), and replacing the SPSC queue with one guarded by a > >>> spinlock (which completely fixes the issue). > >>> > >>> That last point raises a big question: why are we using a convoluted > >>> lockless algorithm here instead of a simple spinlock? This isn't a > >>> critical path—and even if it were, how much performance benefit are we > >>> actually getting from the lockless design? Probably very little. > >>> > >>> Any objections to me rewriting this around a spinlock-based design? My > >>> head hurts from chasing this bug, and I feel like this is the best way > >>> forward rather than wasting more time here. > >> > >> Well the spsc queue is some standard code I used in previous projects and > >> we have never experienced any issue with that. > >> > > > > I can quite clearly see this not working on my test setup. I can (kinda) > > explain it a bit more. > > > > Look at this code: > > > > 65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct > > spsc_node *node) > > 66 { > > 67 struct spsc_node **tail; > > 68 > > 69 node->next = NULL; > > 70 > > 71 preempt_disable(); > > 72 > > 73 tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, > > (long)&node->next); > > 74 WRITE_ONCE(*tail, node); > > 75 atomic_inc(&queue->job_count); > > 76 > > 77 /* > > 78 * In case of first element verify new node will be visible to > > the consumer > > 79 * thread when we ping the kernel thread that there is new work > > to do. > > 80 */ > > 81 smp_wmb(); > > 82 > > 83 preempt_enable(); > > 84 > > 85 return tail == &queue->head; > > 86 } > > > > Between the execution of atomic_long_xchg and atomic_inc, the submission > > worker could dequeue the previous last job, reducing job_count to zero, > > run the job, observe that job_count == 0 (drm_sched_entity_is_ready), > > and then go to sleep. This function has swapped for the previous tail, > > so it returns false (i.e., not the first, and does not requeue the > > submit worker at caller). > > > > The race window here is tiny, and I would think preempt_disable would > > make it impossible (or preempt_disable is broken drm-tip a few days > > ago), so I’m still a bit perplexed by it. But again, I assure you this > > is, in fact, happening on my test setup. My test case is an SVM one, > > which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if > > that contributes. I can show this race occurring in dmesg if you need > > proof. > > > > The change below fixes the problem. I'm going to post it to unblock > > myself. > > > > diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h > > index 125f096c88cb..ee9df8cc67b7 100644 > > --- a/include/drm/spsc_queue.h > > +++ b/include/drm/spsc_queue.h > > @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue > > *queue, struct spsc_node *n > > > > preempt_disable(); > > > > + atomic_inc(&queue->job_count); > > + smp_mb__after_atomic(); > > + > > tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, > > (long)&node->next); > > WRITE_ONCE(*tail, node); > > - atomic_inc(&queue->job_count); > > > > /* > > * In case of first element verify new node will be visible to > > * the consumer > > I need to double check the code path once more, but that we got this wrong > while could certainly be. > > The code originated in userspace and atomics are also memory barriers there. > Sima had to point out that we have to manually add smp_mb()s here to make it > work. > > >> This is a massively performance critical code path and we need to make > >> sure that we move as few cache lines as possible between the producer and > >> consumer side. > >> > > > > Well, I can't say I buy this argument. If you can show me any real > > workload where using a spinlock here vs. going lockless makes a > > measurable impact, I'd eat my hat. Also, FWIW, this code seemingly > > violates the DRM locking guidelines we're all supposed to follow… But > > anyway, I'll go ahead with the fix above. > > I probably have to take that back, see another comment below. > > > > > Matt > > > >> That was the reason why we replaced the spinlock with the spsc queue > >> before. > > Well we replaced spinlock+kfifo with spsc the for round robing peeking > implementation. > > Background is that because of the spinlock even a "peek" transfers the cache > line as write to the sheduler thread, and when you do the "peek" even on the > idle entities then that starts to not scale at all. > > Since we now have the FIFO implementation and avoiding peeking all the time > into the job queue of idle entities that might as well not suck that badly > any more. >
I think if the consumer (scheduler) always tried to peek—e.g., by checking the SPSC tail rather than the job count—the current SPSC code would work as is. Matt > Regards, > Christian. > > >> > >> Regards, > >> Christian. > >> > >>> > >>> Matt > >> >