On 12/4/25 17:04, Alex Deucher wrote:
> On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <[email protected]> wrote:
>>
>> +Cc Alex, Christian, Danilo
>>
>>
>> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
>>> Stop open coding pending job list in drivers. Add pending job list
>>> iterator which safely walks DRM scheduler list asserting DRM scheduler
>>> is stopped.
>>>
>>> v2:
>>>  - Fix checkpatch (CI)
>>> v3:
>>>  - Drop locked version (Christian)
>>> v4:
>>>  - Reorder patch (Niranjana)
>>
>> Same with the changelog.
>>
>>>
>>> Signed-off-by: Matthew Brost <[email protected]>
>>> Reviewed-by: Niranjana Vishwanathapura <[email protected]>
>>> ---
>>>  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 50 insertions(+)
>>>
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 385bf34e76fe..9d228513d06c 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct 
>>> drm_sched_job *job)
>>>               dma_fence_is_signaled(&s_fence->finished);
>>>  }
>>>
>>> +/**
>>> + * struct drm_sched_pending_job_iter - DRM scheduler pending job iterator 
>>> state
>>> + * @sched: DRM scheduler associated with pending job iterator
>>> + */
>>> +struct drm_sched_pending_job_iter {
>>> +     struct drm_gpu_scheduler *sched;
>>> +};
>>> +
>>> +/* Drivers should never call this directly */
>>> +static inline struct drm_sched_pending_job_iter
>>> +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
>>> +{
>>> +     struct drm_sched_pending_job_iter iter = {
>>> +             .sched = sched,
>>> +     };
>>> +
>>> +     WARN_ON(!drm_sched_is_stopped(sched));
>>> +     return iter;
>>> +}
>>> +
>>> +/* Drivers should never call this directly */
>>> +static inline void
>>> +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter 
>>> iter)
>>> +{
>>> +     WARN_ON(!drm_sched_is_stopped(iter.sched));
>>> +}
>>> +
>>> +DEFINE_CLASS(drm_sched_pending_job_iter, struct drm_sched_pending_job_iter,
>>> +          __drm_sched_pending_job_iter_end(_T),
>>> +          __drm_sched_pending_job_iter_begin(__sched),
>>> +          struct drm_gpu_scheduler *__sched);
>>> +static inline void *
>>> +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t
>>>  *_T)
>>> +{ return _T; }
>>> +#define class_drm_sched_pending_job_iter_is_conditional false
>>> +
>>> +/**
>>> + * drm_sched_for_each_pending_job() - Iterator for each pending job in 
>>> scheduler
>>> + * @__job: Current pending job being iterated over
>>> + * @__sched: DRM scheduler to iterate over pending jobs
>>> + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no filter
>>> + *
>>> + * Iterator for each pending job in scheduler, filtering on an entity, and
>>> + * enforcing scheduler is fully stopped
>>> + */
>>> +#define drm_sched_for_each_pending_job(__job, __sched, __entity)           
>>>   \
>>> +     scoped_guard(drm_sched_pending_job_iter, (__sched))                   
>>>   \
>>> +             list_for_each_entry((__job), &(__sched)->pending_list, list)  
>>>   \
>>> +                     for_each_if(!(__entity) || (__job)->entity == 
>>> (__entity))
>>> +
>>>  #endif
>>
>>
>> See my comments in the first patch. The docu doesn't mention at all why
>> this new functionality exists and when and why users would be expected
>> to use it.
>>
>> As far as I remember from XDC, both AMD and Intel overwrite a timed out
>> jobs buffer data in the rings on GPU reset. To do so, the driver needs
>> the timedout job (passed through timedout_job() callback) and then
>> needs all the pending non-broken jobs.
>>
>> AFAICS your patch provides a generic iterator over the entire
>> pending_list. How is a driver then supposed to determine which are the
>> non-broken jobs (just asking, but that needs to be documented)?
>>
>> Could it make sense to use a different iterator which only returns jobs
>> of not belonging to the same context as the timedout-one?
>>
>> Those are important questions that need to be addressed before merging
>> that.
>>
>> And if this works canonically (i.e., for basically everyone), it needs
>> to be documented in drm_sched_resubmit_jobs() that this iterator is now
>> the canonical way of handling timeouts.
>>
>> Moreover, btw, just yesterday I added an entry to the DRM todo list
>> which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
>> would have to be removed, too.
>>
>>
>> @AMD: Would the code Matthew provides work for you? Please give your
>> input. This is very important common infrastructure.
> 
> I don't think drm_sched_resubmit_jobs() can work for us without major
> rework.  For our kernel queues, we have a single queue on which jobs
> for different clients are scheduled.  When we reset the queue, we lose
> all jobs on the queue and have to re-emit the non-guilty ones.  We do
> this at the ring level, i.e., we save the packets directly from the
> ring and then re-emit the packets for the non-guilty contexts to the
> freshly reset ring.  This avoids running run_job() again which would
> issue new fences and race with memory management, etc.
> 
> I think the following would be workable:
> 1. driver job_timedout() callback flags the job as bad. resets the bad
> queue, and calls drm_sched_resubmit_jobs()
> 2. drm_sched_resubmit_jobs() walks the pending list and calls
> run_job() for every job

Calling run_job() multiple times was one of the worst ideas I have ever seen.

The problem here is that you need a transactional approach to the internal 
driver state which is modified by ->run_job().

So for example if you have:
->run_job(A)
->run_job(B)
->run_job(C)

And after a reset you find that you need to re-submit only job B and A & C are 
filtered then that means that your driver state needs to get back before 
running job A.

> 2. driver run_job() callback looks to see if we already ran this job
> and uses the original fence rather than allocating a new one

Nope, the problem is *all* drivers *must* use the original fence. Otherwise you 
always run into trouble.

We should not promote a driver interface which makes it extremely easy to shoot 
down the whole system.

> 3. driver run_job() callback checks to see if the job is guilty or
> from the same context and if so, sets an error on the fences and
> submits only the fence packet to the queue so that any follow up jobs
> will properly synchronize if they need to wait on the fence from the
> bad job.
> 4. driver run_job() callback will submit the full packet stream for
> non-guilty contexts
> 
> I guess we could use the iterator and implement that logic in the
> driver directly rather than using drm_sched_resubmit_jobs().

Yeah, exactly that's the way to go.

Christian.

> 
> Alex

Reply via email to