From: "Lin.Cao" <linca...@amd.com> [ Upstream commit 471db2c2d4f80ee94225a1ef246e4f5011733e50 ]
When an entity from application B is killed, drm_sched_entity_kill() removes all jobs belonging to that entity through drm_sched_entity_kill_jobs_work(). If application A's job depends on a scheduled fence from application B's job, and that fence is not properly signaled during the killing process, application A's dependency cannot be cleared. This leads to application A hanging indefinitely while waiting for a dependency that will never be resolved. Fix this issue by ensuring that scheduled fences are properly signaled when an entity is killed, allowing dependent applications to continue execution. Signed-off-by: Lin.Cao <linca...@amd.com> Reviewed-by: Philipp Stanner <pha...@kernel.org> Signed-off-by: Christian König <christian.koe...@amd.com> Link: https://lore.kernel.org/r/20250515020713.1110476-1-linca...@amd.com Signed-off-by: Sasha Levin <sas...@kernel.org> --- **YES** This commit should be backported to stable kernel trees. Here's my extensive analysis: ## Analysis of the Commit ### 1. **This is a Critical Bug Fix** The commit fixes a serious bug where applications can hang indefinitely when they depend on scheduled fences from killed jobs. The specific change adds a single line: ```c + drm_sched_fence_scheduled(job->s_fence, NULL); ``` This line signals the scheduled fence when killing a job, which was previously missing and caused the hang. ### 2. **Clear User-Visible Impact** >From the commit message: - Application A's job depends on a scheduled fence from Application B's job - When Application B is killed, the fence was not properly signaled - This causes Application A to hang indefinitely waiting for a dependency that will never resolve This is a real-world scenario that affects system stability and user experience. ### 3. **Minimal and Safe Change** The fix is: - **One line addition** - extremely minimal change - **Well-contained** - only affects the job kill path - **No API/ABI changes** - just adds a missing fence signal - **No new features** - purely fixes existing broken behavior ### 4. **Comparison with Similar Commits** Looking at the historical commits provided: **Similar Commit #1 (Backported: YES)** - "drm/scheduler: fix fence ref counting" - Fixed dependency fence leaks when processes were killed - Added proper fence reference management - Similar in nature: fixing fence handling in the kill path This commit shares key characteristics: - Fixes fence handling bugs in entity kill path - Prevents resource leaks/hangs - Minimal, targeted fix The other similar commits (NOT backported) were either: - Large architectural changes (Commit #3 - complete rework of dependency handling) - Feature additions (Commit #4 - new error handling logic) - Less critical fixes (Commit #5 - memleak in uncommon path) ### 5. **Risk Assessment** **Low Risk**: - The change only affects the error/kill path, not normal operation - Signaling a fence with NULL is a valid operation indicating no hardware execution - The `drm_sched_fence_finished()` call immediately after still signals completion with error - No changes to data structures or algorithms ### 6. **Follows Stable Kernel Rules** According to stable kernel rules, this commit qualifies because it: - ✓ Fixes a real bug that affects users (application hangs) - ✓ Is already in Linus's tree (has proper Link: tag) - ✓ Is small and self-contained (1 line) - ✓ Has been reviewed (Reviewed-by: tag present) - ✓ Doesn't introduce new features - ✓ Has clear problem description and solution ### 7. **Technical Details** The fix ensures proper fence signaling order: 1. `drm_sched_fence_scheduled(job->s_fence, NULL)` - signals that the job was "scheduled" (even though it won't run) 2. `drm_sched_fence_finished(job->s_fence, -ESRCH)` - signals completion with error This allows dependent jobs waiting on the scheduled fence to proceed and handle the error appropriately, preventing indefinite hangs. ## Conclusion This commit is an ideal candidate for stable backporting. It fixes a serious user-visible bug (application hangs) with a minimal, well- understood change that follows the established fence signaling pattern in the DRM scheduler. The fix is similar in nature to previous commits that were successfully backported, and the risk of regression is very low since it only affects the error handling path. drivers/gpu/drm/scheduler/sched_entity.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index da00572d7d42e..2f0fe8df7becc 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -189,6 +189,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work); + drm_sched_fence_scheduled(job->s_fence, NULL); drm_sched_fence_finished(job->s_fence, -ESRCH); WARN_ON(job->s_fence->parent); job->sched->ops->free_job(job); -- 2.39.5