On 18-03-2026 03:02 pm, Zhang, Jesse(Jie) wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

-----Original Message-----
From: Khatri, Sunil <[email protected]>
Sent: Wednesday, March 18, 2026 4:22 PM
To: Koenig, Christian <[email protected]>; Khatri, Sunil
<[email protected]>; Deucher, Alexander <[email protected]>
Cc: [email protected]; Zhang, Jesse(Jie) <[email protected]>
Subject: Re: [PATCH] Revert "drm/amdgpu: harden SIGNAL/WAIT ioctl argument
validation"


On 18-03-2026 01:29 pm, Christian König wrote:
On 3/18/26 08:47, Sunil Khatri wrote:
This reverts commit 0cdff8eb31c139dde4716e4aa37198c16364629e.

The patch has caused regression for userqueues where user is stuck
and is waiting for fences and a gpu reset is triggered in kernel.
Also for any of the parameters when count is zero, the driver does
not read from the pointer and having that check is overkill.

Application:
MESA: error: amdgpu: getting wait num_fences failed
MESA: error: amdgpu: getting wait fences failed
MESA: error: amdgpu: getting wait num_fences failed
MESA: error: amdgpu: getting wait fences failed
After I reverted this patch, the error still occurs when running glxgears.
Does it work fine on your end if you don't apply this patch?

amdgpu: getting wait fences failed
amdgpu: getting wait fences failed
amdgpu: getting wait fences failed

Yes, it works. You might need to update mesa too. I am using the latest mesa with ubuntu and i dont see those error. with your patch they do show.

Regards
Sunil Khatri

Thanks
Jesse


Dmesg:
[  122.668493] amdgpu 0000:0a:00.0: sq_intr: error, detail
0x00000000, type 1, sh 1, priv 0, wave_id 0, simd_id 0, wgp_id 0 [
122.668504] amdgpu 0000:0a:00.0: sq_intr: error, detail 0x00000000,
type 1, sh 1, priv 0, wave_id 0, simd_id 0, wgp_id 0 [  124.687518]
amdgpu 0000:0a:00.0: Dumping IP State [  124.688351] amdgpu
0000:0a:00.0: Dumping IP State Completed [  124.688355] amdgpu
0000:0a:00.0: [drm] AMDGPU device coredump file has been created [
124.688357] amdgpu 0000:0a:00.0: [drm] Check your
/sys/class/drm/card0/device/devcoredump/data
[  124.688361] amdgpu 0000:0a:00.0: ring gfx_0.0.0 timeout, signaled
seq=569, emitted seq=571 [  124.688366] amdgpu 0000:0a:00.0:  Process
Xwayland pid 3471 thread Xwayland:cs0 pid 3479 [  124.688369] amdgpu
0000:0a:00.0: Starting gfx_0.0.0 ring reset [  126.560451] amdgpu
0000:0a:00.0: MES(0) failed to respond to msg=RESET [  126.560456]
amdgpu 0000:0a:00.0: failed to detect and reset [  126.560460] amdgpu
0000:0a:00.0: Failed to detect and reset queues, err (-110) [
128.789840] amdgpu 0000:0a:00.0: Ring gfx_0.0.0 reset failed [
128.789848] amdgpu 0000:0a:00.0: GPU reset begin!. Source:  1 [
128.790161] amdgpu 0000:0a:00.0: Guilty job already signaled, skipping HW
reset [  128.790174] amdgpu 0000:0a:00.0: GPU reset(1) succeeded!
[  128.804538] amdgpu 0000:0a:00.0: [drm] device wedged, but
recovered through reset [  128.804574] amdgpu 0000:0a:00.0: GPU reset
begin!. Source:  6 [  128.816663] amdgpu 0000:0a:00.0: Dumping IP
State [  128.817458] amdgpu 0000:0a:00.0: Dumping IP State Completed
[  130.963939] amdgpu 0000:0a:00.0: MES(1) failed to respond to
msg=REMOVE_QUEUE [  130.963949] amdgpu 0000:0a:00.0: failed to unmap
legacy queue

Cc: Jesse Zhang <[email protected]>
Signed-off-by: Sunil Khatri <[email protected]>
---
   .../gpu/drm/amd/amdgpu/amdgpu_userq_fence.c   | 29 -------------------
   1 file changed, 29 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
index 3fcd70a38374..0d9a13081f2f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
@@ -484,16 +484,6 @@ int amdgpu_userq_signal_ioctl(struct drm_device
*dev, void *data,
        args->num_bo_read_handles > AMDGPU_USERQ_MAX_HANDLES)
            return -EINVAL;

-  /* Reject non-NULL pointers paired with a zero count. */
-  if (!args->num_syncobj_handles && args->syncobj_handles)
-          return -EINVAL;
-
-  if (!args->num_bo_read_handles && args->bo_read_handles)
-          return -EINVAL;
-
-  if (!args->num_bo_write_handles && args->bo_write_handles)
-          return -EINVAL;
-
    num_syncobj_handles = args->num_syncobj_handles;
    syncobj_handles = memdup_array_user(u64_to_user_ptr(args-
syncobj_handles),
                                        num_syncobj_handles, sizeof(u32)); @@ -
950,25 +940,6 @@
int amdgpu_userq_wait_ioctl(struct drm_device *dev, void *data,
        wait_info->num_bo_read_handles >
AMDGPU_USERQ_MAX_HANDLES)
            return -EINVAL;

-  /* Reject non-NULL pointers paired with a zero count: the pointer
-   * is meaningless and indicates inconsistent input from userspace.
-   */
-  if (!wait_info->num_syncobj_handles && wait_info->syncobj_handles)
-          return -EINVAL;
-
-  if (!wait_info->num_syncobj_timeline_handles &&
-      (wait_info->syncobj_timeline_handles || wait_info-
syncobj_timeline_points))
-          return -EINVAL;
-
-  if (!wait_info->num_bo_read_handles && wait_info->bo_read_handles)
-          return -EINVAL;
-
-  if (!wait_info->num_bo_write_handles && wait_info->bo_write_handles)
-          return -EINVAL;
-
-  if (!wait_info->num_fences && wait_info->out_fences)
-          return -EINVAL;
-
Mhm, in general such checks look valid to me.

My educated guess is that userspace sets num_fences = 0 to query if it needs to
resize the pointer out_fences or not.
If you have time please double check which check fails here.
Sure, i will check on that but for now i have pushed this revert.

regards

sunil khatri

Apart from that Reviewed-by: Christian König <[email protected]>.

Regards,
Christian.

    num_syncobj = wait_info->num_syncobj_handles;
    ptr = u64_to_user_ptr(wait_info->syncobj_handles);
    syncobj_handles = memdup_array_user(ptr, num_syncobj,
sizeof(u32));

Reply via email to