Hi Christian,
The messages prompted on timeout are Errors not just Warnings although we did
not see any real problem(for the dgemm special case). That's why we say it
confusing.
And i suppose you want a fix like my previous patch(see attachment).
Regards,
Evan
> -----Original Message-----
> From: Christian König [mailto:[email protected]]
> Sent: Monday, March 19, 2018 5:42 PM
> To: Quan, Evan <[email protected]>; [email protected]
> Cc: Deucher, Alexander <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset
> disabled
>
> Am 19.03.2018 um 07:08 schrieb Evan Quan:
> > Since under some heavy computing environment(dgemm test), it takes the
> > asic over 10+ seconds to finish the dispatched single job which will
> > trigger the timeout. It's quite confusing although it does not seem to
> > bring any real problems.
> > As a quick workround, we choose to disable timeout when GPU reset is
> > disabled.
>
> NAK, I enabled those warning intentionally even when the GPU recovery is
> disabled to have a hint in the logs what goes wrong.
>
> Please only increase the timeout for the compute queue and/or add a
> separate timeout for them.
>
> Regards,
> Christian.
>
>
> >
> > Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
> > Signed-off-by: Evan Quan <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 8bd9c3f..9d6a775 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -861,6 +861,13 @@ static void
> amdgpu_device_check_arguments(struct amdgpu_device *adev)
> > amdgpu_lockup_timeout = 10000;
> > }
> >
> > + /*
> > + * Disable timeout when GPU reset is disabled to avoid confusing
> > + * timeout messages in the kernel log.
> > + */
> > + if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
> > + amdgpu_lockup_timeout = INT_MAX;
> > +
> > adev->firmware.load_type = amdgpu_ucode_get_load_type(adev,
> amdgpu_fw_load_type);
> > }
> >
--- Begin Message ---
Under some heavy computing test(dgemm) environment, it may takes
the asic over 50+ seconds to finish the dispatched single job
which will trigger the timeout. It's quite annoying although it
does not seem to bring any real problems.
As a quick workround, we choose to not enfoce the timeout
setting on compute queues.
Change-Id: I210011a90898617367e897a90e9f8fb2639281a3
Signed-off-by: Evan Quan <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 008e198..455a81e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -435,7 +435,9 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
if (ring->funcs->type != AMDGPU_RING_TYPE_KIQ) {
r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
num_hw_submission, amdgpu_job_hang_limit,
- msecs_to_jiffies(amdgpu_lockup_timeout),
ring->name);
+ (ring->funcs->type ==
AMDGPU_RING_TYPE_COMPUTE) ?
+ MAX_SCHEDULE_TIMEOUT :
msecs_to_jiffies(amdgpu_lockup_timeout),
+ ring->name);
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
ring->name);
--
2.7.4
--- End Message ---
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx