On 2/10/20 4:50 PM, Luben Tuikov wrote:
Thank you for bringing awareness of this issue, publicly.
As soon as this patch showed up back in November of 2019,
I objected to it, privately.
I didn't find this objection in my mail actually
I suggested to instead use a _list_ to store the "state" of
all jobs of the same state. Then, at any time, timeout interrupt
or whatever, we can atomically (irq spinlock) move the timeout/bad
job to the timedout/cleanup/bad job list, and wake someone up
to deal with that list asynchronously, and return from the interrupt/etc.
Sounds a good idea to me, i think enough for us to have 2 lists, timeout
list for jobs scheduled to HW and not yet completed (completion fence
signaled) and cleanup list for those that did complete. This should give
alternative solution to the race condition this patch was addressing
without causing the break the Lucas reported. If no one objects I think
i can try implement it.
Then in due time, if any more interrupts or whatnot take place,
the job will either be in the timeout list or not. If it it,
then the instigator backs off as someone else (the list handler) will/is
awake and handling it (obviously a state variable may be kept as well).
This draws somewhat from my days with iSCSI, SCSI and SAS, 15 years ago,
where a device can complete a job (task) at anytime regardless
of what the SCSI layer "thinks" the task's state is: timed-out, aborted,
whatever. It is a very simple and elegant solution which generalizes
On 2020-02-10 11:55 a.m., Andrey Grodzovsky wrote:
Lucas - Ping on my question and also I attached this temporary solution for
etnaviv to clarify my point. If that something acceptable for now at least i
can do the same for v3d where it requires a bit more code changes.
On 2/6/20 10:49 AM, Andrey Grodzovsky wrote:
Well a revert would break our driver.
The real solution is that somebody needs to sit down, gather ALL the
requirements and then come up with a solution which is clean and works for
I can to take on this as indeed our general design on this becomes more and
more entangled as GPU reset scenarios grow in complexity (at least in AMD
driver). Currently I am on a high priority internal task which should take me
around a week or 2 to finish and after that I can get to it.
Regarding temporary solution - I looked into v3d and etnaviv use cases and we
in AMD actually face the same scenario where we decide to skip HW reset if the
guilty job did finish by the time we are processing the timeout (see
amdgpu_device_gpu_recover and skip_hw_reset goto) - the difference is we always
call drm_sched_stop/start irrespectively of whether we are going to actually HW
reset or not (same as extend timeout). I wonder if something like this can be
done also for ve3 and etnaviv ?
amd-gfx mailing list
amd-gfx mailing list