Am 04.06.2018 um 09:52 schrieb Huang Rui:
On Sat, Jun 02, 2018 at 03:01:57AM +0800, Kuehling, Felix wrote:
On 2018-06-01 06:09 AM, Christian König wrote:
Am 01.06.2018 um 11:29 schrieb Huang Rui:
On Fri, Jun 01, 2018 at 05:13:49PM +0800, Christian König wrote:
Am 01.06.2018 um 08:41 schrieb Huang Rui:
After defer the execution of gfx/compute ib tests. However, at that
time, the
gfx already go into "mid state" of gfxoff.

PWR_MISC_CNTL_STATUS: PWR_GFXOFF_STATUS field (2:1 bits)
0 = GFXOFF.
1 = Transition out of GFXOFF state.
2 = Not in GFXOFF.
3 = Transition into GFXOFF.

If hit the mid state (1 or 3), the doorbell writing interrupt
cannot wake up the
gfx back successfully. And the field value is 1 when we issue the
ib test at
that, so we got the hang. This is the root cause that we
encountered the issue.

Meanwhile, we cannot set clockgating of GFX after gfx is already in
"off" state.
So here we should move the gfx powergating and gfxoff enabling
behavior at the
end of initialization behind ib test and clockgating.
Mhm, that still looks like a only halve backed solution:

1. What prevents this bug from happening during "normal" IB submission
from userspace?

2. Shouldn't we poll the PWR_MISC_CNTL_STATUS register to make sure we
are not in any transition phase instead?

Yes, right. How about also add polling of PWR_MISC_CNTL_STATUS in
amdgpu_ring_commit() behind set_wptr that confirm the status as "0"
or "2"?
You could add an end_use() callback for that, but I think we rather
need to do this in gfx_v9_0_ring_set_wptr_gfx() before we write the
doorbell.
Isn't testing the status like this is a potential race condition.

Well it could when we use both GFX and compute at the same time.


Having to do this at all is contrary to the documentation that I've
read. Writing a doorbell should wake up the GFX engine. Are we sure that
we understand the cause of the problem correctly? Does the IB test use
any MMIO? Maybe it's doing an HDP flush using MMIO for a ring that
doesn't support HDP flushing.

Felix, thanks to reminder. I supposed you mentioned MMIO using is to avoid
runtime gfx register access, right? Our IB test uses WRITE_DATA packet to
write specific pattern value into the gart memory. I don't use any gfx
registers. And gfxoff is only supported on raven, we don't emit hdp flush
on apu. Actually, I also doubted whether it is caused race condition. But
the hang happens when only modprobe amdgpu module, and not startx at that
time. It won't have any other commands from user space.

Felix is perfectly right that this doesn't sounds like a complete solution to the problem.

The IB test only brings the issue to the surface, working around it by delaying enabling gfxoff would only hide the real problem.

I'm pretty sure that the exact same race can happen with startx or other command submissions as well.

+ Morris, who works for raven SMC firmware.
After discuessed with him, he suggested that we would better to confirm the
GFXOFF_STATUS as 0 or 2, then write the doorbell. Because if GFXOFF_STATUS
is 1 or 3 in mid state (in-progress of translation), SMC will drop the
doorbell interrupt. When GFXOFF status is 0 or 2, already in the target
state, SMC can repond interrupt at once. Morris, please correct me if I was
wrong.

That will be rather hard to guarantee and would completely circumvent the idea behind gfxoff, e.g. the driver would need to assist the firmware again turning things on/off.

Regards,
Christian.


Thanks,
Ray

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to