Re: Regression on gfx8 with ring init

Andrey Grodzovsky Fri, 21 Sep 2018 10:57:02 -0700

No worries, I will just revert locally until then to clear the extraerrors during my investigation of current GPU reset status and issues.


Andrey


On 09/21/2018 01:53 PM, Christian König wrote:

I unfortunately don't have a Polaris to test this myself.

But please give me time till Monday so that I can at least try onemore things to fix it.


Christian.

Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:


Ping...


Andrey


On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:

What's the status with this error and the suggested patch to fix it? It impacts GPU reset on Polaris11.

Do we want to investigate why the original patch breaks it or justdisable with the proposed patch ?

P.S Suspend resume also stopped working on latest branch - willbisect it later today or tomorrow.



Andrey


On 09/18/2018 11:00 AM, Christian König wrote:

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.cb/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(structamdgpu_ring *ring, long timeout)

        return r;
 }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring,long timeout)

+{
+       return 0;
+}

 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {

@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcsgfx_v8_0_ring_funcs_kiq = {

        .emit_ib = gfx_v8_0_ring_emit_ib_compute,
        .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
        .test_ring = gfx_v8_0_ring_test_ring,
-       .test_ib = gfx_v8_0_ring_test_ib,
+       .test_ib = gfx_v8_0_kiq_ring_test_ib,
        .insert_nop = amdgpu_ring_insert_nop,
        .pad_ib = amdgpu_ring_generic_pad_ib,
        .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Computequeue are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:

FWIW, a number of consumer Raven boards have bad IVRS tables(windows doesn't use interrupt remapping so they are sometimeswrong and probably not validated. There are a number ofworkaround to manually override the IVRS tables to makeinterrupts work. I think specifying pci=noacpi is also apossible workaround.



Alex

------------------------------------------------------------------------

*From:* amd-gfx <[email protected]> on behalfof Christian König <[email protected]>

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead ofthe

number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>

> oh sorry I thought you wanted to test HEAD~ ... Attached is alog from

> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:

>>>> Odd I couldn't even boot my system with the dGPU as primaryafter>>>> rebuilding the kernel. It got hung up in the IOMMU driver(loads>>>> of AMD-Vi IOMMU errors) which I wasn't able to capturebecause it

>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody

>>>>>>> volunteer to figure out why interrupts sometimes doesn'twork

>>>>>>> correctly on Raven?
>>>>>>

>>>>>> What does "doesn't work correctly?" My workstation is aRaven1

>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>>>

>>>>> The problem seems to be that on some boards IH handlingdoesn't

>>>>> work as it should.
>>>>>
>>>>> Can you try to disable the onboard graphics and try again?
>>>>>
>>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>>>> This commit:
>>>>>>>>
>>>>>>>> [root@raven linux]# git bisect good

>>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the firstbad commit

>>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>>>> Author: Christian König <[email protected]>
>>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>>>>>
>>>>>>>>     drm/amdgpu: remove fence fallback
>>>>>>>>
>>>>>>>>     DC doesn't seem to have a fallback path either.
>>>>>>>>

>>>>>>>> So when interrupts doesn't work any more we arepretty much

>>>>>>>> busted no
>>>>>>>>     matter what.
>>>>>>>>
>>>>>>>> Signed-off-by: Christian König <[email protected]>
>>>>>>>> Reviewed-by: Chunming Zhou <[email protected]>
>>>>>>>>
>>>>>>>> Results in this:
>>>>>>>>
>>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>>>>>> 0000:07:00.0 on minor 1

>>>>>>>> [ 24.335674] modprobe (3895) used greatest stackdepth: 12600

>>>>>>>> bytes left
>>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>>>>>> amdgpu: IB test timed out.
>>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>>>>>> failed (-110).
>>>>>>>> [   28.506708] fuse init (API version 7.27)
>>>>>>>>
>>>>>>>> On init with my polaris/raven1 system.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Tom
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: Regression on gfx8 with ring init

Reply via email to