Re: Regression on gfx8 with ring init
BTW, this also seems to be what breaks suspend/resume. Andrey On 09/21/2018 01:56 PM, Andrey Grodzovsky wrote: No worries, I will just revert locally until then to clear the extra errors during my investigation of current GPU reset status and issues. Andrey On 09/21/2018 01:53 PM, Christian König wrote: I unfortunately don't have a Polaris to test this myself. But please give me time till Monday so that I can at least try one more things to fix it. Christian. Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky: Ping... Andrey On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote: What's the status with this error and the suggested patch to fix it ? It impacts GPU reset on Polaris11. Do we want to investigate why the original patch breaks it or just disable with the proposed patch ? P.S Suspend resume also stopped working on latest branch - will bisect it later today or tomorrow. Andrey On 09/18/2018 11:00 AM, Christian König wrote: Tom, can you try if the following makes it working again? diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c index b6160de70d12..d65f5ba92fc5 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring *ring, long timeout) return r; } +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long timeout) +{ + return 0; +} static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) { @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_kiq = { .emit_ib = gfx_v8_0_ring_emit_ib_compute, .emit_fence = gfx_v8_0_ring_emit_fence_kiq, .test_ring = gfx_v8_0_ring_test_ring, - .test_ib = gfx_v8_0_ring_test_ib, + .test_ib = gfx_v8_0_kiq_ring_test_ib, .insert_nop = amdgpu_ring_insert_nop, .pad_ib = amdgpu_ring_generic_pad_ib, .emit_rreg = gfx_v8_0_ring_emit_rreg, Thanks, Christian. Am 18.09.2018 um 16:41 schrieb Christian König: CRTC and GFX interrupts seem to be working perfectly fine. The problem here looks like only EOP interrupts from the Compute queue are not correctly handled. Most likely a bug somewhere in gfx_v8_0_eop_irq(). Christian. Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex *From:* amd-gfx on behalf of Christian König *Sent:* Tuesday, September 18, 2018 10:31:16 AM *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) *Subject:* Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctl
Re: Regression on gfx8 with ring init
No worries, I will just revert locally until then to clear the extra errors during my investigation of current GPU reset status and issues. Andrey On 09/21/2018 01:53 PM, Christian König wrote: I unfortunately don't have a Polaris to test this myself. But please give me time till Monday so that I can at least try one more things to fix it. Christian. Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky: Ping... Andrey On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote: What's the status with this error and the suggested patch to fix it ? It impacts GPU reset on Polaris11. Do we want to investigate why the original patch breaks it or just disable with the proposed patch ? P.S Suspend resume also stopped working on latest branch - will bisect it later today or tomorrow. Andrey On 09/18/2018 11:00 AM, Christian König wrote: Tom, can you try if the following makes it working again? diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c index b6160de70d12..d65f5ba92fc5 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring *ring, long timeout) return r; } +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long timeout) +{ + return 0; +} static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) { @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_kiq = { .emit_ib = gfx_v8_0_ring_emit_ib_compute, .emit_fence = gfx_v8_0_ring_emit_fence_kiq, .test_ring = gfx_v8_0_ring_test_ring, - .test_ib = gfx_v8_0_ring_test_ib, + .test_ib = gfx_v8_0_kiq_ring_test_ib, .insert_nop = amdgpu_ring_insert_nop, .pad_ib = amdgpu_ring_generic_pad_ib, .emit_rreg = gfx_v8_0_ring_emit_rreg, Thanks, Christian. Am 18.09.2018 um 16:41 schrieb Christian König: CRTC and GFX interrupts seem to be working perfectly fine. The problem here looks like only EOP interrupts from the Compute queue are not correctly handled. Most likely a bug somewhere in gfx_v8_0_eop_irq(). Christian. Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex *From:* amd-gfx on behalf of Christian König *Sent:* Tuesday, September 18, 2018 10:31:16 AM *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) *Subject:* Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a Raven1 >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been &g
Re: Regression on gfx8 with ring init
I unfortunately don't have a Polaris to test this myself. But please give me time till Monday so that I can at least try one more things to fix it. Christian. Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky: Ping... Andrey On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote: What's the status with this error and the suggested patch to fix it ? It impacts GPU reset on Polaris11. Do we want to investigate why the original patch breaks it or just disable with the proposed patch ? P.S Suspend resume also stopped working on latest branch - will bisect it later today or tomorrow. Andrey On 09/18/2018 11:00 AM, Christian König wrote: Tom, can you try if the following makes it working again? diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c index b6160de70d12..d65f5ba92fc5 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring *ring, long timeout) return r; } +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long timeout) +{ + return 0; +} static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) { @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_kiq = { .emit_ib = gfx_v8_0_ring_emit_ib_compute, .emit_fence = gfx_v8_0_ring_emit_fence_kiq, .test_ring = gfx_v8_0_ring_test_ring, - .test_ib = gfx_v8_0_ring_test_ib, + .test_ib = gfx_v8_0_kiq_ring_test_ib, .insert_nop = amdgpu_ring_insert_nop, .pad_ib = amdgpu_ring_generic_pad_ib, .emit_rreg = gfx_v8_0_ring_emit_rreg, Thanks, Christian. Am 18.09.2018 um 16:41 schrieb Christian König: CRTC and GFX interrupts seem to be working perfectly fine. The problem here looks like only EOP interrupts from the Compute queue are not correctly handled. Most likely a bug somewhere in gfx_v8_0_eop_irq(). Christian. Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex *From:* amd-gfx on behalf of Christian König *Sent:* Tuesday, September 18, 2018 10:31:16 AM *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) *Subject:* Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a Raven1 >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>>> >>>>>> Anything I could test with my devel raven? >>>
Re: Regression on gfx8 with ring init
Ping... Andrey On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote: What's the status with this error and the suggested patch to fix it ? It impacts GPU reset on Polaris11. Do we want to investigate why the original patch breaks it or just disable with the proposed patch ? P.S Suspend resume also stopped working on latest branch - will bisect it later today or tomorrow. Andrey On 09/18/2018 11:00 AM, Christian König wrote: Tom, can you try if the following makes it working again? diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c index b6160de70d12..d65f5ba92fc5 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring *ring, long timeout) return r; } +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long timeout) +{ + return 0; +} static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) { @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_kiq = { .emit_ib = gfx_v8_0_ring_emit_ib_compute, .emit_fence = gfx_v8_0_ring_emit_fence_kiq, .test_ring = gfx_v8_0_ring_test_ring, - .test_ib = gfx_v8_0_ring_test_ib, + .test_ib = gfx_v8_0_kiq_ring_test_ib, .insert_nop = amdgpu_ring_insert_nop, .pad_ib = amdgpu_ring_generic_pad_ib, .emit_rreg = gfx_v8_0_ring_emit_rreg, Thanks, Christian. Am 18.09.2018 um 16:41 schrieb Christian König: CRTC and GFX interrupts seem to be working perfectly fine. The problem here looks like only EOP interrupts from the Compute queue are not correctly handled. Most likely a bug somewhere in gfx_v8_0_eop_irq(). Christian. Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex *From:* amd-gfx on behalf of Christian König *Sent:* Tuesday, September 18, 2018 10:31:16 AM *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) *Subject:* Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a Raven1 >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>>> >>>>>> Anything I could test with my devel raven? >>>>> >>>>> The problem seems to be that on some boards IH handling doesn't >>>>> work as it should. >>>>> >>>>> Can you try to disable the onboard graphics and try again? &
Re: Regression on gfx8 with ring init
What's the status with this error and the suggested patch to fix it ? It impacts GPU reset on Polaris11. Do we want to investigate why the original patch breaks it or just disable with the proposed patch ? P.S Suspend resume also stopped working on latest branch - will bisect it later today or tomorrow. Andrey On 09/18/2018 11:00 AM, Christian König wrote: Tom, can you try if the following makes it working again? diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c index b6160de70d12..d65f5ba92fc5 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring *ring, long timeout) return r; } +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long timeout) +{ + return 0; +} static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) { @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_kiq = { .emit_ib = gfx_v8_0_ring_emit_ib_compute, .emit_fence = gfx_v8_0_ring_emit_fence_kiq, .test_ring = gfx_v8_0_ring_test_ring, - .test_ib = gfx_v8_0_ring_test_ib, + .test_ib = gfx_v8_0_kiq_ring_test_ib, .insert_nop = amdgpu_ring_insert_nop, .pad_ib = amdgpu_ring_generic_pad_ib, .emit_rreg = gfx_v8_0_ring_emit_rreg, Thanks, Christian. Am 18.09.2018 um 16:41 schrieb Christian König: CRTC and GFX interrupts seem to be working perfectly fine. The problem here looks like only EOP interrupts from the Compute queue are not correctly handled. Most likely a bug somewhere in gfx_v8_0_eop_irq(). Christian. Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex *From:* amd-gfx on behalf of Christian König *Sent:* Tuesday, September 18, 2018 10:31:16 AM *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) *Subject:* Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a Raven1 >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>>> >>>>>> Anything I could test with my devel raven? >>>>> >>>>> The problem seems to be that on some boards IH handling doesn't >>>>> work as it should. >>>>> >>>>> Can you try to disable the onboard graphics and try again? >>>>> >>>>> If that still
Re: Regression on gfx8 with ring init
Tom, can you try if the following makes it working again? diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c index b6160de70d12..d65f5ba92fc5 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring *ring, long timeout) return r; } +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long timeout) +{ + return 0; +} static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) { @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_kiq = { .emit_ib = gfx_v8_0_ring_emit_ib_compute, .emit_fence = gfx_v8_0_ring_emit_fence_kiq, .test_ring = gfx_v8_0_ring_test_ring, - .test_ib = gfx_v8_0_ring_test_ib, + .test_ib = gfx_v8_0_kiq_ring_test_ib, .insert_nop = amdgpu_ring_insert_nop, .pad_ib = amdgpu_ring_generic_pad_ib, .emit_rreg = gfx_v8_0_ring_emit_rreg, Thanks, Christian. Am 18.09.2018 um 16:41 schrieb Christian König: CRTC and GFX interrupts seem to be working perfectly fine. The problem here looks like only EOP interrupts from the Compute queue are not correctly handled. Most likely a bug somewhere in gfx_v8_0_eop_irq(). Christian. Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex *From:* amd-gfx on behalf of Christian König *Sent:* Tuesday, September 18, 2018 10:31:16 AM *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) *Subject:* Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a Raven1 >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>>> >>>>>> Anything I could test with my devel raven? >>>>> >>>>> The problem seems to be that on some boards IH handling doesn't >>>>> work as it should. >>>>> >>>>> Can you try to disable the onboard graphics and try again? >>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the >>>>> resulting dmesg of loading amdgpu (but don't start any UMD). >>>>> >>>>> Thanks, >>>>> Christian. >>>>> >>>>>> >>>>>> >>>>>
Re: Regression on gfx8 with ring init
CRTC and GFX interrupts seem to be working perfectly fine. The problem here looks like only EOP interrupts from the Compute queue are not correctly handled. Most likely a bug somewhere in gfx_v8_0_eop_irq(). Christian. Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex *From:* amd-gfx on behalf of Christian König *Sent:* Tuesday, September 18, 2018 10:31:16 AM *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) *Subject:* Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a Raven1 >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>>> >>>>>> Anything I could test with my devel raven? >>>>> >>>>> The problem seems to be that on some boards IH handling doesn't >>>>> work as it should. >>>>> >>>>> Can you try to disable the onboard graphics and try again? >>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the >>>>> resulting dmesg of loading amdgpu (but don't start any UMD). >>>>> >>>>> Thanks, >>>>> Christian. >>>>> >>>>>> >>>>>> >>>>>> Tom >>>>>> >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis: >>>>>>>> This commit: >>>>>>>> >>>>>>>> [root@raven linux]# git bisect good >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 >>>>>>>> Author: Christian König >>>>>>>> Date: Tue Sep 18 10:38:09 2018 +0200 >>>>>>>> >>>>>>>> drm/amdgpu: remove fence fallback >>>>>>>> >>>>>>>> DC doesn't seem to have a fallback path either. >>>>>>>> >>>>>>>> So when interrupts doesn't work any more we are pretty much >>>>>>>> busted no >>>>>>>> matter what. >>>>&g
Re: Regression on gfx8 with ring init
On 2018-09-18 10:31 a.m., Christian König wrote: Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. That's a bingo. [ 32.231734] [drm] Initialized amdgpu 3.27.0 20150101 for :01:00.0 on minor 0 [ 32.233803] modprobe (3816) used greatest stack depth: 12464 bytes left [ 35.266007] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 35.266373] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring (kiq_2.1.0) 9 (-110). [ 35.403034] [drm:process_one_work] *ERROR* ib ring test failed (-110). Should point out that kfd still has the old fence logic: [root@raven amd]# git grep enable_signaling amdgpu/amdgpu_amdkfd_fence.c: * nofity when the BO is free to move. fence_add_callback --> enable_signaling amdgpu/amdgpu_amdkfd_fence.c: * --> amdgpu_amdkfd_fence.enable_signaling amdgpu/amdgpu_amdkfd_fence.c: * amdgpu_amdkfd_fence.enable_signaling - Start a work item that will quiesce amdgpu/amdgpu_amdkfd_fence.c: * amdkfd_fence_enable_signaling - This gets called when TTM wants to evict amdgpu/amdgpu_amdkfd_fence.c:static bool amdkfd_fence_enable_signaling(struct dma_fence *f) amdgpu/amdgpu_amdkfd_fence.c: .enable_signaling = amdkfd_fence_enable_signaling, Tom Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: On 2018-09-18 10:13 a.m., Christian König wrote: Mhm, there is no more failed IB-test in there isn't it? oh sorry I thought you wanted to test HEAD~ ... Attached is a log from the tip of drm-next Tom Christian. Am 18.09.2018 um 16:09 schrieb Tom St Denis: Disabling IOMMU in the BIOS resulted in a correct boot up... Here's the log. Tom On 2018-09-18 9:58 a.m., Tom St Denis wrote: Odd I couldn't even boot my system with the dGPU as primary after rebuilding the kernel. It got hung up in the IOMMU driver (loads of AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed before loading the network stack. Bizarre. I'll keep trying. Tom On 2018-09-18 9:35 a.m., Christian König wrote: Am 18.09.2018 um 15:32 schrieb Tom St Denis: On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? The problem seems to be that on some boards IH handling doesn't work as it should. Can you try to disable the onboard graphics and try again? If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu (but don't start any UMD). Thanks, Christian. Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated. There are a number of workaround to manually override the IVRS tables to make interrupts work. I think specifying pci=noacpi is also a possible workaround. Alex From: amd-gfx on behalf of Christian König Sent: Tuesday, September 18, 2018 10:31:16 AM To: StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) Subject: Re: Regression on gfx8 with ring init Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: > On 2018-09-18 10:13 a.m., Christian König wrote: >> Mhm, there is no more failed IB-test in there isn't it? > > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from > the tip of drm-next > > Tom > >> >> Christian. >> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>> >>> Here's the log. >>> >>> Tom >>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>> Odd I couldn't even boot my system with the dGPU as primary after >>>> rebuilding the kernel. It got hung up in the IOMMU driver (loads >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it >>>> panic'ed before loading the network stack. >>>> >>>> Bizarre. >>>> >>>> I'll keep trying. >>>> >>>> Tom >>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>>>> correctly on Raven? >>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a Raven1 >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>>> >>>>>> Anything I could test with my devel raven? >>>>> >>>>> The problem seems to be that on some boards IH handling doesn't >>>>> work as it should. >>>>> >>>>> Can you try to disable the onboard graphics and try again? >>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the >>>>> resulting dmesg of loading amdgpu (but don't start any UMD). >>>>> >>>>> Thanks, >>>>> Christian. >>>>> >>>>>> >>>>>> >>>>>> Tom >>>>>> >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis: >>>>>>>> This commit: >>>>>>>> >>>>>>>> [root@raven linux]# git bisect good >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 >>>>>>>> Author: Christian König >>>>>>>> Date: Tue Sep 18 10:38:09 2018 +0200 >>>>>>>> >>>>>>>> drm/amdgpu: remove fence fallback >>>>>>>> >>>>>>>> DC doesn't seem to have a fallback path either. >>>>>>>> >>>>>>>> So when interrupts doesn't work any more we are pretty much >>>>>>>> busted no >>>>>>>> matter what. >>>>>>>> >>>>>>>> Signed-off-by: Christian König >>>>>>>> Reviewed-by: Chunming Zhou >>>>>>>> >>>>>>>> Results in this: >>>>>>>> >>>>>>>> [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for >>>>>>>> :07:00.0 on minor 1 >>>>>>>> [ 24.335674] modprobe (3895) used greatest stack depth: 12600 >>>>>>>> bytes left >>>>>>>> [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* >>>>>>>> amdgpu: IB test timed out. >>>>>>>> [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* >>>>>>>> amdgpu: failed testing IB on ring 9 (-110). >>>>>>>> [ 26.407885] [drm:process_one_work] *ERROR* ib ring test >>>>>>>> failed (-110). >>>>>>>> [ 28.506708] fuse init (API version 7.27) >>>>>>>> >>>>>>>> On init with my polaris/raven1 system. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Tom >>>>>>>> ___ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>> >>>>>> >>>>> >>>> >>> >> > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine. But looking at the error message once more I see that this actually affects ring number 9 and not the GFX ring. Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the number? That must be some of the compute rings. Thanks, Christian. Am 18.09.2018 um 16:20 schrieb Tom St Denis: On 2018-09-18 10:13 a.m., Christian König wrote: Mhm, there is no more failed IB-test in there isn't it? oh sorry I thought you wanted to test HEAD~ ... Attached is a log from the tip of drm-next Tom Christian. Am 18.09.2018 um 16:09 schrieb Tom St Denis: Disabling IOMMU in the BIOS resulted in a correct boot up... Here's the log. Tom On 2018-09-18 9:58 a.m., Tom St Denis wrote: Odd I couldn't even boot my system with the dGPU as primary after rebuilding the kernel. It got hung up in the IOMMU driver (loads of AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed before loading the network stack. Bizarre. I'll keep trying. Tom On 2018-09-18 9:35 a.m., Christian König wrote: Am 18.09.2018 um 15:32 schrieb Tom St Denis: On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? The problem seems to be that on some boards IH handling doesn't work as it should. Can you try to disable the onboard graphics and try again? If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu (but don't start any UMD). Thanks, Christian. Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
On 2018-09-18 10:13 a.m., Christian König wrote: Mhm, there is no more failed IB-test in there isn't it? oh sorry I thought you wanted to test HEAD~ ... Attached is a log from the tip of drm-next Tom Christian. Am 18.09.2018 um 16:09 schrieb Tom St Denis: Disabling IOMMU in the BIOS resulted in a correct boot up... Here's the log. Tom On 2018-09-18 9:58 a.m., Tom St Denis wrote: Odd I couldn't even boot my system with the dGPU as primary after rebuilding the kernel. It got hung up in the IOMMU driver (loads of AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed before loading the network stack. Bizarre. I'll keep trying. Tom On 2018-09-18 9:35 a.m., Christian König wrote: Am 18.09.2018 um 15:32 schrieb Tom St Denis: On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? The problem seems to be that on some boards IH handling doesn't work as it should. Can you try to disable the onboard graphics and try again? If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu (but don't start any UMD). Thanks, Christian. Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx amdgpu_ih_process2.log.gz Description: application/gzip ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
Mhm, there is no more failed IB-test in there isn't it? Christian. Am 18.09.2018 um 16:09 schrieb Tom St Denis: Disabling IOMMU in the BIOS resulted in a correct boot up... Here's the log. Tom On 2018-09-18 9:58 a.m., Tom St Denis wrote: Odd I couldn't even boot my system with the dGPU as primary after rebuilding the kernel. It got hung up in the IOMMU driver (loads of AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed before loading the network stack. Bizarre. I'll keep trying. Tom On 2018-09-18 9:35 a.m., Christian König wrote: Am 18.09.2018 um 15:32 schrieb Tom St Denis: On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? The problem seems to be that on some boards IH handling doesn't work as it should. Can you try to disable the onboard graphics and try again? If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu (but don't start any UMD). Thanks, Christian. Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
Disabling IOMMU in the BIOS resulted in a correct boot up... Here's the log. Tom On 2018-09-18 9:58 a.m., Tom St Denis wrote: Odd I couldn't even boot my system with the dGPU as primary after rebuilding the kernel. It got hung up in the IOMMU driver (loads of AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed before loading the network stack. Bizarre. I'll keep trying. Tom On 2018-09-18 9:35 a.m., Christian König wrote: Am 18.09.2018 um 15:32 schrieb Tom St Denis: On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? The problem seems to be that on some boards IH handling doesn't work as it should. Can you try to disable the onboard graphics and try again? If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu (but don't start any UMD). Thanks, Christian. Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx amdgpu_ih_process.log.gz Description: application/gzip ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
Odd I couldn't even boot my system with the dGPU as primary after rebuilding the kernel. It got hung up in the IOMMU driver (loads of AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed before loading the network stack. Bizarre. I'll keep trying. Tom On 2018-09-18 9:35 a.m., Christian König wrote: Am 18.09.2018 um 15:32 schrieb Tom St Denis: On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? The problem seems to be that on some boards IH handling doesn't work as it should. Can you try to disable the onboard graphics and try again? If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu (but don't start any UMD). Thanks, Christian. Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
Am 18.09.2018 um 15:32 schrieb Tom St Denis: On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? The problem seems to be that on some boards IH handling doesn't work as it should. Can you try to disable the onboard graphics and try again? If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu (but don't start any UMD). Thanks, Christian. Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
On 2018-09-18 9:30 a.m., Christian König wrote: Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? What does "doesn't work correctly?" My workstation is a Raven1 (Ryzen 2400G) and other than the TTM bulk move issue has been perfectly stable (through suspend/resumes too I might add). Anything I could test with my devel raven? Tom Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Regression on gfx8 with ring init
Great, not sure if that is a good or a bad news. Anyway going to revert the change for now. Does anybody volunteer to figure out why interrupts sometimes doesn't work correctly on Raven? Christian. Am 18.09.2018 um 15:27 schrieb Tom St Denis: This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Regression on gfx8 with ring init
This commit: [root@raven linux]# git bisect good 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 Author: Christian König Date: Tue Sep 18 10:38:09 2018 +0200 drm/amdgpu: remove fence fallback DC doesn't seem to have a fallback path either. So when interrupts doesn't work any more we are pretty much busted no matter what. Signed-off-by: Christian König Reviewed-by: Chunming Zhou Results in this: [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 on minor 1 [ 24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 9 (-110). [ 26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110). [ 28.506708] fuse init (API version 7.27) On init with my polaris/raven1 system. Cheers, Tom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx