Re: Regression on gfx8 with ring init

2018-09-21 Thread Andrey Grodzovsky

BTW, this also seems to be what breaks suspend/resume.


Andrey


On 09/21/2018 01:56 PM, Andrey Grodzovsky wrote:


No worries, I will just revert locally until then to clear the extra 
errors during my investigation of current GPU reset status and issues.



Andrey


On 09/21/2018 01:53 PM, Christian König wrote:

I unfortunately don't have a Polaris to test this myself.

But please give me time till Monday so that I can at least try one 
more things to fix it.


Christian.

Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:


Ping...


Andrey


On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:


What's the status with this error and the suggested patch to fix it 
? It impacts GPU reset on Polaris11.


Do we want to investigate why the original patch breaks it or just 
disable with the proposed patch ?



P.S Suspend resume also stopped working on latest branch - will 
bisect it later today or tomorrow.



Andrey


On 09/18/2018 11:00 AM, Christian König wrote:

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
amdgpu_ring *ring, long timeout)

    return r;
 }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, 
long timeout)

+{
+   return 0;
+}

 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {

    .emit_ib = gfx_v8_0_ring_emit_ib_compute,
    .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
    .test_ring = gfx_v8_0_ring_test_ring,
-   .test_ib = gfx_v8_0_ring_test_ib,
+   .test_ib = gfx_v8_0_kiq_ring_test_ib,
    .insert_nop = amdgpu_ring_insert_nop,
    .pad_ib = amdgpu_ring_generic_pad_ib,
    .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute 
queue are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:


FWIW, a number of consumer Raven boards have bad IVRS tables 
(windows doesn't use interrupt remapping so they are sometimes 
wrong and probably not validated.  There are a number of 
workaround to manually override the IVRS tables to make 
interrupts work.  I think specifying pci=noacpi is also a 
possible workaround.



Alex


*From:* amd-gfx  on 
behalf of Christian König 

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead 
of the

number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a 
log from

> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary 
after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver 
(loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture 
because it

>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't 
work

>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctl

Re: Regression on gfx8 with ring init

2018-09-21 Thread Andrey Grodzovsky
No worries, I will just revert locally until then to clear the extra 
errors during my investigation of current GPU reset status and issues.



Andrey


On 09/21/2018 01:53 PM, Christian König wrote:

I unfortunately don't have a Polaris to test this myself.

But please give me time till Monday so that I can at least try one 
more things to fix it.


Christian.

Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:


Ping...


Andrey


On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:


What's the status with this error and the suggested patch to fix it 
? It impacts GPU reset on Polaris11.


Do we want to investigate why the original patch breaks it or just 
disable with the proposed patch ?



P.S Suspend resume also stopped working on latest branch - will 
bisect it later today or tomorrow.



Andrey


On 09/18/2018 11:00 AM, Christian König wrote:

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
amdgpu_ring *ring, long timeout)

    return r;
 }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, 
long timeout)

+{
+   return 0;
+}

 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {

    .emit_ib = gfx_v8_0_ring_emit_ib_compute,
    .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
    .test_ring = gfx_v8_0_ring_test_ring,
-   .test_ib = gfx_v8_0_ring_test_ib,
+   .test_ib = gfx_v8_0_kiq_ring_test_ib,
    .insert_nop = amdgpu_ring_insert_nop,
    .pad_ib = amdgpu_ring_generic_pad_ib,
    .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute 
queue are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:


FWIW, a number of consumer Raven boards have bad IVRS tables 
(windows doesn't use interrupt remapping so they are sometimes 
wrong and probably not validated.  There are a number of 
workaround to manually override the IVRS tables to make 
interrupts work.  I think specifying pci=noacpi is also a 
possible workaround.



Alex


*From:* amd-gfx  on behalf 
of Christian König 

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of 
the

number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a 
log from

> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary 
after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver 
(loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture 
because it

>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't 
work

>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a 
Raven1

>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
&g

Re: Regression on gfx8 with ring init

2018-09-21 Thread Christian König

I unfortunately don't have a Polaris to test this myself.

But please give me time till Monday so that I can at least try one more 
things to fix it.


Christian.

Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:


Ping...


Andrey


On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:


What's the status with this error and the suggested patch to fix it ? 
It impacts GPU reset on Polaris11.


Do we want to investigate why the original patch breaks it or just 
disable with the proposed patch ?



P.S Suspend resume also stopped working on latest branch - will 
bisect it later today or tomorrow.



Andrey


On 09/18/2018 11:00 AM, Christian König wrote:

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
amdgpu_ring *ring, long timeout)

    return r;
 }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
timeout)

+{
+   return 0;
+}

 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {

    .emit_ib = gfx_v8_0_ring_emit_ib_compute,
    .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
    .test_ring = gfx_v8_0_ring_test_ring,
-   .test_ib = gfx_v8_0_ring_test_ib,
+   .test_ib = gfx_v8_0_kiq_ring_test_ib,
    .insert_nop = amdgpu_ring_insert_nop,
    .pad_ib = amdgpu_ring_generic_pad_ib,
    .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute 
queue are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:


FWIW, a number of consumer Raven boards have bad IVRS tables 
(windows doesn't use interrupt remapping so they are sometimes 
wrong and probably not validated.  There are a number of 
workaround to manually override the IVRS tables to make interrupts 
work.  I think specifying pci=noacpi is also a possible workaround.



Alex


*From:* amd-gfx  on behalf 
of Christian König 

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a 
log from

> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary 
after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver 
(loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture 
because it

>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a 
Raven1

>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>

Re: Regression on gfx8 with ring init

2018-09-21 Thread Andrey Grodzovsky

Ping...


Andrey


On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:


What's the status with this error and the suggested patch to fix it ? 
It impacts GPU reset on Polaris11.


Do we want to investigate why the original patch breaks it or just 
disable with the proposed patch ?



P.S Suspend resume also stopped working on latest branch - will bisect 
it later today or tomorrow.



Andrey


On 09/18/2018 11:00 AM, Christian König wrote:

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
amdgpu_ring *ring, long timeout)

    return r;
 }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
timeout)

+{
+   return 0;
+}

 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {

    .emit_ib = gfx_v8_0_ring_emit_ib_compute,
    .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
    .test_ring = gfx_v8_0_ring_test_ring,
-   .test_ib = gfx_v8_0_ring_test_ib,
+   .test_ib = gfx_v8_0_kiq_ring_test_ib,
    .insert_nop = amdgpu_ring_insert_nop,
    .pad_ib = amdgpu_ring_generic_pad_ib,
    .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute 
queue are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:


FWIW, a number of consumer Raven boards have bad IVRS tables 
(windows doesn't use interrupt remapping so they are sometimes 
wrong and probably not validated.  There are a number of workaround 
to manually override the IVRS tables to make interrupts work.  I 
think specifying pci=noacpi is also a possible workaround.



Alex


*From:* amd-gfx  on behalf 
of Christian König 

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a log 
from

> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>>>
>>>>> The problem seems to be that on some boards IH handling doesn't
>>>>> work as it should.
>>>>>
>>>>> Can you try to disable the onboard graphics and try again?
&

Re: Regression on gfx8 with ring init

2018-09-20 Thread Andrey Grodzovsky
What's the status with this error and the suggested patch to fix it ? It 
impacts GPU reset on Polaris11.


Do we want to investigate why the original patch breaks it or just 
disable with the proposed patch ?



P.S Suspend resume also stopped working on latest branch - will bisect 
it later today or tomorrow.



Andrey


On 09/18/2018 11:00 AM, Christian König wrote:

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
amdgpu_ring *ring, long timeout)

    return r;
 }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
timeout)

+{
+   return 0;
+}

 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {

    .emit_ib = gfx_v8_0_ring_emit_ib_compute,
    .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
    .test_ring = gfx_v8_0_ring_test_ring,
-   .test_ib = gfx_v8_0_ring_test_ib,
+   .test_ib = gfx_v8_0_kiq_ring_test_ib,
    .insert_nop = amdgpu_ring_insert_nop,
    .pad_ib = amdgpu_ring_generic_pad_ib,
    .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute 
queue are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:


FWIW, a number of consumer Raven boards have bad IVRS tables 
(windows doesn't use interrupt remapping so they are sometimes wrong 
and probably not validated.  There are a number of workaround to 
manually override the IVRS tables to make interrupts work.  I think 
specifying pci=noacpi is also a possible workaround.



Alex


*From:* amd-gfx  on behalf of 
Christian König 

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a log 
from

> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>>>
>>>>> The problem seems to be that on some boards IH handling doesn't
>>>>> work as it should.
>>>>>
>>>>> Can you try to disable the onboard graphics and try again?
>>>>>
>>>>> If that still 

Re: Regression on gfx8 with ring init

2018-09-18 Thread Christian König

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c

index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring 
*ring, long timeout)

    return r;
 }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
timeout)

+{
+   return 0;
+}

 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {

    .emit_ib = gfx_v8_0_ring_emit_ib_compute,
    .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
    .test_ring = gfx_v8_0_ring_test_ring,
-   .test_ib = gfx_v8_0_ring_test_ib,
+   .test_ib = gfx_v8_0_kiq_ring_test_ib,
    .insert_nop = amdgpu_ring_insert_nop,
    .pad_ib = amdgpu_ring_generic_pad_ib,
    .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute queue 
are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:


FWIW, a number of consumer Raven boards have bad IVRS tables (windows 
doesn't use interrupt remapping so they are sometimes wrong and 
probably not validated.  There are a number of workaround to manually 
override the IVRS tables to make interrupts work. I think specifying 
pci=noacpi is also a possible workaround.



Alex


*From:* amd-gfx  on behalf of 
Christian König 

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a log from
> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>>>
>>>>> The problem seems to be that on some boards IH handling doesn't
>>>>> work as it should.
>>>>>
>>>>> Can you try to disable the onboard graphics and try again?
>>>>>
>>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>
>>>>>

Re: Regression on gfx8 with ring init

2018-09-18 Thread Christian König

CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute queue 
are not correctly handled.


Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:


FWIW, a number of consumer Raven boards have bad IVRS tables (windows 
doesn't use interrupt remapping so they are sometimes wrong and 
probably not validated.  There are a number of workaround to manually 
override the IVRS tables to make interrupts work.  I think specifying 
pci=noacpi is also a possible workaround.



Alex


*From:* amd-gfx  on behalf of 
Christian König 

*Sent:* Tuesday, September 18, 2018 10:31:16 AM
*To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
*Subject:* Re: Regression on gfx8 with ring init
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a log from
> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>>>
>>>>> The problem seems to be that on some boards IH handling doesn't
>>>>> work as it should.
>>>>>
>>>>> Can you try to disable the onboard graphics and try again?
>>>>>
>>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>>>> This commit:
>>>>>>>>
>>>>>>>> [root@raven linux]# git bisect good
>>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
>>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>>>> Author: Christian König 
>>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>>>>>
>>>>>>>>     drm/amdgpu: remove fence fallback
>>>>>>>>
>>>>>>>>     DC doesn't seem to have a fallback path either.
>>>>>>>>
>>>>>>>>     So when interrupts doesn't work any more we are pretty much
>>>>>>>> busted no
>>>>>>>>     matter what.
>>>>&g

Re: Regression on gfx8 with ring init

2018-09-18 Thread Tom St Denis

On 2018-09-18 10:31 a.m., Christian König wrote:

Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually 
affects ring number 9 and not the GFX ring.


Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the 
number?


That must be some of the compute rings.


That's a bingo.

[   32.231734] [drm] Initialized amdgpu 3.27.0 20150101 for :01:00.0 
on minor 0

[   32.233803] modprobe (3816) used greatest stack depth: 12464 bytes left
[   35.266007] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB 
test timed out.
[   35.266373] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring (kiq_2.1.0) 9 (-110).

[   35.403034] [drm:process_one_work] *ERROR* ib ring test failed (-110).

Should point out that kfd still has the old fence logic:

[root@raven amd]# git grep enable_signaling
amdgpu/amdgpu_amdkfd_fence.c: *  nofity when the BO is free to move. 
fence_add_callback --> enable_signaling

amdgpu/amdgpu_amdkfd_fence.c: *  --> amdgpu_amdkfd_fence.enable_signaling
amdgpu/amdgpu_amdkfd_fence.c: * amdgpu_amdkfd_fence.enable_signaling - 
Start a work item that will quiesce
amdgpu/amdgpu_amdkfd_fence.c: * amdkfd_fence_enable_signaling - This 
gets called when TTM wants to evict
amdgpu/amdgpu_amdkfd_fence.c:static bool 
amdkfd_fence_enable_signaling(struct dma_fence *f)
amdgpu/amdgpu_amdkfd_fence.c:   .enable_signaling = 
amdkfd_fence_enable_signaling,



Tom



Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:

On 2018-09-18 10:13 a.m., Christian König wrote:

Mhm, there is no more failed IB-test in there isn't it?


oh sorry I thought you wanted to test HEAD~ ... Attached is a log from 
the tip of drm-next


Tom



Christian.

Am 18.09.2018 um 16:09 schrieb Tom St Denis:

Disabling IOMMU in the BIOS resulted in a correct boot up...

Here's the log.

Tom

On 2018-09-18 9:58 a.m., Tom St Denis wrote:
Odd I couldn't even boot my system with the dGPU as primary after 
rebuilding the kernel.  It got hung up in the IOMMU driver (loads 
of AMD-Vi IOMMU errors) which I wasn't able to capture because it 
panic'ed before loading the network stack.


Bizarre.

I'll keep trying.

Tom

On 2018-09-18 9:35 a.m., Christian König wrote:

Am 18.09.2018 um 15:32 schrieb Tom St Denis:

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody 
volunteer to figure out why interrupts sometimes doesn't work 
correctly on Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 
(Ryzen 2400G) and other than the TTM bulk move issue has been 
perfectly stable (through suspend/resumes too I might add).


Anything I could test with my devel raven?


The problem seems to be that on some boards IH handling doesn't 
work as it should.


Can you try to disable the onboard graphics and try again?

If that still doesn't work there is a DRM_DEBUG in 
amdgpu_ih_process(), make that a DRM_ERROR and send me the 
resulting dmesg of loading amdgpu (but don't start any UMD).


Thanks,
Christian.




Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much 
busted no

    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 
bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* 
amdgpu: IB test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* 
amdgpu: failed testing IB on ring 9 (-110).
[   26.407885] [drm:process_one_work] *ERROR* ib ring test 
failed (-110).

[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


















___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Deucher, Alexander
FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't 
use interrupt remapping so they are sometimes wrong and probably not validated. 
 There are a number of workaround to manually override the IVRS tables to make 
interrupts work.  I think specifying pci=noacpi is also a possible workaround.


Alex


From: amd-gfx  on behalf of Christian 
König 
Sent: Tuesday, September 18, 2018 10:31:16 AM
To: StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
Subject: Re: Regression on gfx8 with ring init

Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a log from
> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>>>
>>>>> The problem seems to be that on some boards IH handling doesn't
>>>>> work as it should.
>>>>>
>>>>> Can you try to disable the onboard graphics and try again?
>>>>>
>>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>>>> This commit:
>>>>>>>>
>>>>>>>> [root@raven linux]# git bisect good
>>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
>>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>>>> Author: Christian König 
>>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>>>>>
>>>>>>>> drm/amdgpu: remove fence fallback
>>>>>>>>
>>>>>>>> DC doesn't seem to have a fallback path either.
>>>>>>>>
>>>>>>>> So when interrupts doesn't work any more we are pretty much
>>>>>>>> busted no
>>>>>>>> matter what.
>>>>>>>>
>>>>>>>> Signed-off-by: Christian König 
>>>>>>>> Reviewed-by: Chunming Zhou 
>>>>>>>>
>>>>>>>> Results in this:
>>>>>>>>
>>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>>>>>> :07:00.0 on minor 1
>>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 12600
>>>>>>>> bytes left
>>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>>>>>> amdgpu: IB test timed out.
>>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>>>>>> failed (-110).
>>>>>>>> [   28.506708] fuse init (API version 7.27)
>>>>>>>>
>>>>>>>> On init with my polaris/raven1 system.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Tom
>>>>>>>> ___
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Christian König

Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually 
affects ring number 9 and not the GFX ring.


Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the 
number?


That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:

On 2018-09-18 10:13 a.m., Christian König wrote:

Mhm, there is no more failed IB-test in there isn't it?


oh sorry I thought you wanted to test HEAD~ ... Attached is a log from 
the tip of drm-next


Tom



Christian.

Am 18.09.2018 um 16:09 schrieb Tom St Denis:

Disabling IOMMU in the BIOS resulted in a correct boot up...

Here's the log.

Tom

On 2018-09-18 9:58 a.m., Tom St Denis wrote:
Odd I couldn't even boot my system with the dGPU as primary after 
rebuilding the kernel.  It got hung up in the IOMMU driver (loads 
of AMD-Vi IOMMU errors) which I wasn't able to capture because it 
panic'ed before loading the network stack.


Bizarre.

I'll keep trying.

Tom

On 2018-09-18 9:35 a.m., Christian König wrote:

Am 18.09.2018 um 15:32 schrieb Tom St Denis:

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody 
volunteer to figure out why interrupts sometimes doesn't work 
correctly on Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 
(Ryzen 2400G) and other than the TTM bulk move issue has been 
perfectly stable (through suspend/resumes too I might add).


Anything I could test with my devel raven?


The problem seems to be that on some boards IH handling doesn't 
work as it should.


Can you try to disable the onboard graphics and try again?

If that still doesn't work there is a DRM_DEBUG in 
amdgpu_ih_process(), make that a DRM_ERROR and send me the 
resulting dmesg of loading amdgpu (but don't start any UMD).


Thanks,
Christian.




Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much 
busted no

    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 
bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* 
amdgpu: IB test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* 
amdgpu: failed testing IB on ring 9 (-110).
[   26.407885] [drm:process_one_work] *ERROR* ib ring test 
failed (-110).

[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
















___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Tom St Denis

On 2018-09-18 10:13 a.m., Christian König wrote:

Mhm, there is no more failed IB-test in there isn't it?


oh sorry I thought you wanted to test HEAD~ ... Attached is a log from 
the tip of drm-next


Tom



Christian.

Am 18.09.2018 um 16:09 schrieb Tom St Denis:

Disabling IOMMU in the BIOS resulted in a correct boot up...

Here's the log.

Tom

On 2018-09-18 9:58 a.m., Tom St Denis wrote:
Odd I couldn't even boot my system with the dGPU as primary after 
rebuilding the kernel.  It got hung up in the IOMMU driver (loads of 
AMD-Vi IOMMU errors) which I wasn't able to capture because it 
panic'ed before loading the network stack.


Bizarre.

I'll keep trying.

Tom

On 2018-09-18 9:35 a.m., Christian König wrote:

Am 18.09.2018 um 15:32 schrieb Tom St Denis:

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody volunteer 
to figure out why interrupts sometimes doesn't work correctly on 
Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 
(Ryzen 2400G) and other than the TTM bulk move issue has been 
perfectly stable (through suspend/resumes too I might add).


Anything I could test with my devel raven?


The problem seems to be that on some boards IH handling doesn't work 
as it should.


Can you try to disable the onboard graphics and try again?

If that still doesn't work there is a DRM_DEBUG in 
amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting 
dmesg of loading amdgpu (but don't start any UMD).


Thanks,
Christian.




Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much 
busted no

    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 
bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* 
amdgpu: IB test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* 
amdgpu: failed testing IB on ring 9 (-110).
[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed 
(-110).

[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
















amdgpu_ih_process2.log.gz
Description: application/gzip
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Christian König

Mhm, there is no more failed IB-test in there isn't it?

Christian.

Am 18.09.2018 um 16:09 schrieb Tom St Denis:

Disabling IOMMU in the BIOS resulted in a correct boot up...

Here's the log.

Tom

On 2018-09-18 9:58 a.m., Tom St Denis wrote:
Odd I couldn't even boot my system with the dGPU as primary after 
rebuilding the kernel.  It got hung up in the IOMMU driver (loads of 
AMD-Vi IOMMU errors) which I wasn't able to capture because it 
panic'ed before loading the network stack.


Bizarre.

I'll keep trying.

Tom

On 2018-09-18 9:35 a.m., Christian König wrote:

Am 18.09.2018 um 15:32 schrieb Tom St Denis:

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody volunteer 
to figure out why interrupts sometimes doesn't work correctly on 
Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 
(Ryzen 2400G) and other than the TTM bulk move issue has been 
perfectly stable (through suspend/resumes too I might add).


Anything I could test with my devel raven?


The problem seems to be that on some boards IH handling doesn't work 
as it should.


Can you try to disable the onboard graphics and try again?

If that still doesn't work there is a DRM_DEBUG in 
amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting 
dmesg of loading amdgpu (but don't start any UMD).


Thanks,
Christian.




Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much 
busted no

    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 
bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* 
amdgpu: IB test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* 
amdgpu: failed testing IB on ring 9 (-110).
[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed 
(-110).

[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx












___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Tom St Denis

Disabling IOMMU in the BIOS resulted in a correct boot up...

Here's the log.

Tom

On 2018-09-18 9:58 a.m., Tom St Denis wrote:
Odd I couldn't even boot my system with the dGPU as primary after 
rebuilding the kernel.  It got hung up in the IOMMU driver (loads of 
AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed 
before loading the network stack.


Bizarre.

I'll keep trying.

Tom

On 2018-09-18 9:35 a.m., Christian König wrote:

Am 18.09.2018 um 15:32 schrieb Tom St Denis:

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody volunteer to 
figure out why interrupts sometimes doesn't work correctly on Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 
(Ryzen 2400G) and other than the TTM bulk move issue has been 
perfectly stable (through suspend/resumes too I might add).


Anything I could test with my devel raven?


The problem seems to be that on some boards IH handling doesn't work 
as it should.


Can you try to disable the onboard graphics and try again?

If that still doesn't work there is a DRM_DEBUG in 
amdgpu_ih_process(), make that a DRM_ERROR and send me the resulting 
dmesg of loading amdgpu (but don't start any UMD).


Thanks,
Christian.




Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much 
busted no

    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 
bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: 
IB test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring 9 (-110).
[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed 
(-110).

[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx












amdgpu_ih_process.log.gz
Description: application/gzip
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Tom St Denis
Odd I couldn't even boot my system with the dGPU as primary after 
rebuilding the kernel.  It got hung up in the IOMMU driver (loads of 
AMD-Vi IOMMU errors) which I wasn't able to capture because it panic'ed 
before loading the network stack.


Bizarre.

I'll keep trying.

Tom

On 2018-09-18 9:35 a.m., Christian König wrote:

Am 18.09.2018 um 15:32 schrieb Tom St Denis:

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody volunteer to 
figure out why interrupts sometimes doesn't work correctly on Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 (Ryzen 
2400G) and other than the TTM bulk move issue has been perfectly 
stable (through suspend/resumes too I might add).


Anything I could test with my devel raven?


The problem seems to be that on some boards IH handling doesn't work as 
it should.


Can you try to disable the onboard graphics and try again?

If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), 
make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu 
(but don't start any UMD).


Thanks,
Christian.




Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much 
busted no

    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 
bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: 
IB test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring 9 (-110).
[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed 
(-110).

[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx








___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Christian König

Am 18.09.2018 um 15:32 schrieb Tom St Denis:

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody volunteer to 
figure out why interrupts sometimes doesn't work correctly on Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 (Ryzen 
2400G) and other than the TTM bulk move issue has been perfectly 
stable (through suspend/resumes too I might add).


Anything I could test with my devel raven?


The problem seems to be that on some boards IH handling doesn't work as 
it should.


Can you try to disable the onboard graphics and try again?

If that still doesn't work there is a DRM_DEBUG in amdgpu_ih_process(), 
make that a DRM_ERROR and send me the resulting dmesg of loading amdgpu 
(but don't start any UMD).


Thanks,
Christian.




Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much 
busted no

    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 
bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: 
IB test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring 9 (-110).
[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed 
(-110).

[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx






___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Tom St Denis

On 2018-09-18 9:30 a.m., Christian König wrote:

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody volunteer to 
figure out why interrupts sometimes doesn't work correctly on Raven?


What does "doesn't work correctly?"  My workstation is a Raven1 (Ryzen 
2400G) and other than the TTM bulk move issue has been perfectly stable 
(through suspend/resumes too I might add).


Anything I could test with my devel raven?

Tom



Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much busted no
    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 bytes 
left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB 
test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring 9 (-110).

[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110).
[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Regression on gfx8 with ring init

2018-09-18 Thread Christian König

Great, not sure if that is a good or a bad news.

Anyway going to revert the change for now. Does anybody volunteer to 
figure out why interrupts sometimes doesn't work correctly on Raven?


Christian.

Am 18.09.2018 um 15:27 schrieb Tom St Denis:

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

    drm/amdgpu: remove fence fallback

    DC doesn't seem to have a fallback path either.

    So when interrupts doesn't work any more we are pretty much busted no
    matter what.

    Signed-off-by: Christian König 
    Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
:07:00.0 on minor 1
[   24.335674] modprobe (3895) used greatest stack depth: 12600 bytes 
left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB 
test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring 9 (-110).

[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110).
[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Regression on gfx8 with ring init

2018-09-18 Thread Tom St Denis

This commit:

[root@raven linux]# git bisect good
9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
Author: Christian König 
Date:   Tue Sep 18 10:38:09 2018 +0200

drm/amdgpu: remove fence fallback

DC doesn't seem to have a fallback path either.

So when interrupts doesn't work any more we are pretty much busted no
matter what.

Signed-off-by: Christian König 
Reviewed-by: Chunming Zhou 

Results in this:

[   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for :07:00.0 
on minor 1

[   24.335674] modprobe (3895) used greatest stack depth: 12600 bytes left
[   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB 
test timed out.
[   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring 9 (-110).

[   26.407885] [drm:process_one_work] *ERROR* ib ring test failed (-110).
[   28.506708] fuse init (API version 7.27)

On init with my polaris/raven1 system.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx