Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-20 Thread Mikhail Gavrilov
On Wed, 20 Feb 2019 at 20:39, Grodzovsky, Andrey
 wrote:
> No, we only fixed the original deadlock with display driver during GPU
> reset. I still didn't have time to go over your captures for the GPU
> page fault.
>
> The deadlock we see here is another deadlock, different from the one
> already fixed. I suggest you open a bugzilla ticket for this and add me
> there so we can track it and take care of it.
>
> Andrey
>

Ok, here I filled bug report about new deadlock:
https://bugs.freedesktop.org/show_bug.cgi?id=109692

--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-20 Thread Grodzovsky, Andrey

On 2/20/19 12:28 AM, Mikhail Gavrilov wrote:
> On Tue, 19 Feb 2019 at 20:24, Grodzovsky, Andrey
>  wrote:
>> Just pull in latest drm-next from here -
>> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
>>
>> Andrey
> Tested this kernel and result not good for me.
> 1) "amdgpu :0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070113C"
> happens again. I thought this would fixed.

No, we only fixed the original deadlock with display driver during GPU 
reset. I still didn't have time to go over your captures for the GPU 
page fault.

The deadlock we see here is another deadlock, different from the one 
already fixed. I suggest you open a bugzilla ticket for this and add me 
there so we can track it and take care of it.

Andrey


>
> 2) After it "WARNING: possible circular locking dependency detected" happens.
>
> [  302.266337] ==
> [  302.266338] WARNING: possible circular locking dependency detected
> [  302.266340] 5.0.0-rc1-drm-next-kernel+ #1 Tainted: G C
> [  302.266341] --
> [  302.266343] kworker/5:2/871 is trying to acquire lock:
> [  302.266345] 0abbb16a
> (&(>fence_drv.lock)->rlock){-.-.}, at:
> dma_fence_remove_callback+0x1a/0x60
> [  302.266352]
> but task is already holding lock:
> [  302.266353] 6e32ba38
> (&(>job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140
> [gpu_sched]
> [  302.266358]
> which lock already depends on the new lock.
>
> [  302.266360]
> the existing dependency chain (in reverse order) is:
> [  302.266361]
> -> #1 (&(>job_list_lock)->rlock){-.-.}:
> [  302.266366]drm_sched_process_job+0x4d/0x180 [gpu_sched]
> [  302.266368]dma_fence_signal+0x111/0x1a0
> [  302.266414]amdgpu_fence_process+0xa3/0x100 [amdgpu]
> [  302.266470]sdma_v4_0_process_trap_irq+0x6e/0xa0 [amdgpu]
> [  302.266523]amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
> [  302.266576]amdgpu_ih_process+0x84/0xf0 [amdgpu]
> [  302.266628]amdgpu_irq_handler+0x1b/0x50 [amdgpu]
> [  302.266632]__handle_irq_event_percpu+0x3f/0x290
> [  302.266635]handle_irq_event_percpu+0x31/0x80
> [  302.266637]handle_irq_event+0x34/0x51
> [  302.266639]handle_edge_irq+0x7c/0x1a0
> [  302.266643]handle_irq+0xbf/0x100
> [  302.266646]do_IRQ+0x61/0x120
> [  302.266648]ret_from_intr+0x0/0x22
> [  302.266651]cpuidle_enter_state+0xbf/0x470
> [  302.266654]do_idle+0x1ec/0x280
> [  302.266657]cpu_startup_entry+0x19/0x20
> [  302.20]start_secondary+0x1b3/0x200
> [  302.23]secondary_startup_64+0xa4/0xb0
> [  302.24]
> -> #0 (&(>fence_drv.lock)->rlock){-.-.}:
> [  302.28]_raw_spin_lock_irqsave+0x49/0x83
> [  302.266670]dma_fence_remove_callback+0x1a/0x60
> [  302.266673]drm_sched_stop+0x59/0x140 [gpu_sched]
> [  302.266717]amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
> [  302.266761]amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
> [  302.266822]amdgpu_job_timedout+0x109/0x130 [amdgpu]
> [  302.266827]drm_sched_job_timedout+0x40/0x70 [gpu_sched]
> [  302.266831]process_one_work+0x272/0x5d0
> [  302.266834]worker_thread+0x50/0x3b0
> [  302.266836]kthread+0x108/0x140
> [  302.266839]ret_from_fork+0x27/0x50
> [  302.266840]
> other info that might help us debug this:
>
> [  302.266841]  Possible unsafe locking scenario:
>
> [  302.266842]CPU0CPU1
> [  302.266843]
> [  302.266844]   lock(&(>job_list_lock)->rlock);
> [  302.266846]
> lock(&(>fence_drv.lock)->rlock);
> [  302.266847]
> lock(&(>job_list_lock)->rlock);
> [  302.266849]   lock(&(>fence_drv.lock)->rlock);
> [  302.266850]
>  *** DEADLOCK ***
>
> [  302.266852] 5 locks held by kworker/5:2/871:
> [  302.266853]  #0: d133fb6e ((wq_completion)"events"){+.+.},
> at: process_one_work+0x1e9/0x5d0
> [  302.266857]  #1: 8a5c3f7e
> ((work_completion)(&(>work_tdr)->work)){+.+.}, at:
> process_one_work+0x1e9/0x5d0
> [  302.266862]  #2: b9b2c76f (>lock_reset){+.+.}, at:
> amdgpu_device_lock_adev+0x17/0x40 [amdgpu]
> [  302.266908]  #3: ac637728 (>lock_hidden){+.+.}, at:
> kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
> [  302.266965]  #4: 6e32ba38
> (&(>job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140
> [gpu_sched]
> [  302.266971]
> stack backtrace:
> [  302.266975] CPU: 5 PID: 871 Comm: kworker/5:2 Tainted: G C
>5.0.0-rc1-drm-next-kernel+ #1
> [  302.266976] Hardware name: System manufacturer System Product
> Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
> [  302.266980] Workqueue: events drm_sched_job_timedout [gpu_sched]
> [  302.266982] Call 

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-19 Thread Grodzovsky, Andrey
Just pull in latest drm-next from here - 
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

Andrey

On 2/14/19 11:18 PM, Mikhail Gavrilov wrote:
> On Thu, 14 Feb 2019 at 20:51, Grodzovsky, Andrey
>  wrote:
>> Got it.
>>
>> Andrey
>>
> Cool, please don't forget give me patch for testing.
>
>
> --
> Best Regards,
> Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-15 Thread Mikhail Gavrilov via amd-gfx
On Thu, 14 Feb 2019 at 20:51, Grodzovsky, Andrey
 wrote:
>
> Got it.
>
> Andrey
>

Cool, please don't forget give me patch for testing.


--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-14 Thread Grodzovsky, Andrey
Got it.

Andrey

On 2/14/19 4:32 AM, Christian König wrote:
Hey Andrey,

this is on Vega10, so the ASIC always stops after it sees the first fault.

I'm actually working on implementing that it should continue without 
interruption.

Regards,
Christian.

Am 13.02.19 um 22:47 schrieb Grodzovsky, Andrey:

Looks like you are still running this without the latest hang fix since i see 
the deadlock again, but actually what i forgot to ask you is to load amdgpu 
with vm_fault_stop=2 to freeze the ASIC once VM_FAULT is encountered - sorry 
about that. So please retest with amdgpu.vm_fault_stop=2 parameter in GRUB 
loader.

Andrey

On 2/13/19 3:08 PM, Mikhail Gavrilov wrote:
On Wed, 13 Feb 2019 at 23:40, Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>> wrote:
>
> Regarding the original VM_FAULT we can try to debug that a bit to - enable 
> this from trace-cmd
>
> sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e 
> "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"
>
> and when the hang happens
>
> as root
> cd /sys/kernel/debug/tracing && cat trace > event_dump
>
> + as usual would be nice to have the relevant wave dump and registers from 
> UMR + dmesg.
>
> Andrey
[https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png] 
gfx.tar.xz


Just in case, I duplicated all the files on the  file sharing service Mega:
https://mega.nz/#F!pgYCjYrS!NkeTFIja_qwmxqLoSEUyzA


--
Best Regards,
Mike Gavrilov.



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-14 Thread Christian König via amd-gfx

Hey Andrey,

this is on Vega10, so the ASIC always stops after it sees the first fault.

I'm actually working on implementing that it should continue without 
interruption.


Regards,
Christian.

Am 13.02.19 um 22:47 schrieb Grodzovsky, Andrey:


Looks like you are still running this without the latest hang fix 
since i see the deadlock again, but actually what i forgot to ask you 
is to load amdgpu with vm_fault_stop=2 to freeze the ASIC once 
VM_FAULT is encountered - sorry about that. So please retest with 
amdgpu.vm_fault_stop=2 parameter in GRUB loader.


Andrey

On 2/13/19 3:08 PM, Mikhail Gavrilov wrote:
On Wed, 13 Feb 2019 at 23:40, Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>> wrote:

>
> Regarding the original VM_FAULT we can try to debug that a bit to - 
enable this from trace-cmd

>
> sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e 
"amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"

>
> and when the hang happens
>
> as root
> cd /sys/kernel/debug/tracing && cat trace > event_dump
>
> + as usual would be nice to have the relevant wave dump and 
registers from UMR + dmesg.

>
> Andrey
gfx.tar.xz 




Just in case, I duplicated all the files on the  file sharing service 
Mega:

https://mega.nz/#F!pgYCjYrS!NkeTFIja_qwmxqLoSEUyzA


--
Best Regards,
Mike Gavrilov.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-14 Thread Michel Dänzer

[ Puts on list administrator hat ]

On 2019-02-14 5:16 a.m., Mikhail Gavrilov via amd-gfx wrote:
> 
> Just in case, I duplicated all the files on the  file sharing service Mega:
> https://mega.nz/#F!pgYCjYrS!NkeTFIja_qwmxqLoSEUyzA

Please only share such large files via an external service, don't send
huge e-mails to the mailing list.

Also, please subscribe your e-mail address to the mailing list, so that
it'll be clear to list moderators when your posts are too large (but if
they're not, your posts will go through without a moderator needing to
approve them).


To everybody:

Please trim quoted text when following up to another post. There have
been instances with a single paragraph of text, fitting in a few KB,
followed by multiple MBs of quoted text it didn't reference. If gmail
makes trimming quoted text too hard, please consider using a better tool.

Thanks for your cooperation.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-13 Thread Grodzovsky, Andrey
Looks like you are still running this without the latest hang fix since i see 
the deadlock again, but actually what i forgot to ask you is to load amdgpu 
with vm_fault_stop=2 to freeze the ASIC once VM_FAULT is encountered - sorry 
about that. So please retest with amdgpu.vm_fault_stop=2 parameter in GRUB 
loader.

Andrey

On 2/13/19 3:08 PM, Mikhail Gavrilov wrote:
On Wed, 13 Feb 2019 at 23:40, Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>> wrote:
>
> Regarding the original VM_FAULT we can try to debug that a bit to - enable 
> this from trace-cmd
>
> sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e 
> "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"
>
> and when the hang happens
>
> as root
> cd /sys/kernel/debug/tracing && cat trace > event_dump
>
> + as usual would be nice to have the relevant wave dump and registers from 
> UMR + dmesg.
>
> Andrey
[https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png] 
gfx.tar.xz


Just in case, I duplicated all the files on the  file sharing service Mega:
https://mega.nz/#F!pgYCjYrS!NkeTFIja_qwmxqLoSEUyzA


--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-13 Thread Grodzovsky, Andrey
OK, just apply the following to your amdgpu_dm_do_flip function and see 
if GPU reset does proceed after you experience the hang.

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d59bafc..586301f 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -4809,7 +4809,7 @@ static void amdgpu_dm_do_flip(struct 
drm_atomic_state *state,

     /* Wait for all fences on this FB */
WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
- MAX_SCHEDULE_TIMEOUT) < 0);
+   msecs_to_jiffies(5000)) < 0);

     amdgpu_bo_get_tiling_flags(abo, _flags);

Andrey

On 2/13/19 8:59 AM, Mikhail Gavrilov wrote:
> On Wed, 13 Feb 2019 at 00:44, Grodzovsky, Andrey
>  wrote:
>> Sorry, for your kernel this particular set of prints should go in 
>> amdgpu_dm_do_flip
>>
>
> Kernel logs became very weird after yesterday patch.
> Too many messages even without reproducing  the issue which cause
> "ring gfx timeout".
>
> --
> Best Regards,
> Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-12 Thread Mikhail Gavrilov via amd-gfx
On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey
 wrote:
>
> It should recover you - so this looks like a bug. I noticed in one of
> the call traces this - drm_atomic_helper_suspend which points to system
> going into sleep mode, is it what happened, did it hang when system
> tried to sleep ?
>

It's weird because the computer was not enter in sleep mode. I am sure.
Steps for reproduce:
1. Launch Shadow of The tomb Rider on Proton
2. Wait some time until mouse stop respond
3. Dump gfx, waves and all other dumps including dmesg

And of course the power button (button which enter in sleep mode) was
not pressed.

So the new dumps has any new useful info? Or they are pointless?
--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-12 Thread Grodzovsky, Andrey
Sure, that probably would be the solution, one missing detail here 
(besides confirming with the debug prints that this is the scenario we 
are hitting) is WHY we even stuck in 
reservation_object_wait_timeout_rcu, in amdgpu_device_pre_asic_reset 
(during GPU reset) we are first forcing all outstanding HW fences 
completion through amdgpu_fence_driver_force_completion BEFORE 
proceeding to ip blocks suspend in amdgpu_device_ip_suspend. One 
possible explanation would be that the fence attached to the BO is a 
scheduler fence (SW fence) and not the backing HW fence, I will be able 
to verify this with some fence traces after confirming that the deadlock 
indeed is the one I described.

Andrey

On 2/12/19 1:29 PM, Kazlauskas, Nicholas wrote:
> The MAX_SCHEDULE_TIMEOUT is probably not a good idea on the wait in DM.
>
> I wonder if we could just do shorter wait and skip the FB
> update/programming if it fails after some reasonable amount of time.
>
> This would still allow recovery to happen at least even if the display
> isn't showing the right buffer.
>
> Nicholas Kazlauskas
>
> On 2/12/19 12:46 PM, Grodzovsky, Andrey wrote:
>> I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved
>> and then stack waiting for fences to signal in
>> reservation_object_wait_timeout_rcu (which won't signal because there
>> was a VM_FAULT). Then when we try to shutdown display block during reset
>> recovery from drm_atomic_helper_suspend we also try to reserve the BO,
>> probably from dm_plane_helper_cleanup_fb ending in deadlock.
>>
>> To confirm i am attaching some printks around the BO reservation -
>> please apply and rerun.
>>
>> Also, probably a good idea to open FDO ticket on this instead of using
>> amd-gfx.
>>
>> Andrey
>>
>>
>> On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:
>>> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey
>>>  wrote:
 It should recover you - so this looks like a bug. I noticed in one of
 the call traces this - drm_atomic_helper_suspend which points to system
 going into sleep mode, is it what happened, did it hang when system
 tried to sleep ?

>>> It's weird because the computer was not enter in sleep mode. I am sure.
>>> Steps for reproduce:
>>> 1. Launch Shadow of The tomb Rider on Proton2. Wait some time until mouse 
>>> stop respond
>>> 3. Dump gfx, waves and all other dumps including dmesg
>>>
>>> And of course the power button (button which enter in sleep mode) was
>>> not pressed.
>>>
>>> So the new dumps has any new useful info? Or they are pointless?
>>> --
>>> Best Regards,
>>> Mike Gavrilov.
>>>
>>> ___
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-12 Thread Nicholas Kazlauskas
The MAX_SCHEDULE_TIMEOUT is probably not a good idea on the wait in DM.

I wonder if we could just do shorter wait and skip the FB 
update/programming if it fails after some reasonable amount of time.

This would still allow recovery to happen at least even if the display 
isn't showing the right buffer.

Nicholas Kazlauskas

On 2/12/19 12:46 PM, Grodzovsky, Andrey wrote:
> I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved
> and then stack waiting for fences to signal in
> reservation_object_wait_timeout_rcu (which won't signal because there
> was a VM_FAULT). Then when we try to shutdown display block during reset
> recovery from drm_atomic_helper_suspend we also try to reserve the BO,
> probably from dm_plane_helper_cleanup_fb ending in deadlock.
> 
> To confirm i am attaching some printks around the BO reservation -
> please apply and rerun.
> 
> Also, probably a good idea to open FDO ticket on this instead of using
> amd-gfx.
> 
> Andrey
> 
> 
> On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:
>> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey
>>  wrote:
>>> It should recover you - so this looks like a bug. I noticed in one of
>>> the call traces this - drm_atomic_helper_suspend which points to system
>>> going into sleep mode, is it what happened, did it hang when system
>>> tried to sleep ?
>>>
>> It's weird because the computer was not enter in sleep mode. I am sure.
>> Steps for reproduce:
>> 1. Launch Shadow of The tomb Rider on Proton2. Wait some time until mouse 
>> stop respond
>> 3. Dump gfx, waves and all other dumps including dmesg
>>
>> And of course the power button (button which enter in sleep mode) was
>> not pressed.
>>
>> So the new dumps has any new useful info? Or they are pointless?
>> --
>> Best Regards,
>> Mike Gavrilov.
>>
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-12 Thread Grodzovsky, Andrey

On 2/12/19 7:34 AM, Mikhail Gavrilov wrote:
> Hi folks. Sorry for noise.
> But I really don't know Is it enough to send my logs or not.
> As I am understand different sequences may cause "ring gfx timeout".
> I am also not hear which version I need wait or which patch I needs
> apply before testing again.
> But today I again reproduced "ring gfx timeout" on latest kernel and mesa.
> Kernel: 5.0.0-0.rc6
> Mesa: 19.0.0-rc2
> LLVM: 7.0.1
>
> Also I saw in kernel log the line "amdgpu :0b:00.0: GPU reset
> begin!" but it not helps prevent computer from hanging.
>
> I want to ask about my expectations. Is right that after "GPU reset
> begin!" GPU finally resume work? Or my expectations are vainly?


It should recover you - so this looks like a bug. I noticed in one of 
the call traces this - drm_atomic_helper_suspend which points to system 
going into sleep mode, is it what happened, did it hang when system 
tried to sleep ?

Andrey


>
> --
> Best Regards,
> Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-12 Thread Grodzovsky, Andrey
They are useful. I am gonna take a look later.

Andrey

On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:
> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey
>  wrote:
>> It should recover you - so this looks like a bug. I noticed in one of
>> the call traces this - drm_atomic_helper_suspend which points to system
>> going into sleep mode, is it what happened, did it hang when system
>> tried to sleep ?
>>
> It's weird because the computer was not enter in sleep mode. I am sure.
> Steps for reproduce:
> 1. Launch Shadow of The tomb Rider on Proton
> 2. Wait some time until mouse stop respond
> 3. Dump gfx, waves and all other dumps including dmesg
>
> And of course the power button (button which enter in sleep mode) was
> not pressed.
>
> So the new dumps has any new useful info? Or they are pointless?
> --
> Best Regards,
> Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

2019-02-12 Thread Grodzovsky, Andrey
I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved 
and then stack waiting for fences to signal in 
reservation_object_wait_timeout_rcu (which won't signal because there 
was a VM_FAULT). Then when we try to shutdown display block during reset 
recovery from drm_atomic_helper_suspend we also try to reserve the BO,  
probably from dm_plane_helper_cleanup_fb ending in deadlock.

To confirm i am attaching some printks around the BO reservation - 
please apply and rerun.

Also, probably a good idea to open FDO ticket on this instead of using 
amd-gfx.

Andrey


On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:
> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey
>  wrote:
>> It should recover you - so this looks like a bug. I noticed in one of
>> the call traces this - drm_atomic_helper_suspend which points to system
>> going into sleep mode, is it what happened, did it hang when system
>> tried to sleep ?
>>
> It's weird because the computer was not enter in sleep mode. I am sure.
> Steps for reproduce:
> 1. Launch Shadow of The tomb Rider on Proton2. Wait some time until mouse 
> stop respond
> 3. Dump gfx, waves and all other dumps including dmesg
>
> And of course the power button (button which enter in sleep mode) was
> not pressed.
>
> So the new dumps has any new useful info? Or they are pointless?
> --
> Best Regards,
> Mike Gavrilov.
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d59bafc..e15cd3c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2353,6 +2353,8 @@ static int get_fb_info(const struct amdgpu_framebuffer *amdgpu_fb,
   uint64_t *tiling_flags)
 {
struct amdgpu_bo *rbo = gem_to_amdgpu_bo(amdgpu_fb->base.obj[0]);
+
+   DRM_ERROR("Before %p\n",rbo);
int r = amdgpu_bo_reserve(rbo, false);
 
if (unlikely(r)) {
@@ -2362,6 +2364,8 @@ static int get_fb_info(const struct amdgpu_framebuffer *amdgpu_fb,
return r;
}
 
+   DRM_ERROR("After %p\n",rbo);
+
if (tiling_flags)
amdgpu_bo_get_tiling_flags(rbo, tiling_flags);
 
@@ -3715,9 +3719,11 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
obj = new_state->fb->obj[0];
rbo = gem_to_amdgpu_bo(obj);
adev = amdgpu_ttm_adev(rbo->tbo.bdev);
+   DRM_ERROR("Before %p\n",rbo);
r = amdgpu_bo_reserve(rbo, false);
if (unlikely(r != 0))
return r;
+   DRM_ERROR("After %p\n",rbo);
 
if (plane->type != DRM_PLANE_TYPE_CURSOR)
domain = amdgpu_display_supported_domains(adev);
@@ -3790,11 +3796,13 @@ static void dm_plane_helper_cleanup_fb(struct drm_plane *plane,
return;
 
rbo = gem_to_amdgpu_bo(old_state->fb->obj[0]);
+   DRM_ERROR("Before %p\n",__LINE__);
r = amdgpu_bo_reserve(rbo, false);
if (unlikely(r)) {
DRM_ERROR("failed to reserve rbo before unpin\n");
return;
}
+   DRM_ERROR("After %d\n",__LINE__);
 
amdgpu_bo_unpin(rbo);
amdgpu_bo_unreserve(rbo);
@@ -4801,15 +4809,17 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 * blocking commit to as per framework helpers
 */
abo = gem_to_amdgpu_bo(fb->obj[0]);
+   DRM_ERROR("Before %p\n",abo);
r = amdgpu_bo_reserve(abo, true);
if (unlikely(r != 0)) {
DRM_ERROR("failed to reserve buffer before flip\n");
WARN_ON(1);
}
-
+   DRM_ERROR("After %p\n",abo);
/* Wait for all fences on this FB */
WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
-   MAX_SCHEDULE_TIMEOUT) < 0);
+   msecs_to_jiffies(5000)) < 0);
+   DRM_ERROR("After  reservation_object_wait_timeout_rcu %p\n",abo);
 
amdgpu_bo_get_tiling_flags(abo, _flags);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx