Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-28 Thread Felix Kuehling
I see, thanks for clarifying. So this is happening because we unmap the 
HIQ with direct MMIO register writes instead of using the KIQ.



I'm OK with this patch as a workaround, but as a proper fix, we should 
probably add a hiq_hqd_destroy function that uses KIQ, similar to how we 
have hiq_mqd_load functions that use KIQ to map the HIQ.



Regards,
  Felix



Am 2022-01-27 um 21:34 schrieb Yin, Tianci (Rico):


[AMD Official Use Only]


The error message is from HIQ dequeue procedure,  not from HCQ, so no 
doorbell writing.


Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072]  dump_stack+0x7d/0x9c
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651] 
 hqd_destroy_v10_3+0x58/0x254 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.48] 
 destroy_mqd+0x1e/0x30 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884] 
 kernel_queue_uninit+0xcf/0x100 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985] 
 pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, 
hanging); this priv_queue == HIQ
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127] 
 stop_cpsch+0x98/0x100 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242] 
 kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338] 
 kgd2kfd_suspend+0x1b/0x20 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433] 
 amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529] 
 amdgpu_device_fini_hw+0x182/0x335 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655] 
 amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732] 
 amdgpu_pci_remove+0x27/0x40 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806] 
 pci_device_remove+0x3e/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809] 
 device_release_driver_internal+0x103/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813] 
 driver_detach+0x4c/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814] 
 bus_remove_driver+0x5c/0xd0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815] 
 driver_unregister+0x31/0x50
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817] 
 pci_unregister_driver+0x40/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818] 
 amdgpu_exit+0x15/0x2d1 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942] 
 __x64_sys_delete_module+0x147/0x260
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944]  ? 
exit_to_user_mode_prepare+0x41/0x1d0

Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946]  ? ksys_write+0x67/0xe0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948] 
 do_syscall_64+0x40/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951] 
 entry_SYSCALL_64_after_hwframe+0x44/0xae


Regards,
Rico

*From:* Kuehling, Felix 
*Sent:* Thursday, January 27, 2022 23:28
*To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin) 
; amd-gfx@lists.freedesktop.org 

*Cc:* Grodzovsky, Andrey ; Chen, Guchun 


*Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
The hang you're seeing is the result of a command submission of an
UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a
doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up
the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this
doorbell not trigger gfxoff exit during rmmod?


Regards,
   Felix



Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
> which is IGT requirement so that IGT can make itself DRM master to
> test KMS.
> igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload
>
> From my understanding, the KFD process belongs to the regular way of
> gfxoff exit, which doorbell writing triggers gfxoff exit. For example,
> KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ,
> these both trigger doorbell writing(pls refer to
> gfx_v10_0_ring_set_wptr_compute()).
>
> As to the IGT reload test, the dequeue request is not thru a cmd on a
> ring, it directly writes CP registers, so GFX core remains in gfxoff.
>
> Thanks,
> Rico
>
> 
> *From:* Kuehling, Felix 
> *Sent:* Wednesday, January 26, 2022 23:08
> *To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin)
> ; amd-gfx@lists.freedesktop.org
> 
> *Cc:* Grodzovsky, Andrey ; Chen, Guchun
> 
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> My question is, why is this problem only seen during module unload? Why
> aren't we seeing HWS hangs due to GFX_OFF all the time in normal
> operations? For example when the GPU is idle and a new KFD process is
> started, creating a new runlist. Are we just getting lucky because the
> process first has to alloca

Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-27 Thread Yin, Tianci (Rico)
[AMD Official Use Only]

The error message is from HIQ dequeue procedure,  not from HCQ, so no doorbell 
writing.

Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072]  dump_stack+0x7d/0x9c
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651]  
hqd_destroy_v10_3+0x58/0x254 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.48]  destroy_mqd+0x1e/0x30 
[amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884]  
kernel_queue_uninit+0xcf/0x100 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985]  pm_uninit+0x1a/0x30 
[amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127]  stop_cpsch+0x98/0x100 
[amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242]  
kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338]  kgd2kfd_suspend+0x1b/0x20 
[amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433]  
amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529]  
amdgpu_device_fini_hw+0x182/0x335 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655]  
amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732]  amdgpu_pci_remove+0x27/0x40 
[amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806]  pci_device_remove+0x3e/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809]  
device_release_driver_internal+0x103/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813]  driver_detach+0x4c/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814]  bus_remove_driver+0x5c/0xd0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815]  driver_unregister+0x31/0x50
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817]  
pci_unregister_driver+0x40/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818]  amdgpu_exit+0x15/0x2d1 
[amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942]  
__x64_sys_delete_module+0x147/0x260
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944]  ? 
exit_to_user_mode_prepare+0x41/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946]  ? ksys_write+0x67/0xe0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948]  do_syscall_64+0x40/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951]  
entry_SYSCALL_64_after_hwframe+0x44/0xae

Regards,
Rico

From: Kuehling, Felix 
Sent: Thursday, January 27, 2022 23:28
To: Yin, Tianci (Rico) ; Wang, Yang(Kevin) 
; amd-gfx@lists.freedesktop.org 

Cc: Grodzovsky, Andrey ; Chen, Guchun 

Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

The hang you're seeing is the result of a command submission of an
UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a
doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up
the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this
doorbell not trigger gfxoff exit during rmmod?


Regards,
   Felix



Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
> which is IGT requirement so that IGT can make itself DRM master to
> test KMS.
> igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload
>
> From my understanding, the KFD process belongs to the regular way of
> gfxoff exit, which doorbell writing triggers gfxoff exit. For example,
> KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ,
> these both trigger doorbell writing(pls refer to
> gfx_v10_0_ring_set_wptr_compute()).
>
> As to the IGT reload test, the dequeue request is not thru a cmd on a
> ring, it directly writes CP registers, so GFX core remains in gfxoff.
>
> Thanks,
> Rico
>
> 
> *From:* Kuehling, Felix 
> *Sent:* Wednesday, January 26, 2022 23:08
> *To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin)
> ; amd-gfx@lists.freedesktop.org
> 
> *Cc:* Grodzovsky, Andrey ; Chen, Guchun
> 
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> My question is, why is this problem only seen during module unload? Why
> aren't we seeing HWS hangs due to GFX_OFF all the time in normal
> operations? For example when the GPU is idle and a new KFD process is
> started, creating a new runlist. Are we just getting lucky because the
> process first has to allocate some memory, which maybe makes some HW
> access (flushing TLBs etc.) that wakes up the GPU?
>
>
> Regards,
>Felix
>
>
>
> Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):
> >
> > [AMD Official Use Only]
> >
> >
> > Thanks Kevin and Felix!
> >
> > In gfxoff state, the dequeue request(by cp register writing) can't
> > make gfxoff exit, actually the cp is powered off and the cp register
> > writi

Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-27 Thread Felix Kuehling
The hang you're seeing is the result of a command submission of an 
UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a 
doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up 
the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this 
doorbell not trigger gfxoff exit during rmmod?



Regards,
  Felix



Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):


[AMD Official Use Only]


The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
which is IGT requirement so that IGT can make itself DRM master to 
test KMS.

igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload

From my understanding, the KFD process belongs to the regular way of 
gfxoff exit, which doorbell writing triggers gfxoff exit. For example, 
KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, 
these both trigger doorbell writing(pls refer to 
gfx_v10_0_ring_set_wptr_compute()).


As to the IGT reload test, the dequeue request is not thru a cmd on a 
ring, it directly writes CP registers, so GFX core remains in gfxoff.


Thanks,
Rico


*From:* Kuehling, Felix 
*Sent:* Wednesday, January 26, 2022 23:08
*To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin) 
; amd-gfx@lists.freedesktop.org 

*Cc:* Grodzovsky, Andrey ; Chen, Guchun 


*Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
My question is, why is this problem only seen during module unload? Why
aren't we seeing HWS hangs due to GFX_OFF all the time in normal
operations? For example when the GPU is idle and a new KFD process is
started, creating a new runlist. Are we just getting lucky because the
process first has to allocate some memory, which maybe makes some HW
access (flushing TLBs etc.) that wakes up the GPU?


Regards,
   Felix



Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> Thanks Kevin and Felix!
>
> In gfxoff state, the dequeue request(by cp register writing) can't
> make gfxoff exit, actually the cp is powered off and the cp register
> writing is invalid, doorbell registers writing(regluar way) or
> directly request smu to disable gfx powergate(by invoking
> amdgpu_gfx_off_ctrl) can trigger gfxoff exit.
>
> I have also tryed
> 
amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),

> but it has no effect.
>
> [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
> [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x
> [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x
> [10386.162297] amdgpu: mmCP_STAT : 0x
> [10386.162303] amdgpu: mmCP_BUSY_STAT : 0x
> [10386.162308] amdgpu: mmRLC_STAT : 0x
> [10386.162314] amdgpu: mmGRBM_STATUS : 0x
> [10386.162320] amdgpu: mmGRBM_STATUS2: 0x
>
> Thanks again!
> Rico
> 
> *From:* Kuehling, Felix 
> *Sent:* Tuesday, January 25, 2022 23:31
> *To:* Wang, Yang(Kevin) ; Yin, Tianci (Rico)
> ; amd-gfx@lists.freedesktop.org
> 
> *Cc:* Grodzovsky, Andrey ; Chen, Guchun
> 
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> I have no objection to the change. It restores the sequence that was
> used before e9669fb78262. But I don't understand why GFX_OFF is causing
> a preemption error during module unload, but not when KFD is in normal
> use. Maybe it's because of the compute power profile that's normally set
> by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.
>
>
> Either way, the patch is
>
> Acked-by: Felix Kuehling 
>
>
>
> Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
> >
> > [AMD Official Use Only]
> >
> >
> > [AMD Official Use Only]
> >
> >
> > the issue is introduced in following patch, so add following
> > information is better.
> > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> > /
> > /
> > Reviewed-by: Yang Wang 
> > /
> > /
> > Best Regards,
> > Kevin
> >
> > 
--------------------

> > *From:* amd-gfx  on behalf of
> > Tianci Yin 
> > *Sent:* Tuesday, January 25, 2022 6:03 PM
> > *To:* amd-gfx@lists.freedesktop.org 
> > *Cc:* Grodzovsky, Andrey ; Yin, Tianci
> > (Rico) ; Chen, Guchun 
> > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> > From: "Tianci.Yin" 
> >
> > [why]
> > In rmmod procedure, kfd sends cp a dequeue request, but the
> > request does not get response, then an error message "cp
> > queue pipe 4 queue 0 preemption failed" printed.
> >
> > [how]
> > Performing kfd suspending 

Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-26 Thread Yin, Tianci (Rico)
[AMD Official Use Only]

The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
which is IGT requirement so that IGT can make itself DRM master to test KMS.
igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload

>From my understanding, the KFD process belongs to the regular way of gfxoff 
>exit, which doorbell writing triggers gfxoff exit. For example, KFD maps HCQ 
>thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, these both trigger 
>doorbell writing(pls refer to gfx_v10_0_ring_set_wptr_compute()).

As to the IGT reload test, the dequeue request is not thru a cmd on a ring, it 
directly writes CP registers, so GFX core remains in gfxoff.

Thanks,
Rico


From: Kuehling, Felix 
Sent: Wednesday, January 26, 2022 23:08
To: Yin, Tianci (Rico) ; Wang, Yang(Kevin) 
; amd-gfx@lists.freedesktop.org 

Cc: Grodzovsky, Andrey ; Chen, Guchun 

Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

My question is, why is this problem only seen during module unload? Why
aren't we seeing HWS hangs due to GFX_OFF all the time in normal
operations? For example when the GPU is idle and a new KFD process is
started, creating a new runlist. Are we just getting lucky because the
process first has to allocate some memory, which maybe makes some HW
access (flushing TLBs etc.) that wakes up the GPU?


Regards,
   Felix



Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> Thanks Kevin and Felix!
>
> In gfxoff state, the dequeue request(by cp register writing) can't
> make gfxoff exit, actually the cp is powered off and the cp register
> writing is invalid, doorbell registers writing(regluar way) or
> directly request smu to disable gfx powergate(by invoking
> amdgpu_gfx_off_ctrl) can trigger gfxoff exit.
>
> I have also tryed
> amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),
> but it has no effect.
>
> [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
> [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x
> [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x
> [10386.162297] amdgpu: mmCP_STAT : 0x
> [10386.162303] amdgpu: mmCP_BUSY_STAT : 0x
> [10386.162308] amdgpu: mmRLC_STAT : 0x
> [10386.162314] amdgpu: mmGRBM_STATUS : 0x
> [10386.162320] amdgpu: mmGRBM_STATUS2: 0x
>
> Thanks again!
> Rico
> 
> *From:* Kuehling, Felix 
> *Sent:* Tuesday, January 25, 2022 23:31
> *To:* Wang, Yang(Kevin) ; Yin, Tianci (Rico)
> ; amd-gfx@lists.freedesktop.org
> 
> *Cc:* Grodzovsky, Andrey ; Chen, Guchun
> 
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> I have no objection to the change. It restores the sequence that was
> used before e9669fb78262. But I don't understand why GFX_OFF is causing
> a preemption error during module unload, but not when KFD is in normal
> use. Maybe it's because of the compute power profile that's normally set
> by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.
>
>
> Either way, the patch is
>
> Acked-by: Felix Kuehling 
>
>
>
> Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
> >
> > [AMD Official Use Only]
> >
> >
> > [AMD Official Use Only]
> >
> >
> > the issue is introduced in following patch, so add following
> > information is better.
> > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> > /
> > /
> > Reviewed-by: Yang Wang 
> > /
> > /
> > Best Regards,
> > Kevin
> >
> > --------------------
> > *From:* amd-gfx  on behalf of
> > Tianci Yin 
> > *Sent:* Tuesday, January 25, 2022 6:03 PM
> > *To:* amd-gfx@lists.freedesktop.org 
> > *Cc:* Grodzovsky, Andrey ; Yin, Tianci
> > (Rico) ; Chen, Guchun 
> > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> > From: "Tianci.Yin" 
> >
> > [why]
> > In rmmod procedure, kfd sends cp a dequeue request, but the
> > request does not get response, then an error message "cp
> > queue pipe 4 queue 0 preemption failed" printed.
> >
> > [how]
> > Performing kfd suspending after disabling gfxoff can fix it.
> >
> > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
> > Signed-off-by: Tianci.Yin 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index b75d67f644e5..77

Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-26 Thread Felix Kuehling
My question is, why is this problem only seen during module unload? Why 
aren't we seeing HWS hangs due to GFX_OFF all the time in normal 
operations? For example when the GPU is idle and a new KFD process is 
started, creating a new runlist. Are we just getting lucky because the 
process first has to allocate some memory, which maybe makes some HW 
access (flushing TLBs etc.) that wakes up the GPU?



Regards,
  Felix



Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):


[AMD Official Use Only]


Thanks Kevin and Felix!

In gfxoff state, the dequeue request(by cp register writing) can't 
make gfxoff exit, actually the cp is powered off and the cp register 
writing is invalid, doorbell registers writing(regluar way) or 
directly request smu to disable gfx powergate(by invoking 
amdgpu_gfx_off_ctrl) can trigger gfxoff exit.


I have also tryed 
amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false), 
but it has no effect.


[10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
[10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x
[10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x
[10386.162297] amdgpu: mmCP_STAT : 0x
[10386.162303] amdgpu: mmCP_BUSY_STAT : 0x
[10386.162308] amdgpu: mmRLC_STAT : 0x
[10386.162314] amdgpu: mmGRBM_STATUS : 0x
[10386.162320] amdgpu: mmGRBM_STATUS2: 0x

Thanks again!
Rico

*From:* Kuehling, Felix 
*Sent:* Tuesday, January 25, 2022 23:31
*To:* Wang, Yang(Kevin) ; Yin, Tianci (Rico) 
; amd-gfx@lists.freedesktop.org 

*Cc:* Grodzovsky, Andrey ; Chen, Guchun 


*Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
I have no objection to the change. It restores the sequence that was
used before e9669fb78262. But I don't understand why GFX_OFF is causing
a preemption error during module unload, but not when KFD is in normal
use. Maybe it's because of the compute power profile that's normally set
by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.


Either way, the patch is

Acked-by: Felix Kuehling 



Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
>
> [AMD Official Use Only]
>
>
> [AMD Official Use Only]
>
>
> the issue is introduced in following patch, so add following
> information is better.
> /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> /
> /
> Reviewed-by: Yang Wang 
> /
> /
> Best Regards,
> Kevin
>
> 
> *From:* amd-gfx  on behalf of
> Tianci Yin 
> *Sent:* Tuesday, January 25, 2022 6:03 PM
> *To:* amd-gfx@lists.freedesktop.org 
> *Cc:* Grodzovsky, Andrey ; Yin, Tianci
> (Rico) ; Chen, Guchun 
> *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> From: "Tianci.Yin" 
>
> [why]
> In rmmod procedure, kfd sends cp a dequeue request, but the
> request does not get response, then an error message "cp
> queue pipe 4 queue 0 preemption failed" printed.
>
> [how]
> Performing kfd suspending after disabling gfxoff can fix it.
>
> Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
> Signed-off-by: Tianci.Yin 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b75d67f644e5..77e9837ba342 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct
> amdgpu_device *adev)
>  }
>  }
>
> -   amdgpu_amdkfd_suspend(adev, false);
> -
>  amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>  amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>
> +   amdgpu_amdkfd_suspend(adev, false);
> +
>  /* Workaroud for ASICs need to disable SMC first */
>  amdgpu_device_smu_fini_early(adev);
>
> --
> 2.25.1
>


Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-25 Thread Yin, Tianci (Rico)
[AMD Official Use Only]

Thanks Kevin and Felix!

In gfxoff state, the dequeue request(by cp register writing) can't make gfxoff 
exit, actually the cp is powered off and the cp register writing is invalid, 
doorbell registers writing(regluar way) or directly request smu to disable gfx 
powergate(by invoking amdgpu_gfx_off_ctrl) can trigger gfxoff exit.

I have also tryed 
amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false), but 
it has no effect.

[10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
[10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x
[10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x
[10386.162297] amdgpu: mmCP_STAT : 0x
[10386.162303] amdgpu: mmCP_BUSY_STAT : 0x
[10386.162308] amdgpu: mmRLC_STAT : 0x
[10386.162314] amdgpu: mmGRBM_STATUS : 0x
[10386.162320] amdgpu: mmGRBM_STATUS2: 0x

Thanks again!
Rico

From: Kuehling, Felix 
Sent: Tuesday, January 25, 2022 23:31
To: Wang, Yang(Kevin) ; Yin, Tianci (Rico) 
; amd-gfx@lists.freedesktop.org 

Cc: Grodzovsky, Andrey ; Chen, Guchun 

Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

I have no objection to the change. It restores the sequence that was
used before e9669fb78262. But I don't understand why GFX_OFF is causing
a preemption error during module unload, but not when KFD is in normal
use. Maybe it's because of the compute power profile that's normally set
by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.


Either way, the patch is

Acked-by: Felix Kuehling 



Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
>
> [AMD Official Use Only]
>
>
> [AMD Official Use Only]
>
>
> the issue is introduced in following patch, so add following
> information is better.
> /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> /
> /
> Reviewed-by: Yang Wang 
> /
> /
> Best Regards,
> Kevin
>
> 
> *From:* amd-gfx  on behalf of
> Tianci Yin 
> *Sent:* Tuesday, January 25, 2022 6:03 PM
> *To:* amd-gfx@lists.freedesktop.org 
> *Cc:* Grodzovsky, Andrey ; Yin, Tianci
> (Rico) ; Chen, Guchun 
> *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> From: "Tianci.Yin" 
>
> [why]
> In rmmod procedure, kfd sends cp a dequeue request, but the
> request does not get response, then an error message "cp
> queue pipe 4 queue 0 preemption failed" printed.
>
> [how]
> Performing kfd suspending after disabling gfxoff can fix it.
>
> Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
> Signed-off-by: Tianci.Yin 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b75d67f644e5..77e9837ba342 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct
> amdgpu_device *adev)
>  }
>  }
>
> -   amdgpu_amdkfd_suspend(adev, false);
> -
>  amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>  amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>
> +   amdgpu_amdkfd_suspend(adev, false);
> +
>  /* Workaroud for ASICs need to disable SMC first */
>  amdgpu_device_smu_fini_early(adev);
>
> --
> 2.25.1
>


Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-25 Thread Felix Kuehling
I have no objection to the change. It restores the sequence that was 
used before e9669fb78262. But I don't understand why GFX_OFF is causing 
a preemption error during module unload, but not when KFD is in normal 
use. Maybe it's because of the compute power profile that's normally set 
by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.



Either way, the patch is

Acked-by: Felix Kuehling 



Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):


[AMD Official Use Only]


[AMD Official Use Only]


the issue is introduced in following patch, so add following 
information is better.

/fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
/
/
Reviewed-by: Yang Wang 
/
/
Best Regards,
Kevin


*From:* amd-gfx  on behalf of 
Tianci Yin 

*Sent:* Tuesday, January 25, 2022 6:03 PM
*To:* amd-gfx@lists.freedesktop.org 
*Cc:* Grodzovsky, Andrey ; Yin, Tianci 
(Rico) ; Chen, Guchun 

*Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
From: "Tianci.Yin" 

[why]
In rmmod procedure, kfd sends cp a dequeue request, but the
request does not get response, then an error message "cp
queue pipe 4 queue 0 preemption failed" printed.

[how]
Performing kfd suspending after disabling gfxoff can fix it.

Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
Signed-off-by: Tianci.Yin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index b75d67f644e5..77e9837ba342 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct 
amdgpu_device *adev)

 }
 }

-   amdgpu_amdkfd_suspend(adev, false);
-
 amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
 amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);

+   amdgpu_amdkfd_suspend(adev, false);
+
 /* Workaroud for ASICs need to disable SMC first */
 amdgpu_device_smu_fini_early(adev);

--
2.25.1



Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-25 Thread Wang, Yang(Kevin)
[AMD Official Use Only]

the issue is introduced in following patch, so add following information is 
better.
fixes: (e9669fb78262) drm/amdgpu: Add early fini callback

Reviewed-by: Yang Wang 

Best Regards,
Kevin


From: amd-gfx  on behalf of Tianci Yin 

Sent: Tuesday, January 25, 2022 6:03 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Grodzovsky, Andrey ; Yin, Tianci (Rico) 
; Chen, Guchun 
Subject: [PATCH] drm/amdgpu: Fix an error message in rmmod

From: "Tianci.Yin" 

[why]
In rmmod procedure, kfd sends cp a dequeue request, but the
request does not get response, then an error message "cp
queue pipe 4 queue 0 preemption failed" printed.

[how]
Performing kfd suspending after disabling gfxoff can fix it.

Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
Signed-off-by: Tianci.Yin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b75d67f644e5..77e9837ba342 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct 
amdgpu_device *adev)
 }
 }

-   amdgpu_amdkfd_suspend(adev, false);
-
 amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
 amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);

+   amdgpu_amdkfd_suspend(adev, false);
+
 /* Workaroud for ASICs need to disable SMC first */
 amdgpu_device_smu_fini_early(adev);

--
2.25.1



[PATCH] drm/amdgpu: Fix an error message in rmmod

2022-01-25 Thread Tianci Yin
From: "Tianci.Yin" 

[why]
In rmmod procedure, kfd sends cp a dequeue request, but the
request does not get response, then an error message "cp
queue pipe 4 queue 0 preemption failed" printed.

[how]
Performing kfd suspending after disabling gfxoff can fix it.

Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
Signed-off-by: Tianci.Yin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b75d67f644e5..77e9837ba342 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct 
amdgpu_device *adev)
}
}
 
-   amdgpu_amdkfd_suspend(adev, false);
-
amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
 
+   amdgpu_amdkfd_suspend(adev, false);
+
/* Workaroud for ASICs need to disable SMC first */
amdgpu_device_smu_fini_early(adev);
 
-- 
2.25.1