RE: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag

2024-05-26 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Yang, Stanley 
> Sent: Thursday, May 23, 2024 9:57 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: RE: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -Original Message-
> > From: amd-gfx  On Behalf Of Tao
> > Zhou
> > Sent: Thursday, May 23, 2024 6:02 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhou1, Tao 
> > Subject: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag
> >
> > Set the flag to true if bad page number reaches threshold.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|  7 +++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h|  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h |  3 +--
> >  4 files changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index ecce022c657b..934dfb2bf9e5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2940,7 +2940,6 @@ int amdgpu_ras_recovery_init(struct
> > amdgpu_device
> > *adev)
> >   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >   struct ras_err_handler_data **data;
> >   u32  max_eeprom_records_count = 0;
> > - bool exc_err_limit = false;
> >   int ret;
> >
> >   if (!con || amdgpu_sriov_vf(adev)) @@ -2977,12 +2976,12 @@ int
> > amdgpu_ras_recovery_init(struct amdgpu_device *adev)
> >*/
> >   if (adev->gmc.xgmi.pending_reset)
> >   return 0;
> > - ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &exc_err_limit);
> > + ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
> >   /*
> >* This calling fails when exc_err_limit is true or
> >* ret != 0.
> >*/
> > - if (exc_err_limit || ret)
> > + if (con->is_rma || ret)
> >   goto free;
> >
> >   if (con->eeprom_control.ras_num_recs) { @@ -3033,7 +3032,7 @@
> > int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
> >* Except error threshold exceeding case, other failure cases in this
> >* function would not fail amdgpu driver init.
> >*/
> > - if (!exc_err_limit)
> > + if (!con->is_rma)
> >   ret = 0;
> >   else
> >   ret = -EINVAL;
>
> [Stanley]: Should stop device service if device is under RMA during running? 
> the
> amdgpu_ras_recovery_init function only be called during the process of loading
> driver.

[Tao] yes, I plan to stop service in resume stage after mode-1 if run-time RMA 
is reported. But I have no environment to verify the design right now, so this 
is TODO temporarily.

>
> Regards,
> Stanley
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index d06c01b978cd..437c58c85639 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -521,6 +521,7 @@ struct amdgpu_ras {
> >   bool update_channel_flag;
> >   /* Record status of smu mca debug mode */
> >   bool is_aca_debug_mode;
> > + bool is_rma;
> >
> >   /* Record special requirements of gpu reset caller */
> >   uint32_t  gpu_reset_flags;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 9b789dcc2bd1..eae0a555df3c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -750,6 +750,9 @@ amdgpu_ras_eeprom_update_header(struct
> > amdgpu_ras_eeprom_control *control)
> >   control->tbl_rai.health_percent = 0;
> >   }
> >
> > + if (amdgpu_bad_page_threshold != -1)
> > + ras->is_rma = true;
> > +
> >   /* ignore the -ENOTSUPP return value */
> >   amdgpu_dpm_send_rma_reason(adev);
> >   }
> > @@ -1321,8 +1324,7 @@ static int __read_table_ras_info(struct
> > amdgpu_ras_eeprom_control *control)
> >   return res == RAS_TABLE_V2_1_INFO_SIZE ? 0 : res;  }
> >
> > -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> > -bool *exceed_err_limit)
> > +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
> >  {
> >   struct amdgpu_device *adev = to_amdgpu_device(control);
> >   unsigned char buf[RAS_TABLE_HEADER_SIZE] = { 0 }; @@ -1330,7
> > +1332,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control
> > *control,
> >   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> >   int res;
> >
> > - *exceed_err_limit = false;
> > + ras->is_rma = false;
> >
> >   if (!__is_ras_eeprom_supported(adev))
> >   return 0;
>

RE: [PATCH] Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices"

2024-05-26 Thread Xu, Feifei
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Feifei Xu 

-Original Message-
From: Alex Deucher 
Sent: Friday, May 24, 2024 2:44 AM
To: Deucher, Alexander 
Cc: amd-gfx@lists.freedesktop.org; Xu, Feifei 
Subject: Re: [PATCH] Revert "drm/amdkfd: fix gfx_target_version for certain 
11.0.3 devices"

Ping?

On Mon, May 20, 2024 at 2:52 PM Alex Deucher  wrote:
>
> This reverts commit 28ebbb4981cb1fad12e0b1227dbecc88810b1ee8.
>
> Revert this commit as apparently the LLVM code to take advantage of
> this never landed.
>
> Signed-off-by: Alex Deucher 
> Cc: Feifei Xu 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 6b15e55811b69..fba9b9a258a50 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -426,15 +426,8 @@ struct kfd_dev *kgd2kfd_probe(struct amdgpu_device 
> *adev, bool vf)
> f2g = &gfx_v11_kfd2kgd;
> break;
> case IP_VERSION(11, 0, 3):
> -   if ((adev->pdev->device == 0x7460 &&
> -adev->pdev->revision == 0x00) ||
> -   (adev->pdev->device == 0x7461 &&
> -adev->pdev->revision == 0x00))
> -   /* Note: Compiler version is 11.0.5 while HW 
> version is 11.0.3 */
> -   gfx_target_version = 110005;
> -   else
> -   /* Note: Compiler version is 11.0.1 while HW 
> version is 11.0.3 */
> -   gfx_target_version = 110001;
> +   /* Note: Compiler version is 11.0.1 while HW version 
> is 11.0.3 */
> +   gfx_target_version = 110001;
> f2g = &gfx_v11_kfd2kgd;
> break;
> case IP_VERSION(11, 5, 0):
> --
> 2.45.1
>


6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-05-26 Thread Mikhail Gavrilov
Hi,
Day before yesterday I replaced 7900XTX to 6900XT for got clear in
which kernel first time appeared warning message "DMA-API: amdgpu
:0f:00.0: cacheline tracking EEXIST, overlapping mappings aren't
supported".
The kernel 6.3 and older won't boot on a computer with Radeon 7900XTX.
When I booted the system with 6900XT I saw a green flashing bar on top
of the screen when I typed commands in the gnome terminal which was
maximized on full screen.
Demonstration: https://youtu.be/tTvwQ_5pRkk
For reproduction you need Radeon 6900XT GPU connected to 120Hz OLED TV by HDMI.

I bisected the issue and the first commit which I found was 6d4279cb99ac.
commit 6d4279cb99ac4f51d10409501d29969f687ac8dc (HEAD)
Author: Rodrigo Siqueira 
Date:   Tue Mar 26 10:42:05 2024 -0600

drm/amd/display: Drop legacy code

This commit removes code that are not used by display anymore.

Acked-by: Hamza Mahfooz 
Signed-off-by: Rodrigo Siqueira 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/display/dc/inc/hw/stream_encoder.h |  4 
 drivers/gpu/drm/amd/display/dc/inc/resource.h  |  7 ---
 drivers/gpu/drm/amd/display/dc/optc/dcn20/dcn20_optc.c | 10 --
 drivers/gpu/drm/amd/display/dc/resource/dcn21/dcn21_resource.c | 33
+
 4 files changed, 1 insertion(+), 53 deletions(-)

Every time after bisecting I usually make sure that I found the right
commit and build the kernel with revert of the bad commit.
But this time I again observed an issue after running a kernel builded
without commit 6d4279cb99ac.
And I decided to find a second bad commit.
The second bad commit has been bc87d666c05.
commit bc87d666c05a13e6d4ae1ddce41fc43d2567b9a2 (HEAD)
Author: Rodrigo Siqueira 
Date:   Tue Mar 26 11:55:19 2024 -0600

drm/amd/display: Add fallback configuration for set DRR in DCN10

Set OTG/OPTC parameters to 0 if something goes wrong on DCN10.

Acked-by: Hamza Mahfooz 
Signed-off-by: Rodrigo Siqueira 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

After reverting both these commits on top of 54f71b0369c9 the issue is gone.

I also attach the build config.

My hardware specs: https://linux-hardware.org/?probe=f25a873c5e

Rodrigo or anyone else from the AMD team can you look please.

-- 
Best Regards,
Mike Gavrilov.


.config.zip
Description: Zip archive


RE: \'--?J;/. [ [PATCH] drm/amd: Fix shutdown (again) on some SMU v13.0.4/11 platforms

2024-05-26 Thread Huang, Tim
[Public]

This patch is,

Reviewed-by: Tim Huang 



> -Original Message-
> From: amd-gfx  On Behalf Of Mario
> Limonciello
> Sent: Sunday, May 26, 2024 8:59 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Limonciello, Mario ; lectrode
> ; sta...@vger.kernel.org;
> regressi...@lists.linux.dev
> Subject: \'--?J;/. [ [PATCH] drm/amd: Fix shutdown (again) on some SMU
> v13.0.4/11 platforms
>
> commit cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown
> for some SMU 13.0.4/13.0.11 users") attempted to fix shutdown issues that
> were reported since commit 31729e8c21ec ("drm/amd/pm: fixes a random
> hang in S4 for SMU v13.0.4/11") but caused issues for some people.
>
> Adjust the workaround flow to properly only apply in the S4 case:
> -> For shutdown go through SMU_MSG_PrepareMp1ForUnload For S4 go
> through
> -> SMU_MSG_GfxDeviceDriverReset and
>SMU_MSG_PrepareMp1ForUnload
>
> Reported-and-tested-by: lectrode 
> Closes: https://github.com/void-linux/void-packages/issues/50417
> Cc: sta...@vger.kernel.org
> Fixes: cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for
> some SMU 13.0.4/13.0.11 users")
> Signed-off-by: Mario Limonciello 
> ---
> Cc: regressi...@lists.linux.dev
> ---
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c  | 20 ++
> -
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c
> index 4abfcd32747d..c7ab0d7027d9 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c
> @@ -226,15 +226,17 @@ static int
> smu_v13_0_4_system_features_control(struct smu_context *smu, bool en)
>   struct amdgpu_device *adev = smu->adev;
>   int ret = 0;
>
> - if (!en && adev->in_s4) {
> - /* Adds a GFX reset as workaround just before sending the
> -  * MP1_UNLOAD message to prevent GC/RLC/PMFW from
> entering
> -  * an invalid state.
> -  */
> - ret = smu_cmn_send_smc_msg_with_param(smu,
> SMU_MSG_GfxDeviceDriverReset,
> -   SMU_RESET_MODE_2,
> NULL);
> - if (ret)
> - return ret;
> + if (!en && !adev->in_s0ix) {
> + if (adev->in_s4) {
> + /* Adds a GFX reset as workaround just before
> sending the
> +  * MP1_UNLOAD message to prevent GC/RLC/PMFW
> from entering
> +  * an invalid state.
> +  */
> + ret = smu_cmn_send_smc_msg_with_param(smu,
> SMU_MSG_GfxDeviceDriverReset,
> +
>   SMU_RESET_MODE_2, NULL);
> + if (ret)
> + return ret;
> + }
>
>   ret = smu_cmn_send_smc_msg(smu,
> SMU_MSG_PrepareMp1ForUnload, NULL);
>   }
> --
> 2.43.0



[PATCH] drm/amd: Fix shutdown (again) on some SMU v13.0.4/11 platforms

2024-05-26 Thread Mario Limonciello
commit cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for
some SMU 13.0.4/13.0.11 users") attempted to fix shutdown issues
that were reported since commit 31729e8c21ec ("drm/amd/pm: fixes a
random hang in S4 for SMU v13.0.4/11") but caused issues for some
people.

Adjust the workaround flow to properly only apply in the S4 case:
-> For shutdown go through SMU_MSG_PrepareMp1ForUnload
-> For S4 go through SMU_MSG_GfxDeviceDriverReset and
   SMU_MSG_PrepareMp1ForUnload

Reported-and-tested-by: lectrode 
Closes: https://github.com/void-linux/void-packages/issues/50417
Cc: sta...@vger.kernel.org
Fixes: cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for some SMU 
13.0.4/13.0.11 users")
Signed-off-by: Mario Limonciello 
---
Cc: regressi...@lists.linux.dev
---
 .../drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c  | 20 ++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c
index 4abfcd32747d..c7ab0d7027d9 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c
@@ -226,15 +226,17 @@ static int smu_v13_0_4_system_features_control(struct 
smu_context *smu, bool en)
struct amdgpu_device *adev = smu->adev;
int ret = 0;
 
-   if (!en && adev->in_s4) {
-   /* Adds a GFX reset as workaround just before sending the
-* MP1_UNLOAD message to prevent GC/RLC/PMFW from entering
-* an invalid state.
-*/
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_GfxDeviceDriverReset,
- SMU_RESET_MODE_2, NULL);
-   if (ret)
-   return ret;
+   if (!en && !adev->in_s0ix) {
+   if (adev->in_s4) {
+   /* Adds a GFX reset as workaround just before sending 
the
+* MP1_UNLOAD message to prevent GC/RLC/PMFW from 
entering
+* an invalid state.
+*/
+   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_GfxDeviceDriverReset,
+   SMU_RESET_MODE_2, NULL);
+   if (ret)
+   return ret;
+   }
 
ret = smu_cmn_send_smc_msg(smu, SMU_MSG_PrepareMp1ForUnload, 
NULL);
}
-- 
2.43.0