RE: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag
[AMD Official Use Only - AMD Internal Distribution Only] > -Original Message- > From: Yang, Stanley > Sent: Thursday, May 23, 2024 9:57 PM > To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org > Cc: Zhou1, Tao > Subject: RE: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag > > [AMD Official Use Only - AMD Internal Distribution Only] > > > -Original Message- > > From: amd-gfx On Behalf Of Tao > > Zhou > > Sent: Thursday, May 23, 2024 6:02 PM > > To: amd-gfx@lists.freedesktop.org > > Cc: Zhou1, Tao > > Subject: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag > > > > Set the flag to true if bad page number reaches threshold. > > > > Signed-off-by: Tao Zhou > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 7 +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h| 1 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 3 +-- > > 4 files changed, 11 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > index ecce022c657b..934dfb2bf9e5 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > @@ -2940,7 +2940,6 @@ int amdgpu_ras_recovery_init(struct > > amdgpu_device > > *adev) > > struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > > struct ras_err_handler_data **data; > > u32 max_eeprom_records_count = 0; > > - bool exc_err_limit = false; > > int ret; > > > > if (!con || amdgpu_sriov_vf(adev)) @@ -2977,12 +2976,12 @@ int > > amdgpu_ras_recovery_init(struct amdgpu_device *adev) > >*/ > > if (adev->gmc.xgmi.pending_reset) > > return 0; > > - ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &exc_err_limit); > > + ret = amdgpu_ras_eeprom_init(&con->eeprom_control); > > /* > >* This calling fails when exc_err_limit is true or > >* ret != 0. > >*/ > > - if (exc_err_limit || ret) > > + if (con->is_rma || ret) > > goto free; > > > > if (con->eeprom_control.ras_num_recs) { @@ -3033,7 +3032,7 @@ > > int amdgpu_ras_recovery_init(struct amdgpu_device *adev) > >* Except error threshold exceeding case, other failure cases in this > >* function would not fail amdgpu driver init. > >*/ > > - if (!exc_err_limit) > > + if (!con->is_rma) > > ret = 0; > > else > > ret = -EINVAL; > > [Stanley]: Should stop device service if device is under RMA during running? > the > amdgpu_ras_recovery_init function only be called during the process of loading > driver. [Tao] yes, I plan to stop service in resume stage after mode-1 if run-time RMA is reported. But I have no environment to verify the design right now, so this is TODO temporarily. > > Regards, > Stanley > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > index d06c01b978cd..437c58c85639 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > > @@ -521,6 +521,7 @@ struct amdgpu_ras { > > bool update_channel_flag; > > /* Record status of smu mca debug mode */ > > bool is_aca_debug_mode; > > + bool is_rma; > > > > /* Record special requirements of gpu reset caller */ > > uint32_t gpu_reset_flags; > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > index 9b789dcc2bd1..eae0a555df3c 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > @@ -750,6 +750,9 @@ amdgpu_ras_eeprom_update_header(struct > > amdgpu_ras_eeprom_control *control) > > control->tbl_rai.health_percent = 0; > > } > > > > + if (amdgpu_bad_page_threshold != -1) > > + ras->is_rma = true; > > + > > /* ignore the -ENOTSUPP return value */ > > amdgpu_dpm_send_rma_reason(adev); > > } > > @@ -1321,8 +1324,7 @@ static int __read_table_ras_info(struct > > amdgpu_ras_eeprom_control *control) > > return res == RAS_TABLE_V2_1_INFO_SIZE ? 0 : res; } > > > > -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control, > > -bool *exceed_err_limit) > > +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control) > > { > > struct amdgpu_device *adev = to_amdgpu_device(control); > > unsigned char buf[RAS_TABLE_HEADER_SIZE] = { 0 }; @@ -1330,7 > > +1332,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control > > *control, > > struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); > > int res; > > > > - *exceed_err_limit = false; > > + ras->is_rma = false; > > > > if (!__is_ras_eeprom_supported(adev)) > > return 0; >
RE: [PATCH] Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices"
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Feifei Xu -Original Message- From: Alex Deucher Sent: Friday, May 24, 2024 2:44 AM To: Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org; Xu, Feifei Subject: Re: [PATCH] Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices" Ping? On Mon, May 20, 2024 at 2:52 PM Alex Deucher wrote: > > This reverts commit 28ebbb4981cb1fad12e0b1227dbecc88810b1ee8. > > Revert this commit as apparently the LLVM code to take advantage of > this never landed. > > Signed-off-by: Alex Deucher > Cc: Feifei Xu > --- > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 11 ++- > 1 file changed, 2 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > index 6b15e55811b69..fba9b9a258a50 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > @@ -426,15 +426,8 @@ struct kfd_dev *kgd2kfd_probe(struct amdgpu_device > *adev, bool vf) > f2g = &gfx_v11_kfd2kgd; > break; > case IP_VERSION(11, 0, 3): > - if ((adev->pdev->device == 0x7460 && > -adev->pdev->revision == 0x00) || > - (adev->pdev->device == 0x7461 && > -adev->pdev->revision == 0x00)) > - /* Note: Compiler version is 11.0.5 while HW > version is 11.0.3 */ > - gfx_target_version = 110005; > - else > - /* Note: Compiler version is 11.0.1 while HW > version is 11.0.3 */ > - gfx_target_version = 110001; > + /* Note: Compiler version is 11.0.1 while HW version > is 11.0.3 */ > + gfx_target_version = 110001; > f2g = &gfx_v11_kfd2kgd; > break; > case IP_VERSION(11, 5, 0): > -- > 2.45.1 >
6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
Hi, Day before yesterday I replaced 7900XTX to 6900XT for got clear in which kernel first time appeared warning message "DMA-API: amdgpu :0f:00.0: cacheline tracking EEXIST, overlapping mappings aren't supported". The kernel 6.3 and older won't boot on a computer with Radeon 7900XTX. When I booted the system with 6900XT I saw a green flashing bar on top of the screen when I typed commands in the gnome terminal which was maximized on full screen. Demonstration: https://youtu.be/tTvwQ_5pRkk For reproduction you need Radeon 6900XT GPU connected to 120Hz OLED TV by HDMI. I bisected the issue and the first commit which I found was 6d4279cb99ac. commit 6d4279cb99ac4f51d10409501d29969f687ac8dc (HEAD) Author: Rodrigo Siqueira Date: Tue Mar 26 10:42:05 2024 -0600 drm/amd/display: Drop legacy code This commit removes code that are not used by display anymore. Acked-by: Hamza Mahfooz Signed-off-by: Rodrigo Siqueira Signed-off-by: Alex Deucher drivers/gpu/drm/amd/display/dc/inc/hw/stream_encoder.h | 4 drivers/gpu/drm/amd/display/dc/inc/resource.h | 7 --- drivers/gpu/drm/amd/display/dc/optc/dcn20/dcn20_optc.c | 10 -- drivers/gpu/drm/amd/display/dc/resource/dcn21/dcn21_resource.c | 33 + 4 files changed, 1 insertion(+), 53 deletions(-) Every time after bisecting I usually make sure that I found the right commit and build the kernel with revert of the bad commit. But this time I again observed an issue after running a kernel builded without commit 6d4279cb99ac. And I decided to find a second bad commit. The second bad commit has been bc87d666c05. commit bc87d666c05a13e6d4ae1ddce41fc43d2567b9a2 (HEAD) Author: Rodrigo Siqueira Date: Tue Mar 26 11:55:19 2024 -0600 drm/amd/display: Add fallback configuration for set DRR in DCN10 Set OTG/OPTC parameters to 0 if something goes wrong on DCN10. Acked-by: Hamza Mahfooz Signed-off-by: Rodrigo Siqueira Signed-off-by: Alex Deucher drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) After reverting both these commits on top of 54f71b0369c9 the issue is gone. I also attach the build config. My hardware specs: https://linux-hardware.org/?probe=f25a873c5e Rodrigo or anyone else from the AMD team can you look please. -- Best Regards, Mike Gavrilov. .config.zip Description: Zip archive
RE: \'--?J;/. [ [PATCH] drm/amd: Fix shutdown (again) on some SMU v13.0.4/11 platforms
[Public] This patch is, Reviewed-by: Tim Huang > -Original Message- > From: amd-gfx On Behalf Of Mario > Limonciello > Sent: Sunday, May 26, 2024 8:59 PM > To: amd-gfx@lists.freedesktop.org > Cc: Limonciello, Mario ; lectrode > ; sta...@vger.kernel.org; > regressi...@lists.linux.dev > Subject: \'--?J;/. [ [PATCH] drm/amd: Fix shutdown (again) on some SMU > v13.0.4/11 platforms > > commit cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown > for some SMU 13.0.4/13.0.11 users") attempted to fix shutdown issues that > were reported since commit 31729e8c21ec ("drm/amd/pm: fixes a random > hang in S4 for SMU v13.0.4/11") but caused issues for some people. > > Adjust the workaround flow to properly only apply in the S4 case: > -> For shutdown go through SMU_MSG_PrepareMp1ForUnload For S4 go > through > -> SMU_MSG_GfxDeviceDriverReset and >SMU_MSG_PrepareMp1ForUnload > > Reported-and-tested-by: lectrode > Closes: https://github.com/void-linux/void-packages/issues/50417 > Cc: sta...@vger.kernel.org > Fixes: cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for > some SMU 13.0.4/13.0.11 users") > Signed-off-by: Mario Limonciello > --- > Cc: regressi...@lists.linux.dev > --- > .../drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c | 20 ++ > - > 1 file changed, 11 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c > index 4abfcd32747d..c7ab0d7027d9 100644 > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c > @@ -226,15 +226,17 @@ static int > smu_v13_0_4_system_features_control(struct smu_context *smu, bool en) > struct amdgpu_device *adev = smu->adev; > int ret = 0; > > - if (!en && adev->in_s4) { > - /* Adds a GFX reset as workaround just before sending the > - * MP1_UNLOAD message to prevent GC/RLC/PMFW from > entering > - * an invalid state. > - */ > - ret = smu_cmn_send_smc_msg_with_param(smu, > SMU_MSG_GfxDeviceDriverReset, > - SMU_RESET_MODE_2, > NULL); > - if (ret) > - return ret; > + if (!en && !adev->in_s0ix) { > + if (adev->in_s4) { > + /* Adds a GFX reset as workaround just before > sending the > + * MP1_UNLOAD message to prevent GC/RLC/PMFW > from entering > + * an invalid state. > + */ > + ret = smu_cmn_send_smc_msg_with_param(smu, > SMU_MSG_GfxDeviceDriverReset, > + > SMU_RESET_MODE_2, NULL); > + if (ret) > + return ret; > + } > > ret = smu_cmn_send_smc_msg(smu, > SMU_MSG_PrepareMp1ForUnload, NULL); > } > -- > 2.43.0
[PATCH] drm/amd: Fix shutdown (again) on some SMU v13.0.4/11 platforms
commit cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for some SMU 13.0.4/13.0.11 users") attempted to fix shutdown issues that were reported since commit 31729e8c21ec ("drm/amd/pm: fixes a random hang in S4 for SMU v13.0.4/11") but caused issues for some people. Adjust the workaround flow to properly only apply in the S4 case: -> For shutdown go through SMU_MSG_PrepareMp1ForUnload -> For S4 go through SMU_MSG_GfxDeviceDriverReset and SMU_MSG_PrepareMp1ForUnload Reported-and-tested-by: lectrode Closes: https://github.com/void-linux/void-packages/issues/50417 Cc: sta...@vger.kernel.org Fixes: cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for some SMU 13.0.4/13.0.11 users") Signed-off-by: Mario Limonciello --- Cc: regressi...@lists.linux.dev --- .../drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c | 20 ++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c index 4abfcd32747d..c7ab0d7027d9 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c @@ -226,15 +226,17 @@ static int smu_v13_0_4_system_features_control(struct smu_context *smu, bool en) struct amdgpu_device *adev = smu->adev; int ret = 0; - if (!en && adev->in_s4) { - /* Adds a GFX reset as workaround just before sending the -* MP1_UNLOAD message to prevent GC/RLC/PMFW from entering -* an invalid state. -*/ - ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_GfxDeviceDriverReset, - SMU_RESET_MODE_2, NULL); - if (ret) - return ret; + if (!en && !adev->in_s0ix) { + if (adev->in_s4) { + /* Adds a GFX reset as workaround just before sending the +* MP1_UNLOAD message to prevent GC/RLC/PMFW from entering +* an invalid state. +*/ + ret = smu_cmn_send_smc_msg_with_param(smu, SMU_MSG_GfxDeviceDriverReset, + SMU_RESET_MODE_2, NULL); + if (ret) + return ret; + } ret = smu_cmn_send_smc_msg(smu, SMU_MSG_PrepareMp1ForUnload, NULL); } -- 2.43.0