[AMD Official Use Only - AMD Internal Distribution Only] Thanks for the clarification. I have seen both terms used differently by different people, so I changed my original patch to reflect that. I'll fix those 3 mistakes and push it out.
Kent > -----Original Message----- > From: Zhou1, Tao <[email protected]> > Sent: Wednesday, February 4, 2026 10:34 PM > To: Russell, Kent <[email protected]>; [email protected] > Cc: Liu, Xiang(Dean) <[email protected]> > Subject: RE: [PATCH] drm/amdgpu: Send applicable RMA CPERs at end of RAS init > > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: Russell, Kent <[email protected]> > > Sent: Thursday, February 5, 2026 1:52 AM > > To: [email protected] > > Cc: Liu, Xiang(Dean) <[email protected]>; Zhou1, Tao > <[email protected]>; > > Russell, Kent <[email protected]> > > Subject: [PATCH] drm/amdgpu: Send applicable RMA CPERs at end of RAS init > > > > Firmware and monitoring tools may not be ready to receive a CPER when we > > read > > the bad pages, so send the CPERs at the end of RAs initialization to ensure > > that > the > > [Tao] RAs -> RAS > > > FW is ready to receive and process the CPER. This removes the previous CPER > > submission that was added during bad page load, and sends both in-band and > > out- > > of-band at the same time. > > > > Signed-off-by: Kent Russell <[email protected]> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 ++ > > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 27 ++++++++++++++++--- > > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 2 ++ > > 3 files changed, 27 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > index b28fcf932f7e..856b1bf83533 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > @@ -4650,6 +4650,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev) > > amdgpu_ras_block_late_init_default(adev, > > &obj->ras_comm); > > } > > > > + amdgpu_ras_check_bad_page_status(adev); > > + > > return 0; > > } > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > index 469d04a39d7d..91de4085a66d 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > @@ -1712,10 +1712,6 @@ int amdgpu_ras_eeprom_check(struct > > amdgpu_ras_eeprom_control *control) > > dev_warn(adev->dev, "RAS records:%u exceeds 90%% of > > threshold:%d", > > control->ras_num_bad_pages, > > ras->bad_page_cnt_threshold); > > - if (amdgpu_bad_page_threshold != 0 && > > - control->ras_num_bad_pages >= ras- > > >bad_page_cnt_threshold) > > - amdgpu_dpm_send_rma_reason(adev); > > - > > } else if (hdr->header == RAS_TABLE_HDR_BAD && > > amdgpu_bad_page_threshold != 0) { > > if (hdr->version >= RAS_TABLE_VER_V2_1) { @@ -1932,3 > > +1928,26 @@ int amdgpu_ras_smu_erase_ras_table(struct amdgpu_device > *adev, > > > > result); > > return -EOPNOTSUPP; > > } > > + > > +void amdgpu_ras_check_bad_page_status(struct amdgpu_device *adev) { > > + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); > > + struct amdgpu_ras_eeprom_control *control = ras ? &ras->eeprom_control > > +: NULL; > > + > > + if (!control || amdgpu_bad_page_threshold == 0) > > + return; > > + > > + if (control->ras_num_bad_pages >= ras->bad_page_cnt_threshold) { > > + if (amdgpu_dpm_send_rma_reason(adev)) > > + dev_warn(adev->dev, "Unable to send in-band RMA > > CPER"); > > [Tao] this is oob cper and the following one is ib cper. > > With my concerns fixed, the patch is: Reviewed-by: Tao Zhou > <[email protected]> > > > + else > > + dev_dbg(adev->dev, "Sent in-band RMA CPER"); > > + > > + if (adev->cper.enabled && !amdgpu_uniras_enabled(adev)) { > > + if (amdgpu_cper_generate_bp_threshold_record(adev)) > > + dev_warn(adev->dev, "Unable to send > > out-of-band > > RMA CPER"); > > + else > > + dev_dbg(adev->dev, "Sent out-of-band RMA > > CPER"); > > + } > > + } > > +} > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > index 2e5d63957e71..a62114800a92 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > @@ -193,6 +193,8 @@ int amdgpu_ras_eeprom_read_idx(struct > > amdgpu_ras_eeprom_control *control, > > > > int amdgpu_ras_eeprom_update_record_num(struct amdgpu_ras_eeprom_control > > *control); > > > > +void amdgpu_ras_check_bad_page_status(struct amdgpu_device *adev); > > + > > extern const struct file_operations amdgpu_ras_debugfs_eeprom_size_ops; > > extern const struct file_operations amdgpu_ras_debugfs_eeprom_table_ops; > > > > -- > > 2.43.0 >
