[AMD Official Use Only - General] OK, I'll add a new function to do the check.
Tao > -----Original Message----- > From: Zhang, Hawking <hawking.zh...@amd.com> > Sent: Tuesday, February 14, 2023 6:03 PM > To: Yang, Stanley <stanley.y...@amd.com>; Zhou1, Tao > <tao.zh...@amd.com>; amd-gfx@lists.freedesktop.org; Chai, Thomas > <yipeng.c...@amd.com>; Li, Candice <candice...@amd.com> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new > bad page > > [AMD Official Use Only - General] > > -EINVAL looks better than EEXIST, but it still doesn't fit the case? > Alternatively, > Can we compare the count before and after amdgpu_ras_add_bad_pages to > decide whether reset ue_count to 0 or not. That could be straightforward than > struggling for returning which error code in this specific case. > > Regards, > Hawking > > -----Original Message----- > From: Yang, Stanley <stanley.y...@amd.com> > Sent: Tuesday, February 14, 2023 10:42 > To: Zhou1, Tao <tao.zh...@amd.com>; Zhang, Hawking > <hawking.zh...@amd.com>; amd-gfx@lists.freedesktop.org; Chai, Thomas > <yipeng.c...@amd.com>; Li, Candice <candice...@amd.com> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new > bad page > > [AMD Official Use Only - General] > > > > > -----Original Message----- > > From: Zhou1, Tao <tao.zh...@amd.com> > > Sent: Monday, February 13, 2023 4:25 PM > > To: Zhang, Hawking <hawking.zh...@amd.com>; amd- > > g...@lists.freedesktop.org; Yang, Stanley <stanley.y...@amd.com>; Chai, > > Thomas <yipeng.c...@amd.com>; Li, Candice <candice...@amd.com> > > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no > > new bad page > > > > [AMD Official Use Only - General] > > > > > -----Original Message----- > > > From: Zhang, Hawking <hawking.zh...@amd.com> > > > Sent: Friday, February 10, 2023 11:02 PM > > > To: Zhou1, Tao <tao.zh...@amd.com>; amd-gfx@lists.freedesktop.org; > > > Yang, Stanley <stanley.y...@amd.com>; Chai, Thomas > > > <yipeng.c...@amd.com>; Li, Candice <candice...@amd.com> > > > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if > > > no new bad page > > > > > > [AMD Official Use Only - General] > > > > > > + /* if no new bad page is found, no need to > > > + increase ue count > > */ > > > + if (ret == -EEXIST) > > > + err_data->ue_count = 0; > > > > > > Returning EEXIST in such case is not reasonable. Might consider > > > return a bool for > > > amdgpu_ras_add_bad_pages: true means it does add some new bad page; > > > false means it doesn't change anything. > > > > > > Regards, > > > Hawking > > > > [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and > > amdgpu_ras_recovery_init also need to check the return value. I'd like > > to keep the type of return value unchanged. > > How about -EINVAL? > > Stanley: How about return -EALREADY? > > Regards, > Stanley > > > > > > > > -----Original Message----- > > > From: Zhou1, Tao <tao.zh...@amd.com> > > > Sent: Friday, February 10, 2023 16:45 > > > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking > > > <hawking.zh...@amd.com>; Yang, Stanley <stanley.y...@amd.com>; > > Chai, > > > Thomas <yipeng.c...@amd.com>; Li, Candice <candice...@amd.com> > > > Cc: Zhou1, Tao <tao.zh...@amd.com> > > > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no > > new > > > bad page > > > > > > If a UMC bad page is reserved but not freed by an application, the > > > application may trigger uncorrectable error repeatly by accessing the > > > page. > > > > > > Signed-off-by: Tao Zhou <tao.zh...@amd.com> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++- > > > 2 files changed, 13 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > index e85c4689ce2c..eafe01a24349 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct > > > amdgpu_device *adev, { > > > struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > > > struct ras_err_handler_data *data; > > > - int ret = 0; > > > + int ret = 0, old_cnt; > > > uint32_t i; > > > > > > if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 > > > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device > > *adev, > > > if (!data) > > > goto out; > > > > > > + old_cnt = data->count; > > > + > > > for (i = 0; i < pages; i++) { > > > if (amdgpu_ras_check_bad_page_unlock(con, > > > bps[i].retired_page << > > > AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6 > > > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device > > *adev, > > > data->count++; > > > data->space_left--; > > > } > > > + > > > + /* all pages have been reserved before, no new bad page */ > > > + if (old_cnt == data->count) > > > + ret = -EEXIST; > > > + > > > out: > > > mutex_unlock(&con->recovery_lock); > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > > index 1c7fcb4f2380..772c431e4065 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > > @@ -145,8 +145,12 @@ static int > > amdgpu_umc_do_page_retirement(struct > > > amdgpu_device *adev, > > > > > > if ((amdgpu_bad_page_threshold != 0) && > > > err_data->err_addr_cnt) { > > > - amdgpu_ras_add_bad_pages(adev, err_data->err_addr, > > > + ret = amdgpu_ras_add_bad_pages(adev, > > > + err_data->err_addr, > > > > > > err_data->err_addr_cnt); > > > + /* if no new bad page is found, no need to > > > + increase ue count > > */ > > > + if (ret == -EEXIST) > > > + err_data->ue_count = 0; > > > + > > > amdgpu_ras_save_bad_pages(adev); > > > > > > amdgpu_dpm_send_hbm_bad_pages_num(adev, con- > > > >eeprom_control.ras_num_recs); > > > -- > > > 2.35.1 > > > > >