Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues: - [High] `amdgpu_umc_lookup_bad_pages_in_a_row()` incorrectly returns 0 and drops error pages if `retire_unit` exceeds the array length. - [High] Infinite overcounting of `de_count` and duplicate bad pages if `amdgpu_umc_fill_error_record()` fails. -- commit dbc37c594d1c3e76008f0f9e176701a44797cd01 Author: Ruoyu Wang <[email protected]> drm/amdgpu: initialize ret before UMC error record loop The commit initializes ret to 0 before iterating over the bad pages returned by amdgpu_umc_lookup_bad_pages_in_a_row() to prevent returning uninitialized data when no pages are found. > diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > index a6df6a778f50b..6d3c0f495e0f2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > @@ -690,6 +690,7 @@ static int umc_v12_0_fill_error_record(struct > amdgpu_device *adev, > count = amdgpu_umc_lookup_bad_pages_in_a_row(adev, > ecc_err->pa_pfn << AMDGPU_GPU_PAGE_SHIFT, > page_pfn, ARRAY_SIZE(page_pfn)); [Severity: High] This is a pre-existing issue, but does amdgpu_umc_lookup_bad_pages_in_a_row() incorrectly return 0 and drop error pages if the retire_unit exceeds the array length? Looking at amdgpu_umc_lookup_bad_pages_in_a_row() in drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c, if adev->umc.retire_unit is greater than len, the loop terminates early via goto out: for (i = 0; i < adev->umc.retire_unit; i++) { if (i >= len) goto out; pfns[i] = err_data.err_addr[i].retired_page; } ret = i; adev->umc.err_addr_cnt = err_data.err_addr_cnt; out: kfree(err_data.err_addr); return ret; Since ret is 0 from the previously successful call to amdgpu_umc_pages_in_a_row(), this bypasses the updates to ret and adev->umc.err_addr_cnt. Does this cause the function to incorrectly return 0 instead of len, dropping the successfully retrieved bad pages? > + ret = 0; > > for (i = 0; i < count; i++) { > ret = amdgpu_umc_fill_error_record(err_data, [Severity: High] This is a pre-existing issue, but can this create an infinite loop of overcounting if amdgpu_umc_fill_error_record() fails? If amdgpu_umc_fill_error_record() returns an error, the loop in umc_v12_0_fill_error_record() breaks, err_data->de_count is unconditionally incremented, and the error code is returned. The caller umc_v12_0_query_ras_ecc_err_addr() detects this failure: ret = umc_v12_0_fill_error_record(adev, entries[i], ras_error_status); if (ret) { dev_err(adev->dev, "Fail to fill umc error record, ret:%d\n", ret); break; } radix_tree_tag_clear(ecc_tree, entries[i]->pa_pfn, UMC_ECC_NEW_DETECTED_TAG); Because it breaks out of its iteration without clearing UMC_ECC_NEW_DETECTED_TAG, won't any subsequent query (like a periodic retirement worker) find the exact same entry, leading to the exact same failure sequence and infinitely incrementing the counters? -- Sashiko AI review · https://sashiko.dev/#/patchset/[email protected]?part=1
