Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:

Pre-existing issues:
- [High] `amdgpu_umc_lookup_bad_pages_in_a_row()` incorrectly returns 0 and 
drops error pages if `retire_unit` exceeds the array length.
- [High] Infinite overcounting of `de_count` and duplicate bad pages if 
`amdgpu_umc_fill_error_record()` fails.
--

commit dbc37c594d1c3e76008f0f9e176701a44797cd01
Author: Ruoyu Wang <[email protected]>

drm/amdgpu: initialize ret before UMC error record loop

The commit initializes ret to 0 before iterating over the bad pages returned
by amdgpu_umc_lookup_bad_pages_in_a_row() to prevent returning uninitialized
data when no pages are found.

> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c 
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index a6df6a778f50b..6d3c0f495e0f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -690,6 +690,7 @@ static int umc_v12_0_fill_error_record(struct 
> amdgpu_device *adev,
>       count = amdgpu_umc_lookup_bad_pages_in_a_row(adev,
>                               ecc_err->pa_pfn << AMDGPU_GPU_PAGE_SHIFT,
>                               page_pfn, ARRAY_SIZE(page_pfn));

[Severity: High]
This is a pre-existing issue, but does
amdgpu_umc_lookup_bad_pages_in_a_row() incorrectly return 0 and drop error
pages if the retire_unit exceeds the array length?

Looking at amdgpu_umc_lookup_bad_pages_in_a_row() in
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c, if adev->umc.retire_unit is greater
than len, the loop terminates early via goto out:

    for (i = 0; i < adev->umc.retire_unit; i++) {
        if (i >= len)
            goto out;

        pfns[i] = err_data.err_addr[i].retired_page;
    }
    ret = i;
    adev->umc.err_addr_cnt = err_data.err_addr_cnt;

out:
    kfree(err_data.err_addr);
    return ret;

Since ret is 0 from the previously successful call to
amdgpu_umc_pages_in_a_row(), this bypasses the updates to ret and
adev->umc.err_addr_cnt. Does this cause the function to incorrectly return 0
instead of len, dropping the successfully retrieved bad pages?

> +     ret = 0;
>  
>       for (i = 0; i < count; i++) {
>               ret = amdgpu_umc_fill_error_record(err_data,

[Severity: High]
This is a pre-existing issue, but can this create an infinite loop of
overcounting if amdgpu_umc_fill_error_record() fails?

If amdgpu_umc_fill_error_record() returns an error, the loop in
umc_v12_0_fill_error_record() breaks, err_data->de_count is unconditionally
incremented, and the error code is returned.

The caller umc_v12_0_query_ras_ecc_err_addr() detects this failure:

    ret = umc_v12_0_fill_error_record(adev, entries[i], ras_error_status);
    if (ret) {
        dev_err(adev->dev, "Fail to fill umc error record, ret:%d\n", ret);
        break;
    }
    radix_tree_tag_clear(ecc_tree,
            entries[i]->pa_pfn, UMC_ECC_NEW_DETECTED_TAG);

Because it breaks out of its iteration without clearing
UMC_ECC_NEW_DETECTED_TAG, won't any subsequent query (like a periodic
retirement worker) find the exact same entry, leading to the exact same
failure sequence and infinitely incrementing the counters?

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=1

Reply via email to