On 10/21/2021 7:26 PM, Russell, Kent wrote:
[AMD Official Use Only]



-----Original Message-----
From: Lazar, Lijo <lijo.la...@amd.com>
Sent: Thursday, October 21, 2021 1:25 AM
To: Russell, Kent <kent.russ...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben <luben.tui...@amd.com>; Joshi, Mukul <mukul.jo...@amd.com>
Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case



On 10/20/2021 10:05 PM, Kent Russell wrote:
If the bad_page_threshold kernel parameter is set to -2,
continue to post the GPU. Print a warning to dmesg that this action has
been done, and that page retirement will obviously not work for said GPU

Cc: Luben Tuikov <luben.tui...@amd.com>
Cc: Mukul Joshi <mukul.jo...@amd.com>
Signed-off-by: Kent Russell <kent.russ...@amd.com>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
   1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 1ede0f0d6f55..31852330c1db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
amdgpu_ras_eeprom_control *control,
                        res = amdgpu_ras_eeprom_correct_header_tag(control,
                                                                   
RAS_TABLE_HDR_VAL);
                } else {
-                       *exceed_err_limit = true;
-                       dev_err(adev->dev,
-                               "RAS records:%d exceed threshold:%d, "
-                               "GPU will not be initialized. Replace this GPU 
or increase the
threshold",
+                       dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
                                control->ras_num_recs, 
ras->bad_page_cnt_threshold);
+                       if (amdgpu_bad_page_threshold == -2) {
+                               dev_warn(adev->dev, "GPU will be initialized 
due to
bad_page_threshold = -2.");
+                               dev_warn(adev->dev, "Page retirement will not 
work for
this GPU in this state.");

Now, this looks as good as booting with 0 = disable bad page retirement.
I thought page retirement will work as long as EEPROM has space, but it
won't bother about the threshold. If the intent is to ignore bad page
retirement, then 0 is good enough and -2 is not required.

Also, when user passes threshold=-2, what is the threshold being
compared against to say *exceed_err_limit = true?

My thought on having the -2 option is that we'll still enable page retirement, 
we just won't shut the GPU down when it hits the threshold. The bad pages will 
still be retired as we hit them, it will just never disable the GPU. The 
comment about retirement not working is definitely incorrect now (leftover from 
previous local patches), so I'll remove that. In this case, I don't think we'd 
ever exceed the error limit. exceed_err_limit is only really used when we are 
disabling the GPU, so we wouldn't want to set that to true. Otherwise we 
wouldn't be loading the bad pages in gpu_recovery_init, and we'll still return 
0 from gpu_recovery_init.


Probably, this is too late as this patch is superseded. Replying to this to keep the context.

I realized that the question went in the wrong direction because of *exceed_err_limit = true. Saw it in the earlier line but didn't realize that it was removed.

Anyway, main question I wanted to ask was -
        
"When user passes threshold=-2, what is the threshold being compared against?"

Looked it up today and saw this piece of code.
                if (amdgpu_bad_page_threshold < 0) {
                u64 val = adev->gmc.mc_vram_size;

                do_div(val, RAS_BAD_PAGE_COVER);
                con->bad_page_cnt_threshold = min(lower_32_bits(val),
                                                  max_count);


For a user who has been specifying say bad_page_threshold=100 in the command line and unable to boot, now this is doing something else.

The comparison in next boot is done against a different threshold. In fact, it won't even come to this comparison logic in the next boot as the threshold is raised by default when this parameter is specified.

The only case this parameter takes effect is when a user is unable to boot because bad page retirements have exceeded the "default threshold". In other cases it just resets the comparison limit. Hope that is the intention.

Thanks,
Lijo


  Kent

Thanks,
Lijo

+                               res = 0;
+                       } else {
+                               *exceed_err_limit = true;
+                               dev_err(adev->dev, "GPU will not be 
initialized. Replace this
GPU or increase the threshold.");
+                       }
                }
        } else {
                DRM_INFO("Creating a new EEPROM table");

Reply via email to