On Tue, 11 Nov 2025 20:55:17 +1000
Gavin Shan <[email protected]> wrote:

> Hi Jonathan,
> 
> On 11/11/25 8:07 PM, Jonathan Cameron wrote:
> > On Tue, 11 Nov 2025 14:08:13 +1000
> > Gavin Shan <[email protected]> wrote:  
> >> On 11/11/25 12:49 AM, Igor Mammedov wrote:  
> >>> On Thu, 6 Nov 2025 13:15:52 +1000
> >>> Gavin Shan <[email protected]> wrote:  
> >>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:  
> >>>>> On Wed,  5 Nov 2025 21:44:49 +1000
> >>>>> Gavin Shan <[email protected]> wrote:
> >>>>>         
> >>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >>>>>> one problematic host page affects 16 guest pages. we need to send 16
> >>>>>> consective errors in this specific case.
> >>>>>>
> >>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >>>>>> hunk of code to generate the GHES error status is pulled out from
> >>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >>>>>> generic error status block is also updated accordingly if multiple
> >>>>>> error data entries are contained in the generic error status block.
> >>>>>>
> >>>>>> Signed-off-by: Gavin Shan <[email protected]>  
> >>>>> Hi Gavin,
> >>>>>
> >>>>> Mostly fine, but a few comments on the defines added and a
> >>>>> question on what the multiple things are meant to mean?
> >>>>>         
> >>>>
> >>>> Thanks for your review and comments, replies as below.
> >>>>     
> >>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> >>>>>> index a9c08e73c0..527b85c8d8 100644
> >>>>>> --- a/hw/acpi/ghes.c
> >>>>>> +++ b/hw/acpi/ghes.c
> >>>>>> @@ -57,8 +57,12 @@
> >>>>>>     /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error 
> >>>>>> Section */
> >>>>>>     #define ACPI_GHES_MEM_CPER_LENGTH           80
> >>>>>>     
> >>>>>> -/* Masks for block_status flags */
> >>>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
> >>>>>> +/* Bits for block_status flags */
> >>>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
> >>>>>> +#define ACPI_GEBS_CORRECTABLE             1
> >>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> >>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3  
> >>>>>
> >>>>> So this maps to the bits in block status.
> >>>>>
> >>>>> I'm not actually sure what these multiple variants are meant to tell us.
> >>>>> The multiple error blocks example referred to by the spec is a way to 
> >>>>> represent
> >>>>> the same error applying to multiple places.  So that's one error, many 
> >>>>> blocks.
> >>>>> I have no idea if we set these bits in that case.
> >>>>>
> >>>>> Based on a quick look I don't think linux even takes any notice.  THere
> >>>>> are defines in actbl1.h but I'm not seeing any use made of them.
> >>>>>         
> >>>>
> >>>> I hope Igor can confirm since it was suggested by him.
> >>>>
> >>>> It's hard to understand how exactly these multiple variants are used 
> >>>> from the
> >>>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
> >>>>
> >>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that 
> >>>> more
> >>>> than one uncorrectable errors have been detected.
> >>>>
> >>>> I don't see those multiple variants have been used by Linux. So I think 
> >>>> it's
> >>>> safe to drop them.  
> >>>
> >>> even though example describes 'same' error at different components,
> >>> the bit fields descriptions doesn't set any limits on what 'more than 
> >>> one' means.
> >>>
> >>> Also from guest POV it's multiple different pages that we are reporting 
> >>> here
> >>> as multiple CPERs.
> >>> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
> >>>      
> >>
> >> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
> >> is fine. Again, this field isn't used by Linux guest.  
> > I don't care strongly.  Maybe we should ask for a spec clarification as I 
> > doubt
> > implementations will be consistent on this given the vague description and 
> > that
> > Linux ignores it today.
> >   
> 
> Google Gemini has the following question. If it can be trusted, it should be
> set when @num_of_addresses is larger than 1.
> 
> Quota from Google Gemini:
> 
> The system firmware sets this bit to indicate to the Operating System Power 
> Management (OSPM)
> that more than one correctable error condition has been detected and logged 
> for the associated
> hardware component since the last time the status was cleared by the 
> software. This is crucial
> because a high frequency of correctable errors often indicates a potential 
> underlying hardware
> issue that could lead to uncorrectable (and potentially fatal) errors if not 
> addressed (e.g.,
> in memory, where multiple correctable errors might trigger a spare memory 
> operation).
> 
> >>  
> >>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4  
> >>>>>
> >>>>> This is bits 4-13 and the define isn't used. I'd drop it.
> >>>>>         
> >>>>
> >>>> The definition is used in acpi_ghes_memory_errors() of this patch. 
> >>>> However,
> >>>> I don't see it has been used by Linux. This field isn't used by Linux to 
> >>>> determine
> >>>> the total number of error entries. So I think I can drop it either if 
> >>>> Igor is ok.
> >>>>     
> >>
> >> Lets keep this field either in next revision if Jonathan is fine.  
> > 
> > I'm fine with the field, but not the value.  As far as I can tell form the 
> > spec, it should
> > be a mask, not a single bit.
> >   
> 
> Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision.

I'm even more confused now.  The GEBS Error Data entry count should be field 
from 13:4
and the value taken should be the number of entries in the record, so 1, 4, 16 
depending
on the page size.

So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT then 
I'd be much happier.


> 
> Thanks,
> Gavin
> 
> 


Reply via email to