On Mon, Sep 16, 2019, Luck, Tony wrote:
>On Mon, Sep 16, 2019 at 11:37:18AM +0000, Tony W Wang-oc wrote:
>> Zhaoxin newer CPUs support LMCE that compatible with Intel's
>> "Machine-Check Architecture", so add support for Zhaoxin LMCE
>> in mce/core.c.
>>
>> Signed-off-by: Tony W Wang-oc <tonywwang...@zhaoxin.com>
>> ---
>>  arch/x86/kernel/cpu/mce/core.c | 35
>+++++++++++++++++++++++++++++++++--
>>  1 file changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 65c5a1f..acdd76b 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -1132,6 +1132,27 @@ static bool __mc_check_crashing_cpu(int cpu)
>>              u64 mcgstatus;
>>
>>              mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
>> +
>> +            if (boot_cpu_data.x86_vendor == X86_VENDOR_ZHAOXIN) {
>> +                    if (mcgstatus & MCG_STATUS_LMCES)
>> +                            return false;
>> +
>> +                    if (!(mcgstatus & MCG_STATUS_LMCES)) {
>
>Don't really need this test ... you already did "return false" if
>the LMCES bit was set ... so this test is redundant (and you can avoid
>indenting the next dozen lines.

Got it, Thank you.

But have a question about below codes:
        if (mcgstatus & MCG_STATUS_RIPV) {
                mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
                return true;
        }
These seems require all #MC exception errors set MCG_STATUS_RIPV = 1
in order to skip synchronize which "return true;" actually does for this.

As Intel SDM show, "Recoverable-not-continuable SRAR Type" errors may
set MCG_STATUS_RIPV = 0, PCC = 0. When these #MC errors broadcast
to offline CPU, may cause kernel panic with synchronize timeout (offline
CPU can't skip synchronize in this case).

Could "return true;" outside the if-case?
        if (mcgstatus & MCG_STATUS_RIPV) {
                mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
        } 
        return true; 

Sincerely
TonyWWang-oc

Reply via email to