On Mon, Sep 16, 2019, Luck, Tony wrote: >On Mon, Sep 16, 2019 at 11:37:18AM +0000, Tony W Wang-oc wrote: >> Zhaoxin newer CPUs support LMCE that compatible with Intel's >> "Machine-Check Architecture", so add support for Zhaoxin LMCE >> in mce/core.c. >> >> Signed-off-by: Tony W Wang-oc <tonywwang...@zhaoxin.com> >> --- >> arch/x86/kernel/cpu/mce/core.c | 35 >+++++++++++++++++++++++++++++++++-- >> 1 file changed, 33 insertions(+), 2 deletions(-) >> >> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c >> index 65c5a1f..acdd76b 100644 >> --- a/arch/x86/kernel/cpu/mce/core.c >> +++ b/arch/x86/kernel/cpu/mce/core.c >> @@ -1132,6 +1132,27 @@ static bool __mc_check_crashing_cpu(int cpu) >> u64 mcgstatus; >> >> mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); >> + >> + if (boot_cpu_data.x86_vendor == X86_VENDOR_ZHAOXIN) { >> + if (mcgstatus & MCG_STATUS_LMCES) >> + return false; >> + >> + if (!(mcgstatus & MCG_STATUS_LMCES)) { > >Don't really need this test ... you already did "return false" if >the LMCES bit was set ... so this test is redundant (and you can avoid >indenting the next dozen lines.
Got it, Thank you. But have a question about below codes: if (mcgstatus & MCG_STATUS_RIPV) { mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); return true; } These seems require all #MC exception errors set MCG_STATUS_RIPV = 1 in order to skip synchronize which "return true;" actually does for this. As Intel SDM show, "Recoverable-not-continuable SRAR Type" errors may set MCG_STATUS_RIPV = 0, PCC = 0. When these #MC errors broadcast to offline CPU, may cause kernel panic with synchronize timeout (offline CPU can't skip synchronize in this case). Could "return true;" outside the if-case? if (mcgstatus & MCG_STATUS_RIPV) { mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); } return true; Sincerely TonyWWang-oc