On Thu, Jan 07, 2021 at 12:26:19AM +0000, Luck, Tony wrote: > > Please see below for an updated patch. > > Yes. That worked: > > [ 78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false > positives): 24-47,120-143 > [ 78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false > positives): 24-47,120-143 > [ 78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered > broadcast exception handler > > I guess that more than one CPU hit the timeout and so your new message was > printed twice > before the panic code took over?
Could well be. It would be easy to add a flag that allowed only one CPU to print the message. Does that make sense? (See off-the-cuff probably-broken delta patch below for one approach.) > Once again, the whole of socket 1 is MIA rather than just the pair of threads > on one of the cores there. > But that's a useful improvement (eliminating the other three sockets on this > system). > > Tested-by: Tony Luck <tony.l...@intel.com> Thank you very much! I will apply this. Thanx, Paul ------------------------------------------------------------------------ diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7a6e1f3..b46ac56 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -882,6 +882,7 @@ static atomic_t mce_callin; */ static cpumask_t mce_present_cpus; static cpumask_t mce_missing_cpus; +static atomic_t mce_missing_cpus_gate; /* * Check if a timeout waiting for other CPUs happened. @@ -900,7 +901,7 @@ static int mce_timed_out(u64 *t, const char *msg) if (!mca_cfg.monarch_timeout) goto out; if ((s64)*t < SPINUNIT) { - if (mca_cfg.tolerant <= 1) { + if (mca_cfg.tolerant <= 1 && !atomic_xchg(&mce_missing_cpus_gate, 1)) { if (cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus)) pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n", __func__, cpumask_pr_args(&mce_missing_cpus)); @@ -1017,6 +1018,7 @@ static int mce_start(int *no_way_out) */ order = atomic_inc_return(&mce_callin); cpumask_set_cpu(smp_processor_id(), &mce_present_cpus); + atomic_set(&mce_missing_cpus_gate, 0); /* * Wait for everyone. @@ -1126,6 +1128,7 @@ static int mce_end(int order) atomic_set(&global_nwo, 0); atomic_set(&mce_callin, 0); cpumask_clear(&mce_present_cpus); + atomic_set(&mce_missing_cpus_gate, 0); barrier(); /* @@ -2725,6 +2728,7 @@ static void mce_reset(void) atomic_set(&mce_callin, 0); atomic_set(&global_nwo, 0); cpumask_clear(&mce_present_cpus); + atomic_set(&mce_missing_cpus_gate, 0); } static int fake_panic_get(void *data, u64 *val)