On 04/10/15 at 12:49am, Naoya Horiguchi wrote: > On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote: > > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote: > > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should > > > > account for that, no? > > > > > > > > And if those are offlined, they're very very unlikely to trigger an MCE > > > > as they're idle and not executing code. > > > > > > Let's step back a few feet and look at the big picture. There are three > > > main classes of machine check > > > that we might see while trying to run kdump - an remember that all > > > machine checks are currently > > > broadcast, so all cpus whether online or offline will see them > > > > > > 1) Fatal > > > We have to crash - lose the dump. Having a new machine check handler > > > will make things a bit easier > > > to see what happened because we won't have any synchronization failed > > > messages from the offline > > > cpus. > > > > But this should not be a problem if kdump path keeps cpu_online_mask > > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or > > so. Those should clear cpu_online_mask and then mce_start() will work > > fine on the crashing CPU. > > > > IMHO, of course. > > Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done > yet,) raising tolerance should work without timeout message. > So I think you are right.
Hi Naoya, Thanks for great efforts you have made on this issue. I am trying to catch up, and have read mails in this thread. Please also add me to CC list when you post a new version. I would like to review it. Thanks Baoquan > > > > 2) Execution path recoverable (SRAR in SDM parlance). > > > Also going to be fatal (kdump is all running in ring0, and we can't > > > recover from errors in ring 0). Cleaner > > > messages as above. Potentially in the future we might be able to make the > > > kdump machine check handler > > > actually recover by just skipping a page - if the location of the error > > > was in the old kernel image. > > > > > > 3) Non-execution path recoverable (SRAO in SDM) > > > We ought to be able to keep kdump running if this happens - the "AO" > > > stands for "action optional", > > > so we are going to choose to not take an action. Wherever the error was, > > > it won't affect correctness > > > of execution of the current context. > > > > Those could be simply made to go to dmesg during kdump, i.e. decouple > > any MCE consumers. And we do that now anyway, i.e. box without mcelog or > > some other ras daemon running. > > > > So we could reuse the normal handler - we just need to do some tweaking > > first... AFAICT, of course. I believe in that endeavor, the devil will > > be in the detail. > > OK, I'll try this approach with updating cpu_online_mask. > > Thanks, > Naoya Horiguchi-- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/