"Aneesh Kumar K.V" <aneesh.ku...@linux.ibm.com> writes: > On 08/08/2018 08:26 PM, Michael Ellerman wrote: >> Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> writes: >>> From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> >>> >>> Introduce recovery action for recovered memory errors (MCEs). There are >>> soft memory errors like SLB Multihit, which can be a result of a bad >>> hardware OR software BUG. Kernel can easily recover from these soft errors >>> by flushing SLB contents. After the recovery kernel can still continue to >>> function without any issue. But in some scenario's we may keep getting >>> these soft errors until the root cause is fixed. To be able to analyze and >>> find the root cause, best way is to gather enough data and system state at >>> the time of MCE. Hence this patch introduces a sysctl knob where user can >>> decide either to continue after recovery or panic the kernel to capture the >>> dump. >> >> I'm not convinced we want this. >> >> As we've discovered it's often not possible to reconstruct what happened >> based on a dump anyway. >> >> The key thing you need is the content of the SLB and that's not included >> in a dump. >> >> So I think we should dump the SLB content when we get the MCE (which >> this series does) and any other useful info, and then if we can recover >> we should. > > The reasoning there is what if we got multi-hit due to some corruption > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca > data structure due to wrong pointer. Now that is far fetched, but then > possible right?. Hence the idea that, if we don't have much insight into > why a slb multi-hit occur from the dmesg which include slb content, > slb_cache contents etc, there should be an easy way to force a dump that > might assist in further debug.
If you're debugging something complex that you can't determine from the SLB dump then you should be running a debug kernel anyway. And if anything you want to drop into xmon and sit there, preserving the most state, rather than taking a dump. The last SLB multi-hit I debugged was this: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=db7130d63fd8 Which took quite a while to track down, including a bunch of tracing and so on. A dump would not have helped in the slightest. cheers