On Thu, 9 Aug 2018 12:26:46 +0200 Michal Suchánek <msucha...@suse.de> wrote:
> On Thu, 9 Aug 2018 18:33:33 +1000 > Nicholas Piggin <npig...@gmail.com> wrote: > > > On Thu, 9 Aug 2018 13:39:45 +0530 > > Ananth N Mavinakayanahalli <ana...@linux.vnet.ibm.com> wrote: > > > > > On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote: > > > > On Thu, 09 Aug 2018 16:34:07 +1000 > > > > Michael Ellerman <m...@ellerman.id.au> wrote: > > > > > > > > > "Aneesh Kumar K.V" <aneesh.ku...@linux.ibm.com> writes: > > > > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote: > > > > > >> Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> writes: > > > > > >>> From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> > > > > > >>> > > > > > >>> Introduce recovery action for recovered memory errors > > > > > >>> (MCEs). There are soft memory errors like SLB Multihit, > > > > > >>> which can be a result of a bad hardware OR software BUG. > > > > > >>> Kernel can easily recover from these soft errors by > > > > > >>> flushing SLB contents. After the recovery kernel can still > > > > > >>> continue to function without any issue. But in some > > > > > >>> scenario's we may keep getting these soft errors until the > > > > > >>> root cause is fixed. To be able to analyze and find the > > > > > >>> root cause, best way is to gather enough data and system > > > > > >>> state at the time of MCE. Hence this patch introduces a > > > > > >>> sysctl knob where user can decide either to continue after > > > > > >>> recovery or panic the kernel to capture the dump. > > > > > >> > > > > > >> I'm not convinced we want this. > > > > > >> > > > > > >> As we've discovered it's often not possible to reconstruct > > > > > >> what happened based on a dump anyway. > > > > > >> > > > > > >> The key thing you need is the content of the SLB and that's > > > > > >> not included in a dump. > > > > > >> > > > > > >> So I think we should dump the SLB content when we get the > > > > > >> MCE (which this series does) and any other useful info, and > > > > > >> then if we can recover we should. > > > > > > > > > > > > The reasoning there is what if we got multi-hit due to some > > > > > > corruption in slb_cache_ptr. ie. some part of kernel is > > > > > > wrongly updating the paca data structure due to wrong > > > > > > pointer. Now that is far fetched, but then possible right?. > > > > > > Hence the idea that, if we don't have much insight into why a > > > > > > slb multi-hit occur from the dmesg which include slb content, > > > > > > slb_cache contents etc, there should be an easy way to force > > > > > > a dump that might assist in further debug. > > > > > > > > > > If you're debugging something complex that you can't determine > > > > > from the SLB dump then you should be running a debug kernel > > > > > anyway. And if anything you want to drop into xmon and sit > > > > > there, preserving the most state, rather than taking a dump. > > > > > > > > I'm not saying for a dump specifically, just some form of crash. > > > > And we really should have an option to xmon on panic, but that's > > > > another story. > > > > > > That's fine during development or in a lab, not something we could > > > enforce in a customer environment, could we? > > > > xmon on panic? Not something to enforce but IMO (without thinking > > about it too much but having encountered it several times) it should > > probably be tied xmon on BUG option. > > You should get that with this patch and xmon=on or am I missing > something? Oh yeah, I just got a bit side tracked and added something not very relevant -- a panic() call should drop to xmon if we have xmon=on. It doesn't today (or last I looked), but that's nothing to do with this patch. Thanks, Nick