On Tue, Sep 17, 2019 at 11:04 AM Sam Bobroff <sbobr...@linux.ibm.com> wrote: > > On Tue, Sep 03, 2019 at 08:15:56PM +1000, Oliver O'Halloran wrote: > > Currently we print a stack trace in the event handler to help with > > debugging EEH issues. In the case of suprise hot-unplug this is unneeded, > > so we want to prevent printing the stack trace unless we know it's due to > > an actual device error. To accomplish this, we can save a stack trace at > > the point of detection and only print it once the EEH recovery handler has > > determined the freeze was due to an actual error. > > > > Since the whole point of this is to prevent spurious EEH output we also > > move a few prints out of the detection thread, or mark them as pr_debug > > so anyone interested can get output from the eeh_check_dev_failure() > > if they want. > > > > Signed-off-by: Oliver O'Halloran <ooh...@gmail.com> > > I think this is a good change, and even in the normal case it will place > the stacktrace closer to the rest of the recovery information. > > But, I think it would make more sense to put the stacktrace into the > struct eeh_event, rather than the struct eeh_pe. Is there some reason > we can't do that? (It would save a fair bit of memory!)
Two reasons: 1) the eeh_event structures are allocated with GFP_ATOMIC since eeh_dev_check_failure() can be called from any context. Minimising the number of atomic allocations we do is a good idea as a matter of course. 2) We don't pass the eeh_event structure to the event handler function. I guess we could, but... eh I don't see the memory saving as hugely significant either. There's always fewer eeh_pe structures than there are PCI devices since some will share PEs (e.g. switches, multifunction cards) so you'd be saving a dozen KB at most. root@zaius1:~# lspci | wc -l 59 root@zaius1:~# echo $(( $(lspci | wc -l) * 64 * 8)) 30208 I think we'll live... > > Cheers, > Sam.