On Tuesday, February 02/20/18, 2018 at 17:04:20 -0800, Jakub Kicinski wrote:
> On Tue, 20 Feb 2018 16:51:03 -0800, Florian Fainelli wrote:
> > On 02/20/2018 04:43 PM, Jakub Kicinski wrote:
> > > On Mon, 19 Feb 2018 18:04:17 +0530, Rahul Lakkireddy wrote:  
> > >> Our requirement is to analyze the state of firmware/hardware at the
> > >> time of kernel panic.   
> > > 
> > > I was wondering about this since you posted the patch and I can't come
> > > up with any specific scenario where kernel crash would correlate
> > > clearly with device state in non-trivial way.
> > > 
> > > Perhaps there is something about cxgb4 HW/FW that makes this useful.
> > > Could you explain?  Could you give a real life example of a bug?  
> > > Is it related to the TOE-looking TLS offload Atul is posting?
> > > 
> > > Is the panic you're targeting here real or manually triggered from user
> > > space to get a full dump of kernel and FW?
> > > 
> > > That's me trying to guess what you're doing.. :)
> > 

This is not related to TLS that Atul posted.  This is related to
general Field Diagnostics.

When a kernel panic happens on critical production servers, they
may not be reproducible again or may not have downtime for debugging.

Currently vmcore generated after panic, has only snapshot of driver
state and not hardware/firmware state at the time of kernel panic. If
complete state and logs of underlying NIC hardware/firmware (in fact,
all hardware components) is collected, it will be very helpful for
post analysis. 

For example, hardware memory gets incorrectly programmed by driver
due to a race condition which causes a kernel panic indirectly. 
A dump of hardware memory collected during kernel panic, will
definitely help to root cause and fix the issue.

> > One case where this might be helpful is if you are chasing down DMA
> > corruption and you would like to get a nearly instant capture of both
> > the kernel's memory and the adapter which may be responsible for that.
> > This is not probably 100% proof because there is a timing window during
> > which the dumps of both contexts are going to happen, and that alone
> > might be influencing the captured memory view. Just guessing of course.
> 
> Perhaps this is what you mean with the timing window - but with random
> corruptions by the time kernel hits the corrupted memory 40/100Gb
> adapter has likely forgotten all about those DMAs..  And IOMMUs are
> pretty good at catching corruptions on big iron CPUs (i.e. it's easy to
> catch them in testing, even if production environment runs iommu=pt).
> At least that's my gut feeling/experience ;)

Thanks,
Rahul

Reply via email to