On production servers running variety of workloads over time, kernel
panic can happen sporadically after days or even months. It is
important to collect as much debug logs as possible to root cause
and fix the problem, that may not be easy to reproduce. Snapshot of
underlying hardware/firmware state (like register dump, firmware
logs, adapter memory, etc.), at the time of kernel panic will be very
helpful while debugging the culprit device driver.
This series of patches add new generic framework that enable device
drivers to collect device specific snapshot of the hardware/firmware
state of the underlying device at the time of kernel panic. The
collected logs are appended to vmcore along with details, such as
start address and length of the logs, which are required for
extraction during post-analysis.
Device drivers can use crash_driver_dump_register() to register their
callback that collects underlying device specific hardware/firmware
logs during kernel panic (i.e. before booting into the second kernel).
Drivers can unregister with crash_driver_dump_unregister().
To extract the device specific hardware/firmware logs using crash:
crash> help -D | grep DRIVERDUMP
DRIVERDUMP=(cxgb4_0000:02:00.4, ffffb131090bd000, 37782968)
crash> rd ffffb131090bd000 37782968 -r hardware.log
37782968 bytes copied from 0xffffb131090bd000 to hardware.log
Patch 1 adds API to allow drivers to register callback to
collect the device specific hardware/firmware logs.
Patch 2 shows a cxgb4 driver example using the API to collect
hardware/firmware logs during kernel panic.
Suggestions and feedback will be much appreciated.
Rahul Lakkireddy (2):
kernel/crash_core: add API to collect hardware dump in kernel panic
cxgb4: collect hardware dump in kernel panic
drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 6 ++
drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 95 +++++++++++++++++++++++-
drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h | 4 +
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 12 +++
include/linux/crash_core.h | 33 ++++++++
kernel/crash_core.c | 83 ++++++++++++++++++++-
kernel/kexec_core.c | 1 +
7 files changed, 229 insertions(+), 5 deletions(-)