[ndctl PATCH v2 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"

marc . herbert Wed, 14 May 2025 19:18:58 -0700

From: Marc Herbert <marc.herb...@linux.intel.com>

While a "Call Trace" is usually a bad omen, the show_trace_log_lvl()
function supports any log level. So a "Call Trace" is not a reliable
indication of a failure. More importantly: any other WARNING or ERROR
during a test should make a test FAIL. Before this commit, it does not.


So, leverage log levels for the PASS/FAIL decision.  This catches all
issues and not just the ones printing Call Traces.

Add a simple way to exclude expected warnings and errors, either on a
per-test basis or globally.

Add a way for negative tests to fail when if some expected errors are
missing.

Stop relying on the magic and convenient but inaccurate $SECONDS bash
variable because its precision is (surprise!) about 1 second. In the
first version of this patch, $SECONDS was kept and working but it
required a 1++ second long "cooldown" between tests to isolate their
logs from each other. After dropping $SECONDS from journalctl, no
cooldown delay is required.

As a good example (which initiated this), the test feedback when hitting
bug https://github.com/pmem/ndctl/issues/278, where the cxl_test module
errors at load, is completely changed by this. Instead of only half the
tests failing with a fairly cryptic and late "Numerical result out of
range" error from user space, now all tests are failing early and more
consistently; displaying the same, earlier and more relevant error.

This simple log-level based approach has been successfully used for
years in the CI of https://github.com/thesofproject and caught
countless firmware and kernel bugs.

Note: the popular message "possible circular locking ..." recently fixed
by revert v6.15-rc1-4-gdc1771f71854 is at the WARNING level, including
its Call Trace.

Signed-off-by: Marc Herbert <marc.herb...@linux.intel.com>
---
 test/common            | 137 +++++++++++++++++++++++++++++++++++++++--
 test/cxl-events.sh     |   2 +
 test/cxl-poison.sh     |   7 +++
 test/cxl-xor-region.sh |   2 +
 test/dax.sh            |   6 ++
 5 files changed, 149 insertions(+), 5 deletions(-)

diff --git a/test/common b/test/common
index 74e74dd4fff9..67dfd03f9693 100644
--- a/test/common
+++ b/test/common
@@ -15,6 +15,28 @@ err()
        exit "$rc"
 }
 
+time_init()
+{
+       test "$SECONDS" -le 1 || err 'test/common must be included first!'
+       # ... otherwise NDTEST_START is inaccurate
+
+       NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
+
+       # Log anchor, especially useful when running tests back to back
+       printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
+               "$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg
+
+       # Default value, can be overriden by the environment
+       : "${NDTEST_LOG_DBG:=false}"
+
+       if "$NDTEST_LOG_DBG"; then
+               local _cd_dbg_msg="NDTEST_LOG_DBG: $test_basename early msg 
must be found by check_dmesg()"
+               printf '<3>%s: %s\n' "$test_basename" "$_cd_dbg_msg" > /dev/kmsg
+               kmsg_fail_if_missing+=("$_cd_dbg_msg")
+       fi
+}
+time_init
+
 # Global variables
 
 # NDCTL
@@ -143,15 +165,120 @@ json2var()
        sed -e "s/[{}\",]//g; s/\[//g; s/\]//g; s/:/=/g"
 }
 
-# check_dmesg
+# check_dmesg() performs two actions controlled by two bash arrays:
+# "kmsg_no_fail_on" and "kmsg_fail_if_missing". These list of extended
+# regular expressions (grep '-E') have default values here that can
+# be customized by each test.
+#
+# 1. check_dmesg() first checks the output of `journalctl -k -p warning`
+# and makes the invoking test FAIL if any unexpected kernel error or
+# warning occurred during the test. Messages in either the
+# "kmsg_no_fail_on" or the "kmsg_fail_if_missing" arrays are expected
+# and do NOT cause a test failure. All other errors and warnings cause a
+# test failure.
+#
+# 2.1 Then, check_dmesg() makes sure at least one line in the logs
+# matches each regular expression in the "kmsg_fail_if_missing" array. If
+# any of these expected messages was never issued during the test, then
+# the test fails. This is especially useful for "negative" tests
+# triggering expected errors; but not just. Unlike 1., all log levels
+# are searched. Avoid relying on "optional" messages (e.g.: dyndbg) in
+# "kmsg_fail_if_missing".
+#
+# 2.2 to make sure "something" happened during the test, check_dmesg()
+# provides a default, non-empty kmsg_fail_if_missing value that searches
+# for either "nfit_test" or pmem" or "cxl_". These are not searched if
+# the test already provides some value(s) in "kmsg_fail_if_missing".
+# While not recommended, a test could use check_dmesg() and opt out of
+# "kmsg_fail_if_missing" with a pointless regular expression like '.'
+
+# Always append with '+=' to give any test the freedom to source this
+# file before or after adding exclusions.
+# kmsg_no_fail_on+=('this array cannot be empty otherwise grep -v fails')
+
+kmsg_no_fail_on+=('cxl_core: loading out-of-tree module taints kernel')
+kmsg_no_fail_on+=('cxl_mock_mem.*: CXL MCE unsupported')
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.*: Extended linear cache calculation 
failed rc:-2')
+
+# 'modprobe nfit_test' prints these every time it's not already loaded
+kmsg_no_fail_on+=(
+    'nd_pmem namespace.*: unable to guarantee persistence of writes'
+    'nfit_test nfit_test.*: failed to evaluate _FIT'
+    'nfit_test nfit_test.*: Error found in NVDIMM nmem.* flags: save_fail 
restore_fail flush_fail not_armed'
+    'nfit_test nfit_test.1: Error found in NVDIMM nmem.* flags: map_fail'
+)
+
+# notice level to give some information without flooding the (single!)
+# testlog.txt file
+journalctl_notice()
+{
+       ( set +x;
+         printf ' ------------ More verbose logs at t=%ds ----------\n' 
"$SECONDS" )
+       journalctl -b --no-pager -o short-precise -p notice --since 
"-$((SECONDS*1000 + 1000)) ms"
+}
+
 # $1: line number where this is called
 check_dmesg()
 {
-       # validate no WARN or lockdep report during the run
+       local _e_kmsg_no_fail_on=()
+       for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
+               _e_kmsg_no_fail_on+=('-e' "$re")
+       done
+
+       # Give some time for a complete kmsg->journalctl flush + any delayed 
test effect.
        sleep 1
-       log=$(journalctl -r -k --since "-$((SECONDS+1))s")
-       grep -q "Call Trace" <<< $log && err $1
-       true
+
+       if "$NDTEST_LOG_DBG"; then
+               journalctl -q -b --since "$NDTEST_START" \
+                       -o short-precise > journal-"$(basename "$0")".log
+       fi
+       # After enabling, check the timings in:
+       #    head -n 7 $(ls -1t build/journal-*.log | tac)
+       #    journalctl --since='- 5 min' -o short-precise -g 'test/common'
+
+       { # Redirect to stderr so this is all at the _bottom_ in the log file
+
+       # Fail on kernel WARNING or ERROR.
+       if journalctl -q -o short-precise -p warning -k --since "$NDTEST_START" 
|
+               grep -E -v "${_e_kmsg_no_fail_on[@]}"; then
+                       journalctl_notice
+                       err "$1"
+       fi
+
+       # Sanity check: make sure "something" has run
+       if [ "${#kmsg_fail_if_missing[@]}" = 0 ]; then
+               kmsg_fail_if_missing+=( '(nfit_test)|(pmem)|(cxl_)' )
+       fi
+
+       local expected_re
+       for expected_re in "${kmsg_fail_if_missing[@]}"; do
+               journalctl -q -k -p 7 --since "$NDTEST_START" |
+                       grep -q -E -e "${expected_re}" || {
+                               printf 'FAIL: expected error not found: %s\n' 
"$expected_re"
+                               journalctl_notice
+                               err "$1"
+               }
+       done
+       } >&2
+
+       # Log anchor, especially useful when running tests back to back
+       printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" 
"$SECONDS" > /dev/kmsg
+
+       if "$NDTEST_LOG_DBG"; then
+           log_stress from_check_dmesg
+       fi
+}
+# Many tests don't use check_dmesg() (yet?) so double down here. Also, this
+# runs later which is better. But before using this make sure there is
+# still no test defining its own EXIT trap.
+if "$NDTEST_LOG_DBG"; then
+    trap 'log_stress from_trap' EXIT
+fi
+
+log_stress()
+{
+       printf '<3>%s@%ds: NDTEST_LOG_DBG; trying to break the next 
check_dmesg() %s\n' \
+               "$test_basename" "$SECONDS" "$1" > /dev/kmsg
 }
 
 # CXL COMMON
diff --git a/test/cxl-events.sh b/test/cxl-events.sh
index c216d6aa9148..1461b487e208 100644
--- a/test/cxl-events.sh
+++ b/test/cxl-events.sh
@@ -25,6 +25,8 @@ rc=1
 dev_path="/sys/bus/platform/devices"
 trace_path="/sys/kernel/tracing"
 
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 test_region_info()
 {
        # Trigger a memdev in the cxl_test autodiscovered region
diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
index 2caf092db460..3985c817914b 100644
--- a/test/cxl-poison.sh
+++ b/test/cxl-poison.sh
@@ -8,6 +8,13 @@ rc=77
 
 set -ex
 
+# FIXME: this should be in "kmsg_fail_if_missing" but this test seems to
+# work only once. Cleanup/reset issue?
+kmsg_no_fail_on+=(
+    'cxl_mock_mem cxl_mem.*: poison inject dpa:0x'
+    'cxl_mock_mem cxl_mem.*: poison clear dpa:0x'
+)
+
 trap 'err $LINENO' ERR
 
 check_prereq "jq"
diff --git a/test/cxl-xor-region.sh b/test/cxl-xor-region.sh
index b9e1d79212d3..f5e0db98b67f 100644
--- a/test/cxl-xor-region.sh
+++ b/test/cxl-xor-region.sh
@@ -17,6 +17,8 @@ modprobe cxl_test interleave_arithmetic=1
 udevadm settle
 rc=1
 
+kmsg_fail_if_missing+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 # THEORY OF OPERATION: Create x1,2,3,4 regions to exercise the XOR math
 # option of the CXL driver. As with other cxl_test tests, changes to the
 # CXL topology in tools/testing/cxl/test/cxl.c may require an update here.
diff --git a/test/dax.sh b/test/dax.sh
index 3ffbc8079eba..0589f0d053ec 100755
--- a/test/dax.sh
+++ b/test/dax.sh
@@ -118,6 +118,12 @@ else
        run_xfs
 fi
 
+kmsg_fail_if_missing+=(
+    'nd_pmem pfn.*: unable to guarantee persistence of writes'
+    'Memory failure: .*: Sending SIGBUS to dax-pmd:.* due to hardware memory 
corruption'
+    'Memory failure: .*: recovery action for dax page: Recovered'
+)
+
 check_dmesg "$LINENO"
 
 exit 0
-- 
2.49.0

[ndctl PATCH v2 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"

Reply via email to