Re: [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"

dan.j.williams Fri, 11 Jul 2025 17:33:07 -0700

marc.herbert@ wrote:
> From: Marc Herbert <marc.herb...@linux.intel.com>
> 
> While a "Call Trace" is usually a bad omen, the show_trace_log_lvl()
> function supports any log level. So a "Call Trace" is not a reliable
> indication of a failure. More importantly: any other WARNING or ERROR
> during a test should make a test FAIL. Before this commit, it does not.
> 
> So, leverage log levels for the PASS/FAIL decision.  This catches all
> issues and not just the ones printing Call Traces.
> 
> Add a simple way to exclude expected warnings and errors, either on a
> per-test basis or globally.
> 
> Add a way for negative tests to fail when if some expected errors are
> missing.


Each subsequent "Add" in a changelog is usually an opportunity to split
the patch into smaller digestable pieces.

> 
> Stop relying on the magic and convenient but inaccurate $SECONDS bash
> variable because its precision is (surprise!) about 1 second. In the
> first version of this patch, $SECONDS was kept and working but it
> required a 1++ second long "cooldown" between tests to isolate their
> logs from each other. After dropping $SECONDS from journalctl, no
> cooldown delay is required.
> 
> As a good example (which initiated this), the test feedback when hitting
> bug https://github.com/pmem/ndctl/issues/278, where the cxl_test module
> errors at load, is completely changed by this. Instead of only half the
> tests failing with a fairly cryptic and late "Numerical result out of
> range" error from user space, now all tests are failing early and more
> consistently; displaying the same, earlier and more relevant error.
> 
> This simple log-level based approach has been successfully used for
> years in the CI of https://github.com/thesofproject and caught
> countless firmware and kernel bugs.
> 
> Note: the popular message "possible circular locking ..." recently fixed
> by revert v6.15-rc1-4-gdc1771f71854 is at the WARNING level, including
> its Call Trace.

This looks promising to me, some comments below.

> 
> Signed-off-by: Marc Herbert <marc.herb...@linux.intel.com>
> ---
>  test/common            | 137 +++++++++++++++++++++++++++++++++++++++--
>  test/cxl-events.sh     |   2 +
>  test/cxl-poison.sh     |   7 +++
>  test/cxl-xor-region.sh |   2 +
>  test/dax.sh            |   6 ++
>  5 files changed, 149 insertions(+), 5 deletions(-)
> 
> diff --git a/test/common b/test/common
> index 74e74dd4fff9..8709932ffbbd 100644
> --- a/test/common
> +++ b/test/common
> @@ -15,6 +15,28 @@ err()
>       exit "$rc"
>  }
>  
> +time_init()
> +{
> +     test "$SECONDS" -le 1 || err 'test/common must be included first!'
> +     # ... otherwise NDTEST_START is inaccurate

What is this protecting against... that the test makes sure that
NDTEST_START happens before any error might have been produced?

I think the proposed anchors make this easily debuggable, there are no
tests that include test/common late, as evidenced by no fixups for this
in this patch.

> +     NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
> +
> +     # Log anchor, especially useful when running tests back to back
> +     printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
> +             "$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg

Why is SECONDS here?

Note that there are some non-shell tests in the suite, not for CXL, but
might want to make this consistent by following on with wrapping those
tests in a script.

> +
> +     # Default value, can be overriden by the environment
> +     : "${NDTEST_LOG_DBG:=false}"
> +
> +     if "$NDTEST_LOG_DBG"; then
> +             local _cd_dbg_msg="NDTEST_LOG_DBG: $test_basename early msg 
> must be found by check_dmesg()"
> +             printf '<3>%s: %s\n' "$test_basename" "$_cd_dbg_msg" > /dev/kmsg
> +             kmsg_fail_if_missing+=("$_cd_dbg_msg")
> +     fi
> +}
> +time_init
> +
>  # Global variables
>  
>  # NDCTL
> @@ -143,15 +165,120 @@ json2var()
>       sed -e "s/[{}\",]//g; s/\[//g; s/\]//g; s/:/=/g"
>  }
>  
> -# check_dmesg
> +# check_dmesg() performs two actions controlled by two bash arrays:
> +# "kmsg_no_fail_on" and "kmsg_fail_if_missing". These list of extended
> +# regular expressions (grep '-E') have default values here that can
> +# be customized by each test.
> +#
> +# 1. check_dmesg() first checks the output of `journalctl -k -p warning`
> +# and makes the invoking test FAIL if any unexpected kernel error or
> +# warning occurred during the test. Messages in either the
> +# "kmsg_no_fail_on" or the "kmsg_fail_if_missing" arrays are expected
> +# and do NOT cause a test failure. All other errors and warnings cause a
> +# test failure.

Is there a mechanism to opt-out of errors and warnings. Sometimes
upstream gets overzealous with chatty dmesg and it would be nice to
quickly check if tests otherwise pass with ignoring messages. Then go
spend the time to track down new messages tripping up the test.

Otherwise I do like that this causes people to be careful in the
messages they add as new warnings and errors are things that distros
debug for customers on a consistent basis.

> +#
> +# 2.1 Then, check_dmesg() makes sure at least one line in the logs
> +# matches each regular expression in the "kmsg_fail_if_missing" array. If
> +# any of these expected messages was never issued during the test, then
> +# the test fails. This is especially useful for "negative" tests
> +# triggering expected errors; but not just. Unlike 1., all log levels
> +# are searched. Avoid relying on "optional" messages (e.g.: dyndbg) in
> +# "kmsg_fail_if_missing".

This feature feels pedantic, not grokking the immediate value. Perhaps
stage this at the end of the series.

> +#
> +# 2.2 to make sure "something" happened during the test, check_dmesg()
> +# provides a default, non-empty kmsg_fail_if_missing value that searches
> +# for either "nfit_test" or pmem" or "cxl_". These are not searched if
> +# the test already provides some value(s) in "kmsg_fail_if_missing".
> +# While not recommended, a test could use check_dmesg() and opt out of
> +# "kmsg_fail_if_missing" with a pointless regular expression like '.'

Is this not captured by modprobe failures?

> +
> +# Always append with '+=' to give any test the freedom to source this
> +# file before or after adding exclusions.
> +# kmsg_no_fail_on+=('this array cannot be empty otherwise grep -v fails')
> +
> +kmsg_no_fail_on+=('cxl_core: loading out-of-tree module taints kernel')
> +kmsg_no_fail_on+=('cxl_mock_mem.*: CXL MCE unsupported')
> +kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.*: Extended linear cache calculation 
> failed')
> +
> +# 'modprobe nfit_test' prints these every time it's not already loaded
> +kmsg_no_fail_on+=(
> +    'nd_pmem namespace.*: unable to guarantee persistence of writes'
> +    'nfit_test nfit_test.*: failed to evaluate _FIT'
> +    'nfit_test nfit_test.*: Error found in NVDIMM nmem.* flags: save_fail 
> restore_fail flush_fail not_armed'
> +    'nfit_test nfit_test.1: Error found in NVDIMM nmem.* flags: map_fail'
> +)
> +
> +# notice level to give some information without flooding the (single!)
> +# testlog.txt file
> +journalctl_notice()
> +{
> +     ( set +x;
> +       printf ' ------------ More verbose logs at t=%ds ----------\n' 
> "$SECONDS" )
> +     journalctl -b --no-pager -o short-precise -p notice --since 
> "-$((SECONDS*1000 + 1000)) ms"

Why is SECONDS here and not NDTEST_START?

> +}
> +
>  # $1: line number where this is called
>  check_dmesg()
>  {
> -     # validate no WARN or lockdep report during the run
> +     local _e_kmsg_no_fail_on=()
> +     for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
> +             _e_kmsg_no_fail_on+=('-e' "$re")
> +     done
> +
> +     # Give some time for a complete kmsg->journalctl flush + any delayed 
> test effect.
>       sleep 1

Feels magical. The sleep was only there to make sure that SECONDS rolls
over. If a test has delayed effects there is no hard guarantee that 1
second is sufficient.

> -     log=$(journalctl -r -k --since "-$((SECONDS+1))s")
> -     grep -q "Call Trace" <<< $log && err $1
> -     true
> +
> +     if "$NDTEST_LOG_DBG"; then
> +             journalctl -q -b --since "$NDTEST_START" \
> +                     -o short-precise > journal-"$(basename "$0")".log
> +     fi
> +     # After enabling, check the timings in:
> +     #    head -n 7 $(ls -1t build/journal-*.log | tac)
> +     #    journalctl --since='- 5 min' -o short-precise -g 'test/common'
> +
> +     { # Redirect to stderr so this is all at the _bottom_ in the log file
> +
> +     # Fail on kernel WARNING or ERROR.
> +     if journalctl -q -o short-precise -p warning -k --since "$NDTEST_START" 
> |
> +             grep -E -v "${_e_kmsg_no_fail_on[@]}"; then
> +                     journalctl_notice
> +                     err "$1"
> +     fi

Per above, add an opt-out for this.

> +
> +     # Sanity check: make sure "something" has run
> +     if [ "${#kmsg_fail_if_missing[@]}" = 0 ]; then
> +             kmsg_fail_if_missing+=( '(nfit_test)|(pmem)|(cxl_)' )
> +     fi
> +
> +     local expected_re
> +     for expected_re in "${kmsg_fail_if_missing[@]}"; do
> +             journalctl -q -k -p 7 --since "$NDTEST_START" |
> +                     grep -q -E -e "${expected_re}" || {
> +                             printf 'FAIL: expected error not found: %s\n' 
> "$expected_re"
> +                             journalctl_notice
> +                             err "$1"
> +             }
> +     done
> +     } >&2
> +
> +     # Log anchor, especially useful when running tests back to back

This comment is going to go stale, I do not think it helps with
understanding the implementation.

> +     printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" 
> "$SECONDS" > /dev/kmsg

I like this, I have long wanted anchors in the log for coordinating
messages.

> +
> +     if "$NDTEST_LOG_DBG"; then
> +         log_stress from_check_dmesg
> +     fi
> +}
> +# Many tests don't use check_dmesg() (yet?) so double down here. Also, this

A comment like this belongs in the changelog, not the code. I don't want
review later patches fixing up the "(yet?)", a comment should help
understand the present, not comment on some future state.

> +# runs later which is better. But before using this make sure there is
> +# still no test defining its own EXIT trap.

> +if "$NDTEST_LOG_DBG"; then
> +    trap 'log_stress from_trap' EXIT
> +fi
> +
> +log_stress()
> +{
> +     printf '<3>%s@%ds: NDTEST_LOG_DBG; trying to break the next 
> check_dmesg() %s\n' \
> +             "$test_basename" "$SECONDS" "$1" > /dev/kmsg
>  }
>  
>  # CXL COMMON
> diff --git a/test/cxl-events.sh b/test/cxl-events.sh
> index 7326eb7447ee..29afd86b8bf8 100644
> --- a/test/cxl-events.sh
> +++ b/test/cxl-events.sh
> @@ -26,6 +26,8 @@ rc=1
>  dev_path="/sys/bus/platform/devices"
>  trace_path="/sys/kernel/tracing"
>  
> +kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.* no CXL window for range')

I want to see the other benefits from this patch before debugging
whether this exclusion needs a fixup somewhere else. So lets break out
the kmsg_no_fail_on mechanism separately from the base dmesg time,
anchors, and err / warn infrastructure.

> +
>  test_region_info()
>  {
>       # Trigger a memdev in the cxl_test autodiscovered region
> diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
> index 6ed890bc666c..8b38cb7960b0 100644
> --- a/test/cxl-poison.sh
> +++ b/test/cxl-poison.sh
> @@ -9,6 +9,13 @@ rc=77
>  set -ex
>  [ -d "/sys/kernel/tracing" ] || do_skip "test requires CONFIG_TRACING"
>  
> +# FIXME: this should be in "kmsg_fail_if_missing" but this test seems to
> +# work only once. Cleanup/reset issue?
> +kmsg_no_fail_on+=(
> +    'cxl_mock_mem cxl_mem.*: poison inject dpa:0x'
> +    'cxl_mock_mem cxl_mem.*: poison clear dpa:0x'
> +)

Yeah another fixup that should be done independently from the base test
infra.

> +
>  trap 'err $LINENO' ERR
>  
>  check_prereq "jq"
> diff --git a/test/cxl-xor-region.sh b/test/cxl-xor-region.sh
> index fb4f9a0a1515..d1a58023a61e 100644
> --- a/test/cxl-xor-region.sh
> +++ b/test/cxl-xor-region.sh
> @@ -16,6 +16,8 @@ modprobe -r cxl_test
>  modprobe cxl_test interleave_arithmetic=1
>  rc=1
>  
> +kmsg_fail_if_missing+=('cxl_mock_mem cxl_mem.* no CXL window for range')

Not sure I want to fail this test if we fixup this message.

> +
>  # THEORY OF OPERATION: Create x1,2,3,4 regions to exercise the XOR math
>  # option of the CXL driver. As with other cxl_test tests, changes to the
>  # CXL topology in tools/testing/cxl/test/cxl.c may require an update here.
> diff --git a/test/dax.sh b/test/dax.sh
> index 3ffbc8079eba..0589f0d053ec 100755
> --- a/test/dax.sh
> +++ b/test/dax.sh
> @@ -118,6 +118,12 @@ else
>       run_xfs
>  fi
>  
> +kmsg_fail_if_missing+=(
> +    'nd_pmem pfn.*: unable to guarantee persistence of writes'

This one is not valuable for this test.

> +    'Memory failure: .*: Sending SIGBUS to dax-pmd:.* due to hardware memory 
> corruption'
> +    'Memory failure: .*: recovery action for dax page: Recovered'

These are useful.

> +)
> +
>  check_dmesg "$LINENO"
>  
>  exit 0
> -- 
> 2.49.0
>

Re: [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"

Reply via email to