common: stop relying on bash $SECONDS in check_dmesg()

Alison Schofield Thu, 21 Aug 2025 17:11:02 -0700

> 
> On 7/24/25 3:00 PM, marc.herb...@linux.intel.com wrote:
> > From: Marc Herbert <marc.herb...@linux.intel.com>
> > 
> > Stop relying on the imprecise bash $SECONDS variable as a test start
> > timestamp when scanning the logs for issues.
> > 
> > $SECONDS was convenient because it came "for free" and did not require
> > any time_init(). But it was not fine-grained enough and its rounding
> > process is not even documented. Keep using $SECONDS in log messages
> > where it is easy to use and more user-friendly than bare timestamps, but
> > switch the critical journalctl scan to a new, absolute NDTEST_START
> > timestamp initialized when test/common is sourced. Use a SECONDS-based,
> > rough sanity check in time_init() to make sure test/common is always
> > sourced early.


This patch has one fix, plus 4 while-we-are-at-it's, with only one of the
latter mentioned in the commit message. 

Limit this patch to fixing the known problem - replacing $SECONDS with an
initialized NDTEST_START for accurate log scanning. You could reasonably
include the log anchor additions since those help folks who view the dmesg
log in one go and tie directly to the timing fix. Then, if there is still
justification for the other debug functionality after we've tightened up
the timing, add them in a separate patch with proper explanation of what
problem they solve.

The commit message is explaining the drawbacks of seconds, but not
mentioning why that is a problem for ndctl. This previous text, captured
the issue:

>> Using the bash variable 'SECONDS' plus 1 for searching the
>> dmesg log sometimes led to one test picking up error messages
>> from the previous test when run as a suite. SECONDS alone may
>> miss some logs, but SECONDS + 1 is just as often too great.

Then state what the patch does to resolve it.
Initialize and use a timestamp with ms granularity with journalctl.

a few comments on the implementation...

> > 
> > Signed-off-by: Marc Herbert <marc.herb...@linux.intel.com>
> > ---
> >  test/common | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 50 insertions(+), 2 deletions(-)
> > 
> > diff --git a/test/common b/test/common
> > index 2d076402ef7c..0214c6aaed5f 100644
> > --- a/test/common
> > +++ b/test/common
> > @@ -15,6 +15,25 @@ err()
> >     exit "$rc"
> >  }
> >  
> > +# Initialize the NDTEST_START timestamp used to scan the logs.
> > +# Insert an anchor/bookmark in the logs to quickly locate the start of any 
> > test.
> > +time_init()
> > +{
> > +   # Refuse to run if anything lasted for too long before this point
> > +   # because that would make NDTEST_START incorrect.
> > +   test "$SECONDS" -le 1 || err 'test/common must be included first!'

Agree w previous reviewer comments, this is not needed.

> > +
> > +   NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
> > +
> > +   # Log anchor, especially useful when running tests back to back
> > +   printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
> > +           "$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg
> > +
> > +   # Default value, can be overridden by the environment
> > +   : "${NDTEST_LOG_DBG:=false}"
> > +}
> > +time_init
> > +
> >  # Global variables
> >  
> >  # NDCTL
> > @@ -147,11 +166,40 @@ json2var()
> >  # $1: line number where this is called
> >  check_dmesg()
> >  {
> > +   if "$NDTEST_LOG_DBG"; then
> > +           # Keep a record of which log lines we scanned
> > +           journalctl -q -b --since "$NDTEST_START" \
> > +                   -o short-precise > journal-"$(basename "$0")".log
> > +   fi
> > +   # After enabling with `NDTEST_LOG_DBG=true meson test`, inspect with:
> > +   #    head -n 7 $(ls -1t build/journal-*.log | tac)
> > +   #    journalctl --since='- 5 min' -o short-precise -g 'test/common'
> > +
> >     # validate no WARN or lockdep report during the run
> > -   sleep 1
> > -   log=$(journalctl -r -k --since "-$((SECONDS+1))s")
> > +   log=$(journalctl -r -k --since "$NDTEST_START")
> >     grep -q "Call Trace" <<< "$log" && err "$1"
> >     true
> > +
> > +   # Log anchor, especially useful when running tests back to back
> > +   printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" 
> > "$SECONDS" > /dev/kmsg
> > +
> > +   if "$NDTEST_LOG_DBG"; then
> > +       log_stress from_check_dmesg
> > +   fi
> > +}
> > +
> > +# While they should, many tests don't use check_dmesg().

I suspect we get more value by directly going after the outliers like
scripts not checking dmesg, or unit tests not wrapped in scripts.
(wrapping unit tests in scripts would be a new requirement)

Although not clear what the below code has to do with tests that
don't check_dmesg().

> > So double down here. Also, this
> > +# runs later which is better.
> > +# Before enabling NDTEST_LOG_DBG=true, make sure no test started defining 
> > its own
> > +# EXIT trap.
> > +if "$NDTEST_LOG_DBG"; then
> > +    trap 'log_stress from_trap' EXIT
> > +fi
> > +
> > +log_stress()
> > +{
> > +   printf '<3>%s@%ds: NDTEST_LOG_DBG Call Trace; trying to break the next 
> > check_dmesg() %s\n' \
> > +           "$test_basename" "$SECONDS" "$1" > /dev/kmsg
> >  }
> >  
> >  # CXL COMMON
>

Re: [ndctl PATCH 3/3] test/common: stop relying on bash $SECONDS in check_dmesg()

Reply via email to