On Wed, 2026-01-14 at 13:35 +0100, Tomas Glozar wrote:
> A kernel panic was observed in the timerlat tracer with the following
> reproducer:
> 
>    #!/bin/bash
>    while true; do
>       rtla timerlat hist -u -d 5s & PID=$!
>       sleep 2
>       echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
>       rtla timerlat hist -k -d 1s
>    done
> 
> The kernel first displays several WARN traces with the following pattern:
> 
>    WARNING: CPU: 1 PID: 1822 at kernel/trace/trace_osnoise.c:1959 
> stop_kthread+0xb7/0xc0

The line number doesn't match up for me; is this the first or second
WARN_ON in that function?

> and finally a null pointer reference BUG:
> 
>    BUG: kernel NULL pointer dereference, address: 0000000000000030
>    ...
>    CPU: 1 UID: 0 PID: 2155 Comm: timerlatu/1
>    ...
>    Call Trace:
>      ...
>      ? timerlat_fd_read+0xf2/0x370
>      ? timerlat_fd_read+0xee/0x370
>      vfs_read+0xe8/0x370
>      ksys_read+0x6d/0xf0
>      do_syscall_64+0x7d/0x160
>      ...
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e

What's the actual fault location?  And those ? lines in the call trace
are "considered to be additional clues" rather than actual unwound
frames; what was in the ... above them?

>  static int osnoise_options_open(struct inode *inode, struct file *file)
>  {
>       return seq_open(file, &osnoise_options_seq_ops);
> @@ -2229,6 +2254,10 @@ static ssize_t osnoise_options_write(struct file 
> *filp, const char __user *ubuf,
>       if (option < 0)
>               return -EINVAL;
>  
> +     retval = osnoise_validate_option(option, enable);
> +     if (retval != 0)
> +             return retval;
> +
>       /*
>        * trace_types_lock is taken to avoid concurrency on start/stop.
>        */

Shouldn't this be done under interface_lock to avoid concurrent
timerlat_fd_open()?  FWIW, your test script doesn't appear to cover the
case of option setting racing with timerlat starting (due to the 2
second delay).

Of course, this is complicated by stop_per_cpu_kthreads() happening
before interface_lock is acquired.  Do we know why that happens outside
the lock?  That might even be the actual cause of this bug.

Though even in the non-race case, we might still want to return -EBUSY
rather than just killing the thread (which might still have races since
we don't wait for the user thread to die).

-Crystal


Reply via email to