Re: kernel diagnostic assertion "!_kernel_lock_held()" failed

Alexander Bluhm Thu, 06 Jul 2023 09:18:31 -0700

On Thu, Jul 06, 2023 at 02:14:09PM +0000, Valdrin MUJA wrote:
> I've applied your patch but crashed again. Here it is:
> ddb{1}> show panic
> *cpu1: kernel diagnostic assertion "refcnt_read(&rt->rt_refcnt) >= 2" failed: 
> f
> ile "/usr/src/sys/net/rtable.c", line 828


This kassert I added seems to be wrong.  I copied it from above
without thinking enough.  Just remove it, updated diff below.

I compared your crash 3 and 4 output:

TEST1> uvm_fault(0xfffffd826717bcc0, 0x8, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at      srp_get_locked+0x11:    movq    0(%rdi),%rax
    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
*225335  47125      0           0          0    1  bgpd
 231752  78299     73   0x1100010          0    3  syslogd
 344909   6421      0     0x14000      0x200    2  wg_handshake
 361415  98860      0     0x14000      0x200    0  reaper

SPOKE1> uvm_fault(0xfffffd81d5995878, 0x8, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at      srp_get_locked+0x11:    movq    0(%rdi),%rax
    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
 448769  98731      0    0x100002          0    3  sh
 350289  69698     73   0x1100010          0    0  syslogd
*114462  84824      0           0          0    1  bgpd
 256495  50081      0     0x14000      0x200    2  wg_handshake

It is interesting that bgpd and wireguard are running in both cases
when it crashes.  Unfortunately you mail does not include this
output for crash 1 and 2.  It is printed immediately when the machine
crashes.  Do you have it in some console history?

I see a lot of different workload on your machine.  That makes it
harder to identify the subsystem that has the bug.  I see bgpd(8)
and wg(2) doing things with network and routing.  Is there something
else?

What has changed to make these crashes happen?  New workload?  New
machine?  Upgrade to 7.3?  Was it stable with 7.2?  ...

Thanks for testing.

bluhm

Index: net/rtable.c
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtable.c,v
retrieving revision 1.82
diff -u -p -r1.82 rtable.c
--- net/rtable.c        19 Apr 2023 17:42:47 -0000      1.82
+++ net/rtable.c        6 Jul 2023 15:56:04 -0000
@@ -604,6 +604,11 @@ rtable_insert(unsigned int rtableid, str
        SRPL_INSERT_HEAD_LOCKED(&rt_rc, &an->an_rtlist, rt, rt_next);
 
        prev = art_insert(ar, an, addr, plen);
+       if (prev == an) {
+               rw_exit_write(&ar->ar_lock);
+               /* keep the refcount for rt while it is in an_rtlist */
+               return (0);
+       }
        if (prev != an) {
                SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry,
                    rt_next);
@@ -689,9 +694,10 @@ rtable_delete(unsigned int rtableid, str
                npaths++;
 
        if (npaths > 1) {
-               KASSERT(refcnt_read(&rt->rt_refcnt) >= 1);
+               KASSERT(refcnt_read(&rt->rt_refcnt) >= 2);
                SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry,
                    rt_next);
+               rtfree(rt);
 
                mrt = SRPL_FIRST_LOCKED(&an->an_rtlist);
                if (npaths == 2)
@@ -703,8 +709,9 @@ rtable_delete(unsigned int rtableid, str
        if (art_delete(ar, an, addr, plen) == NULL)
                panic("art_delete failed to find node %p", an);
 
-       KASSERT(refcnt_read(&rt->rt_refcnt) >= 1);
+       KASSERT(refcnt_read(&rt->rt_refcnt) >= 2);
        SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry, rt_next);
+       rtfree(rt);
        art_put(an);
 
 leave:
@@ -821,12 +828,10 @@ rtable_mpath_reprio(unsigned int rtablei
                 */
                rt->rt_priority = prio;
        } else {
-               rtref(rt); /* keep rt alive in between remove and insert */
                SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist,
                    rt, rtentry, rt_next);
                rt->rt_priority = prio;
                rtable_mpath_insert(an, rt);
-               rtfree(rt);
                error = EAGAIN;
        }
        rw_exit_write(&ar->ar_lock);
@@ -839,6 +844,9 @@ rtable_mpath_insert(struct art_node *an,
 {
        struct rtentry                  *mrt, *prt = NULL;
        uint8_t                          prio = rt->rt_priority;
+
+       /* increment the refcount for rt while it is in an_rtlist */
+       rtref(rt);
 
        if ((mrt = SRPL_FIRST_LOCKED(&an->an_rtlist)) == NULL) {
                SRPL_INSERT_HEAD_LOCKED(&rt_rc, &an->an_rtlist, rt, rt_next);

Re: kernel diagnostic assertion "!_kernel_lock_held()" failed

Reply via email to