On Thu, Jul 06, 2023 at 02:14:09PM +0000, Valdrin MUJA wrote:
> I've applied your patch but crashed again. Here it is:
> ddb{1}> show panic
> *cpu1: kernel diagnostic assertion "refcnt_read(&rt->rt_refcnt) >= 2" failed:
> f
> ile "/usr/src/sys/net/rtable.c", line 828
This kassert I added seems to be wrong. I copied it from above
without thinking enough. Just remove it, updated diff below.
I compared your crash 3 and 4 output:
TEST1> uvm_fault(0xfffffd826717bcc0, 0x8, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at srp_get_locked+0x11: movq 0(%rdi),%rax
TID PID UID PRFLAGS PFLAGS CPU COMMAND
*225335 47125 0 0 0 1 bgpd
231752 78299 73 0x1100010 0 3 syslogd
344909 6421 0 0x14000 0x200 2 wg_handshake
361415 98860 0 0x14000 0x200 0 reaper
SPOKE1> uvm_fault(0xfffffd81d5995878, 0x8, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at srp_get_locked+0x11: movq 0(%rdi),%rax
TID PID UID PRFLAGS PFLAGS CPU COMMAND
448769 98731 0 0x100002 0 3 sh
350289 69698 73 0x1100010 0 0 syslogd
*114462 84824 0 0 0 1 bgpd
256495 50081 0 0x14000 0x200 2 wg_handshake
It is interesting that bgpd and wireguard are running in both cases
when it crashes. Unfortunately you mail does not include this
output for crash 1 and 2. It is printed immediately when the machine
crashes. Do you have it in some console history?
I see a lot of different workload on your machine. That makes it
harder to identify the subsystem that has the bug. I see bgpd(8)
and wg(2) doing things with network and routing. Is there something
else?
What has changed to make these crashes happen? New workload? New
machine? Upgrade to 7.3? Was it stable with 7.2? ...
Thanks for testing.
bluhm
Index: net/rtable.c
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtable.c,v
retrieving revision 1.82
diff -u -p -r1.82 rtable.c
--- net/rtable.c 19 Apr 2023 17:42:47 -0000 1.82
+++ net/rtable.c 6 Jul 2023 15:56:04 -0000
@@ -604,6 +604,11 @@ rtable_insert(unsigned int rtableid, str
SRPL_INSERT_HEAD_LOCKED(&rt_rc, &an->an_rtlist, rt, rt_next);
prev = art_insert(ar, an, addr, plen);
+ if (prev == an) {
+ rw_exit_write(&ar->ar_lock);
+ /* keep the refcount for rt while it is in an_rtlist */
+ return (0);
+ }
if (prev != an) {
SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry,
rt_next);
@@ -689,9 +694,10 @@ rtable_delete(unsigned int rtableid, str
npaths++;
if (npaths > 1) {
- KASSERT(refcnt_read(&rt->rt_refcnt) >= 1);
+ KASSERT(refcnt_read(&rt->rt_refcnt) >= 2);
SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry,
rt_next);
+ rtfree(rt);
mrt = SRPL_FIRST_LOCKED(&an->an_rtlist);
if (npaths == 2)
@@ -703,8 +709,9 @@ rtable_delete(unsigned int rtableid, str
if (art_delete(ar, an, addr, plen) == NULL)
panic("art_delete failed to find node %p", an);
- KASSERT(refcnt_read(&rt->rt_refcnt) >= 1);
+ KASSERT(refcnt_read(&rt->rt_refcnt) >= 2);
SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry, rt_next);
+ rtfree(rt);
art_put(an);
leave:
@@ -821,12 +828,10 @@ rtable_mpath_reprio(unsigned int rtablei
*/
rt->rt_priority = prio;
} else {
- rtref(rt); /* keep rt alive in between remove and insert */
SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist,
rt, rtentry, rt_next);
rt->rt_priority = prio;
rtable_mpath_insert(an, rt);
- rtfree(rt);
error = EAGAIN;
}
rw_exit_write(&ar->ar_lock);
@@ -839,6 +844,9 @@ rtable_mpath_insert(struct art_node *an,
{
struct rtentry *mrt, *prt = NULL;
uint8_t prio = rt->rt_priority;
+
+ /* increment the refcount for rt while it is in an_rtlist */
+ rtref(rt);
if ((mrt = SRPL_FIRST_LOCKED(&an->an_rtlist)) == NULL) {
SRPL_INSERT_HEAD_LOCKED(&rt_rc, &an->an_rtlist, rt, rt_next);