Hi Zhenlei, Thanks for a suggestion.
But is there a reason not to trust a core dump? Especially when the sum of all `mbuf`s matches the value show in frame stack exactly. > > > > On Aug 29, 2025, at 5:08 PM, Paul <de...@ukr.net> wrote: > > > > > > Hi! > > > > > > We have finally managed to reproduce this issue with the help of iperf3. > > > > We have triggered a kernel panic with `sysctl debug.kdb.panic=1` to collect > > core dump, when iperf3 process has entered the inf loop. > > > > Here is the basic analysis, please ask for more if required: > > > > (kgdb) bt > > #0 cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1530 > > #1 0xffffffff808deec8 in ipi_nmi_handler () at > > /usr/src/sys/x86/x86/mp_x86.c:1487 > > #2 0xffffffff8090c7af in trap (frame=0xfffffe03edeb8f30) at > > /usr/src/sys/amd64/amd64/trap.c:248 > > #3 <signal handler called> > > #4 0xffffffff80640e30 in sbcut_internal (sb=sb@entry=0xfffff801b0ec6e00, > > len=-2145162648) at /usr/src/sys/kern/uipc_sockbuf.c:1585 > > #5 0xffffffff80640d78 in sbflush_internal (sb=<optimized out>) at > > /usr/src/sys/kern/uipc_sockbuf.c:1547 > > #6 sbflush_locked (sb=<optimized out>) at > > /usr/src/sys/kern/uipc_sockbuf.c:1559 > > #7 sbflush (sb=sb@entry=0xfffff801b0ec6e00) at > > /usr/src/sys/kern/uipc_sockbuf.c:1567 > > #8 0xffffffff807488f3 in tcp_disconnect (tp=0xfffff8034a572a80) at > > /usr/src/sys/netinet/tcp_usrreq.c:2702 > > #9 0xffffffff80743897 in tcp_usr_disconnect (so=<optimized out>) at > > /usr/src/sys/netinet/tcp_usrreq.c:704 > > #10 0xffffffff80643655 in sodisconnect (so=0xfffff801b0ec6c00) at > > /usr/src/sys/kern/uipc_socket.c:2085 > > #11 soclose (so=0xfffff801b0ec6c00) at /usr/src/sys/kern/uipc_socket.c:1920 > > #12 0xffffffff8053e921 in fo_close (fp=0xfffff801b0ec6e00, > > fp@entry=0xfffff801a51ab410, td=0x80236a68, td@entry=0xfffff801a51ab410) at > > /usr/src/sys/sys/file.h:397 > > #13 _fdrop (fp=0xfffff801b0ec6e00, fp@entry=0xfffff801a51ab410, > > td=0x80236a68, td@entry=0xfffff80276bcd740) at > > /usr/src/sys/kern/kern_descrip.c:3756 > > #14 0xffffffff80541aca in closef (fp=0xfffff801a51ab410, > > td=0xfffff80276bcd740) at /usr/src/sys/kern/kern_descrip.c:2851 > > #15 0xffffffff80545e08 in closefp_impl (fdp=<optimized out>, fd=<optimized > > out>, fp=<optimized out>, td=<optimized out>, audit=<optimized out>) at > > /usr/src/sys/kern/kern_descrip.c:1324 > > #16 0xffffffff8090de97 in syscallenter (td=0xfffff80276bcd740) at > > /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 > > #17 amd64_syscall (td=0xfffff80276bcd740, traced=0) at > > /usr/src/sys/amd64/amd64/trap.c:1241 > > #18 <signal handler called> > > #19 0x000000082510c87a in ?? () > > Backtrace stopped: Cannot access memory at address 0x820dd0058 > > (kgdb) fr 4 > > #4 0xffffffff80640e30 in sbcut_internal (sb=sb@entry=0xfffff801b0ec6e00, > > len=-2145162648) at /usr/src/sys/kern/uipc_sockbuf.c:1585 > > 1585 next = (m = sb->sb_mb) ? m->m_nextpkt : 0; > > (kgdb) p len > > $33 = -2145162648 > > (kgdb) set $total=(unsigned int)0 > > (kgdb) set $count=(unsigned int)0 > > (kgdb) set $next=(struct mbuf*)sb->sb_mb > > (kgdb) while ($next != 0) > >> set $total=$total+$next.m_len > >> set $count=$count+1 > >> set $next=$next.m_next > >> end > > (kgdb) p $total > > $34 = 2149804648 > > (kgdb) p (int)$total > > $35 = -2145162648 > > (kgdb) p $count > > $36 = 1484679 > > > > > > As mentioned before, the problem occurs when the socket is being closed. > > Now we know why. Because of a cast here: > > > > m_freem(sbcut_internal(sb, (int)sb->sb_ccc)); > > > > When `sb->sb_ccc` grows above the max unsigned value that can be stored in > > `int` this cast leads to an infinite > > loop, within this function. As `len` smaller than 0 is basically equivalent > > to 0 in `sbcut_internal()`. > > Just a note. There's KASSERT in sbcut_internal() to check parameter len, > > ``` > static struct mbuf * > sbcut_internal(struct sockbuf *sb, int len) > { > struct mbuf *m, *next, *mfree; > bool is_tls; > > KASSERT(len >= 0, ("%s: len is %d but it is supposed to be >= 0", > __func__, len)); > ... > } > ``` > > so you can retest with kernel `options INVARIANTS` on to verify that, if the > overflow occurs. > > > > > But that's just a part of a problem. Why does the buffer grow this large? > > Our limit is: > > > > kern.ipc.maxsockbuf=157286400 > > > > Is it expected to grow so far beyond this limit? > > > > > > The way we managed to reproduce the issue is to simply spam one host with a > > traffic from another host: > > > > Client: > > > > iperf3 --parallel 8 --time 10 --bidir --client <server-IP> > > > > Server (where bug occurs): > > > > iperf3 --server > > > > > > My guess is the limit is not applied on packet basis. But instead, at some > > other trigger points. > > And when there is a burst we manage to accumulate so many packets that > > their total size becomes > 2147483647. > > The fact that this is a 100GbE card makes it much more likely. > > > >> Hi! > >> It has been a 4th time now that our server had to be hard re-booted. Last > >> two of them in the span of two hours. > >> It was only a week since the server was in production. > >> > >> > >> ... > >> > > > > > Best regards, > Zhenlei >