Re: Kernel deadlocks on 14.3-STABLE with 100GbE card

Zhenlei Huang Fri, 29 Aug 2025 20:31:41 -0700

> On Aug 30, 2025, at 2:33 AM, Paul <de...@ukr.net> wrote:
> 
> Hi Zhenlei,
> 
> Thanks for a suggestion. 
> 
> But is there a reason not to trust a core dump? 
> Especially when the sum of all `mbuf`s matches the value show in frame stack 
> exactly.

Yes you can trust the core dump.

For RELEASES that is almost the only method. Users normally do not run debug 
kernel in production.
Also developers can fetch the same kernel and debug symbols and use addr2line 
to diagnose.

For stable branches and current, that may vary. If users run custom config or 
compile by themself,
then it is not easy for developers to get the same kernel / debug symbols to 
diagnose.
Then  `options INVARIANTS` is much straight forward to help, assuming user can 
compile.

Anyway, either should be fine.

Best regards,
Zhenlei

> 
>> 
>> 
>>> On Aug 29, 2025, at 5:08 PM, Paul <de...@ukr.net> wrote:
>>> 
>>> 
>>> Hi!
>>> 
>>> 
>>> We have finally managed to reproduce this issue with the help of iperf3.
>>> 
>>> We have triggered a kernel panic with `sysctl debug.kdb.panic=1` to collect 
>>> core dump, when iperf3 process has entered the inf loop.
>>> 
>>> Here is the basic analysis, please ask for more if required:
>>> 
>>> (kgdb) bt
>>> #0  cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1530
>>> #1  0xffffffff808deec8 in ipi_nmi_handler () at 
>>> /usr/src/sys/x86/x86/mp_x86.c:1487
>>> #2  0xffffffff8090c7af in trap (frame=0xfffffe03edeb8f30) at 
>>> /usr/src/sys/amd64/amd64/trap.c:248
>>> #3  <signal handler called>
>>> #4  0xffffffff80640e30 in sbcut_internal (sb=sb@entry=0xfffff801b0ec6e00, 
>>> len=-2145162648) at /usr/src/sys/kern/uipc_sockbuf.c:1585
>>> #5  0xffffffff80640d78 in sbflush_internal (sb=<optimized out>) at 
>>> /usr/src/sys/kern/uipc_sockbuf.c:1547
>>> #6  sbflush_locked (sb=<optimized out>) at 
>>> /usr/src/sys/kern/uipc_sockbuf.c:1559
>>> #7  sbflush (sb=sb@entry=0xfffff801b0ec6e00) at 
>>> /usr/src/sys/kern/uipc_sockbuf.c:1567
>>> #8  0xffffffff807488f3 in tcp_disconnect (tp=0xfffff8034a572a80) at 
>>> /usr/src/sys/netinet/tcp_usrreq.c:2702
>>> #9  0xffffffff80743897 in tcp_usr_disconnect (so=<optimized out>) at 
>>> /usr/src/sys/netinet/tcp_usrreq.c:704
>>> #10 0xffffffff80643655 in sodisconnect (so=0xfffff801b0ec6c00) at 
>>> /usr/src/sys/kern/uipc_socket.c:2085
>>> #11 soclose (so=0xfffff801b0ec6c00) at /usr/src/sys/kern/uipc_socket.c:1920
>>> #12 0xffffffff8053e921 in fo_close (fp=0xfffff801b0ec6e00, 
>>> fp@entry=0xfffff801a51ab410, td=0x80236a68, td@entry=0xfffff801a51ab410) at 
>>> /usr/src/sys/sys/file.h:397
>>> #13 _fdrop (fp=0xfffff801b0ec6e00, fp@entry=0xfffff801a51ab410, 
>>> td=0x80236a68, td@entry=0xfffff80276bcd740) at 
>>> /usr/src/sys/kern/kern_descrip.c:3756
>>> #14 0xffffffff80541aca in closef (fp=0xfffff801a51ab410, 
>>> td=0xfffff80276bcd740) at /usr/src/sys/kern/kern_descrip.c:2851
>>> #15 0xffffffff80545e08 in closefp_impl (fdp=<optimized out>, fd=<optimized 
>>> out>, fp=<optimized out>, td=<optimized out>, audit=<optimized out>) at 
>>> /usr/src/sys/kern/kern_descrip.c:1324
>>> #16 0xffffffff8090de97 in syscallenter (td=0xfffff80276bcd740) at 
>>> /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
>>> #17 amd64_syscall (td=0xfffff80276bcd740, traced=0) at 
>>> /usr/src/sys/amd64/amd64/trap.c:1241
>>> #18 <signal handler called>
>>> #19 0x000000082510c87a in ?? ()
>>> Backtrace stopped: Cannot access memory at address 0x820dd0058
>>> (kgdb) fr 4
>>> #4  0xffffffff80640e30 in sbcut_internal (sb=sb@entry=0xfffff801b0ec6e00, 
>>> len=-2145162648) at /usr/src/sys/kern/uipc_sockbuf.c:1585
>>> 1585                next = (m = sb->sb_mb) ? m->m_nextpkt : 0;
>>> (kgdb) p len
>>> $33 = -2145162648
>>> (kgdb) set $total=(unsigned int)0
>>> (kgdb) set $count=(unsigned int)0
>>> (kgdb) set $next=(struct mbuf*)sb->sb_mb
>>> (kgdb) while ($next != 0)
>>>> set $total=$total+$next.m_len
>>>> set $count=$count+1
>>>> set $next=$next.m_next
>>>> end
>>> (kgdb) p $total
>>> $34 = 2149804648
>>> (kgdb) p (int)$total
>>> $35 = -2145162648
>>> (kgdb) p $count
>>> $36 = 1484679
>>> 
>>> 
>>> As mentioned before, the problem occurs when the socket is being closed. 
>>> Now we know why. Because of a cast here:
>>> 
>>> m_freem(sbcut_internal(sb, (int)sb->sb_ccc));
>>> 
>>> When `sb->sb_ccc` grows above the max unsigned value that can be stored in 
>>> `int` this cast leads to an infinite 
>>> loop, within this function. As `len` smaller than 0 is basically equivalent 
>>> to 0 in `sbcut_internal()`.
>> 
>> Just a note. There's KASSERT in sbcut_internal() to check parameter len,
>> 
>> ```
>> static struct mbuf *
>> sbcut_internal(struct sockbuf *sb, int len)
>> {
>>        struct mbuf *m, *next, *mfree;
>>        bool is_tls;
>> 
>>        KASSERT(len >= 0, ("%s: len is %d but it is supposed to be >= 0",
>>            __func__, len));
>> ...
>> }
>> ```
>> 
>> so you can retest with kernel `options INVARIANTS` on  to verify that, if 
>> the overflow occurs.
>> 
>>> 
>>> But that's just a part of a problem. Why does the buffer grow this large? 
>>> Our limit is:
>>> 
>>> kern.ipc.maxsockbuf=157286400
>>> 
>>> Is it expected to grow so far beyond this limit?
>>> 
>>> 
>>> The way we managed to reproduce the issue is to simply spam one host with a 
>>> traffic from another host:
>>> 
>>> Client:
>>> 
>>> iperf3 --parallel 8 --time 10 --bidir --client <server-IP>
>>> 
>>> Server (where bug occurs):
>>> 
>>> iperf3 --server
>>> 
>>> 
>>> My guess is the limit is not applied on packet basis. But instead, at some 
>>> other trigger points.
>>> And when there is a burst we manage to accumulate so many packets that 
>>> their total size becomes > 2147483647.
>>> The fact that this is a 100GbE card makes it much more likely.
>>> 
>>>> Hi!
>>>> It has been a 4th time now that our server had to be hard re-booted. Last 
>>>> two of them in the span of two hours.
>>>> It was only a week since the server was in production.
>>>> 
>>>> 
>>>> ...
>>>> 
>>> 
>> 
>> 
>> Best regards,
>> Zhenlei
>>
Re: Kernel deadlocks on 14.3-STABLE with 100GbE card

Reply via email to