BTW I have run more tests to see any performance gains and I failed to
notice any significant differences. So maybe atomics are not that expensive
in this case. But I think my proposal still stands.
I also noticed that if I bump the number of iterations to over 100K at some
point the client fails to connect with EADDRNOTAVAIL on both x64 and
aarch64:
in_pcb.cc: EADDRNOTAVAIL: 0, last - first:55535
tst-tcp1: client: connect() failed after i=111840!
[tst-tcp1: client]: Address not available
I pinpointed it to this fragment of in_pcb_lport() in
bsd/sys/netinet/in_pcb.cc (see the lines 605 and 606 in bold):
598 if (dorandom)
599 *lastport = first + (arc4random() % (last - first));
600
601 count = last - first;
602
603 do {
604 if (count-- < 0) /* completely used? */ {
*605 printf("in_pcb.cc: EADDRNOTAVAIL: 0, last -
first:%ld\n", last - first); 606 return
(EADDRNOTAVAIL);*
607 }
608 ++*lastport;
609 if (*lastport < first || *lastport > last)
610 *lastport = first;
611 lport = htons(*lastport);
612
613 #ifdef INET6
614 if ((inp->inp_vflag & INP_IPV6) != 0)
615 tmpinp = in6_pcblookup_local(pcbinfo,
616 &inp->in6p_laddr, lport, lookupflags,
cred);
617 #endif
618 #if defined(INET) && defined(INET6)
619 else
620 #endif
621 #ifdef INET
622 tmpinp = in_pcblookup_local(pcbinfo, laddr,
623 lport, lookupflags, cred);
624 #endif
625 } while (tmpinp != NULL);
626
627 #ifdef INET
628 if ((inp->inp_vflag & (INP_IPV4|INP_IPV6)) == INP_IPV4)
629 laddrp->s_addr = laddr.s_addr;
630 #endif
631 *lportp = lport;
632
633 return (0);
634 }
I wonder if we somehow ran out of available ports because maybe we do not
clean/make them available fast enough?
On Monday, March 28, 2022 at 9:09:04 PM UTC-4 Waldek Kozaczuk wrote:
> Hi,
>
> As I was working on a patch to address the issue
> https://github.com/cloudius-systems/osv/issues/1087, I discovered that
> occasionally it would crash or hang while running a test
> tst-bsd-tcp1-zsndrcv.cc. The crash would look like this:
>
> page fault outside application, addr: 0x0000000000000000
> [registers]
> PC: 0x0000000fc011ca5c <ifa_ifwithnet+152>
> X00: 0x0000000000000012 X01: 0xffff90004088f4a0 X02: 0x0000000000000000
> X03: 0xffff800040f668c0 X04: 0xffffa00040e778e8 X05: 0xffff800040f66950
> X06: 0x0000000000000000 X07: 0x0000000000000000 X08: 0x0000000000000000
> X09: 0x0000000000000000 X10: 0x0000000000000000 X11: 0x0000000000000000
> X12: 0x0000000000004000 X13: 0x0000000000000000 X14: 0x00000000000000b0
> X15: 0x0000000000000000 X16: 0x0000000000000001 X17: 0x0000000000000001
> X18: 0x0000000055697ee1 X19: 0xffffa0004088e800 X20: 0xffff800040f66840
> X21: 0xffff90004088f000 X22: 0x0000000fc0672838 X23: 0x0000000000000000
> X24: 0x0000000000000000 X25: 0xffff90004088f4a0 X26: 0x0000000000000001
> X27: 0x00000000ffffffff X28: 0x0000000000000002 X29: 0xffff800040f667a0
> X30: 0x0000000fc011ca2c SP: 0xffff800040f667a0 ESR: 0x0000000096000006
> PSTATE: 0x0000000020000345
> Aborted
>
> [backtrace]
> 0x0000000fc014c2e4 <mmu::vm_fault(unsigned long, exception_frame*)+724>
> 0x0000000fc017e4c8 <page_fault+100>
> 0x0000000fc017e254 <???+-1072176556>
> 0x0000000fc01234dc <???+-1072548644>
> 0x0000000fc0124a80 <in_pcbconnect_setup+192>
> 0x0000000fc0145920 <???+-1072408288>
> 0x0000000fc0145afc <???+-1072407812>
> 0x0000000fc0100bb8 <kern_connect+120>
> 0x0000000fc010de7c <linux_connect+60>
> 0x0000000fc010233c <connect+12>
> 0x00001000000376a0 <???+226976>
> 0x00001000000377e0 <???+227296>
> 0x0000000fc0257440 <thread_main_c+32>
> 0x0000000fc017d784 <???+-1072179324>
>
> ---------
> (gdb) bt
> #0 processor::wfi () at arch/aarch64/processor.hh:52
> #1 processor::halt_no_interrupts () at arch/aarch64/processor.hh:52
> #2 arch::halt_no_interrupts () at arch/aarch64/arch.hh:48
> #3 osv::halt () at arch/aarch64/power.cc:24
> #4 0x0000000fc00e9fcc in abort (fmt=fmt@entry=0xfc0481190 "Aborted\n") at
> runtime.cc:142
> #5 0x0000000fc00a0654 in abort () at runtime.cc:106
> #6 0x0000000fc014c2e8 in mmu::vm_sigsegv (ef=0xffff800040f66680, addr=0)
> at core/mmu.cc:1420
> #7 mmu::vm_sigsegv (ef=0xffff800040f66680, addr=0) at core/mmu.cc:1414
> #8 mmu::vm_fault (addr=0, addr@entry=1, ef=ef@entry=0xffff800040f66680)
> at core/mmu.cc:1442
> #9 0x0000000fc017e4cc in page_fault (ef=0xffff800040f66680) at
> arch/aarch64/mmu.cc:50
> #10 <signal handler called>
> #11 0x0000000fc011ca5c in ifa_ifwithnet
> (addr=addr@entry=0xffff800040f66840, ignore_ptp=0) at bsd/sys/net/if.cc:1434
> #12 0x0000000fc01234e0 in in_pcbladdr (inp=inp@entry=0xffffa000411c1600,
> faddr=faddr@entry=0xffff800040f669d8, laddr=laddr@entry=0xffff800040f669d0,
> cred=0x0)
> at bsd/sys/netinet/in_pcb.cc:969
> #13 0x0000000fc0124a84 in in_pcbconnect_setup
> (inp=inp@entry=0xffffa000411c1600, nam=nam@entry=0xffffa00040b046d0,
> laddrp=laddrp@entry=0xffff800040f66a30,
> lportp=lportp@entry=0xffff800040f66a2e,
> faddrp=faddrp@entry=0xffffa000411c1664,
> fportp=fportp@entry=0xffffa000411c1654,
> oinpp=oinpp@entry=0xffff800040f66a38,
> cred=cred@entry=0x0) at bsd/sys/netinet/in_pcb.cc:1056
> #14 0x0000000fc0145924 in tcp_connect (tp=tp@entry=0xffffa00040e5e400,
> nam=nam@entry=0xffffa00040b046d0, td=<optimized out>) at
> bsd/sys/netinet/tcp_usrreq.cc:1089
> #15 0x0000000fc0145b00 in tcp_usr_connect (td=<optimized out>,
> nam=0xffffa00040b046d0, so=<optimized out>) at
> bsd/sys/netinet/tcp_usrreq.cc:463
> #16 tcp_usr_connect (so=<optimized out>, nam=0xffffa00040b046d0,
> td=<optimized out>) at bsd/sys/netinet/tcp_usrreq.cc:436
> #17 0x0000000fc0100bbc in kern_connect (fd=fd@entry=4,
> sa=0xffffa00040b046d0) at bsd/sys/kern/uipc_syscalls.cc:377
> #18 0x0000000fc010de80 in linux_connect (s=s@entry=4,
> name=name@entry=0xffff800040f66be8, namelen=namelen@entry=16) at
> bsd/sys/compat/linux/linux_socket.cc:684
> #19 0x0000000fc0102340 in connect (fd=fd@entry=4,
> addr=addr@entry=0xffff800040f66be8, len=len@entry=16) at
> bsd/sys/kern/uipc_syscalls_wrap.cc:128
> #20 0x00001000000376a4 in test_bsd_tcp1::tcp_client (this=0x2000001ffa98)
> at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zrcv.cc:169
> #21 0x00001000000377e4 in test_bsd_tcp1::run()::{lambda()#2}::operator()()
> const (__closure=<optimized out>)
> at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zrcv.cc:230
> #22 std::__invoke_impl<void,
> test_bsd_tcp1::run()::{lambda()#2}&>(std::__invoke_other,
> test_bsd_tcp1::run()::{lambda()#2}&) (__f=...)
> at /usr/include/c++/10/bits/invoke.h:60
> #23 std::__invoke_r<void,
> test_bsd_tcp1::run()::{lambda()#2}&>(std::__is_invocable&&,
> (test_bsd_tcp1::run()::{lambda()#2}&)...) (__fn=...)
> at /usr/include/c++/10/bits/invoke.h:153
> #24 std::_Function_handler<void (),
> test_bsd_tcp1::run()::{lambda()#2}>::_M_invoke(std::_Any_data const&)
> (__functor=...)
> at /usr/include/c++/10/bits/std_function.h:291
> #25 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040c72e00) at
> core/sched.cc:1267
> #26 sched::thread_main_c (t=0xffffa00040c72e00) at
> arch/aarch64/arch-switch.hh:162
> #27 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
> (gdb) frame 11
> #11 0x0000000fc011ca5c in ifa_ifwithnet
> (addr=addr@entry=0xffff800040f66840, ignore_ptp=0) at bsd/sys/net/if.cc:1434
> 1434 if (ifa->ifa_addr->sa_family != af)
> (gdb) p ifa->ifa_addr
> $1 = (bsd_sockaddr *) 0xffff80004088e800
> (gdb) p ifa->ifa_addr->sa_family
> $2 = 232 '\350'
> (gdb) p af
> $3 = 2
>
> As you can see the fault happened in line 1434 of if.cc:
> *1434 if (ifa->ifa_addr->sa_family != af)*
> 1435 next: continue;
> when dereferencing the field ifa_addr of the ifa pointer.
>
> In this case, gdb actually showed *ifa_addr* with a valid non-zero
> address (which is interesting) but most of the time when I connected I
> would see *ifa_addr* equal to 0.
>
> Sometimes this test would "hang", which I discovered was a tcp server
> thread left listening on a socket after tcp client thread terminated when
> its loop broke due to an error after calling connect() on a socket:
>
> for (int i=0; i < ITERATIONS; i++) {
> ...
> dbg_d("client: connecting to 127.0.0.1...");
> if ( connect(s, (struct sockaddr *)&raddr, sizeof(raddr)) < 0
> ) {
> dbg_d("client: connect() failed!");
> return -1;
> }
> }
>
> Initially, I thought that these "hangs" or crashes were not really
> specific to TCP stack, but instead just random manifestations of some bug
> in my patch fixing issue 1087. The patch was about changing preboot.S,
> boot.S, and core/mmu.cc to move aarch64 kernel code to a different much
> higher place in virtual memory so that the same loader.elf/kernel.elf could
> run correctly on both QEMU and firecracker. I speculated that somehow some
> of virt/phys or phys/virt translation routines have bugs, then possibly
> something wrong in the new assembly code updating early boot mapping page
> tables, or finally some dirty cache lines and some missing *isb* or *dsb*
> instruction. Nothing was helping and at some point, I found that strangely
> adding a nop instruction to premain() in loader.cc shifting the address of
> premain would somehow make this crash go away (or most likely I did not see
> it as often). Then I started debugging the TCP stack and code around the
> crash and "hang" to understand why exactly this was misbehaving. It had to
> do with some memory mapping bug I thought.
>
> To make this long story short I eventually took a step back and tried to
> reproduce the exact same crashes/"hangs" on the original version of the
> code before the patch as it is on master now. Also during the research, I
> realized that the other 3 tst-bsd-tcp1-*cc tests were failing in a similar
> way.
>
> tst-bsd-tcp1.cc
> tst-bsd-tcp1-zrcv.cc
> tst-bsd-tcp1-zsnd.cc
>
> All of these tests exercise basic socket API fundamentals where one thread
> acts as a TCP server listening on some port and another thread acting a
> client in a loop of 400 iterations connecting to the server, sending some
> data and reading it back as an echo, and closing the socket. The 3 -z* ones
> exercise the zero-copy API to send and/or receive data.
>
> After I increased the number of loops to something much higher like 100K I
> started seeing the exact same errors after 5-10K of iterations with the
> 0-address faults being more dominant. So clearly the issue was not caused
> by my kernel memory layout changes but due to some old bug and somehow my
> patch changes made these more frequent. I also tried to reproduce it on
> x86_64 with the same number of iterations but to no avail.
>
> Now a little by luck and little by deduction, I managed to realize that
> the page fault might be caused by something writing over/erasing the memory
> (stack overflow?) where the ifa_addr field of the *bsd_ifaddr* struct
> (ifa variable in the example above). Here is the struct defined in
> the bsd/sys/net/if_var.h (it will be quite important in this long email
> below):
>
> /*
> * The bsd_ifaddr structure contains information about one address
> * of an interface. They are maintained by the different address families,
> * are allocated and attached when an address is set, and are linked
> * together so all addresses for an interface can be located.
> *
> * NOTE: a 'struct bsd_ifaddr' is always at the beginning of a larger
> * chunk of malloc'ed memory, where we store the three addresses
> * (ifa_addr, ifa_dstaddr and ifa_netmask) referenced here.
> */
> struct bsd_ifaddr {
> struct bsd_sockaddr *ifa_addr; /* address of interface */
> struct bsd_sockaddr *ifa_dstaddr; /* other end of p-to-p
> link */
> #define ifa_broadaddr ifa_dstaddr /* broadcast address interface */
> struct bsd_sockaddr *ifa_netmask; /* used to determine
> subnet */
> struct if_data if_data; /* not all members are meaningful
> */
> struct ifnet *ifa_ifp; /* back-pointer to interface */
> TAILQ_ENTRY(bsd_ifaddr) ifa_link; /* queue macro glue */
> void (*ifa_rtrequest) /* check or clean routes (+ or
> -)'d */
> (int, struct rtentry *, struct rt_addrinfo *);
> u_short ifa_flags; /* mostly rt_flags for cloning */
> *u_int ifa_refcnt;* /* references to this structure
> */
> int ifa_metric; /* cost of going out this
> interface */
> int (*ifa_claim_addr) /* check if an addr goes to this
> if */
> (struct bsd_ifaddr *, struct bsd_sockaddr *);
> struct mtx ifa_mtx;
> };
>
> Obviously, finding what possibly erases this field would be like looking
> for a needle in the stack of hay. Then I theorized, that the memory
> allocated to hold the bsd_ifaddr may be freed somehow and used again for
> the next malloc and end up being written to by something completely else.
>
> To that end, I captured the address of the bsd_ifaddr (for example
> 0xffffa00040882800) and then put a breakpoint or printf in
> core/mempool.cc:free() to check if something frees that address. And indeed
> before the crash or hang it would be caught in this check. Eventually, I
> managed to capture what calls the free():
>
> #9 0x0000000fc0243e10 in memory::pool::free (this=0xffffa00040e94800,
> object=object@entry=0xffff80004088e800) at core/mempool.cc:334
> #10 0x0000000fc0244010 in free (object=0xffff80004088e800) at
> core/mempool.cc:1897
> #11 free (object=object@entry=0xffffa0004088e800) at core/mempool.cc:1871
> #12 0x0000000fc011c568 in *ifa_free* (ifa=0xffffa0004088e800) at
> bsd/sys/net/if.cc:1204
> #13 ifa_free (ifa=ifa@entry=0xffffa0004088e800) at bsd/sys/net/if.cc:1200
> #14 0x0000000fc012c328 in ip_preprocess_packet (m=<optimized out>,
> protocol=@0xffff800040a47f5b: 192 '\300', hlen=@0xffff800040a47f5c: 20)
> at bsd/sys/netinet/ip_input.cc:717
> #15 0x0000000fc012c734 in ip_input (m=<optimized out>) at
> bsd/sys/netinet/ip_input.cc:770
> #16 0x0000000fc0117c24 in netisr_process_workstream_proto (proto=1,
> nwsp=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr.cc:576
> #17 swi_net (arg=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr.cc:615
> #18 0x0000000fc0118f9c in netisr_osv_thread_wrapper (handler=0xfc0117b10
> <swi_net(void*)>, arg=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr1.cc:29
> #19 0x0000000fc0119010 in operator() (__closure=<optimized out>,
> __closure=<optimized out>) at bsd/sys/net/netisr1.cc:39
> #20 std::__invoke_impl<void, netisr_osv_start_thread(netisr_osv_handler_t,
> void*)::<lambda()>&> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
> #21 std::__invoke_r<void, netisr_osv_start_thread(netisr_osv_handler_t,
> void*)::<lambda()>&> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
> #22 std::_Function_handler<void(),
> netisr_osv_start_thread(netisr_osv_handler_t, void*)::<lambda()>
> >::_M_invoke(const std::_Any_data &) (__functor=...)
> at /usr/include/c++/10/bits/std_function.h:291
> #23 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860200) at
> core/sched.cc:1267
> #24 sched::thread_main_c (t=0xffffa00040860200) at
> arch/aarch64/arch-switch.hh:162
> #25 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
>
> And here is where the most critical part of freeBSD networking code in
> bsd/sys/net/if.cc comes:
>
> 1180 /*
> 1181 * Reference count functions for ifaddrs.
> 1182 */
> 1183 void
> 1184 *ifa_init*(struct bsd_ifaddr *ifa)
> 1185 {
> 1186
> 1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
> 1188 refcount_init(&ifa->ifa_refcnt, 1);
> 1189 }
> 1190
> 1191 void
> 1192 *ifa_ref*(struct bsd_ifaddr *ifa)
> 1193 {
> 1194
> 1195 refcount_acquire(&ifa->ifa_refcnt);
> 1196 }
> 1197
> 1198 void
> 1199 *ifa_free*(struct bsd_ifaddr *ifa)
> 1200 {
> 1201
> 1202 if (refcount_release(&ifa->ifa_refcnt)) {
> 1203 mtx_destroy(&ifa->ifa_mtx);
> 1204 free(ifa);
> 1205 }
> 1206 }
>
> It seems that *ifa_free*() calls mtx_destroy() and free() when the
> *ifa_refcnt* field of bsd_ifaddr is 0 after refcount_release() is called.
> See code of refcount_release in bsd/sys/sys/refcount.h:
>
> static __inline int
> refcount_release(volatile u_int *count)
> {
> u_int old;
>
> /* XXX: Should this have a rel membar? */
> old = atomic_fetchadd_int(count, -1);
> KASSERT(old > 0, ("negative refcount %p", count));
> return (old == 1);
> }
>
> BTW the XXX comment made me a little bit suspicious. But before we go
> there let me try to explain what I figured out about the ifa_refcnt field.
> The comment from the bsd_ifaddr definition states this:
>
> *u_int ifa_refcnt;* /* references to this structure */
>
> So it seems to be some sort of reference count field. But what purpose
> does it serve? The simple grep of *ifa_refcnt *to see where it is used
> gives this list:
>
> ./bsd/sys/net/if_var.h: u_int ifa_refcnt; /*
> references to this structure */
> ./bsd/sys/net/if.cc: refcount_init(&ifa->ifa_refcnt, 1);
> ./bsd/sys/net/if.cc: refcount_acquire(&ifa->ifa_refcnt);
> ./bsd/sys/net/if.cc: if (refcount_release(&ifa->ifa_refcnt)) {
>
> which in essence are the 3 functions from bsd/sys/net/if.cc I put couple
> of paragraphs above - ifa_init(), ifa_ref() and ifa_free(). I do not know
> for sure of the intention behind ifa_refcnt (I did study some FreeBSD
> commits) but based on its use by these 3 functions, it seems to act as a
> reference count to implement some sort of garbage collection mechanism
> around *bsd_ifaddr* structures to make sure that it is eventually getting
> free()-ed after no longer used/referenced. From what I figured the
> ifa_init() is called during this struct allocation and initialization and
> there are two places in bsd subtree:
>
> - if_attach_internal() function in bsd/sys/net/if.cc
> - in_control() when the SIOCSIFADDR, SIOCSIFNETMASK or SIOCSIFDSTADDR cmd
> passed in bsd/sys/netinet/in.cc.
>
> In essence, the ifa_init() sets the ifa_mtx field and sets ifa_refcnt of
> the *bsd_ifaddr *to 1.
> It turns out that the ifa_ref() is then called in 33 places and ifa_free()
> is called in 55 places in the bsd subtree. The former increments and the
> latter decrements the ifa_refcnt counter.
>
> So I started studying any changes to how these functions may have changed
> on OSv side of bsd tree and I have found this commit:
>
>
> https://github.com/cloudius-systems/osv/commit/8ad9f53c5dbcacf5632ef878ab9cf93b0d860b10
>
> - which is backport of the commit on freebsd side to remove some redundant
> (?) occurrences of ifa_ref() and ifa_free() from
> bsd/sys/netinet/ip_output.cc
> <https://github.com/cloudius-systems/osv/commit/8ad9f53c5dbcacf5632ef878ab9cf93b0d860b10#diff-172218361051c20090e57b94b614925569c4eb942a49c136923d4251dcb72f53>
> ,
>
> Then, I found this patch in FreeBSD codebase around refcount_release -
> https://github.com/freebsd/freebsd-src/commit/157d5e6df35748cdb589f743bfeccc04bf9af93e
>
> - which changes refcount_release() to look like this:
>
> static __inline int
> refcount_release(volatile u_int *count)
> {
> u_int old;
>
> /* XXX: Should this have a rel membar? */
> atomic_thread_fence_rel();
> old = atomic_fetchadd_int(count, -1);
> KASSERT(old > 0, ("negative refcount %p", count));
> return (old == 1);
> if (old > 1)
> return (0);
>
> /*
> * Last reference. Signal the user to call the destructor.
> *
> * Ensure that the destructor sees all updates. The fence_rel
> * at the start of the function synchronized with this fence.
> */
> atomic_thread_fence_acq();
> return (1);
> }
>
> And then some other changes adding and removing ifa_ref() and ifa_free():
>
> -
>
> https://github.com/freebsd/freebsd-src/commit/1ebec5faf41f65a793b4f4217d7a67125c877510
>
> - fixes missed refs
> -
>
> https://github.com/freebsd/freebsd-src/commit/2a4bd982d0530863c7fd2aa1ff2b8de27b8896c7
>
> - introduces new macro NET_EPOCH_CALL in some related places
>
> On top of that the ifa_ref() and ifa_free() have been changed and enhanced
> with new related functions and look like this in the latest version (see
> https://github.com/freebsd/freebsd-src/blob/main/sys/net/if.c#L1759-L1795
> ):
>
> void
> ifa_ref(struct ifaddr *ifa)
> {
> u_int old __diagused;
>
> old = refcount_acquire(&ifa->ifa_refcnt);
> KASSERT(old > 0, ("%s: ifa %p has 0 refs", __func__, ifa));
> }
>
> int
> ifa_try_ref(struct ifaddr *ifa)
> {
>
> NET_EPOCH_ASSERT();
> return (refcount_acquire_if_not_zero(&ifa->ifa_refcnt));
> }
>
> static void
> ifa_destroy(epoch_context_t ctx)
> {
> struct ifaddr *ifa;
>
> ifa = __containerof(ctx, struct ifaddr, ifa_epoch_ctx);
> counter_u64_free(ifa->ifa_opackets);
> counter_u64_free(ifa->ifa_ipackets);
> counter_u64_free(ifa->ifa_obytes);
> counter_u64_free(ifa->ifa_ibytes);
> free(ifa, M_IFADDR);
> }
>
> void
> ifa_free(struct ifaddr *ifa)
> {
>
> if (refcount_release(&ifa->ifa_refcnt))
> NET_EPOCH_CALL(ifa_destroy, &ifa->ifa_epoch_ctx);
> }
>
> The struct ifaddr has changed as well.
>
> Why am I listing all these changes? Well, my theory (I think more than
> that) is that there is some old bug in the reference counting logic around
> ifa_refcnt in the bsd substree - either un-balanced (too many, too few)
> number of calls to ifa_ref/ifa_free or buggy refcount_release() function
> (see TODO comment) or combination of those that somehow only shows its ugly
> face on aarch64. Or possibly there is some race condition that makes calls
> to ifa_free() be executed before ifa_ref() even if they are balanced. *Please
> note that this bug happens on 1 or more CPUs so it is not SMP-specific by
> any means.*
>
> If that is the case then ideally we would want to fix this bug by
> back-porting all these commits and possibly many more from the current
> FreeBSD TCP code. And I have actually tried to do that - 1st by backporting
> the refcount_release() patch which did not help. Then apply some extra
> changes to add/remove some calls to ifa_ref() and ifa_free(). This led to
> some new crashes now even on x86_64. All of this has made me think that
> bringing back all necessary changes from FreeBSD tree to fix this bug would
> be a very tedious and risky effort. We might make the situation even worse.
>
> So instead I realized there is a simpler solution to this problem -
> eliminate buggy reference counting code like this:
>
> diff --git a/bsd/sys/net/if.cc b/bsd/sys/net/if.cc
> index f717da37..fb8f604c 100644
> --- a/bsd/sys/net/if.cc
> +++ b/bsd/sys/net/if.cc
> @@ -1192,17 +1192,17 @@ void
> ifa_ref(struct bsd_ifaddr *ifa)
> {
>
> - refcount_acquire(&ifa->ifa_refcnt);
> + //refcount_acquire(&ifa->ifa_refcnt);
> }
>
> void
> ifa_free(struct bsd_ifaddr *ifa)
> {
>
> - if (refcount_release(&ifa->ifa_refcnt)) {
> - mtx_destroy(&ifa->ifa_mtx);
> - free(ifa);
> - }
> + //if (refcount_release(&ifa->ifa_refcnt)) {
> + // mtx_destroy(&ifa->ifa_mtx);
> + // free(ifa);
> + //}
> }
>
> The bsd_ifaddr is allocated only a handful of times (=5) and destroyed
> once during OSv runtime. Here is the list of places that I captured from
> gdb:
>
> (gdb) hbr bsd/sys/net/if.cc:1187
> Hardware assisted breakpoint 1 at 0xfc011bddc: bsd/sys/net/if.cc:1187. (2
> locations)
> (gdb) c
> Continuing.
>
> Breakpoint 1, if_attach_internal (vmove=0, ifp=0xffffa0004088e400) at
> bsd/sys/net/if.cc:1187
> 1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
> (gdb) bt
> #0 if_attach_internal (vmove=0, ifp=0xffffa0004088e400) at
> bsd/sys/net/if.cc:1187
> #1 if_attach (ifp=ifp@entry=0xffff90004088f000) at bsd/sys/net/if.cc:519
> #2 0x0000000fc011a390 in lo_clone_create (ifc=0xfc061d908 <lo_cloner>,
> unit=0, params=<optimized out>) at bsd/sys/net/if_loop.cc:116
> #3 0x0000000fc011a0d0 in ifc_simple_create (ifc=0xfc061d908 <lo_cloner>,
> name=0xffff8000407a2d50 "lo0", len=16, params=0x0) at
> bsd/sys/net/if_clone.cc:572
> #4 0x0000000fc0119360 in if_clone_createif (ifc=ifc@entry=0xfc061d908
> <lo_cloner>, name=name@entry=0xffff8000407a2d50 "lo0", len=len@entry=16,
> params=params@entry=0x0) at bsd/sys/net/if_clone.cc:171
> #5 0x0000000fc0119458 in if_clone_createif (params=0x0, len=16,
> name=0xffff8000407a2d50 "lo0", ifc=0xfc061d908 <lo_cloner>) at
> bsd/sys/net/if_clone.cc:523
> #6 ifc_simple_attach (ifc=0xfc061d908 <lo_cloner>) at
> bsd/sys/net/if_clone.cc:525
> #7 0x0000000fc0119908 in if_clone_attach (ifc=ifc@entry=0xfc061d908
> <lo_cloner>) at bsd/sys/net/if_clone.cc:306
> #8 0x0000000fc011a420 in vnet_loif_init () at bsd/sys/net/if_loop.cc:125
> #9 0x0000000fc00ec2dc in net_init () at bsd/net.cc:64
> #10 0x0000000fc00dc99c in main_cont (loader_argc=<optimized out>,
> loader_argv=<optimized out>) at loader.cc:729
> #11 0x0000000fc0257444 in sched::thread::main (this=0xfc0663e28) at
> core/sched.cc:1267
> #12 sched::thread_main_c (t=0xfc0663e28) at arch/aarch64/arch-switch.hh:162
> #13 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
> (gdb) c
> Continuing.
>
> Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa0004088e800) at
> bsd/sys/net/if.cc:1187
> 1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
> (gdb) bt
> #0 ifa_init (ifa=ifa@entry=0xffffa0004088e800) at bsd/sys/net/if.cc:1187
> #1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034,
> data=data@entry=0xffff8000407a2d50 "lo0", ifp=ifp@entry=0xffff90004088f000,
> td=td@entry=0x0)
> at bsd/sys/netinet/in.cc:388
> #2 0x0000000fc0107554 in osv::start_if (if_name="lo0",
> ip_addr="127.0.0.1", mask_addr="255.0.0.0") at bsd/porting/networking.cc:113
> #3 0x0000000fc00ec350 in net_init () at
> /usr/include/c++/10/bits/char_traits.h:322
> #4 0x0000000fc00dc99c in main_cont (loader_argc=<optimized out>,
> loader_argv=<optimized out>) at loader.cc:729
> #5 0x0000000fc0257444 in sched::thread::main (this=0xfc0663e28) at
> core/sched.cc:1267
> #6 sched::thread_main_c (t=0xfc0663e28) at arch/aarch64/arch-switch.hh:162
> #7 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
> (gdb) c
> Continuing.
>
> Breakpoint 1, if_attach_internal (vmove=0, ifp=0xffffa0004088ec00) at
> bsd/sys/net/if.cc:1187
> 1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
> (gdb) bt
> #0 if_attach_internal (vmove=0, ifp=0xffffa0004088ec00) at
> bsd/sys/net/if.cc:1187
> #1 if_attach (ifp=ifp@entry=0xffff900040961000) at bsd/sys/net/if.cc:519
> #2 0x0000000fc0110314 in ether_ifattach (ifp=0xffff900040961000,
> lla=lla@entry=0xffff90004096a248 "RT") at bsd/sys/net/if_ethersubr.cc:902
> #3 0x0000000fc017790c in virtio::net::net (this=0xffff90004096a000,
> dev=...) at drivers/virtio-net.cc:304
> #4 0x0000000fc0177cdc in aligned_new<virtio::net, virtio::virtio_device&>
> () at /usr/include/c++/10/new:175
> #5 virtio::net::probe (dev=<optimized out>) at drivers/virtio-net.cc:931
> #6 0x0000000fc016cfd8 in std::function<hw::hw_driver*
> (hw::hw_device*)>::operator()(hw::hw_device*) const (__args#0=<optimized
> out>, this=<optimized out>)
> at drivers/driver.cc:39
> #7 operator() (dev=0xffffa0004096f300, __closure=0x200000100b10) at
> drivers/driver.cc:40
> #8 std::__invoke_impl<void,
> hw::driver_manager::load_all()::<lambda(hw::hw_device*)>&, hw::hw_device*>
> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
> #9 std::__invoke_r<void,
> hw::driver_manager::load_all()::<lambda(hw::hw_device*)>&, hw::hw_device*>
> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
> #10 std::_Function_handler<void(hw::hw_device*),
> hw::driver_manager::load_all()::<lambda(hw::hw_device*)> >::_M_invoke(const
> std::_Any_data &, hw::hw_device *&&) (
> __functor=..., __args#0=<optimized out>) at
> /usr/include/c++/10/bits/std_function.h:291
> #11 0x0000000fc0169658 in std::function<void
> (hw::hw_device*)>::operator()(hw::hw_device*) const (__args#0=<optimized
> out>, this=0x200000100b10)
> at /usr/include/c++/10/bits/std_function.h:617
> #12 hw::device_manager::for_each_device(std::function<void
> (hw::hw_device*)>) (this=this@entry=0xffffa00040854b40, func=...) at
> drivers/device.cc:67
> #13 0x0000000fc016cd18 in hw::driver_manager::load_all
> (this=this@entry=0xffffa00040854f80) at /usr/include/c++/10/new:175
> #14 0x0000000fc017b7e8 in arch_init_drivers () at
> arch/aarch64/arch-setup.cc:240
> #15 0x0000000fc00dcf7c in do_main_thread (_main_args=0xffffa000406fb7f0)
> at loader.cc:416
> #16 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at
> libc/pthread.cc:116
> #17 std::__invoke_impl<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__f=...)
> at /usr/include/c++/10/bits/invoke.h:60
> #18 std::__invoke_r<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
> at /usr/include/c++/10/bits/invoke.h:153
> #19 std::_Function_handler<void(), pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const
> std::_Any_data &) (__functor=...) at
> /usr/include/c++/10/bits/std_function.h:291
> #20 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at
> core/sched.cc:1267
> #21 sched::thread_main_c (t=0xffffa00040860c00) at
> arch/aarch64/arch-switch.hh:162
> #22 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
> (gdb) c
> Continuing.
>
> Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at
> bsd/sys/net/if.cc:1187
> 1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
> (gdb) bt
> #0 ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
> #1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034,
> data=data@entry=0x200000100a70 "eth0", ifp=ifp@entry=0xffff900040961000,
> td=td@entry=0x0)
> at bsd/sys/netinet/in.cc:388
> #2 0x0000000fc0107554 in osv::start_if (if_name="eth0",
> ip_addr="0.0.0.0", mask_addr="255.255.255.0") at
> bsd/porting/networking.cc:113
> #3 0x0000000fc00da234 in operator() (if_name="eth0",
> __closure=0x200000100db0) at /usr/include/c++/10/ext/new_allocator.h:79
> #4 std::__invoke_impl<void,
> do_main_thread(void*)::<lambda(std::string)>&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > > (__f=...)
> at /usr/include/c++/10/bits/invoke.h:60
> #5 std::__invoke_r<void, do_main_thread(void*)::<lambda(std::string)>&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > > (__fn=...)
> at /usr/include/c++/10/bits/invoke.h:153
> #6 std::_Function_handler<void(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >),
> do_main_thread(void*)::<lambda(std::string)> >::_M_invoke(const
> std::_Any_data &, std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > &&) (__functor=..., __args#0=...)
> at /usr/include/c++/10/bits/std_function.h:291
> #7 0x0000000fc0107274 in std::function<void
> (std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> >)>::operator()(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >) const (__args#0="",
> this=0x200000100db0) at /usr/include/c++/10/bits/std_function.h:617
> #8 osv::for_each_if(std::function<void (std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >)>) (func=...) at
> bsd/porting/networking.cc:32
> #9 0x0000000fc00dcfd4 in do_main_thread (_main_args=0xffffa000406fb7f0)
> at /usr/include/c++/10/new:175
> #10 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at
> libc/pthread.cc:116
> #11 std::__invoke_impl<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__f=...)
> at /usr/include/c++/10/bits/invoke.h:60
> #12 std::__invoke_r<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
> at /usr/include/c++/10/bits/invoke.h:153
> #13 std::_Function_handler<void(), pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const
> std::_Any_data &) (__functor=...) at
> /usr/include/c++/10/bits/std_function.h:291
> #14 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at
> core/sched.cc:1267
> #15 sched::thread_main_c (t=0xffffa00040860c00) at
> arch/aarch64/arch-switch.hh:162
> #16 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
> (gdb) c
> Continuing.
>
> Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at
> bsd/sys/net/if.cc:1187
> 1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
> (gdb) bt
> #0 ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
> #1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034,
> data=data@entry=0x200000100a70 "eth0", ifp=ifp@entry=0xffff900040961000,
> td=td@entry=0x0)
> at bsd/sys/netinet/in.cc:388
> #2 0x0000000fc0107554 in osv::start_if (if_name="eth0",
> ip_addr="0.0.0.0", mask_addr="255.255.255.0") at
> bsd/porting/networking.cc:113
> #3 0x0000000fc00da234 in operator() (if_name="eth0",
> __closure=0x200000100db0) at /usr/include/c++/10/ext/new_allocator.h:79
> #4 std::__invoke_impl<void,
> do_main_thread(void*)::<lambda(std::string)>&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > > (__f=...)
> at /usr/include/c++/10/bits/invoke.h:60
> #5 std::__invoke_r<void, do_main_thread(void*)::<lambda(std::string)>&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > > (__fn=...)
> at /usr/include/c++/10/bits/invoke.h:153
> #6 std::_Function_handler<void(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >),
> do_main_thread(void*)::<lambda(std::string)> >::_M_invoke(const
> std::_Any_data &, std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > &&) (__functor=..., __args#0=...)
> at /usr/include/c++/10/bits/std_function.h:291
> #7 0x0000000fc0107274 in std::function<void
> (std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> >)>::operator()(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >) const (__args#0="",
> this=0x200000100db0) at /usr/include/c++/10/bits/std_function.h:617
> #8 osv::for_each_if(std::function<void (std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >)>) (func=...) at
> bsd/porting/networking.cc:32
> #9 0x0000000fc00dcfd4 in do_main_thread (_main_args=0xffffa000406fb7f0)
> at /usr/include/c++/10/new:175
> #10 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at
> libc/pthread.cc:116
> #11 std::__invoke_impl<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__f=...)
> at /usr/include/c++/10/bits/invoke.h:60
> #12 std::__invoke_r<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
> at /usr/include/c++/10/bits/invoke.h:153
> #13 std::_Function_handler<void(), pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const
> std::_Any_data &) (__functor=...) at
> /usr/include/c++/10/bits/std_function.h:291
> #14 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at
> core/sched.cc:1267
> #15 sched::thread_main_c (t=0xffffa00040860c00) at
> arch/aarch64/arch-switch.hh:162
> #16 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
> (gdb) c
> Continuing.
>
> Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040960400) at
> bsd/sys/net/if.cc:1187
> 1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
> (gdb) bt
> #0 ifa_init (ifa=ifa@entry=0xffffa00040960400) at bsd/sys/net/if.cc:1187
> #1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034,
> data=data@entry=0xffff800040d52d90 "eth0",
> ifp=ifp@entry=0xffff900040961000, td=td@entry=0x0)
> at bsd/sys/netinet/in.cc:388
> #2 0x0000000fc0107554 in osv::start_if (if_name="eth0",
> ip_addr="192.168.122.15", mask_addr="255.255.255.0") at
> bsd/porting/networking.cc:113
> #3 0x0000000fc027b05c in dhcp::dhcp_interface_state::state_request
> (this=this@entry=0xffffa0004096bc40, dm=...) at
> /usr/include/c++/10/bits/char_traits.h:322
> #4 0x0000000fc027c3d0 in dhcp::dhcp_interface_state::process_packet
> (this=0xffffa0004096bc40, m=m@entry=0xffffa00040c7c100) at core/dhcp.cc:633
> #5 0x0000000fc027c540 in dhcp::dhcp_worker::dhcp_worker_fn
> (this=0xfc06aab30 <net_dhcp_worker>) at core/dhcp.cc:843
> #6 0x0000000fc027c5e0 in operator() (__closure=<optimized out>,
> __closure=<optimized out>) at core/dhcp.cc:773
> #7 std::__invoke_impl<void, dhcp::dhcp_worker::init()::<lambda()>&>
> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
> #8 std::__invoke_r<void, dhcp::dhcp_worker::init()::<lambda()>&>
> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
> #9 std::_Function_handler<void(), dhcp::dhcp_worker::init()::<lambda()>
> >::_M_invoke(const std::_Any_data &) (__functor=...)
> at /usr/include/c++/10/bits/std_function.h:291
> #10 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040b08800) at
> core/sched.cc:1267
> #11 sched::thread_main_c (t=0xffffa00040b08800) at
> arch/aarch64/arch-switch.hh:162
> #12 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
>
> Ideally, as a "good citizen" we would want to free what we allocated but
> given that even now these 5 calls to allocate bsd_ifaddr (mostly from
> osv::start_if()) are not balanced with corresponding ones to free
> bsd_ifaddr, possibly because we lack all necessary calls to osv::stop_if()
> (only one in core/dhcp.cc). But given there is only one occurrence of
> freeing bsd_ifaddr(), disabling reference counting logic would only lead to
> a single leak of memory (roughly 1024 bytes). These structures are
> long-lived and map to a handful of IPs/interfaces (lo, eth0, ...) OSv would
> have.
>
> There is also extra benefit: the refcount_release()/refcount_acquire()
> functions called by ifa_free()/ifa_ref() are not free - they operate on
> atomics and probably lead to some contention in SMP case. I have not made
> any measurements but I would assume the tcp stack could become a little bit
> faster by not using atomics at all (there are 33 calls to ifa_ref() and 55
> to ifa_free() in bsd subtree). For example, the refcount_acquire() calls
> this inline assembly:
>
> static __inline u_int atomic_fetchadd_int(volatile u_int *p, u_int val)
> {
> u_int result;
> u_int status;
> __asm __volatile("1: ldaxr %w0, %1 ; "
> " add %w2, %w2, %w0 ; "
> " stlxr %w3, %w2, %1 ; "
> " cbnz %w3, 1b ; "
> : "=&r"(result), "+Q"(*p), "+r"(val), "=&r"(status));
>
> return result;
> }
>
> This is actually a loop likely to happen in SMP contentious cases. See
> this article - https://cpufun.substack.com/p/atomics-in-aarch64?s=r.
>
> After applying the patch (see above) to disable ifa_refcnt counting
> (ifa_ref, ifa_free do nothing) and running all 4 tst-bsd-tcp-* tests with
> 100K iterations on 1, 2, and 4 CPUs on both aarch64 and x86_64 I could not
> reproduce these crashes and "hang" scenarios.
>
> What do you think about my idea?
>
> Waldek
>
>
--
You received this message because you are subscribed to the Google Groups "OSv
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/osv-dev/8a8cb526-6d52-410a-81f3-91169dd1d6c3n%40googlegroups.com.