Hi,

As I was working on a patch to address the 
issue https://github.com/cloudius-systems/osv/issues/1087, I discovered 
that occasionally it would crash or hang while running a test 
tst-bsd-tcp1-zsndrcv.cc. The crash would look like this:

page fault outside application, addr: 0x0000000000000000
[registers]
PC: 0x0000000fc011ca5c <ifa_ifwithnet+152>
X00: 0x0000000000000012 X01: 0xffff90004088f4a0 X02: 0x0000000000000000
X03: 0xffff800040f668c0 X04: 0xffffa00040e778e8 X05: 0xffff800040f66950
X06: 0x0000000000000000 X07: 0x0000000000000000 X08: 0x0000000000000000
X09: 0x0000000000000000 X10: 0x0000000000000000 X11: 0x0000000000000000
X12: 0x0000000000004000 X13: 0x0000000000000000 X14: 0x00000000000000b0
X15: 0x0000000000000000 X16: 0x0000000000000001 X17: 0x0000000000000001
X18: 0x0000000055697ee1 X19: 0xffffa0004088e800 X20: 0xffff800040f66840
X21: 0xffff90004088f000 X22: 0x0000000fc0672838 X23: 0x0000000000000000
X24: 0x0000000000000000 X25: 0xffff90004088f4a0 X26: 0x0000000000000001
X27: 0x00000000ffffffff X28: 0x0000000000000002 X29: 0xffff800040f667a0
X30: 0x0000000fc011ca2c SP:  0xffff800040f667a0 ESR: 0x0000000096000006
PSTATE: 0x0000000020000345
Aborted

[backtrace]
0x0000000fc014c2e4 <mmu::vm_fault(unsigned long, exception_frame*)+724>
0x0000000fc017e4c8 <page_fault+100>
0x0000000fc017e254 <???+-1072176556>
0x0000000fc01234dc <???+-1072548644>
0x0000000fc0124a80 <in_pcbconnect_setup+192>
0x0000000fc0145920 <???+-1072408288>
0x0000000fc0145afc <???+-1072407812>
0x0000000fc0100bb8 <kern_connect+120>
0x0000000fc010de7c <linux_connect+60>
0x0000000fc010233c <connect+12>
0x00001000000376a0 <???+226976>
0x00001000000377e0 <???+227296>
0x0000000fc0257440 <thread_main_c+32>
0x0000000fc017d784 <???+-1072179324>

---------
(gdb) bt
#0  processor::wfi () at arch/aarch64/processor.hh:52
#1  processor::halt_no_interrupts () at arch/aarch64/processor.hh:52
#2  arch::halt_no_interrupts () at arch/aarch64/arch.hh:48
#3  osv::halt () at arch/aarch64/power.cc:24
#4  0x0000000fc00e9fcc in abort (fmt=fmt@entry=0xfc0481190 "Aborted\n") at 
runtime.cc:142
#5  0x0000000fc00a0654 in abort () at runtime.cc:106
#6  0x0000000fc014c2e8 in mmu::vm_sigsegv (ef=0xffff800040f66680, addr=0) 
at core/mmu.cc:1420
#7  mmu::vm_sigsegv (ef=0xffff800040f66680, addr=0) at core/mmu.cc:1414
#8  mmu::vm_fault (addr=0, addr@entry=1, ef=ef@entry=0xffff800040f66680) at 
core/mmu.cc:1442
#9  0x0000000fc017e4cc in page_fault (ef=0xffff800040f66680) at 
arch/aarch64/mmu.cc:50
#10 <signal handler called>
#11 0x0000000fc011ca5c in ifa_ifwithnet 
(addr=addr@entry=0xffff800040f66840, ignore_ptp=0) at bsd/sys/net/if.cc:1434
#12 0x0000000fc01234e0 in in_pcbladdr (inp=inp@entry=0xffffa000411c1600, 
faddr=faddr@entry=0xffff800040f669d8, laddr=laddr@entry=0xffff800040f669d0, 
cred=0x0)
    at bsd/sys/netinet/in_pcb.cc:969
#13 0x0000000fc0124a84 in in_pcbconnect_setup 
(inp=inp@entry=0xffffa000411c1600, nam=nam@entry=0xffffa00040b046d0, 
laddrp=laddrp@entry=0xffff800040f66a30, 
    lportp=lportp@entry=0xffff800040f66a2e, 
faddrp=faddrp@entry=0xffffa000411c1664, 
fportp=fportp@entry=0xffffa000411c1654, 
oinpp=oinpp@entry=0xffff800040f66a38, 
    cred=cred@entry=0x0) at bsd/sys/netinet/in_pcb.cc:1056
#14 0x0000000fc0145924 in tcp_connect (tp=tp@entry=0xffffa00040e5e400, 
nam=nam@entry=0xffffa00040b046d0, td=<optimized out>) at 
bsd/sys/netinet/tcp_usrreq.cc:1089
#15 0x0000000fc0145b00 in tcp_usr_connect (td=<optimized out>, 
nam=0xffffa00040b046d0, so=<optimized out>) at 
bsd/sys/netinet/tcp_usrreq.cc:463
#16 tcp_usr_connect (so=<optimized out>, nam=0xffffa00040b046d0, 
td=<optimized out>) at bsd/sys/netinet/tcp_usrreq.cc:436
#17 0x0000000fc0100bbc in kern_connect (fd=fd@entry=4, 
sa=0xffffa00040b046d0) at bsd/sys/kern/uipc_syscalls.cc:377
#18 0x0000000fc010de80 in linux_connect (s=s@entry=4, 
name=name@entry=0xffff800040f66be8, namelen=namelen@entry=16) at 
bsd/sys/compat/linux/linux_socket.cc:684
#19 0x0000000fc0102340 in connect (fd=fd@entry=4, 
addr=addr@entry=0xffff800040f66be8, len=len@entry=16) at 
bsd/sys/kern/uipc_syscalls_wrap.cc:128
#20 0x00001000000376a4 in test_bsd_tcp1::tcp_client (this=0x2000001ffa98) 
at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zrcv.cc:169
#21 0x00001000000377e4 in test_bsd_tcp1::run()::{lambda()#2}::operator()() 
const (__closure=<optimized out>)
    at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zrcv.cc:230
#22 std::__invoke_impl<void, 
test_bsd_tcp1::run()::{lambda()#2}&>(std::__invoke_other, 
test_bsd_tcp1::run()::{lambda()#2}&) (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#23 std::__invoke_r<void, 
test_bsd_tcp1::run()::{lambda()#2}&>(std::__is_invocable&&, 
(test_bsd_tcp1::run()::{lambda()#2}&)...) (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:153
#24 std::_Function_handler<void (), 
test_bsd_tcp1::run()::{lambda()#2}>::_M_invoke(std::_Any_data const&) 
(__functor=...)
    at /usr/include/c++/10/bits/std_function.h:291
#25 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040c72e00) at 
core/sched.cc:1267
#26 sched::thread_main_c (t=0xffffa00040c72e00) at 
arch/aarch64/arch-switch.hh:162
#27 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) frame 11
#11 0x0000000fc011ca5c in ifa_ifwithnet 
(addr=addr@entry=0xffff800040f66840, ignore_ptp=0) at bsd/sys/net/if.cc:1434
1434                            if (ifa->ifa_addr->sa_family != af)
(gdb) p ifa->ifa_addr
$1 = (bsd_sockaddr *) 0xffff80004088e800
(gdb) p ifa->ifa_addr->sa_family
$2 = 232 '\350'
(gdb) p af
$3 = 2

As you can see the fault happened in line 1434 of if.cc:
*1434                         if (ifa->ifa_addr->sa_family != af)*
1435 next:                           continue;
when dereferencing the field ifa_addr of the ifa pointer.

In this case, gdb actually showed *ifa_addr* with a valid non-zero address 
(which is interesting) but most of the time when I connected I would see 
*ifa_addr* equal to 0.

Sometimes this test would "hang", which I discovered was a tcp server 
thread left listening on a socket after tcp client thread terminated when 
its loop broke due to an error after calling connect() on a socket:

for (int i=0; i < ITERATIONS; i++) {
...
             dbg_d("client: connecting to 127.0.0.1...");
             if ( connect(s, (struct sockaddr *)&raddr, sizeof(raddr)) < 0 
) {
                dbg_d("client: connect() failed!");
                 return -1;
             }
}

Initially, I thought that these "hangs" or crashes were not really specific 
to TCP stack, but instead just random manifestations of some bug in my 
patch fixing issue 1087. The patch was about changing preboot.S, boot.S, 
and core/mmu.cc to move aarch64 kernel code to a different much higher 
place in virtual memory so that the same loader.elf/kernel.elf could run 
correctly on both QEMU and firecracker. I speculated that somehow some of 
virt/phys or phys/virt translation routines have bugs, then possibly 
something wrong in the new assembly code updating early boot mapping page 
tables, or finally some dirty cache lines and some missing *isb* or *dsb* 
instruction. Nothing was helping and at some point, I found that strangely 
adding a nop instruction to premain() in loader.cc shifting the address of 
premain would somehow make this crash go away (or most likely I did not see 
it as often). Then I started debugging the TCP stack and code around the 
crash and "hang" to understand why exactly this was misbehaving. It had to 
do with some memory mapping bug I thought. 
 
To make this long story short I eventually took a step back and tried to 
reproduce the exact same crashes/"hangs" on the original version of the 
code before the patch as it is on master now. Also during the research, I 
realized that the other 3 tst-bsd-tcp1-*cc tests were failing in a similar 
way.

tst-bsd-tcp1.cc
tst-bsd-tcp1-zrcv.cc
tst-bsd-tcp1-zsnd.cc

All of these tests exercise basic socket API fundamentals where one thread 
acts as a TCP server listening on some port and another thread acting a 
client in a loop of 400 iterations connecting to the server, sending some 
data and reading it back as an echo, and closing the socket. The 3 -z* ones 
exercise the zero-copy API to send and/or receive data. 

After I increased the number of loops to something much higher like 100K I 
started seeing the exact same errors after 5-10K of iterations with the 
0-address faults being more dominant. So clearly the issue was not caused 
by my kernel memory layout changes but due to some old bug and somehow my 
patch changes made these more frequent. I also tried to reproduce it on 
x86_64 with the same number of iterations but to no avail. 

Now a little by luck and little by deduction, I managed to realize that the 
page fault might be caused by something writing over/erasing the memory 
(stack overflow?) where the ifa_addr field of the *bsd_ifaddr* struct (ifa 
variable in the example above). Here is the struct defined in 
the bsd/sys/net/if_var.h (it will be quite important in this long email 
below):

/*
 * The bsd_ifaddr structure contains information about one address
 * of an interface.  They are maintained by the different address families,
 * are allocated and attached when an address is set, and are linked
 * together so all addresses for an interface can be located.
 *
 * NOTE: a 'struct bsd_ifaddr' is always at the beginning of a larger
 * chunk of malloc'ed memory, where we store the three addresses
 * (ifa_addr, ifa_dstaddr and ifa_netmask) referenced here.
 */
struct bsd_ifaddr {
        struct  bsd_sockaddr *ifa_addr; /* address of interface */
        struct  bsd_sockaddr *ifa_dstaddr;      /* other end of p-to-p link 
*/
#define ifa_broadaddr   ifa_dstaddr     /* broadcast address interface */
        struct  bsd_sockaddr *ifa_netmask;      /* used to determine subnet 
*/
        struct  if_data if_data;        /* not all members are meaningful */
        struct  ifnet *ifa_ifp;         /* back-pointer to interface */
        TAILQ_ENTRY(bsd_ifaddr) ifa_link;       /* queue macro glue */
        void    (*ifa_rtrequest)        /* check or clean routes (+ or -)'d 
*/
                (int, struct rtentry *, struct rt_addrinfo *);
        u_short ifa_flags;              /* mostly rt_flags for cloning */
        *u_int   ifa_refcnt;*             /* references to this structure */
        int     ifa_metric;             /* cost of going out this interface 
*/
        int (*ifa_claim_addr)           /* check if an addr goes to this if 
*/
                (struct bsd_ifaddr *, struct bsd_sockaddr *);
        struct mtx ifa_mtx;
};

Obviously, finding what possibly erases this field would be like looking 
for a needle in the stack of hay. Then I theorized, that the memory 
allocated to hold the bsd_ifaddr may be freed somehow and used again for 
the next malloc and end up being written to by something completely else.

To that end, I captured the address of the bsd_ifaddr (for example 
0xffffa00040882800) and then put a breakpoint or printf in 
core/mempool.cc:free() to check if something frees that address. And indeed 
before the crash or hang it would be caught in this check. Eventually, I 
managed to capture what calls the free():

#9  0x0000000fc0243e10 in memory::pool::free (this=0xffffa00040e94800, 
object=object@entry=0xffff80004088e800) at core/mempool.cc:334
#10 0x0000000fc0244010 in free (object=0xffff80004088e800) at 
core/mempool.cc:1897
#11 free (object=object@entry=0xffffa0004088e800) at core/mempool.cc:1871
#12 0x0000000fc011c568 in *ifa_free* (ifa=0xffffa0004088e800) at 
bsd/sys/net/if.cc:1204
#13 ifa_free (ifa=ifa@entry=0xffffa0004088e800) at bsd/sys/net/if.cc:1200
#14 0x0000000fc012c328 in ip_preprocess_packet (m=<optimized out>, 
protocol=@0xffff800040a47f5b: 192 '\300', hlen=@0xffff800040a47f5c: 20)
    at bsd/sys/netinet/ip_input.cc:717
#15 0x0000000fc012c734 in ip_input (m=<optimized out>) at 
bsd/sys/netinet/ip_input.cc:770
#16 0x0000000fc0117c24 in netisr_process_workstream_proto (proto=1, 
nwsp=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr.cc:576
#17 swi_net (arg=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr.cc:615
#18 0x0000000fc0118f9c in netisr_osv_thread_wrapper (handler=0xfc0117b10 
<swi_net(void*)>, arg=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr1.cc:29
#19 0x0000000fc0119010 in operator() (__closure=<optimized out>, 
__closure=<optimized out>) at bsd/sys/net/netisr1.cc:39
#20 std::__invoke_impl<void, netisr_osv_start_thread(netisr_osv_handler_t, 
void*)::<lambda()>&> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#21 std::__invoke_r<void, netisr_osv_start_thread(netisr_osv_handler_t, 
void*)::<lambda()>&> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#22 std::_Function_handler<void(), 
netisr_osv_start_thread(netisr_osv_handler_t, void*)::<lambda()> 
>::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/10/bits/std_function.h:291
#23 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860200) at 
core/sched.cc:1267
#24 sched::thread_main_c (t=0xffffa00040860200) at 
arch/aarch64/arch-switch.hh:162
#25 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116

And here is where the most critical part of freeBSD networking code in 
bsd/sys/net/if.cc comes:

1180 /*
1181  * Reference count functions for ifaddrs.
1182  */
1183 void
1184 *ifa_init*(struct bsd_ifaddr *ifa)
1185 {
1186 
1187         mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
1188         refcount_init(&ifa->ifa_refcnt, 1);
1189 }
1190 
1191 void
1192 *ifa_ref*(struct bsd_ifaddr *ifa)
1193 {
1194 
1195         refcount_acquire(&ifa->ifa_refcnt);
1196 }
1197 
1198 void
1199 *ifa_free*(struct bsd_ifaddr *ifa)
1200 {
1201 
1202         if (refcount_release(&ifa->ifa_refcnt)) {
1203                 mtx_destroy(&ifa->ifa_mtx);
1204                 free(ifa);
1205         }
1206 }

It seems that *ifa_free*() calls mtx_destroy() and free() when the 
*ifa_refcnt* field of bsd_ifaddr is 0 after refcount_release() is called. 
See code of refcount_release in bsd/sys/sys/refcount.h:

static __inline int
refcount_release(volatile u_int *count)
{
        u_int old;

        /* XXX: Should this have a rel membar? */
        old = atomic_fetchadd_int(count, -1);
        KASSERT(old > 0, ("negative refcount %p", count));
        return (old == 1);
}

BTW the XXX comment made me a little bit suspicious. But before we go there 
let me try to explain what I figured out about the ifa_refcnt field. The 
comment from the bsd_ifaddr definition states this:

  *u_int   ifa_refcnt;*             /* references to this structure */

So it seems to be some sort of reference count field. But what purpose does 
it serve? The simple grep of *ifa_refcnt *to see where it is used gives 
this list:

./bsd/sys/net/if_var.h:        u_int        ifa_refcnt;                /* 
references to this structure */
./bsd/sys/net/if.cc:        refcount_init(&ifa->ifa_refcnt, 1);
./bsd/sys/net/if.cc:        refcount_acquire(&ifa->ifa_refcnt);
./bsd/sys/net/if.cc:        if (refcount_release(&ifa->ifa_refcnt)) {

which in essence are the 3 functions from bsd/sys/net/if.cc I put couple of 
paragraphs above - ifa_init(), ifa_ref() and ifa_free(). I do not know for 
sure of the intention behind ifa_refcnt (I did study some FreeBSD commits) 
but based on its use by these 3 functions, it seems to act as a reference 
count to implement some sort of garbage collection mechanism around 
*bsd_ifaddr* structures to make sure that it is eventually getting 
free()-ed after no longer used/referenced. From what I figured the 
ifa_init() is called during this struct allocation and initialization and 
there are two places in bsd subtree:

- if_attach_internal() function in bsd/sys/net/if.cc
- in_control() when the SIOCSIFADDR, SIOCSIFNETMASK or SIOCSIFDSTADDR cmd 
passed in bsd/sys/netinet/in.cc.

In essence, the ifa_init() sets the ifa_mtx field and sets ifa_refcnt of 
the *bsd_ifaddr *to 1.
It turns out that the ifa_ref() is then called in 33 places and ifa_free() 
is called in 55 places in the bsd subtree. The former increments and the 
latter decrements the ifa_refcnt counter.

So I started studying any changes to how these functions may have changed 
on OSv side of bsd tree and I have found this commit:

https://github.com/cloudius-systems/osv/commit/8ad9f53c5dbcacf5632ef878ab9cf93b0d860b10
 
- which is backport of the commit on freebsd side to remove some redundant 
(?) occurrences of ifa_ref() and ifa_free() from 
bsd/sys/netinet/ip_output.cc 
<https://github.com/cloudius-systems/osv/commit/8ad9f53c5dbcacf5632ef878ab9cf93b0d860b10#diff-172218361051c20090e57b94b614925569c4eb942a49c136923d4251dcb72f53>
,

Then, I found this patch in FreeBSD codebase around refcount_release 
- 
https://github.com/freebsd/freebsd-src/commit/157d5e6df35748cdb589f743bfeccc04bf9af93e
 
- which changes refcount_release() to look like this:

static __inline int
refcount_release(volatile u_int *count)
{
        u_int old;

        /* XXX: Should this have a rel membar? */
        atomic_thread_fence_rel();
        old = atomic_fetchadd_int(count, -1);
        KASSERT(old > 0, ("negative refcount %p", count));
        return (old == 1);
        if (old > 1)
                return (0);

        /*
         * Last reference.  Signal the user to call the destructor.
         *
         * Ensure that the destructor sees all updates.  The fence_rel
         * at the start of the function synchronized with this fence.
         */
        atomic_thread_fence_acq();
        return (1);
}

And then some other changes adding and removing  ifa_ref() and ifa_free():

   - 
https://github.com/freebsd/freebsd-src/commit/1ebec5faf41f65a793b4f4217d7a67125c877510
 
   - fixes missed refs
   - 
https://github.com/freebsd/freebsd-src/commit/2a4bd982d0530863c7fd2aa1ff2b8de27b8896c7
 
   - introduces new macro NET_EPOCH_CALL in some related places

On top of that the ifa_ref() and ifa_free() have been changed and enhanced 
with new related functions and look like this in the latest version (see 
https://github.com/freebsd/freebsd-src/blob/main/sys/net/if.c#L1759-L1795):

void
ifa_ref(struct ifaddr *ifa)
{
        u_int old __diagused;

        old = refcount_acquire(&ifa->ifa_refcnt);
        KASSERT(old > 0, ("%s: ifa %p has 0 refs", __func__, ifa));
}

int
ifa_try_ref(struct ifaddr *ifa)
{

        NET_EPOCH_ASSERT();
        return (refcount_acquire_if_not_zero(&ifa->ifa_refcnt));
}

static void
ifa_destroy(epoch_context_t ctx)
{
        struct ifaddr *ifa;

        ifa = __containerof(ctx, struct ifaddr, ifa_epoch_ctx);
        counter_u64_free(ifa->ifa_opackets);
        counter_u64_free(ifa->ifa_ipackets);
        counter_u64_free(ifa->ifa_obytes);
        counter_u64_free(ifa->ifa_ibytes);
        free(ifa, M_IFADDR);
}

void
ifa_free(struct ifaddr *ifa)
{

        if (refcount_release(&ifa->ifa_refcnt))
                NET_EPOCH_CALL(ifa_destroy, &ifa->ifa_epoch_ctx);
}

The struct ifaddr has changed as well.

Why am I listing all these changes? Well, my theory (I think more than 
that) is that there is some old bug in the reference counting logic around 
ifa_refcnt in the bsd substree - either un-balanced (too many, too few) 
number of calls to ifa_ref/ifa_free or buggy refcount_release() function 
(see TODO comment) or combination of those that somehow only shows its ugly 
face on aarch64. Or possibly there is some race condition that makes calls 
to ifa_free() be executed before ifa_ref() even if they are balanced. *Please 
note that this bug happens on 1 or more CPUs so it is not SMP-specific by 
any means.*

If that is the case then ideally we would want to fix this bug by 
back-porting all these commits and possibly many more from the current 
FreeBSD TCP code. And I have actually tried to do that - 1st by backporting 
the refcount_release() patch which did not help. Then apply some extra 
changes to add/remove some calls to ifa_ref() and ifa_free(). This led to 
some new crashes now even on x86_64. All of this has made me think that 
bringing back all necessary changes from FreeBSD tree to fix this bug would 
be a very tedious and risky effort. We might make the situation even worse.

So instead I realized there is a simpler solution to this problem - 
eliminate buggy reference counting code like this:

diff --git a/bsd/sys/net/if.cc b/bsd/sys/net/if.cc
index f717da37..fb8f604c 100644
--- a/bsd/sys/net/if.cc
+++ b/bsd/sys/net/if.cc
@@ -1192,17 +1192,17 @@ void
 ifa_ref(struct bsd_ifaddr *ifa)
 {
 
-        refcount_acquire(&ifa->ifa_refcnt);
+        //refcount_acquire(&ifa->ifa_refcnt);
 }
 
 void
 ifa_free(struct bsd_ifaddr *ifa)
 {
 
-        if (refcount_release(&ifa->ifa_refcnt)) {
-                mtx_destroy(&ifa->ifa_mtx);
-                free(ifa);
-        }
+        //if (refcount_release(&ifa->ifa_refcnt)) {
+        //        mtx_destroy(&ifa->ifa_mtx);
+        //        free(ifa);
+        //}
 }

The  bsd_ifaddr is allocated only a handful of times (=5) and destroyed 
once during OSv runtime. Here is the list of places that I captured from 
gdb:

(gdb) hbr bsd/sys/net/if.cc:1187
Hardware assisted breakpoint 1 at 0xfc011bddc: bsd/sys/net/if.cc:1187. (2 
locations)
(gdb) c
Continuing.

Breakpoint 1, if_attach_internal (vmove=0, ifp=0xffffa0004088e400) at 
bsd/sys/net/if.cc:1187
1187                mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0  if_attach_internal (vmove=0, ifp=0xffffa0004088e400) at 
bsd/sys/net/if.cc:1187
#1  if_attach (ifp=ifp@entry=0xffff90004088f000) at bsd/sys/net/if.cc:519
#2  0x0000000fc011a390 in lo_clone_create (ifc=0xfc061d908 <lo_cloner>, 
unit=0, params=<optimized out>) at bsd/sys/net/if_loop.cc:116
#3  0x0000000fc011a0d0 in ifc_simple_create (ifc=0xfc061d908 <lo_cloner>, 
name=0xffff8000407a2d50 "lo0", len=16, params=0x0) at 
bsd/sys/net/if_clone.cc:572
#4  0x0000000fc0119360 in if_clone_createif (ifc=ifc@entry=0xfc061d908 
<lo_cloner>, name=name@entry=0xffff8000407a2d50 "lo0", len=len@entry=16, 
    params=params@entry=0x0) at bsd/sys/net/if_clone.cc:171
#5  0x0000000fc0119458 in if_clone_createif (params=0x0, len=16, 
name=0xffff8000407a2d50 "lo0", ifc=0xfc061d908 <lo_cloner>) at 
bsd/sys/net/if_clone.cc:523
#6  ifc_simple_attach (ifc=0xfc061d908 <lo_cloner>) at 
bsd/sys/net/if_clone.cc:525
#7  0x0000000fc0119908 in if_clone_attach (ifc=ifc@entry=0xfc061d908 
<lo_cloner>) at bsd/sys/net/if_clone.cc:306
#8  0x0000000fc011a420 in vnet_loif_init () at bsd/sys/net/if_loop.cc:125
#9  0x0000000fc00ec2dc in net_init () at bsd/net.cc:64
#10 0x0000000fc00dc99c in main_cont (loader_argc=<optimized out>, 
loader_argv=<optimized out>) at loader.cc:729
#11 0x0000000fc0257444 in sched::thread::main (this=0xfc0663e28) at 
core/sched.cc:1267
#12 sched::thread_main_c (t=0xfc0663e28) at arch/aarch64/arch-switch.hh:162
#13 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.

Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa0004088e800) at 
bsd/sys/net/if.cc:1187
1187                mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0  ifa_init (ifa=ifa@entry=0xffffa0004088e800) at bsd/sys/net/if.cc:1187
#1  0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, 
data=data@entry=0xffff8000407a2d50 "lo0", ifp=ifp@entry=0xffff90004088f000, 
td=td@entry=0x0)
    at bsd/sys/netinet/in.cc:388
#2  0x0000000fc0107554 in osv::start_if (if_name="lo0", 
ip_addr="127.0.0.1", mask_addr="255.0.0.0") at bsd/porting/networking.cc:113
#3  0x0000000fc00ec350 in net_init () at 
/usr/include/c++/10/bits/char_traits.h:322
#4  0x0000000fc00dc99c in main_cont (loader_argc=<optimized out>, 
loader_argv=<optimized out>) at loader.cc:729
#5  0x0000000fc0257444 in sched::thread::main (this=0xfc0663e28) at 
core/sched.cc:1267
#6  sched::thread_main_c (t=0xfc0663e28) at arch/aarch64/arch-switch.hh:162
#7  0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.

Breakpoint 1, if_attach_internal (vmove=0, ifp=0xffffa0004088ec00) at 
bsd/sys/net/if.cc:1187
1187                mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0  if_attach_internal (vmove=0, ifp=0xffffa0004088ec00) at 
bsd/sys/net/if.cc:1187
#1  if_attach (ifp=ifp@entry=0xffff900040961000) at bsd/sys/net/if.cc:519
#2  0x0000000fc0110314 in ether_ifattach (ifp=0xffff900040961000, 
lla=lla@entry=0xffff90004096a248 "RT") at bsd/sys/net/if_ethersubr.cc:902
#3  0x0000000fc017790c in virtio::net::net (this=0xffff90004096a000, 
dev=...) at drivers/virtio-net.cc:304
#4  0x0000000fc0177cdc in aligned_new<virtio::net, virtio::virtio_device&> 
() at /usr/include/c++/10/new:175
#5  virtio::net::probe (dev=<optimized out>) at drivers/virtio-net.cc:931
#6  0x0000000fc016cfd8 in std::function<hw::hw_driver* 
(hw::hw_device*)>::operator()(hw::hw_device*) const (__args#0=<optimized 
out>, this=<optimized out>)
    at drivers/driver.cc:39
#7  operator() (dev=0xffffa0004096f300, __closure=0x200000100b10) at 
drivers/driver.cc:40
#8  std::__invoke_impl<void, 
hw::driver_manager::load_all()::<lambda(hw::hw_device*)>&, hw::hw_device*> 
(__f=...) at /usr/include/c++/10/bits/invoke.h:60
#9  std::__invoke_r<void, 
hw::driver_manager::load_all()::<lambda(hw::hw_device*)>&, hw::hw_device*> 
(__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#10 std::_Function_handler<void(hw::hw_device*), 
hw::driver_manager::load_all()::<lambda(hw::hw_device*)> >::_M_invoke(const 
std::_Any_data &, hw::hw_device *&&) (
    __functor=..., __args#0=<optimized out>) at 
/usr/include/c++/10/bits/std_function.h:291
#11 0x0000000fc0169658 in std::function<void 
(hw::hw_device*)>::operator()(hw::hw_device*) const (__args#0=<optimized 
out>, this=0x200000100b10)
    at /usr/include/c++/10/bits/std_function.h:617
#12 hw::device_manager::for_each_device(std::function<void 
(hw::hw_device*)>) (this=this@entry=0xffffa00040854b40, func=...) at 
drivers/device.cc:67
#13 0x0000000fc016cd18 in hw::driver_manager::load_all 
(this=this@entry=0xffffa00040854f80) at /usr/include/c++/10/new:175
#14 0x0000000fc017b7e8 in arch_init_drivers () at 
arch/aarch64/arch-setup.cc:240
#15 0x0000000fc00dcf7c in do_main_thread (_main_args=0xffffa000406fb7f0) at 
loader.cc:416
#16 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at 
libc/pthread.cc:116
#17 std::__invoke_impl<void, pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()>&> (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#18 std::__invoke_r<void, pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:153
#19 std::_Function_handler<void(), pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const 
std::_Any_data &) (__functor=...) at 
/usr/include/c++/10/bits/std_function.h:291
#20 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at 
core/sched.cc:1267
#21 sched::thread_main_c (t=0xffffa00040860c00) at 
arch/aarch64/arch-switch.hh:162
#22 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.

Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at 
bsd/sys/net/if.cc:1187
1187                mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0  ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
#1  0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, 
data=data@entry=0x200000100a70 "eth0", ifp=ifp@entry=0xffff900040961000, 
td=td@entry=0x0)
    at bsd/sys/netinet/in.cc:388
#2  0x0000000fc0107554 in osv::start_if (if_name="eth0", ip_addr="0.0.0.0", 
mask_addr="255.255.255.0") at bsd/porting/networking.cc:113
#3  0x0000000fc00da234 in operator() (if_name="eth0", 
__closure=0x200000100db0) at /usr/include/c++/10/ext/new_allocator.h:79
#4  std::__invoke_impl<void, do_main_thread(void*)::<lambda(std::string)>&, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > > (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#5  std::__invoke_r<void, do_main_thread(void*)::<lambda(std::string)>&, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > > (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:153
#6  std::_Function_handler<void(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >), 
do_main_thread(void*)::<lambda(std::string)> >::_M_invoke(const 
std::_Any_data &, std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > &&) (__functor=..., __args#0=...)
    at /usr/include/c++/10/bits/std_function.h:291
#7  0x0000000fc0107274 in std::function<void 
(std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >)>::operator()(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >) const (__args#0="", 
this=0x200000100db0) at /usr/include/c++/10/bits/std_function.h:617
#8  osv::for_each_if(std::function<void (std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >)>) (func=...) at 
bsd/porting/networking.cc:32
#9  0x0000000fc00dcfd4 in do_main_thread (_main_args=0xffffa000406fb7f0) at 
/usr/include/c++/10/new:175
#10 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at 
libc/pthread.cc:116
#11 std::__invoke_impl<void, pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()>&> (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#12 std::__invoke_r<void, pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:153
#13 std::_Function_handler<void(), pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const 
std::_Any_data &) (__functor=...) at 
/usr/include/c++/10/bits/std_function.h:291
#14 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at 
core/sched.cc:1267
#15 sched::thread_main_c (t=0xffffa00040860c00) at 
arch/aarch64/arch-switch.hh:162
#16 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.

Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at 
bsd/sys/net/if.cc:1187
1187                mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0  ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
#1  0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, 
data=data@entry=0x200000100a70 "eth0", ifp=ifp@entry=0xffff900040961000, 
td=td@entry=0x0)
    at bsd/sys/netinet/in.cc:388
#2  0x0000000fc0107554 in osv::start_if (if_name="eth0", ip_addr="0.0.0.0", 
mask_addr="255.255.255.0") at bsd/porting/networking.cc:113
#3  0x0000000fc00da234 in operator() (if_name="eth0", 
__closure=0x200000100db0) at /usr/include/c++/10/ext/new_allocator.h:79
#4  std::__invoke_impl<void, do_main_thread(void*)::<lambda(std::string)>&, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > > (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#5  std::__invoke_r<void, do_main_thread(void*)::<lambda(std::string)>&, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > > (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:153
#6  std::_Function_handler<void(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >), 
do_main_thread(void*)::<lambda(std::string)> >::_M_invoke(const 
std::_Any_data &, std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > &&) (__functor=..., __args#0=...)
    at /usr/include/c++/10/bits/std_function.h:291
#7  0x0000000fc0107274 in std::function<void 
(std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >)>::operator()(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >) const (__args#0="", 
this=0x200000100db0) at /usr/include/c++/10/bits/std_function.h:617
#8  osv::for_each_if(std::function<void (std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >)>) (func=...) at 
bsd/porting/networking.cc:32
#9  0x0000000fc00dcfd4 in do_main_thread (_main_args=0xffffa000406fb7f0) at 
/usr/include/c++/10/new:175
#10 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at 
libc/pthread.cc:116
#11 std::__invoke_impl<void, pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()>&> (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60
#12 std::__invoke_r<void, pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:153
#13 std::_Function_handler<void(), pthread_private::pthread::pthread(void* 
(*)(void*), void*, sigset_t, const 
pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const 
std::_Any_data &) (__functor=...) at 
/usr/include/c++/10/bits/std_function.h:291
#14 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at 
core/sched.cc:1267
#15 sched::thread_main_c (t=0xffffa00040860c00) at 
arch/aarch64/arch-switch.hh:162
#16 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.

Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040960400) at 
bsd/sys/net/if.cc:1187
1187                mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0  ifa_init (ifa=ifa@entry=0xffffa00040960400) at bsd/sys/net/if.cc:1187
#1  0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, 
data=data@entry=0xffff800040d52d90 "eth0", 
ifp=ifp@entry=0xffff900040961000, td=td@entry=0x0)
    at bsd/sys/netinet/in.cc:388
#2  0x0000000fc0107554 in osv::start_if (if_name="eth0", 
ip_addr="192.168.122.15", mask_addr="255.255.255.0") at 
bsd/porting/networking.cc:113
#3  0x0000000fc027b05c in dhcp::dhcp_interface_state::state_request 
(this=this@entry=0xffffa0004096bc40, dm=...) at 
/usr/include/c++/10/bits/char_traits.h:322
#4  0x0000000fc027c3d0 in dhcp::dhcp_interface_state::process_packet 
(this=0xffffa0004096bc40, m=m@entry=0xffffa00040c7c100) at core/dhcp.cc:633
#5  0x0000000fc027c540 in dhcp::dhcp_worker::dhcp_worker_fn 
(this=0xfc06aab30 <net_dhcp_worker>) at core/dhcp.cc:843
#6  0x0000000fc027c5e0 in operator() (__closure=<optimized out>, 
__closure=<optimized out>) at core/dhcp.cc:773
#7  std::__invoke_impl<void, dhcp::dhcp_worker::init()::<lambda()>&> 
(__f=...) at /usr/include/c++/10/bits/invoke.h:60
#8  std::__invoke_r<void, dhcp::dhcp_worker::init()::<lambda()>&> 
(__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#9  std::_Function_handler<void(), dhcp::dhcp_worker::init()::<lambda()> 
>::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/10/bits/std_function.h:291
#10 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040b08800) at 
core/sched.cc:1267
#11 sched::thread_main_c (t=0xffffa00040b08800) at 
arch/aarch64/arch-switch.hh:162
#12 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116

Ideally, as a "good citizen" we would want to free what we allocated but 
given that even now these 5 calls to allocate bsd_ifaddr (mostly from 
osv::start_if()) are not balanced with corresponding ones to free 
bsd_ifaddr, possibly because we lack all necessary calls to osv::stop_if() 
(only one in core/dhcp.cc). But given there is only one occurrence of 
freeing bsd_ifaddr(), disabling reference counting logic would only lead to 
a single leak of memory (roughly 1024 bytes). These structures are 
long-lived and map to a handful of IPs/interfaces (lo, eth0, ...) OSv would 
have.

There is also extra benefit: the refcount_release()/refcount_acquire() 
functions called by ifa_free()/ifa_ref() are not free - they operate on 
atomics and probably lead to some contention in SMP case. I have not made 
any measurements but I would assume the tcp stack could become a little bit 
faster by not using atomics at all (there are 33 calls to ifa_ref() and 55 
to ifa_free() in bsd subtree). For example, the refcount_acquire() calls 
this inline assembly:

static __inline u_int atomic_fetchadd_int(volatile u_int *p, u_int val)
{
    u_int result;
    u_int status;
    __asm __volatile("1: ldaxr %w0, %1 ; "
                     "   add   %w2, %w2, %w0 ; "
                     "   stlxr %w3, %w2, %1 ; "
                     "   cbnz  %w3, 1b ; "
                     : "=&r"(result), "+Q"(*p), "+r"(val), "=&r"(status));

    return result;
}

This is actually a loop likely to happen in SMP contentious cases. See this 
article - https://cpufun.substack.com/p/atomics-in-aarch64?s=r.

After applying the patch (see above) to disable ifa_refcnt counting 
(ifa_ref, ifa_free do nothing) and running all 4 tst-bsd-tcp-* tests with 
100K iterations on 1, 2, and 4 CPUs on both aarch64 and x86_64 I could not 
reproduce these crashes and "hang" scenarios.

What do you think about my idea?

Waldek

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/3d6aee22-e544-4266-a386-391f2c27fa63n%40googlegroups.com.

Reply via email to