Hi All, I recently ran into a crash due to assertion in a live environment (an openstack neutron-openvswitch environment). While this was using an old version of 2.9.2-0ubuntu0.18.04.3 from Ubuntu 18.04 (Bionic) my analysis suggests that the issue could still exist in master but was hoping for some assistance in understanding how to reproduce or understand the specific situation/netlink packet that led to the crash.
The general issue appears to be that the udpif_revaliditator thread tried to expand a stack-allocated ofpbuf to fit a netlink reply with size 3204 but the buffer is of size 2048. This intentionally raises an assertion as we can't expand the memory on the stack. I've included the full backtrace text at the end of the e-mail. The relevant source tree can also be found here: git clone -b applied/2.9.2-0ubuntu0.18.04.3 https://git.launchpad.net/ubuntu/+source/openvswitch https://git.launchpad.net/ubuntu/+source/openvswitch/tree/?h=applied/2.9.2-0ubuntu0.18.04.3 The crash in __ofpbuf_resize__ appears due to OVS_NOT_REACHED() being called because b->source = OFPBUF_STACK (the line number indicates it's the default: case but this appears to be an optimiser quirk, b->source is OFPBUF_STACK). We can't realloc() the buffer memory if it's allocated on the stack. This buffer is provided in #7 nl_sock_transact_multiple__ during the call to nl_sock_recv__, specified as buf_txn->reply. In this specific case it seems we found transactions[0] available and so we used that rather than tmp_txn. The original source of transactions (it's passed through most of the function calls) appears to be op_auxdata allocated on the stack at the top of the dpif_netlink_operate__ function (dpif-netlink.c:1875). The size of this particular message was 3204, so 2048 went into the buffer and 1156 went into the tail iovector setup inside nl_sock_recv__ which it then tried to expand the ofpbuf to hold. Various nl_sock_* functions have comments about the buffer ideally being the right size for optimal performance (I guess to avoid the reallocation), but it seems like a possible oversight in the dpif_netlink_operate__ workflow that the nl_sock_* functions may ultimately want to try to expand that buffer and then fail because of the stack allocation. What I am having difficulty figuring out from the core file, and where I was hoping for some help, is the actual content of this particular netlink message to understand why it was larger than 2048 bytes and may also give a hint as to how to reproduce the issue and/or understand the justification for needing to increase its size or similar. We only hit this issue once that I know of and doesn't seem to recur easily at least. I also had a hunt and couldn't find any obvious commits/changes touching these areas since that version. I had trouble trying to decode the netlink message, the dpif_op structure, etc to figure out it's actual contents including with the GDB helpers. I suspect I am missing something obvious on how to do that, and hoping someone may have a suggestion on how I can do that and would appreciate any input there or if the issue seems obvious otherwise from my description I am all ears :) Backtrace below. Regards, Trent Thread 1 (Thread 0x7f3e0ffff700 (LWP 1539131)): #0 0x00007f3ed30c8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007f3ed30ca02a in __GI_abort () at abort.c:89 #2 0x00000000004e5035 in ofpbuf_resize__ (b=b@entry=0x7f3e0fffb050, new_headroom=<optimized out>, new_tailroom=new_tailroom@entry=1156) at ../lib/ofpbuf.c:262 #3 0x00000000004e5338 in ofpbuf_prealloc_tailroom (b=b@entry=0x7f3e0fffb050, size=size@entry=1156) at ../lib/ofpbuf.c:291 #4 0x00000000004e54e5 in ofpbuf_put_uninit (size=size@entry=1156, b=b@entry=0x7f3e0fffb050) at ../lib/ofpbuf.c:365 #5 ofpbuf_put (b=b@entry=0x7f3e0fffb050, p=p@entry=0x7f3e0ffcf0a0, size=size@entry=1156) at ../lib/ofpbuf.c:388 #6 0x00000000005392a6 in nl_sock_recv__ (sock=sock@entry=0x7f3e50009150, buf=0x7f3e0fffb050, wait=wait@entry=false) at ../lib/netlink-socket.c:705 #7 0x0000000000539474 in nl_sock_transact_multiple__ (sock=sock@entry=0x7f3e50009150, transactions=transactions@entry=0x7f3e0ffdff20, n=1, done=done@entry=0x7f3e0ffdfe10) at ../lib/netlink-socket.c:824 #8 0x000000000053980a in nl_sock_transact_multiple (sock=0x7f3e50009150, transactions=transactions@entry=0x7f3e0ffdff20, n=n@entry=1) at ../lib/netlink-socket.c:1009 #9 0x000000000053aa1b in nl_sock_transact_multiple (n=1, transactions=0x7f3e0ffdff20, sock=<optimized out>) at ../lib/netlink-socket.c:1765 #10 nl_transact_multiple (protocol=protocol@entry=16, transactions=transactions@entry=0x7f3e0ffdff20, n=n@entry=1) at ../lib/netlink-socket.c:1764 #11 0x0000000000528b01 in dpif_netlink_operate__ (dpif=dpif@entry=0x25a6150, ops=ops@entry=0x7f3e0fffaf28, n_ops=n_ops@entry=1) at ../lib/dpif-netlink.c:1964 #12 0x0000000000529956 in dpif_netlink_operate_chunks (n_ops=1, ops=0x7f3e0fffaf28, dpif=<optimized out>) at ../lib/dpif-netlink.c:2243 #13 dpif_netlink_operate (dpif_=0x25a6150, ops=<optimized out>, n_ops=<optimized out>) at ../lib/dpif-netlink.c:2279 #14 0x00000000004756de in dpif_operate (dpif=0x25a6150, ops=<optimized out>, ops@entry=0x7f3e0fffaf28, n_ops=n_ops@entry=1) at ../lib/dpif.c:1359 #15 0x00000000004758e7 in dpif_flow_get (dpif=<optimized out>, key=<optimized out>, key_len=<optimized out>, ufid=<optimized out>, pmd_id=<optimized out>, buf=buf@entry=0x7f3e0fffb050, flow=<optimized out>) at ../lib/dpif.c:1014 #16 0x000000000043f662 in ukey_create_from_dpif_flow (udpif=0x229cbf0, udpif=0x229cbf0, ukey=<synthetic pointer>, flow=0x7f3e0fffc790) at ../ofproto/ofproto-dpif-upcall.c:1709 #17 ukey_acquire (error=<synthetic pointer>, result=<synthetic pointer>, flow=0x7f3e0fffc790, udpif=0x229cbf0) at ../ofproto/ofproto-dpif-upcall.c:1914 #18 revalidate (revalidator=0x250eaa8) at ../ofproto/ofproto-dpif-upcall.c:2473 #19 0x000000000043f816 in udpif_revalidator (arg=0x250eaa8) at ../ofproto/ofproto-dpif-upcall.c:913 #20 0x00000000004ea4b4 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:348 #21 0x00007f3ed39756ba in start_thread (arg=0x7f3e0ffff700) at pthread_create.c:333 #22 0x00007f3ed319a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
_______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
