Hi, We upgrade the OVS recently from one version 2.10 to another version 2.10. on some HV upgrade, the HV is down when running force reload kernel. In the ovs-ctl log, kill ovs-vswitch is failed, but the script is still going to reload the modules. ``` ovsdb-server is running with pid 2431 ovs-vswitchd is running with pid 2507 Thu Aug 22 23:13:49 UTC 2019:stop 2019-08-22T23:13:50Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock) Alarm clock 2019-08-22T23:13:51Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock) Alarm clock * Exiting ovs-vswitchd (2507) * Killing ovs-vswitchd (2507) * Killing ovs-vswitchd (2507) with SIGKILL * Killing ovs-vswitchd (2507) failed * Exiting ovsdb-server (2431) Thu Aug 22 23:14:58 UTC 2019:load-kmod Thu Aug 22 23:14:58 UTC 2019:start --system-id=random --no-full-hostname /usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help) * Starting ovsdb-server * Configuring Open vSwitch system IDs * ovs-vswitchd is already running * Enabling remote OVSDB managers ovsdb-server is running with pid 3860447 ovs-vswitchd is running with pid 2507 ovsdb-server is running with pid 3860447 ovs-vswitchd is running with pid 2507 Thu Aug 22 23:15:09 UTC 2019:load-kmod Thu Aug 22 23:15:09 UTC 2019:force-reload-kmod --system-id=random --no-full-hostname /usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help) * Detected internal interfaces: br-int Thu Aug 22 23:37:08 UTC 2019:stop 2019-08-22T23:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock) Alarm clock 2019-08-22T23:37:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock) Alarm clock * Exiting ovs-vswitchd (2507) * Killing ovs-vswitchd (2507) * Killing ovs-vswitchd (2507) with SIGKILL * Killing ovs-vswitchd (2507) failed * Exiting ovsdb-server (3860447) Thu Aug 22 23:40:42 UTC 2019:load-kmod * Inserting openvswitch module Thu Aug 22 23:40:42 UTC 2019:start --system-id=random --no-full-hostname /usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help) * Starting ovsdb-server * Configuring Open vSwitch system IDs * Starting ovs-vswitchd * Enabling remote OVSDB managers ovsdb-server is running with pid 2399 ovs-vswitchd is running with pid 2440 ovsdb-server is running with pid 2399 ovs-vswitchd is running with pid 2440 Thu Aug 22 23:46:18 UTC 2019:load-kmod Thu Aug 22 23:46:18 UTC 2019:force-reload-kmod --system-id=random --no-full-hostname /usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help) * Detected internal interfaces: br-int br0 * Saving flows * Exiting ovsdb-server (2399) * Starting ovsdb-server * Configuring Open vSwitch system IDs * Flush old conntrack entries * Exiting ovs-vswitchd (2440) * Saving interface configuration * Removing datapath: system@ovs-system * Removing openvswitch module rmmod: ERROR: Module vxlan is in use by: i40e * Forcing removal of vxlan module * Inserting openvswitch module * Starting ovs-vswitchd * Restoring saved flows * Enabling remote OVSDB managers * Restoring interface configuration ```
But in kern.log, we see the log as below, the process could not exit because waiting br0 release, and then, the ovs-ctl try to `kill term` and `kill -9` the process, it does not work, because kernel is in infinity loop. Then, ovs-ctl try to save the flows, when save flow, core dump happened in kernel. Then HV is down until restart it. ``` Aug 22 16:13:45 slx11c-9gjm kernel: [21177057.998961] device br0 left promiscuous mode Aug 22 16:13:55 slx11c-9gjm kernel: [21177068.044859] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:14:05 slx11c-9gjm kernel: [21177078.068429] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:14:15 slx11c-9gjm kernel: [21177088.312001] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:14:25 slx11c-9gjm kernel: [21177098.359556] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:14:35 slx11c-9gjm kernel: [21177108.603058] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:14:45 slx11c-9gjm kernel: [21177118.658643] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:14:55 slx11c-9gjm kernel: [21177128.678211] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:15:05 slx11c-9gjm kernel: [21177138.721732] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:15:15 slx11c-9gjm kernel: [21177148.817317] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:15:25 slx11c-9gjm kernel: [21177158.828853] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:15:35 slx11c-9gjm kernel: [21177168.860459] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:15:46 slx11c-9gjm kernel: [21177179.104006] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:15:56 slx11c-9gjm kernel: [21177189.143518] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:16:06 slx11c-9gjm kernel: [21177199.183075] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:16:16 slx11c-9gjm kernel: [21177209.262617] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:16:26 slx11c-9gjm kernel: [21177219.510207] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:16:36 slx11c-9gjm kernel: [21177229.521757] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:16:46 slx11c-9gjm kernel: [21177239.765278] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:16:57 slx11c-9gjm kernel: [21177250.008826] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.016400] unregister_netdevice: waiting for br0 to become free. Usage count = 1 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.256456] INFO: task revalidator221:3859855 blocked for more than 120 seconds. Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.264121] Tainted: G OE 4.4.0-104-generic #127-Ubuntu Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271179] revalidator221 D ffff8826844eba48 0 3859855 2506 0x00000084 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271180] ffff8826844eba48 0000000000000000 ffff884fc2e69540 ffff882758688000 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271181] ffff8826844ec000 ffffffffc0463564 ffff882758688000 00000000ffffffff Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271182] ffffffffc0463568 ffff8826844eba60 ffffffff81840765 ffffffffc0463560 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271182] Call Trace: Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271191] [<ffffffff81840765>] schedule+0x35/0x80 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271192] [<ffffffff81840a0e>] schedule_preempt_disabled+0xe/0x10 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271193] [<ffffffff81842649>] __mutex_lock_slowpath+0xb9/0x130 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271195] [<ffffffff818426df>] mutex_lock+0x1f/0x30 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271207] [<ffffffffc042ef56>] ovs_dp_cmd_get+0x46/0xf0 [openvswitch] Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271211] [<ffffffff8176d287>] genl_family_rcv_msg+0x1e7/0x3c0 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271213] [<ffffffff8176d460>] ? genl_family_rcv_msg+0x3c0/0x3c0 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271214] [<ffffffff8176d4d6>] genl_rcv_msg+0x76/0xb0 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271215] [<ffffffff8176a97c>] ? netlink_lookup+0xdc/0x140 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271217] [<ffffffff8176c9a4>] netlink_rcv_skb+0xa4/0xc0 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271218] [<ffffffff8176d088>] genl_rcv+0x28/0x40 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271219] [<ffffffff8176c37a>] netlink_unicast+0x18a/0x240 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271221] [<ffffffff8176c72b>] netlink_sendmsg+0x2fb/0x3a0 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271225] [<ffffffff813a1c11>] ? aa_sock_msg_perm+0x61/0x150 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271227] [<ffffffff8171bad8>] sock_sendmsg+0x38/0x50 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271228] [<ffffffff8171c581>] ___sys_sendmsg+0x281/0x290 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271231] [<ffffffff810badf7>] ? should_numa_migrate_memory+0x57/0x130 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271233] [<ffffffff811c2713>] ? handle_mm_fault+0xbc3/0x1820 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271235] [<ffffffff812101b4>] ? new_sync_read+0x94/0xd0 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271237] [<ffffffff8122e9e5>] ? __fget_light+0x25/0x60 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271238] [<ffffffff8171ced1>] __sys_sendmsg+0x51/0x90 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271239] [<ffffffff8171cf22>] SyS_sendmsg+0x12/0x20 Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271240] [<ffffffff81844872>] entry_SYSCALL_64_fastpath+0x16/0x71 ``` My question is: 1. Why we need to stop the openvswitch service before back up flows. 2. If the openvswitch is exiting status, why it still can receive request. 3. Why the br0 took so long time to exit. I think this is a bug need to improve during upgrading. Thanks L.J. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
