On 2/1/22 23:15, Vladislav Odintsov wrote: > Hi OVN and OVS folks, > > during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long > datapath downtime (48-50 seconds). > I suppose this downtime was triggered by ovn-controller installing new > openflow rules to ovs. > There are relatively not so many flows (13-14k). > openvswitch version is 2.13.4. > ovn-openflow-probe-interval is configured to 30. > > The upgrade procedure was: > 1. Set to ovs external_ids an option to force matching ovn-northd internal > version (ovn-match-northd-version=true) > 2. Upgrade ovn-controller (to first get ovn-controller, which supports > version match between controller and northd and not to break datapath after > upgrading ovn-northd) > 3. Upgrade ovn-northd cluster > > Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next: > > 2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting > for main to quiesce > 2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting > for main to quiesce > 2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting > for main to quiesce > 2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting > for main to quiesce > 2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 > flow_mods 13 s ago (36 adds, 1 deletes) > 2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll > interval (0ms user, 12833ms system) > 2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major > 2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, 18 > involuntary > 2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over > last: 5 seconds, last minute, last hour, hash=9105faf6: > 2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand > 0.0/sec 0.000/sec 0.0000/sec total: 1129 > 2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink > 0.0/sec 11.933/sec 0.1989/sec total: 2363 > 2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand > 0.0/sec 0.833/sec 0.8811/sec total: 11218023 > 2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink > 8.2/sec 11.950/sec 1.0736/sec total: 11217475 > 2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add > 0.0/sec 0.000/sec 0.0000/sec total: 229 > 2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del > 0.0/sec 0.000/sec 0.0000/sec total: 144 > 2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush > 0.0/sec 0.000/sec 0.0000/sec total: 2 > 2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get > 0.0/sec 0.000/sec 0.0000/sec total: 23 > 2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put > 83.0/sec 97.900/sec 80.8306/sec total: 849051748 > 2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del > 147.8/sec 91.733/sec 78.2797/sec total: 810797137 > 2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute > 66.6/sec 83.767/sec 87.8397/sec total: 958697586 > 2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help > 0.0/sec 4.250/sec 4.9892/sec total: 70713908 > 2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract > 169.0/sec 247.017/sec 333.1194/sec total: 7990486004 > 2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc > 0.0/sec 0.183/sec 0.7983/sec total: 5871853 > 2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological > 0.0/sec 0.000/sec 0.0000/sec total: 78 > 2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand > 0.0/sec 0.000/sec 0.0000/sec total: 11 > 2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological > 0.0/sec 0.633/sec 0.9331/sec total: 16638117 > 2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand > 0.0/sec 114.867/sec 139.8006/sec total: 2184263485 > 2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned > 0.0/sec 0.000/sec 0.0003/sec total: 299 > 2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired > 0.0/sec 0.000/sec 0.0006/sec total: 117 > 2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved > 0.0/sec 0.000/sec 0.0000/sec total: 87 > 2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats > 0.0/sec 36.850/sec 43.3306/sec total: 690673778 > 2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged > 0.0/sec 0.100/sec 0.1614/sec total: 2705864 > 2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete > 0.0/sec 0.183/sec 0.2239/sec total: 3688261 > 2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success > 0.0/sec 0.167/sec 0.1953/sec total: 3236818 > 2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node > 3930.6/sec 3941.350/sec 5012.2572/sec total: 115959738162 > 2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout > 159.4/sec 111.417/sec 103.4078/sec total: 1181551227 > 2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued > 0.0/sec 0.050/sec 0.0817/sec total: 1279558 > 2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent > 0.0/sec 0.050/sec 0.0817/sec total: 1279558 > 2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change > 2205.2/sec 2502.783/sec 2874.6258/sec total: 57335390998 > 2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open > 0.0/sec 0.000/sec 0.0000/sec total: 5 > 2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open > 0.0/sec 0.000/sec 0.0000/sec total: 1 > 2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received > 0.0/sec 0.000/sec 0.0000/sec total: 189 > 2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied > 0.0/sec 0.000/sec 0.0000/sec total: 189 > 2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc > 7935.8/sec 15234.350/sec 18372.4347/sec total: 345308271208 > 2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received > 0.0/sec 0.083/sec 0.3556/sec total: 2018930 > 2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent > 0.0/sec 0.067/sec 0.0983/sec total: 1552067 > 2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing > 0.0/sec 0.000/sec 0.0000/sec total: 623 > 2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex > 0.0/sec 0.100/sec 0.1636/sec total: 2713928 > 2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr > 0.0/sec 0.000/sec 0.0000/sec total: 2 > 2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool > 0.0/sec 0.000/sec 0.0000/sec total: 1073 > 2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool > 0.0/sec 0.000/sec 0.0000/sec total: 225 > 2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed > 0.0/sec 0.050/sec 0.0492/sec total: 819614 > 2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received > 346.6/sec 472.133/sec 575.4597/sec total: 11288088779 > 2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo > 18.4/sec 113.250/sec 175.1189/sec total: 3935643810 > 2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent > 308.4/sec 378.133/sec 380.4267/sec total: 4753114969 > 2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush > 0.0/sec 0.000/sec 0.0000/sec total: 1 > 2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow > 0.0/sec 0.067/sec 0.3389/sec total: 1746449 > 2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port > 0.0/sec 0.000/sec 0.0000/sec total: 814 > 2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow > 0.0/sec 0.000/sec 0.0000/sec total: 282520 > 2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure > 0.0/sec 0.017/sec 0.0325/sec total: 539904 > 2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled > 0.0/sec 0.000/sec 0.0000/sec total: 27 > 2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table > 0.0/sec 0.000/sec 0.0106/sec total: 10876 > 2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning > 0.0/sec 0.000/sec 0.0006/sec total: 312 > 2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow > 0.0/sec 0.000/sec 0.0000/sec total: 39204 > 2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow > 0.0/sec 0.000/sec 0.0000/sec total: 7177 > 2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall > 0.4/sec 1.833/sec 2.2664/sec total: 34762798 > 2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention > 0.0/sec 0.000/sec 0.0000/sec total: 9 > 2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace > 0.2/sec 0.050/sec 0.0447/sec total: 7927221 > 2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow > 0.0/sec 0.000/sec 0.0000/sec total: 1118 > 2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions > 123.6/sec 273.183/sec 233.8275/sec total: 2110727223 > 2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure > 0.0/sec 0.017/sec 0.0325/sec total: 539902 > 2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit > 2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in last > 1058229 seconds (most recently, 1058228 seconds ago) due to excessive rate > 2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU > usage) > 2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting > for main to quiesce > 2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting > for main to quiesce > 2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting > for main to quiesce > 2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting > for main to quiesce > 2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll > interval (0ms user, 10643ms system) > 2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major > 2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, 4 > involuntary > 2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate > event coverage for hash=9105faf6 > 2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 > (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) > 2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting > for main to quiesce > 2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting > for main to quiesce > 2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting > for main to quiesce > 2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting > for main to quiesce > 2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll > interval (0ms user, 10115ms system) > 2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major > 2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, 27 > involuntary > 2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate > event coverage for hash=9105faf6 > 2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 > (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU usage) > 2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd > 235 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU > usage) > 2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting > for main to quiesce > 2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting > for main to quiesce > 2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting > for main to quiesce > 2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting > for main to quiesce > 2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection > dropped (Broken pipe) > 2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting > for main to quiesce > 2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting > for main to quiesce > 2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting > for main to quiesce > 2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll > interval (0ms user, 17720ms system) > 2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major > 2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, 24 > involuntary > 2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 > (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) > 2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd > 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > 2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589 > flow_mods 18 s ago (13509 adds, 80 modifications) > 2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response to > inactivity probe after 60 seconds, disconnecting > > > In ovn-controller log: > > 2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt: > connecting to switch > 2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: > connecting... > 2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt: > connecting to switch > 2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: > connecting... > 2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: > connected > 2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: > connected > 2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force > recompute. > 2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force recompute > next time. > 2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: > connection closed by peer > 2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting... > 2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected > 2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity > (last active 30001 ms ago): trimming cache > 2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> > ovn-installed in OVS > 2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in > Southbound > <there are 80 port bindings...> > 2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN> > ovn-installed in OVS > 2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in > Southbound > 2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity > (last active 30004 ms ago): trimming cache > > > I’ve got some questions: > 1. What OVS is doing while `waiting for main to quiesce` ? I noticed this > message appeares each time downtime occurs.
'Unreasonably long 17764ms poll interval (0ms user, 17720ms system)' is very concerning. It means that the main thread spent almost 18 seconds inside the kernel executing some syscall, maybe several syscalls. Main thread is not responsible for flow installation to the kernel, IIRC, so flow updates should not be a case here. You need to figure out what are these syscalls and why they are taking so long to execute inside the kernel. Maybe you have some suspicious kernel logs at the same time? You may also try 'perf' or some other tracing tools to figure out what is going on. > 2. Is there any hint to minimize such downtime while reinstalling OF? Installation of rules is an unlikely root cause here. > 3. Could there be a situation where ovn-northd after upgrade still generating > new logical flows, ovn-controller constantly gets updates from SB and > triggers update to OF in a loop while doesn’t finish getting final lflows > "view"? So, could it start installing new OF while having in memory lflows > which are not consistent at that moment? northd generates all updates in a single transaction, so SB is updated atomically. SB will send this update as a single update message, so this is also atomic. ovn-controller will install all OF rules as an OF bundle, so that is also atomic. > 4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error > means? What type of connection is broken? Is it normal? This looks like a controller connection, i.e. connection between ovn-controller with ovs-vswitchd. Doesn't look very good, but disconnection seems to be caused by inactivity due to overload on one of the sides of the connection. > 5. Do you see anything else abnormal in the logs? '0ms user, 17720ms system' is the most abnormal, AFAICT. Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
