Hi OVN and OVS folks,
during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long
datapath downtime (48-50 seconds).
I suppose this downtime was triggered by ovn-controller installing new openflow
rules to ovs.
There are relatively not so many flows (13-14k).
openvswitch version is 2.13.4.
ovn-openflow-probe-interval is configured to 30.
The upgrade procedure was:
1. Set to ovs external_ids an option to force matching ovn-northd internal
version (ovn-match-northd-version=true)
2. Upgrade ovn-controller (to first get ovn-controller, which supports version
match between controller and northd and not to break datapath after upgrading
ovn-northd)
3. Upgrade ovn-northd cluster
Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next:
2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for
main to quiesce
2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for
main to quiesce
2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for
main to quiesce
2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for
main to quiesce
2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 flow_mods
13 s ago (36 adds, 1 deletes)
2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll
interval (0ms user, 12833ms system)
2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major
2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, 18
involuntary
2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over
last: 5 seconds, last minute, last hour, hash=9105faf6:
2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand 0.0/sec
0.000/sec 0.0000/sec total: 1129
2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink 0.0/sec
11.933/sec 0.1989/sec total: 2363
2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand 0.0/sec
0.833/sec 0.8811/sec total: 11218023
2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink 8.2/sec
11.950/sec 1.0736/sec total: 11217475
2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add 0.0/sec
0.000/sec 0.0000/sec total: 229
2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del 0.0/sec
0.000/sec 0.0000/sec total: 144
2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush 0.0/sec
0.000/sec 0.0000/sec total: 2
2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get 0.0/sec
0.000/sec 0.0000/sec total: 23
2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put 83.0/sec
97.900/sec 80.8306/sec total: 849051748
2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del 147.8/sec
91.733/sec 78.2797/sec total: 810797137
2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute 66.6/sec
83.767/sec 87.8397/sec total: 958697586
2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help 0.0/sec
4.250/sec 4.9892/sec total: 70713908
2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract 169.0/sec
247.017/sec 333.1194/sec total: 7990486004
2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc 0.0/sec
0.183/sec 0.7983/sec total: 5871853
2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological 0.0/sec
0.000/sec 0.0000/sec total: 78
2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand 0.0/sec
0.000/sec 0.0000/sec total: 11
2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological 0.0/sec
0.633/sec 0.9331/sec total: 16638117
2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand 0.0/sec
114.867/sec 139.8006/sec total: 2184263485
2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned 0.0/sec
0.000/sec 0.0003/sec total: 299
2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired 0.0/sec
0.000/sec 0.0006/sec total: 117
2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved 0.0/sec
0.000/sec 0.0000/sec total: 87
2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats 0.0/sec
36.850/sec 43.3306/sec total: 690673778
2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged 0.0/sec
0.100/sec 0.1614/sec total: 2705864
2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete 0.0/sec
0.183/sec 0.2239/sec total: 3688261
2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success 0.0/sec
0.167/sec 0.1953/sec total: 3236818
2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node
3930.6/sec 3941.350/sec 5012.2572/sec total: 115959738162
2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout 159.4/sec
111.417/sec 103.4078/sec total: 1181551227
2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued 0.0/sec
0.050/sec 0.0817/sec total: 1279558
2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent 0.0/sec
0.050/sec 0.0817/sec total: 1279558
2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change
2205.2/sec 2502.783/sec 2874.6258/sec total: 57335390998
2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open 0.0/sec
0.000/sec 0.0000/sec total: 5
2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open 0.0/sec
0.000/sec 0.0000/sec total: 1
2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received 0.0/sec
0.000/sec 0.0000/sec total: 189
2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied 0.0/sec
0.000/sec 0.0000/sec total: 189
2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc
7935.8/sec 15234.350/sec 18372.4347/sec total: 345308271208
2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received 0.0/sec
0.083/sec 0.3556/sec total: 2018930
2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent 0.0/sec
0.067/sec 0.0983/sec total: 1552067
2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing 0.0/sec
0.000/sec 0.0000/sec total: 623
2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex 0.0/sec
0.100/sec 0.1636/sec total: 2713928
2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr 0.0/sec
0.000/sec 0.0000/sec total: 2
2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool 0.0/sec
0.000/sec 0.0000/sec total: 1073
2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool 0.0/sec
0.000/sec 0.0000/sec total: 225
2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed 0.0/sec
0.050/sec 0.0492/sec total: 819614
2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received 346.6/sec
472.133/sec 575.4597/sec total: 11288088779
2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo 18.4/sec
113.250/sec 175.1189/sec total: 3935643810
2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent 308.4/sec
378.133/sec 380.4267/sec total: 4753114969
2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush 0.0/sec
0.000/sec 0.0000/sec total: 1
2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow 0.0/sec
0.067/sec 0.3389/sec total: 1746449
2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port 0.0/sec
0.000/sec 0.0000/sec total: 814
2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow 0.0/sec
0.000/sec 0.0000/sec total: 282520
2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure 0.0/sec
0.017/sec 0.0325/sec total: 539904
2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled 0.0/sec
0.000/sec 0.0000/sec total: 27
2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table 0.0/sec
0.000/sec 0.0106/sec total: 10876
2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning 0.0/sec
0.000/sec 0.0006/sec total: 312
2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow 0.0/sec
0.000/sec 0.0000/sec total: 39204
2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow 0.0/sec
0.000/sec 0.0000/sec total: 7177
2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall 0.4/sec
1.833/sec 2.2664/sec total: 34762798
2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention 0.0/sec
0.000/sec 0.0000/sec total: 9
2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace 0.2/sec
0.050/sec 0.0447/sec total: 7927221
2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow
0.0/sec 0.000/sec 0.0000/sec total: 1118
2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions 123.6/sec
273.183/sec 233.8275/sec total: 2110727223
2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure 0.0/sec
0.017/sec 0.0325/sec total: 539902
2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit
2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in last
1058229 seconds (most recently, 1058228 seconds ago) due to excessive rate
2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU usage)
2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for
main to quiesce
2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for
main to quiesce
2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting for
main to quiesce
2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for
main to quiesce
2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll
interval (0ms user, 10643ms system)
2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major
2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, 4
involuntary
2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate
event coverage for hash=9105faf6
2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd 15
(<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting for
main to quiesce
2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for
main to quiesce
2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for
main to quiesce
2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for
main to quiesce
2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll
interval (0ms user, 10115ms system)
2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major
2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, 27
involuntary
2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate
event coverage for hash=9105faf6
2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd 14
(NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU usage)
2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd 235
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU usage)
2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for
main to quiesce
2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for
main to quiesce
2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for
main to quiesce
2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for
main to quiesce
2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection
dropped (Broken pipe)
2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for
main to quiesce
2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for
main to quiesce
2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for
main to quiesce
2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll
interval (0ms user, 17720ms system)
2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major
2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, 24
involuntary
2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd 15
(<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd 230
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589
flow_mods 18 s ago (13509 adds, 80 modifications)
2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response to
inactivity probe after 60 seconds, disconnecting
In ovn-controller log:
2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting to switch
2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting...
2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting to switch
2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting...
2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connected
2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connected
2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force
recompute.
2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force recompute
next time.
2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: connection
closed by peer
2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting...
2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected
2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity (last
active 30001 ms ago): trimming cache
2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> ovn-installed
in OVS
2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in
Southbound
<there are 80 port bindings...>
2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN>
ovn-installed in OVS
2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in
Southbound
2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity (last
active 30004 ms ago): trimming cache
I’ve got some questions:
1. What OVS is doing while `waiting for main to quiesce` ? I noticed this
message appeares each time downtime occurs.
2. Is there any hint to minimize such downtime while reinstalling OF?
3. Could there be a situation where ovn-northd after upgrade still generating
new logical flows, ovn-controller constantly gets updates from SB and triggers
update to OF in a loop while doesn’t finish getting final lflows "view"? So,
could it start installing new OF while having in memory lflows which are not
consistent at that moment?
4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error
means? What type of connection is broken? Is it normal?
5. Do you see anything else abnormal in the logs?
Let me know if I have to give more inputs...
Thanks.
Regards,
Vladislav Odintsov
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev