Hi,
I’m testing OVN stateless ACL rules with `$port_group_ipVERSION` in match
portion.
There’s a strange behaviour and sometimes I got configuration, which totally
kills my transport nodes, where logical switch ports reside.
ovs-vswitchd and ovn-controller processes utilise 100% 1 core CPU each and
ovs-vswitchd consumes all free memory and repeatedly got killed by OOM-killer.
It consumes 5GB memory in 5-10 seconds!
I reproduced this with OVS 2.13.4 & OVN main, but also tried with actual OVS
master branch and the problem still reproduces.
Below are steps to reproduce:
* It’s better to reproduce this problem with limited by memory ovs-vswitchd
with cgroups or something similar *
1. ovn main branch; openvswitch master branch.
2. deploy one-node ovn installation with ovn-controller
3. create ovn topology:
ovn-nbctl ls-add ls1
ovn-nbctl lsp-add ls1 lsp1
ovn-nbctl lsp-add ls1 lsp2
ovn-nbctl lsp-add ls1 lsp3
ovn-nbctl lsp-add ls1 lsp4
ovn-nbctl lsp-set-addresses lsp1 '0a:01:bf:a1:81:e1 172.31.16.6'
ovn-nbctl lsp-set-addresses lsp2 '0a:01:e7:98:59:61 172.31.16.8'
ovn-nbctl lsp-set-addresses lsp3 '0a:01:85:bf:9d:a1 172.31.16.9'
ovn-nbctl lsp-set-addresses lsp4 '0a:01:a1:d7:e9:01 172.31.16.7'
ovn-nbctl pg-add pg1 lsp1 lsp2 lsp3 lsp4
ovn-nbctl acl-add pg1 from-lport 2000 'ip4.src != $pg1_ip4' allow-stateless
# if not limited ovs-vswitchd with cgroups, it’s recommended to do so or stop
ovn-controller here not to get stuck with exhausted system resources.
4. then plug LSPs to ovs:
ovs-vsctl add-port br-int p1 -- set interface p1 external-ids:iface-id=lsp1
ovs-vsctl add-port br-int p2 -- set interface p2 external-ids:iface-id=lsp2
ovs-vsctl add-port br-int p3 -- set interface p3 external-ids:iface-id=lsp3
ovs-vsctl add-port br-int p4 -- set interface p4 external-ids:iface-id=lsp4
5. carefully start ovn-controller; check ovs-vswitchd memory consumption and
ovn-controller & ovs-vswitchd CPU consumption; stop ovn-controller in order not
to kill your machine.
In OVS logs there are no abnormal messages:
2021-11-19T18:09:36.876Z|00039|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.16.90
2021-11-19T18:10:07.423Z|00001|vlog|INFO|opened log file
/var/log/openvswitch/ovs-vswitchd.log
2021-11-19T18:10:07.429Z|00002|ovs_numa|INFO|Discovered 16 CPU cores on NUMA
node 0
2021-11-19T18:10:07.429Z|00003|ovs_numa|INFO|Discovered 16 CPU cores on NUMA
node 1
2021-11-19T18:10:07.429Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 32 CPU
cores
2021-11-19T18:10:07.429Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock:
connecting...
2021-11-19T18:10:07.429Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock:
connected
2021-11-19T18:10:07.448Z|00007|ofproto_dpif|INFO|system@ovs-system: Datapath
supports recirculation
2021-11-19T18:10:07.449Z|00008|ofproto_dpif|INFO|system@ovs-system: VLAN header
stack length probed as 2
2021-11-19T18:10:07.449Z|00009|ofproto_dpif|INFO|system@ovs-system: MPLS label
stack length probed as 3
2021-11-19T18:10:07.449Z|00010|ofproto_dpif|INFO|system@ovs-system: Datapath
supports truncate action
2021-11-19T18:10:07.449Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath
supports unique flow ids
2021-11-19T18:10:07.449Z|00012|ofproto_dpif|INFO|system@ovs-system: Datapath
supports clone action
2021-11-19T18:10:07.449Z|00013|ofproto_dpif|INFO|system@ovs-system: Max sample
nesting level probed as 10
2021-11-19T18:10:07.449Z|00014|ofproto_dpif|INFO|system@ovs-system: Datapath
supports eventmask in conntrack action
2021-11-19T18:10:07.449Z|00015|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_clear action
2021-11-19T18:10:07.449Z|00016|ofproto_dpif|INFO|system@ovs-system: Max dp_hash
algorithm probed to be 0
2021-11-19T18:10:07.449Z|00017|ofproto_dpif|INFO|system@ovs-system: Datapath
supports check_pkt_len action
2021-11-19T18:10:07.449Z|00018|ofproto_dpif|INFO|system@ovs-system: Datapath
supports timeout policy in conntrack action
2021-11-19T18:10:07.449Z|00019|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_zero_snat
2021-11-19T18:10:07.449Z|00020|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_state
2021-11-19T18:10:07.449Z|00021|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_zone
2021-11-19T18:10:07.449Z|00022|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_mark
2021-11-19T18:10:07.449Z|00023|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_label
2021-11-19T18:10:07.449Z|00024|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_state_nat
2021-11-19T18:10:07.449Z|00025|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_orig_tuple
2021-11-19T18:10:07.449Z|00026|ofproto_dpif|INFO|system@ovs-system: Datapath
supports ct_orig_tuple6
2021-11-19T18:10:07.449Z|00027|ofproto_dpif|INFO|system@ovs-system: Datapath
does not support IPv6 ND Extensions
2021-11-19T18:10:07.450Z|00028|ofproto_dpif_upcall|INFO|Overriding
n-handler-threads to 32, setting n-revalidator-threads to 9
2021-11-19T18:10:07.450Z|00029|ofproto_dpif_upcall|INFO|Starting 41 threads
2021-11-19T18:10:07.508Z|00030|bridge|INFO|bridge br-int: added interface
ovn-ai10-0 on port 4
2021-11-19T18:10:07.508Z|00031|bridge|INFO|bridge br-int: added interface
br-int on port 65534
2021-11-19T18:10:07.509Z|00032|bridge|INFO|bridge br-int: added interface
ovn-ai07-0 on port 5
2021-11-19T18:10:07.509Z|00033|bridge|INFO|bridge br-int: added interface
ovn-ai05-0 on port 6
2021-11-19T18:10:07.509Z|00034|bridge|INFO|bridge br-int: added interface
ovn-ai08-0 on port 7
2021-11-19T18:10:07.509Z|00035|bridge|INFO|bridge br-int: added interface
vif0-DA9875C1 on port 3
2021-11-19T18:10:07.510Z|00036|bridge|INFO|bridge br-int: added interface
ovn-ai09-0 on port 8
2021-11-19T18:10:07.511Z|00037|bridge|INFO|bridge br-int: using datapath ID
000026923f220147
2021-11-19T18:10:07.511Z|00038|connmgr|INFO|br-int: added service controller
"punix:/var/run/openvswitch/br-int.mgmt"
2021-11-19T18:10:07.517Z|00039|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.16.90
in ovn-controller:
2021-11-19T18:10:07.597Z|00003|reconnect|INFO|unix:/run/openvswitch/db.sock:
connected
2021-11-19T18:10:07.602Z|00004|main|INFO|OVN internal version is :
[21.09.90-20.21.0-56.3]
2021-11-19T18:10:07.602Z|00005|main|INFO|OVS IDL reconnected, force recompute.
2021-11-19T18:10:07.606Z|00006|reconnect|INFO|ssl:172.20.34.104:6642:
connecting...
2021-11-19T18:10:07.606Z|00007|main|INFO|OVNSB IDL reconnected, force recompute.
2021-11-19T18:10:07.606Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting to switch
2021-11-19T18:10:07.606Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting...
2021-11-19T18:10:07.606Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt:
connected
2021-11-19T18:10:07.658Z|00008|reconnect|INFO|ssl:172.20.34.104:6642: connected
2021-11-19T18:10:07.666Z|00009|features|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting to switch
2021-11-19T18:10:07.666Z|00010|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting...
2021-11-19T18:10:07.666Z|00011|features|INFO|OVS Feature: ct_zero_snat, state:
supported
2021-11-19T18:10:07.666Z|00012|main|INFO|OVS feature set changed, force
recompute.
2021-11-19T18:10:07.666Z|00013|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting to switch
2021-11-19T18:10:07.666Z|00014|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting...
2021-11-19T18:10:07.667Z|00015|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connected
2021-11-19T18:10:07.682Z|00016|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
connected
2021-11-19T18:10:07.683Z|00017|main|INFO|OVS feature set changed, force
recompute.
2021-11-19T18:10:09.012Z|00018|timeval|WARN|Unreasonably long 1326ms poll
interval (1140ms user, 185ms system)
2021-11-19T18:10:09.012Z|00019|timeval|WARN|faults: 138008 minor, 0 major
2021-11-19T18:10:09.012Z|00020|timeval|WARN|context switches: 1 voluntary, 2
involuntary
2021-11-19T18:10:09.013Z|00021|coverage|INFO|Event coverage, avg rate over
last: 5 seconds, last minute, last hour, hash=81c8f620:
2021-11-19T18:10:09.013Z|00022|coverage|INFO|lflow_run 0.0/sec
0.000/sec 0.0000/sec total: 2
2021-11-19T18:10:09.013Z|00023|coverage|INFO|lflow_cache_add_expr 0.0/sec
0.000/sec 0.0000/sec total: 24
2021-11-19T18:10:09.013Z|00024|coverage|INFO|lflow_cache_add_matches 0.0/sec
0.000/sec 0.0000/sec total: 86
2021-11-19T18:10:09.013Z|00025|coverage|INFO|lflow_cache_add 0.0/sec
0.000/sec 0.0000/sec total: 110
2021-11-19T18:10:09.013Z|00026|coverage|INFO|lflow_cache_miss 0.0/sec
0.000/sec 0.0000/sec total: 131
2021-11-19T18:10:09.013Z|00027|coverage|INFO|physical_run 0.0/sec
0.000/sec 0.0000/sec total: 3
2021-11-19T18:10:09.013Z|00028|coverage|INFO|miniflow_malloc 0.0/sec
0.000/sec 0.0000/sec total: 257413
2021-11-19T18:10:09.013Z|00029|coverage|INFO|hmap_pathological 0.0/sec
0.000/sec 0.0000/sec total: 65
2021-11-19T18:10:09.013Z|00030|coverage|INFO|hmap_expand 0.0/sec
0.000/sec 0.0000/sec total: 1515
2021-11-19T18:10:09.013Z|00031|coverage|INFO|txn_unchanged 0.0/sec
0.000/sec 0.0000/sec total: 19
2021-11-19T18:10:09.013Z|00032|coverage|INFO|txn_incomplete 0.0/sec
0.000/sec 0.0000/sec total: 13
2021-11-19T18:10:09.013Z|00033|coverage|INFO|txn_success 0.0/sec
0.000/sec 0.0000/sec total: 2
2021-11-19T18:10:09.013Z|00034|coverage|INFO|poll_create_node 0.0/sec
0.000/sec 0.0000/sec total: 173
2021-11-19T18:10:09.013Z|00035|coverage|INFO|poll_zero_timeout 0.0/sec
0.000/sec 0.0000/sec total: 5
2021-11-19T18:10:09.013Z|00036|coverage|INFO|rconn_queued 0.0/sec
0.000/sec 0.0000/sec total: 128717
2021-11-19T18:10:09.013Z|00037|coverage|INFO|rconn_sent 0.0/sec
0.000/sec 0.0000/sec total: 32380
2021-11-19T18:10:09.013Z|00038|coverage|INFO|seq_change 0.0/sec
0.000/sec 0.0000/sec total: 67
2021-11-19T18:10:09.013Z|00039|coverage|INFO|pstream_open 0.0/sec
0.000/sec 0.0000/sec total: 1
2021-11-19T18:10:09.013Z|00040|coverage|INFO|stream_open 0.0/sec
0.000/sec 0.0000/sec total: 5
2021-11-19T18:10:09.013Z|00041|coverage|INFO|util_xalloc 0.0/sec
0.000/sec 0.0000/sec total: 2111348
2021-11-19T18:10:09.013Z|00042|coverage|INFO|vconn_open 0.0/sec
0.000/sec 0.0000/sec total: 3
2021-11-19T18:10:09.013Z|00043|coverage|INFO|vconn_received 0.0/sec
0.000/sec 0.0000/sec total: 13
2021-11-19T18:10:09.013Z|00044|coverage|INFO|vconn_sent 0.0/sec
0.000/sec 0.0000/sec total: 32384
2021-11-19T18:10:09.013Z|00045|coverage|INFO|netlink_received 0.0/sec
0.000/sec 0.0000/sec total: 16
2021-11-19T18:10:09.013Z|00046|coverage|INFO|netlink_recv_jumbo 0.0/sec
0.000/sec 0.0000/sec total: 3
2021-11-19T18:10:09.013Z|00047|coverage|INFO|netlink_sent 0.0/sec
0.000/sec 0.0000/sec total: 16
2021-11-19T18:10:09.013Z|00048|coverage|INFO|cmap_expand 0.0/sec
0.000/sec 0.0000/sec total: 3
2021-11-19T18:10:09.013Z|00049|coverage|INFO|107 events never hit
2021-11-19T18:10:31.001Z|00004|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt:
connection closed by peer
2021-11-19T18:10:32.039Z|00050|memory|INFO|293212 kB peak resident set size
after 23.5 seconds
2021-11-19T18:10:32.059Z|00005|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt:
connecting...
2021-11-19T18:10:32.059Z|00006|rconn(ovn_pinctrl0)|WARN|unix:/run/openvswitch/br-int.mgmt:
connection failed (Connection refused)
2021-11-19T18:10:32.059Z|00007|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt:
waiting 2 seconds before reconnect
I couldn’t get any source of the problem except to find the steps to reproduce.
Can somebody please take a look on this?
This looks like a potential serious problem for OVN transport nodes.
Thanks.
Regards,
Vladislav Odintsov
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev