Re: [ovs-discuss] ovn-controller consuming lots of CPU

Mark Michelson Wed, 13 Dec 2017 07:55:47 -0800

On 12/12/2017 03:26 PM, Kevin Lin wrote:

Hi again,
We’re trying to scale up our OVN deployment and we’re seeing someworrying log messages.The topology is 32 containers connected to another 32 containers on 10different ports. This is running on 17 machines (one machine runsovn-northd and ovsdb-server, the other 16 run ovn-controller,ovs-vswitchd, and ovsdb-server). We’re using an address set for thesource group, but not the destination group. We’re also creating adifferent ACL for each port. So the ACLs look like:
One address set for { container1, container2, … container32 }
addressSet -> container1 on port 80
addressSet -> container1 on port 81
…
addressSet -> container1 on port 90
addressSet -> container2 on port 80
…
addressSet -> container32 on port 90

The ovn-controller log:
2017-12-12T20:14:49Z|11878|timeval|WARN|Unreasonably long 1843ms pollinterval (1840ms user, 0ms system)
2017-12-12T20:14:49Z|11879|timeval|WARN|disk: 0 reads, 16 writes
2017-12-12T20:14:49Z|11880|timeval|WARN|context switches: 0 voluntary,21 involuntary2017-12-12T20:14:49Z|11881|poll_loop|DBG|wakeup due to [POLLIN] on fd 9(172.31.11.193:48460<->172.31.2.181:6640) at lib/stream-fd.c:157 (36%CPU usage)2017-12-12T20:14:49Z|11882|poll_loop|DBG|wakeup due to [POLLIN] on fd 12(<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (36% CPU usage)2017-12-12T20:14:49Z|11883|jsonrpc|DBG|tcp:172.31.2.181:6640: receivedreply, result=[], id="echo"2017-12-12T20:14:49Z|11884|netlink_socket|DBG|nl_sock_transact_multiple__ (Success):nl(len:36, type=38(family-defined), flags=9[REQUEST][ECHO], seq=b11,pid=22684528762017-12-12T20:14:49Z|11885|netlink_socket|DBG|nl_sock_recv__ (Success):nl(len:136, type=36(family-defined), flags=0, seq=b11, pid=22684528762017-12-12T20:14:49Z|11886|vconn|DBG|unix:/var/run/openvswitch/br-int.mgmt:received: OFPT_ECHO_REQUEST (OF1.3) (xid=0x0): 0 bytes of payload2017-12-12T20:14:49Z|11887|vconn|DBG|unix:/var/run/openvswitch/br-int.mgmt:sent (Success): OFPT_ECHO_REPLY (OF1.3) (xid=0x0): 0 bytes of payload2017-12-12T20:14:51Z|11888|timeval|WARN|Unreasonably long 1851ms pollinterval (1844ms user, 8ms system)2017-12-12T20:14:51Z|11889|timeval|WARN|context switches: 0 voluntary,11 involuntary2017-12-12T20:14:52Z|11890|poll_loop|DBG|wakeup due to [POLLIN] on fd 9(172.31.11.193:48460<->172.31.2.181:6640) at lib/stream-fd.c:157 (73%CPU usage)2017-12-12T20:14:52Z|11891|jsonrpc|DBG|tcp:172.31.2.181:6640: receivedrequest, method="echo", params=[], id="echo"2017-12-12T20:14:52Z|11892|jsonrpc|DBG|tcp:172.31.2.181:6640: sendreply, result=[], id="echo"2017-12-12T20:14:52Z|11893|netlink_socket|DBG|nl_sock_transact_multiple__ (Success):nl(len:36, type=38(family-defined), flags=9[REQUEST][ECHO], seq=b12,pid=22684528762017-12-12T20:14:52Z|11894|netlink_socket|DBG|nl_sock_recv__ (Success):nl(len:136, type=36(family-defined), flags=0, seq=b12, pid=22684528762017-12-12T20:14:52Z|11895|netdev_linux|DBG|Dropped 18 log messages inlast 56 seconds (most recently, 3 seconds ago) due to excessive rate
2017-12-12T20:14:52Z|11896|netdev_linux|DBG|unknown qdisc "mq"
2017-12-12T20:14:54Z|11897|hmap|DBG|Dropped 15511 log messages in last 6seconds (most recently, 0 seconds ago) due to excessive rate2017-12-12T20:14:54Z|11898|hmap|DBG|ovn/lib/expr.c:2644: 6 nodes inbucket (128 nodes, 64 buckets)2017-12-12T20:14:54Z|11899|timeval|WARN|Unreasonably long 1831ms pollinterval (1828ms user, 4ms system)2017-12-12T20:14:54Z|11900|timeval|WARN|context switches: 0 voluntary,12 involuntary
The log messages show up continuously. The logs appear even when thenetwork isn’t being used.
I poked around with Ethan Jackson and he noted that the hmap countersseem unusually high:
root@ip-172-31-11-193:/# ovs-appctl -t ovn-controller coverage/show
Event coverage, avg rate over last: 5 seconds, last minute, last hour,hash=d6ee5804:hmap_pathological 2323.6/sec 2662.467/sec 2514.0069/sectotal: 9407536hmap_expand 3596.8/sec 4121.283/sec 3890.8833/sectotal: 14604479txn_unchanged 0.8/sec 0.917/sec 0.8658/sectotal: 5659txn_incomplete 0.0/sec 0.000/sec 0.0008/sectotal: 33txn_success 0.0/sec 0.000/sec 0.0006/sectotal: 24poll_create_node 2.4/sec 2.750/sec 2.5986/sectotal: 18218poll_zero_timeout 0.0/sec 0.000/sec 0.0100/sectotal: 71rconn_queued 0.0/sec 0.050/sec 0.0531/sectotal: 252570rconn_sent 0.0/sec 0.050/sec 0.0531/sectotal: 252570seq_change 1.2/sec 1.383/sec 1.2992/sectotal: 8500pstream_open 0.0/sec 0.000/sec 0.0000/sectotal: 1stream_open 0.0/sec 0.000/sec 0.0000/sectotal: 6unixctl_received 0.0/sec 0.000/sec 0.0019/sectotal: 7unixctl_replied 0.0/sec 0.000/sec 0.0019/sectotal: 7util_xalloc 2731550.2/sec 3129900.483/sec 569276.5414/sec total: 11057381035vconn_open 0.0/sec 0.000/sec 0.0000/sectotal: 4vconn_received 0.0/sec 0.050/sec 0.0444/sectotal: 201vconn_sent 0.0/sec 0.000/sec 0.0144/sectotal: 253535netdev_get_ifindex 0.4/sec 0.467/sec 0.4328/sectotal: 2822netlink_received 0.4/sec 0.467/sec 0.4328/sectotal: 2822netlink_sent 0.4/sec 0.467/sec 0.4328/sectotal: 2822cmap_expand 0.0/sec 0.000/sec 0.0000/sectotal: 2
47 events never hit
I’ve also attached the output of ovs-bugtool run from the machinerunning ovn-northd, and one of the machines running ovn-controller andovs-vswitchd.
Thanks,
—Kevin


Hi Kevin,

In short, I agree with Ethan's assessment that the hmap numbers appearlarge. The ACLs, combined with ovn-controller's algorithms, are causingthat. The ACLs result in ~250000 openflow flows being generated.ovn-controller re-creates this flow table from scratch each time it runsits main loop.

These are unsolved problems as yet in ovn-controller. The two high-levelthoughts on the matter are:

1) See if we can reduce the number of generated flows when ACLs are used.

2) See if we can perform incremental changes on the flow table ratherthan re-creating it from scratch every time.


There are two commits that may be of interest to you.

* As Miguel mentioned in his response, Russell Bryant submitted a change[1] within the past couple of months to help towards (1) a bit. The ideawas that if we can detect that a flow will never be needed on a specificmachine, then don't generate it. It's not going to cover everything, butit could potentially help you. That change is present in 2.7.3 and 2.8.1

* A change [2] by Han Zhou fixed an issue where we were generating theflow table and then just throwing it away since vswitchd was not readyfor us to submit a new flow table to it. This change is present in 2.7.3and 2.8.0.

These two commits aren't silver bullets but if you're running a versionof OVS that does not have these commits in them already, you may seesome improvement with an upgrade.


Mark

[1]https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338847.html

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/331286.html
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovn-controller consuming lots of CPU

Reply via email to