On 02/08/2018 07:55 PM, Han Zhou wrote:
On Wed, Feb 7, 2018 at 12:47 PM, Han Zhou <[email protected]
<mailto:[email protected]>> wrote:
>
> When doing scale testing for OVN (using
https://github.com/openvswitch/ovn-scale-test), we had some interesting
findings, and need some help here.
>
> We ran the test "create and bind lports" against branch 2.9 and
branch 2.6, and we found that 2.6 was must faster. With some analysis,
we found out the reason is not because of OVN gets slower in 2.9, but
because the bottleneck of this test in branch 2.9 is ovs-vswitchd.
>
> The testing was run in an environment with 20 farm nodes, each has 50
sandbox HVs (I will just mention them as HVs in short). Before the test,
there are already 9500 lports bound in 950 HVs on 19 farm nodes. The
test run against the last farm node to bind the lport on the 50 HVs
there. The steps in the test scenario are:
>
> 1. Create 5 new LSs in NB (so that the LSs will not be shared with
any of HVs on other farm nodes)
> 2. create 100 lports in NB on a LS
> 3. bind these lports on HVs, 2 for each HV. They are bound
sequentially on each HV, and for each HV the 2 ports are bound using one
command together: ovs-vsctl add-port <port 1> -- set Interface
external-ids:... -- add-port <port 2> -- set Interface external-ids:...
(the script didn't set type to internal, but I hope it is not an issue
for this test).
> 4. wait the port stated changed to up in NB for all the 100 lports
(with a single ovn-nbctl command)
>
> These steps are repeated for 5 times, one for each LS. So in the end
we got 500 more lports created and bound (the total scale is then 1k HVs
and 10k lports).
>
> When running with 2.6, the ovn-controllers were taking most of the
CPU time. However, with 2.9, the CPU of ovn-controllers spikes but there
is always ovs-vswitchd on the top with 100% CPU. It means the
ovs-vswitchd is the bottleneck in this testing. There is only one
ovs-vswitchd with 100% at the same time and different ovs-vswitchd will
spike one after another, since the ports are bound sequentially on each
HV. From the rally log, each 2 ports binding takes around 4 - 5 seconds.
This is just the ovs-vsctl command execution time. The 100% CPU of
ovs-vswitchd explains the slowness.
>
> So, based on this result, we can not using the total time to evaluate
the efficiency of OVN, instead we can evaluate by CPU cost of
ovn-controller processes. In fact, 2.9 ovn-controller costs around 70%
less CPU than 2.6, which I think is due to some optimization we made
earlier. (With my work-in-progress patch it saves much more, and I will
post later as RFC).
>
> However, I cannot explain why ovs-vswitchd is getting slower than 2.6
when doing port-binding. We need expert suggestions here, for what could
be the possible reason of this slowness. We can do more testing with
different versions between 2.6 and 2.9 to find out related change, but
with some pointers it might save some effort. Below are some logs of
ovs-vswitchd when port binding is happening:
>
> ======
> 2018-02-07T00:12:54.558Z|01767|bridge|INFO|bridge br-int: added
interface lport_bc65cd_QFOU3v on port 1028
> 2018-02-07T00:12:55.629Z|01768|timeval|WARN|Unreasonably long 1112ms
poll interval (1016ms user, 4ms system)
> 2018-02-07T00:12:55.629Z|01769|timeval|WARN|faults: 336 minor, 0 major
> 2018-02-07T00:12:55.629Z|01770|timeval|WARN|context switches: 0
voluntary, 13 involuntary
> 2018-02-07T00:12:55.629Z|01771|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour, hash=b256889c:
> 2018-02-07T00:12:55.629Z|01772|coverage|INFO|bridge_reconfigure
0.0/sec 0.000/sec 0.0056/sec total: 29
> 2018-02-07T00:12:55.629Z|01773|coverage|INFO|ofproto_flush
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:55.629Z|01774|coverage|INFO|ofproto_packet_out
0.0/sec 0.000/sec 0.0111/sec total: 90
> 2018-02-07T00:12:55.629Z|01775|coverage|INFO|ofproto_recv_openflow
0.2/sec 0.033/sec 0.4858/sec total: 6673
> 2018-02-07T00:12:55.629Z|01776|coverage|INFO|ofproto_update_port
0.0/sec 0.000/sec 5.5883/sec total: 28258
> 2018-02-07T00:12:55.629Z|01777|coverage|INFO|rev_reconfigure
0.0/sec 0.000/sec 0.0056/sec total: 32
> 2018-02-07T00:12:55.629Z|01778|coverage|INFO|rev_port_toggled
0.0/sec 0.000/sec 0.0011/sec total: 6
> 2018-02-07T00:12:55.629Z|01779|coverage|INFO|rev_flow_table
0.0/sec 0.000/sec 0.0078/sec total: 124
> 2018-02-07T00:12:55.629Z|01780|coverage|INFO|rev_mac_learning
0.0/sec 0.000/sec 0.0036/sec total: 13
> 2018-02-07T00:12:55.629Z|01781|coverage|INFO|xlate_actions
0.0/sec 0.000/sec 0.0111/sec total: 90
> 2018-02-07T00:12:55.629Z|01782|coverage|INFO|ccmap_expand
0.0/sec 0.000/sec 0.0050/sec total: 36
> 2018-02-07T00:12:55.629Z|01783|coverage|INFO|ccmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 49
> 2018-02-07T00:12:55.629Z|01784|coverage|INFO|cmap_expand
0.0/sec 0.000/sec 0.0217/sec total: 257
> 2018-02-07T00:12:55.629Z|01785|coverage|INFO|cmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 141
> 2018-02-07T00:12:55.629Z|01786|coverage|INFO|dpif_port_add
0.0/sec 0.000/sec 0.0022/sec total: 1020
> 2018-02-07T00:12:55.629Z|01787|coverage|INFO|dpif_port_del
0.0/sec 0.000/sec 0.0000/sec total: 40
> 2018-02-07T00:12:55.629Z|01788|coverage|INFO|dpif_flow_flush
0.0/sec 0.000/sec 0.0000/sec total: 2
> 2018-02-07T00:12:55.629Z|01789|coverage|INFO|dpif_flow_get
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:12:55.629Z|01790|coverage|INFO|dpif_flow_put
0.0/sec 0.000/sec 0.0000/sec total: 21
> 2018-02-07T00:12:55.629Z|01791|coverage|INFO|dpif_flow_del
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:12:55.629Z|01792|coverage|INFO|dpif_execute
0.0/sec 0.000/sec 0.0100/sec total: 71
> 2018-02-07T00:12:55.629Z|01793|coverage|INFO|flow_extract
0.0/sec 0.000/sec 0.0000/sec total: 4
> 2018-02-07T00:12:55.629Z|01794|coverage|INFO|miniflow_malloc
0.0/sec 0.000/sec 0.4361/sec total: 6435
> 2018-02-07T00:12:55.629Z|01795|coverage|INFO|hindex_pathological
0.0/sec 0.000/sec 0.0000/sec total: 17
> 2018-02-07T00:12:55.629Z|01796|coverage|INFO|hindex_expand
0.0/sec 0.000/sec 0.0000/sec total: 9
> 2018-02-07T00:12:55.629Z|01797|coverage|INFO|hmap_pathological
0.0/sec 0.000/sec 0.8131/sec total: 13752
> 2018-02-07T00:12:55.630Z|01798|coverage|INFO|hmap_expand
21.0/sec 18.333/sec 11253.0167/sec total: 55905271
> 2018-02-07T00:12:55.630Z|01799|coverage|INFO|hmap_shrink
0.0/sec 0.000/sec 0.0022/sec total: 1029
> 2018-02-07T00:12:55.630Z|01800|coverage|INFO|mac_learning_learned
0.0/sec 0.000/sec 0.0022/sec total: 18
> 2018-02-07T00:12:55.630Z|01801|coverage|INFO|mac_learning_expired
0.0/sec 0.000/sec 0.0017/sec total: 6
> 2018-02-07T00:12:55.630Z|01802|coverage|INFO|netdev_sent
0.0/sec 0.000/sec 0.0400/sec total: 350
> 2018-02-07T00:12:55.630Z|01803|coverage|INFO|netdev_get_stats
203.4/sec 203.400/sec 198.4978/sec total: 798522
> 2018-02-07T00:12:55.630Z|01804|coverage|INFO|txn_unchanged
0.2/sec 0.200/sec 0.1911/sec total: 776
> 2018-02-07T00:12:55.630Z|01805|coverage|INFO|txn_incomplete
0.0/sec 0.000/sec 0.0167/sec total: 106
> 2018-02-07T00:12:55.630Z|01806|coverage|INFO|txn_success
0.0/sec 0.000/sec 0.0111/sec total: 48
> 2018-02-07T00:12:55.630Z|01807|coverage|INFO|poll_create_node
162.6/sec 157.267/sec 156.3664/sec total: 636465
> 2018-02-07T00:12:55.630Z|01808|coverage|INFO|poll_zero_timeout
0.2/sec 0.250/sec 0.4564/sec total: 1936
> 2018-02-07T00:12:55.630Z|01809|coverage|INFO|rconn_queued
0.2/sec 0.033/sec 0.0453/sec total: 1214
> 2018-02-07T00:12:55.630Z|01810|coverage|INFO|rconn_sent
0.2/sec 0.033/sec 0.0453/sec total: 1214
> 2018-02-07T00:12:55.630Z|01811|coverage|INFO|seq_change
1098.8/sec 1099.167/sec 1097.4869/sec total: 4468823
> 2018-02-07T00:12:55.630Z|01812|coverage|INFO|pstream_open
0.0/sec 0.000/sec 0.0000/sec total: 5
> 2018-02-07T00:12:55.630Z|01813|coverage|INFO|stream_open
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:55.630Z|01814|coverage|INFO|util_xalloc
7223.8/sec 6527.317/sec 51569.9900/sec total: 251999059
> 2018-02-07T00:12:55.630Z|01815|coverage|INFO|vconn_received
0.2/sec 0.033/sec 0.4858/sec total: 6675
> 2018-02-07T00:12:55.630Z|01816|coverage|INFO|vconn_sent
0.2/sec 0.033/sec 0.0453/sec total: 1216
> 2018-02-07T00:12:55.630Z|01817|coverage|INFO|netlink_received
0.0/sec 0.000/sec 0.0000/sec total: 3
> 2018-02-07T00:12:55.630Z|01818|coverage|INFO|netlink_sent
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:55.630Z|01819|coverage|INFO|57 events never hit
> 2018-02-07T00:12:56.761Z|01820|timeval|WARN|Unreasonably long 1112ms
poll interval (948ms user, 4ms system)
> 2018-02-07T00:12:56.761Z|01821|timeval|WARN|faults: 7 minor, 0 major
> 2018-02-07T00:12:56.761Z|01822|timeval|WARN|context switches: 2
voluntary, 17 involuntary
> 2018-02-07T00:12:56.761Z|01823|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour, hash=fcc57021:
> 2018-02-07T00:12:56.761Z|01824|coverage|INFO|bridge_reconfigure
0.0/sec 0.000/sec 0.0056/sec total: 30
> 2018-02-07T00:12:56.761Z|01825|coverage|INFO|ofproto_flush
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:56.761Z|01826|coverage|INFO|ofproto_packet_out
0.0/sec 0.000/sec 0.0111/sec total: 90
> 2018-02-07T00:12:56.761Z|01827|coverage|INFO|ofproto_recv_openflow
0.2/sec 0.033/sec 0.4858/sec total: 6683
> 2018-02-07T00:12:56.761Z|01828|coverage|INFO|ofproto_update_port
0.0/sec 0.000/sec 5.5883/sec total: 29268
> 2018-02-07T00:12:56.761Z|01829|coverage|INFO|rev_reconfigure
0.0/sec 0.000/sec 0.0056/sec total: 33
> 2018-02-07T00:12:56.761Z|01830|coverage|INFO|rev_port_toggled
0.0/sec 0.000/sec 0.0011/sec total: 6
> 2018-02-07T00:12:56.761Z|01831|coverage|INFO|rev_flow_table
0.0/sec 0.000/sec 0.0078/sec total: 124
> 2018-02-07T00:12:56.761Z|01832|coverage|INFO|rev_mac_learning
0.0/sec 0.000/sec 0.0036/sec total: 13
> 2018-02-07T00:12:56.761Z|01833|coverage|INFO|xlate_actions
0.0/sec 0.000/sec 0.0111/sec total: 90
> 2018-02-07T00:12:56.761Z|01834|coverage|INFO|ccmap_expand
0.0/sec 0.000/sec 0.0050/sec total: 36
> 2018-02-07T00:12:56.761Z|01835|coverage|INFO|ccmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 49
> 2018-02-07T00:12:56.761Z|01836|coverage|INFO|cmap_expand
0.0/sec 0.000/sec 0.0217/sec total: 257
> 2018-02-07T00:12:56.761Z|01837|coverage|INFO|cmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 141
> 2018-02-07T00:12:56.761Z|01838|coverage|INFO|dpif_port_add
0.0/sec 0.000/sec 0.0022/sec total: 1020
> 2018-02-07T00:12:56.761Z|01839|coverage|INFO|dpif_port_del
0.0/sec 0.000/sec 0.0000/sec total: 40
> 2018-02-07T00:12:56.761Z|01840|coverage|INFO|dpif_flow_flush
0.0/sec 0.000/sec 0.0000/sec total: 2
> 2018-02-07T00:12:56.761Z|01841|coverage|INFO|dpif_flow_get
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:12:56.761Z|01842|coverage|INFO|dpif_flow_put
0.0/sec 0.000/sec 0.0000/sec total: 21
> 2018-02-07T00:12:56.761Z|01843|coverage|INFO|dpif_flow_del
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:12:56.761Z|01844|coverage|INFO|dpif_execute
0.0/sec 0.000/sec 0.0100/sec total: 71
> 2018-02-07T00:12:56.761Z|01845|coverage|INFO|flow_extract
0.0/sec 0.000/sec 0.0000/sec total: 4
> 2018-02-07T00:12:56.761Z|01846|coverage|INFO|miniflow_malloc
0.0/sec 0.000/sec 0.4361/sec total: 6440
> 2018-02-07T00:12:56.761Z|01847|coverage|INFO|hindex_pathological
0.0/sec 0.000/sec 0.0000/sec total: 17
> 2018-02-07T00:12:56.761Z|01848|coverage|INFO|hindex_expand
0.0/sec 0.000/sec 0.0000/sec total: 9
> 2018-02-07T00:12:56.761Z|01849|coverage|INFO|hmap_pathological
0.0/sec 0.000/sec 0.8131/sec total: 13825
> 2018-02-07T00:12:56.761Z|01850|coverage|INFO|hmap_expand
21.0/sec 18.333/sec 11253.0167/sec total: 57949607
> 2018-02-07T00:12:56.761Z|01851|coverage|INFO|hmap_shrink
0.0/sec 0.000/sec 0.0022/sec total: 1029
> 2018-02-07T00:12:56.761Z|01852|coverage|INFO|mac_learning_learned
0.0/sec 0.000/sec 0.0022/sec total: 18
> 2018-02-07T00:12:56.761Z|01853|coverage|INFO|mac_learning_expired
0.0/sec 0.000/sec 0.0017/sec total: 6
> 2018-02-07T00:12:56.761Z|01854|coverage|INFO|netdev_sent
0.0/sec 0.000/sec 0.0400/sec total: 350
> 2018-02-07T00:12:56.761Z|01855|coverage|INFO|netdev_get_stats
203.4/sec 203.400/sec 198.4978/sec total: 799540
> 2018-02-07T00:12:56.761Z|01856|coverage|INFO|txn_unchanged
0.2/sec 0.200/sec 0.1911/sec total: 777
> 2018-02-07T00:12:56.761Z|01857|coverage|INFO|txn_incomplete
0.0/sec 0.000/sec 0.0167/sec total: 108
> 2018-02-07T00:12:56.761Z|01858|coverage|INFO|txn_success
0.0/sec 0.000/sec 0.0111/sec total: 49
> 2018-02-07T00:12:56.761Z|01859|coverage|INFO|poll_create_node
162.6/sec 157.267/sec 156.3664/sec total: 636681
> 2018-02-07T00:12:56.761Z|01860|coverage|INFO|poll_zero_timeout
0.2/sec 0.250/sec 0.4564/sec total: 1937
> 2018-02-07T00:12:56.761Z|01861|coverage|INFO|rconn_queued
0.2/sec 0.033/sec 0.0453/sec total: 1216
> 2018-02-07T00:12:56.761Z|01862|coverage|INFO|rconn_sent
0.2/sec 0.033/sec 0.0453/sec total: 1216
> 2018-02-07T00:12:56.761Z|01863|coverage|INFO|seq_change
1098.8/sec 1099.167/sec 1097.4869/sec total: 4471434
> 2018-02-07T00:12:56.761Z|01864|coverage|INFO|pstream_open
0.0/sec 0.000/sec 0.0000/sec total: 5
> 2018-02-07T00:12:56.761Z|01865|coverage|INFO|stream_open
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:56.761Z|01866|coverage|INFO|util_xalloc
7223.8/sec 6527.317/sec 51569.9900/sec total: 260206006
> 2018-02-07T00:12:56.761Z|01867|coverage|INFO|vconn_received
0.2/sec 0.033/sec 0.4858/sec total: 6685
> 2018-02-07T00:12:56.761Z|01868|coverage|INFO|vconn_sent
0.2/sec 0.033/sec 0.0453/sec total: 1218
> 2018-02-07T00:12:56.761Z|01869|coverage|INFO|netlink_received
0.0/sec 0.000/sec 0.0000/sec total: 3
> 2018-02-07T00:12:56.761Z|01870|coverage|INFO|netlink_sent
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:56.761Z|01871|coverage|INFO|57 events never hit
> 2018-02-07T00:12:57.054Z|01872|bridge|INFO|bridge br-int: added
interface lport_bc65cd_lykjnm on port 1029
> 2018-02-07T00:12:58.202Z|01873|timeval|WARN|Unreasonably long 1179ms
poll interval (1032ms user, 0ms system)
> 2018-02-07T00:12:58.202Z|01874|timeval|WARN|faults: 41 minor, 0 major
> 2018-02-07T00:12:58.202Z|01875|timeval|WARN|context switches: 0
voluntary, 19 involuntary
> 2018-02-07T00:12:58.203Z|01876|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour, hash=668121b0:
> 2018-02-07T00:12:58.203Z|01877|coverage|INFO|bridge_reconfigure
0.0/sec 0.000/sec 0.0056/sec total: 31
> 2018-02-07T00:12:58.203Z|01878|coverage|INFO|ofproto_flush
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:58.203Z|01879|coverage|INFO|ofproto_packet_out
0.0/sec 0.000/sec 0.0111/sec total: 90
> 2018-02-07T00:12:58.203Z|01880|coverage|INFO|ofproto_recv_openflow
0.2/sec 0.033/sec 0.4858/sec total: 6733
> 2018-02-07T00:12:58.203Z|01881|coverage|INFO|ofproto_update_port
0.0/sec 0.000/sec 5.5883/sec total: 30279
> 2018-02-07T00:12:58.203Z|01882|coverage|INFO|rev_reconfigure
0.0/sec 0.000/sec 0.0056/sec total: 34
> 2018-02-07T00:12:58.203Z|01883|coverage|INFO|rev_port_toggled
0.0/sec 0.000/sec 0.0011/sec total: 6
> 2018-02-07T00:12:58.203Z|01884|coverage|INFO|rev_flow_table
0.0/sec 0.000/sec 0.0078/sec total: 125
> 2018-02-07T00:12:58.203Z|01885|coverage|INFO|rev_mac_learning
0.0/sec 0.000/sec 0.0036/sec total: 13
> 2018-02-07T00:12:58.203Z|01886|coverage|INFO|xlate_actions
0.0/sec 0.000/sec 0.0111/sec total: 90
> 2018-02-07T00:12:58.203Z|01887|coverage|INFO|ccmap_expand
0.0/sec 0.000/sec 0.0050/sec total: 36
> 2018-02-07T00:12:58.203Z|01888|coverage|INFO|ccmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 49
> 2018-02-07T00:12:58.203Z|01889|coverage|INFO|cmap_expand
0.0/sec 0.000/sec 0.0217/sec total: 260
> 2018-02-07T00:12:58.203Z|01890|coverage|INFO|cmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 141
> 2018-02-07T00:12:58.203Z|01891|coverage|INFO|dpif_port_add
0.0/sec 0.000/sec 0.0022/sec total: 1021
> 2018-02-07T00:12:58.203Z|01892|coverage|INFO|dpif_port_del
0.0/sec 0.000/sec 0.0000/sec total: 40
> 2018-02-07T00:12:58.203Z|01893|coverage|INFO|dpif_flow_flush
0.0/sec 0.000/sec 0.0000/sec total: 2
> 2018-02-07T00:12:58.203Z|01894|coverage|INFO|dpif_flow_get
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:12:58.203Z|01895|coverage|INFO|dpif_flow_put
0.0/sec 0.000/sec 0.0000/sec total: 21
> 2018-02-07T00:12:58.203Z|01896|coverage|INFO|dpif_flow_del
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:12:58.203Z|01897|coverage|INFO|dpif_execute
0.0/sec 0.000/sec 0.0100/sec total: 71
> 2018-02-07T00:12:58.203Z|01898|coverage|INFO|flow_extract
0.0/sec 0.000/sec 0.0000/sec total: 4
> 2018-02-07T00:12:58.203Z|01899|coverage|INFO|miniflow_malloc
0.0/sec 0.000/sec 0.4361/sec total: 6489
> 2018-02-07T00:12:58.203Z|01900|coverage|INFO|hindex_pathological
0.0/sec 0.000/sec 0.0000/sec total: 17
> 2018-02-07T00:12:58.203Z|01901|coverage|INFO|hindex_expand
0.0/sec 0.000/sec 0.0000/sec total: 9
> 2018-02-07T00:12:58.203Z|01902|coverage|INFO|hmap_pathological
0.0/sec 0.000/sec 0.8131/sec total: 13945
> 2018-02-07T00:12:58.203Z|01903|coverage|INFO|hmap_expand
21.0/sec 18.333/sec 11253.0167/sec total: 59998044
> 2018-02-07T00:12:58.203Z|01904|coverage|INFO|hmap_shrink
0.0/sec 0.000/sec 0.0022/sec total: 1030
> 2018-02-07T00:12:58.203Z|01905|coverage|INFO|mac_learning_learned
0.0/sec 0.000/sec 0.0022/sec total: 18
> 2018-02-07T00:12:58.203Z|01906|coverage|INFO|mac_learning_expired
0.0/sec 0.000/sec 0.0017/sec total: 6
> 2018-02-07T00:12:58.203Z|01907|coverage|INFO|netdev_sent
0.0/sec 0.000/sec 0.0400/sec total: 350
> 2018-02-07T00:12:58.203Z|01908|coverage|INFO|netdev_get_stats
203.4/sec 203.400/sec 198.4978/sec total: 799541
> 2018-02-07T00:12:58.203Z|01909|coverage|INFO|txn_unchanged
0.2/sec 0.200/sec 0.1911/sec total: 777
> 2018-02-07T00:12:58.203Z|01910|coverage|INFO|txn_incomplete
0.0/sec 0.000/sec 0.0167/sec total: 110
> 2018-02-07T00:12:58.203Z|01911|coverage|INFO|txn_success
0.0/sec 0.000/sec 0.0111/sec total: 50
> 2018-02-07T00:12:58.203Z|01912|coverage|INFO|poll_create_node
162.6/sec 157.267/sec 156.3664/sec total: 636974
> 2018-02-07T00:12:58.203Z|01913|coverage|INFO|poll_zero_timeout
0.2/sec 0.250/sec 0.4564/sec total: 1939
> 2018-02-07T00:12:58.203Z|01914|coverage|INFO|rconn_queued
0.2/sec 0.033/sec 0.0453/sec total: 1217
> 2018-02-07T00:12:58.203Z|01915|coverage|INFO|rconn_sent
0.2/sec 0.033/sec 0.0453/sec total: 1217
> 2018-02-07T00:12:58.203Z|01916|coverage|INFO|seq_change
1098.8/sec 1099.167/sec 1097.4869/sec total: 4474605
> 2018-02-07T00:12:58.203Z|01917|coverage|INFO|pstream_open
0.0/sec 0.000/sec 0.0000/sec total: 5
> 2018-02-07T00:12:58.203Z|01918|coverage|INFO|stream_open
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:58.203Z|01919|coverage|INFO|util_xalloc
7223.8/sec 6527.317/sec 51569.9900/sec total: 268435993
> 2018-02-07T00:12:58.203Z|01920|coverage|INFO|vconn_received
0.2/sec 0.033/sec 0.4858/sec total: 6735
> 2018-02-07T00:12:58.203Z|01921|coverage|INFO|vconn_sent
0.2/sec 0.033/sec 0.0453/sec total: 1219
> 2018-02-07T00:12:58.203Z|01922|coverage|INFO|netlink_received
0.0/sec 0.000/sec 0.0000/sec total: 3
> 2018-02-07T00:12:58.203Z|01923|coverage|INFO|netlink_sent
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:12:58.203Z|01924|coverage|INFO|57 events never hit
> 2018-02-07T00:12:58.221Z|01925|bridge|INFO|bridge br-int: added
interface patch-br-int-to-provnet-lswitch_bc65cd_Y7sqEw on port 1030
> 2018-02-07T00:12:58.221Z|01926|bridge|INFO|bridge br0: added
interface patch-provnet-lswitch_bc65cd_Y7sqEw-to-br-int on port 15
> 2018-02-07T00:12:59.313Z|01927|timeval|WARN|Unreasonably long 1111ms
poll interval (1092ms user, 0ms system)
> 2018-02-07T00:12:59.313Z|01928|timeval|WARN|faults: 45 minor, 0 major
> 2018-02-07T00:12:59.313Z|01929|timeval|WARN|context switches: 6
voluntary, 59 involuntary
> 2018-02-07T00:12:59.313Z|01930|coverage|INFO|Skipping details of
duplicate event coverage for hash=668121b0
> 2018-02-07T00:13:00.430Z|01931|timeval|WARN|Unreasonably long 1117ms
poll interval (1112ms user, 0ms system)
> 2018-02-07T00:13:00.430Z|01932|timeval|WARN|faults: 34 minor, 0 major
> 2018-02-07T00:13:00.430Z|01933|timeval|WARN|context switches: 2
voluntary, 11 involuntary
> 2018-02-07T00:13:00.431Z|01934|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour, hash=a612a6e0:
> 2018-02-07T00:13:00.431Z|01935|coverage|INFO|bridge_reconfigure
0.8/sec 0.067/sec 0.0067/sec total: 33
> 2018-02-07T00:13:00.431Z|01936|coverage|INFO|ofproto_flush
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:13:00.431Z|01937|coverage|INFO|ofproto_packet_out
0.2/sec 0.017/sec 0.0114/sec total: 92
> 2018-02-07T00:13:00.431Z|01938|coverage|INFO|ofproto_recv_openflow
32.2/sec 2.717/sec 0.5306/sec total: 6935
> 2018-02-07T00:13:00.431Z|01939|coverage|INFO|ofproto_update_port
808.8/sec 67.400/sec 6.7117/sec total: 32303
> 2018-02-07T00:13:00.431Z|01940|coverage|INFO|rev_reconfigure
0.8/sec 0.067/sec 0.0067/sec total: 36
> 2018-02-07T00:13:00.431Z|01941|coverage|INFO|rev_port_toggled
0.0/sec 0.000/sec 0.0011/sec total: 7
> 2018-02-07T00:13:00.431Z|01942|coverage|INFO|rev_flow_table
0.4/sec 0.033/sec 0.0083/sec total: 126
> 2018-02-07T00:13:00.431Z|01943|coverage|INFO|rev_mac_learning
0.0/sec 0.000/sec 0.0036/sec total: 13
> 2018-02-07T00:13:00.431Z|01944|coverage|INFO|xlate_actions
0.2/sec 0.017/sec 0.0114/sec total: 92
> 2018-02-07T00:13:00.431Z|01945|coverage|INFO|ccmap_expand
0.0/sec 0.000/sec 0.0050/sec total: 36
> 2018-02-07T00:13:00.431Z|01946|coverage|INFO|ccmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 49
> 2018-02-07T00:13:00.431Z|01947|coverage|INFO|cmap_expand
2.0/sec 0.167/sec 0.0244/sec total: 271
> 2018-02-07T00:13:00.431Z|01948|coverage|INFO|cmap_shrink
0.0/sec 0.000/sec 0.0000/sec total: 141
> 2018-02-07T00:13:00.431Z|01949|coverage|INFO|dpif_port_add
0.4/sec 0.033/sec 0.0028/sec total: 1021
> 2018-02-07T00:13:00.431Z|01950|coverage|INFO|dpif_port_del
0.0/sec 0.000/sec 0.0000/sec total: 40
> 2018-02-07T00:13:00.431Z|01951|coverage|INFO|dpif_flow_flush
0.0/sec 0.000/sec 0.0000/sec total: 2
> 2018-02-07T00:13:00.431Z|01952|coverage|INFO|dpif_flow_get
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:13:00.431Z|01953|coverage|INFO|dpif_flow_put
0.0/sec 0.000/sec 0.0000/sec total: 21
> 2018-02-07T00:13:00.431Z|01954|coverage|INFO|dpif_flow_del
0.0/sec 0.000/sec 0.0000/sec total: 20
> 2018-02-07T00:13:00.431Z|01955|coverage|INFO|dpif_execute
0.0/sec 0.000/sec 0.0100/sec total: 72
> 2018-02-07T00:13:00.431Z|01956|coverage|INFO|flow_extract
0.0/sec 0.000/sec 0.0000/sec total: 4
> 2018-02-07T00:13:00.431Z|01957|coverage|INFO|miniflow_malloc
30.8/sec 2.567/sec 0.4789/sec total: 6689
> 2018-02-07T00:13:00.431Z|01958|coverage|INFO|hindex_pathological
0.0/sec 0.000/sec 0.0000/sec total: 17
> 2018-02-07T00:13:00.431Z|01959|coverage|INFO|hindex_expand
0.0/sec 0.000/sec 0.0000/sec total: 9
> 2018-02-07T00:13:00.431Z|01960|coverage|INFO|hmap_pathological
76.8/sec 6.400/sec 0.9197/sec total: 14149
> 2018-02-07T00:13:00.431Z|01961|coverage|INFO|hmap_expand
1637509.8/sec 136476.267/sec 13527.3358/sec total: 64098893
> 2018-02-07T00:13:00.431Z|01962|coverage|INFO|hmap_shrink
0.4/sec 0.033/sec 0.0028/sec total: 1030
> 2018-02-07T00:13:00.431Z|01963|coverage|INFO|mac_learning_learned
0.0/sec 0.000/sec 0.0022/sec total: 19
> 2018-02-07T00:13:00.431Z|01964|coverage|INFO|mac_learning_expired
0.0/sec 0.000/sec 0.0017/sec total: 6
> 2018-02-07T00:13:00.431Z|01965|coverage|INFO|netdev_sent
0.0/sec 0.000/sec 0.0400/sec total: 359
> 2018-02-07T00:13:00.431Z|01966|coverage|INFO|netdev_get_stats
204.4/sec 203.483/sec 198.7817/sec total: 799543
> 2018-02-07T00:13:00.431Z|01967|coverage|INFO|txn_unchanged
0.2/sec 0.200/sec 0.1914/sec total: 777
> 2018-02-07T00:13:00.431Z|01968|coverage|INFO|txn_incomplete
1.4/sec 0.117/sec 0.0186/sec total: 113
> 2018-02-07T00:13:00.431Z|01969|coverage|INFO|txn_success
0.6/sec 0.050/sec 0.0119/sec total: 51
> 2018-02-07T00:13:00.431Z|01970|coverage|INFO|poll_create_node
241.8/sec 164.833/sec 156.7022/sec total: 637792
> 2018-02-07T00:13:00.431Z|01971|coverage|INFO|poll_zero_timeout
1.8/sec 0.367/sec 0.4589/sec total: 1944
> 2018-02-07T00:13:00.431Z|01972|coverage|INFO|rconn_queued
1.2/sec 0.133/sec 0.0469/sec total: 1219
> 2018-02-07T00:13:00.431Z|01973|coverage|INFO|rconn_sent
1.2/sec 0.133/sec 0.0469/sec total: 1219
> 2018-02-07T00:13:00.431Z|01974|coverage|INFO|seq_change
2679.2/sec 1230.183/sec 1101.2081/sec total: 4481090
> 2018-02-07T00:13:00.431Z|01975|coverage|INFO|pstream_open
0.0/sec 0.000/sec 0.0000/sec total: 5
> 2018-02-07T00:13:00.431Z|01976|coverage|INFO|stream_open
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:13:00.431Z|01977|coverage|INFO|util_xalloc
6573732.0/sec 553875.117/sec 60700.1733/sec total: 284890464
> 2018-02-07T00:13:00.431Z|01978|coverage|INFO|vconn_received
32.2/sec 2.717/sec 0.5306/sec total: 6937
> 2018-02-07T00:13:00.431Z|01979|coverage|INFO|vconn_sent
1.2/sec 0.133/sec 0.0469/sec total: 1221
> 2018-02-07T00:13:00.431Z|01980|coverage|INFO|netlink_received
0.0/sec 0.000/sec 0.0000/sec total: 3
> 2018-02-07T00:13:00.431Z|01981|coverage|INFO|netlink_sent
0.0/sec 0.000/sec 0.0000/sec total: 1
> 2018-02-07T00:13:00.431Z|01982|coverage|INFO|57 events never hit
> 2018-02-07T00:13:00.431Z|01983|poll_loop|INFO|Dropped 6 log messages
in last 217 seconds (most recently, 216 seconds ago) due to excessive rate
> 2018-02-07T00:13:00.431Z|01984|poll_loop|INFO|wakeup due to [POLLIN]
on fd 112 (/root/sandbox-192.168.64.0/br-int.mgmt<->) at
../lib/stream-fd.c:157 (90% CPU usage)
>
> I managed to capture a perf record when a ovs-vswitchd is spiking,
not sure if it helps:
> Samples: 6K of event 'cycles:pp', Event count (approx.): 118132588562209
> Children Self Command Shared Object Symbol
◆
> + 66.01% 0.00% ovs-vswitchd ovs-vswitchd [.] _start
▒
> + 66.01% 0.00% ovs-vswitchd libc-2.23.so <http://libc-2.23.so>
[.] __libc_start_main
▒
> + 66.01% 0.00% ovs-vswitchd ovs-vswitchd [.] main
▒
> + 66.01% 0.00% ovs-vswitchd ovs-vswitchd [.]
bridge_run
▒
> + 66.01% 0.00% ovs-vswitchd ovs-vswitchd [.]
bridge_run__
▒
> + 41.88% 0.00% ovs-vswitchd ovs-vswitchd [.]
ofproto_type_run
▒
> + 41.88% 0.00% ovs-vswitchd ovs-vswitchd [.] type_run
▒
> + 24.13% 0.00% ovs-vswitchd ovs-vswitchd [.]
bridge_reconfigure
▒
> + 24.13% 0.00% ovs-vswitchd ovs-vswitchd [.]
ofproto_run
▒
> + 24.13% 0.00% ovs-vswitchd ovs-vswitchd [.]
update_port
▒
> + 23.31% 0.00% ovs-vswitchd ovs-vswitchd [.]
update_mtu
▒
> + 23.31% 0.00% ovs-vswitchd ovs-vswitchd [.]
update_mtu_ofproto
▒
> + 23.31% 0.00% ovs-vswitchd ovs-vswitchd [.]
ofport_is_mtu_overridden
▒
> + 23.27% 0.00% ovs-vswitchd ovs-vswitchd [.]
xlate_txn_start
▒
> + 23.27% 0.00% ovs-vswitchd ovs-vswitchd [.]
xlate_xport_copy
▒
> + 23.27% 0.00% ovs-vswitchd libjemalloc.so.1 [.] calloc
▒
> + 23.27% 0.00% ovs-vswitchd ovs-vswitchd [.] xcalloc
▒
> + 23.27% 23.27% ovs-vswitchd libc-2.23.so <http://libc-2.23.so>
[.] 0x000000000008f294
▒
> + 22.38% 0.00% ovs-vswitchd ovs-vswitchd [.]
ofproto_class_find__
▒
> + 22.38% 0.00% ovs-vswitchd ovs-vswitchd [.]
ofproto_port_open_type
▒
> + 20.96% 0.00% ovs-vswitchd ovs-vswitchd [.]
dp_enumerate_types
▒
> + 20.14% 20.14% ovs-vswitchd libjemalloc.so.1 [.] malloc
▒
> + 19.43% 0.81% ovs-vswitchd ovs-vswitchd [.]
ovs_mutex_lock_at
▒
> + 19.33% 0.00% ovs-vswitchd ovs-vswitchd [.]
sset_add__
▒
> + 19.33% 0.00% ovs-vswitchd ovs-vswitchd [.] xmalloc
▒
> + 18.62% 18.62% ovs-vswitchd libpthread-2.23.so
<http://libpthread-2.23.so> [.] pthread_mutex_lock
▒
> + 18.61% 0.00% ovs-vswitchd ovs-vswitchd [.]
dpif_netdev_run
▒
> + 18.61% 0.00% ovs-vswitchd ovs-vswitchd [.]
dp_netdev_process_rxq_port
▒
> + 18.61% 0.00% ovs-vswitchd ovs-vswitchd [.]
netdev_rxq_recv
▒
> + 18.61% 0.00% ovs-vswitchd ovs-vswitchd [.]
netdev_dummy_rxq_recv
▒
> + 13.03% 0.00% revalidator66 [kernel.kallsyms] [k]
entry_SYSCALL_64_fastpath
▒
> + 13.03% 0.00% revalidator66 libc-2.23.so
<http://libc-2.23.so> [.] __clone
▒
> + 13.03% 0.00% revalidator66 libpthread-2.23.so
<http://libpthread-2.23.so> [.] start_thread
▒
>
> Thanks,
> Han
Interestingly, I located the change that slows down the port-binding:
===
commit 2d4beba14ae1075a7baf34fa57424edaf4eb7697
Author: Numan Siddique <[email protected] <mailto:[email protected]>>
Date: Tue Sep 12 14:22:03 2017 +0530
ofproto: Include patch ports in mtu overriden check
When a patch port is deleted from the bridge (with no other ports
in the bridge) and if the bridge was set to an MTU by the user
earlier, the
MTU of the bridge is overriden to 1500. Please see the below link
for the
steps to reproduce the issue.
This patch fixes this issue.
Reported-at:
https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338665.html
Signed-off-by: Numan Siddique <[email protected]
<mailto:[email protected]>>
Signed-off-by: Ben Pfaff <[email protected] <mailto:[email protected]>>
===
It perfectly matches the perf output shared above when ovs-vswitchd is
running hot.
> + 24.13% 0.00% ovs-vswitchd ovs-vswitchd [.]
ofproto_run ▒
> + 24.13% 0.00% ovs-vswitchd ovs-vswitchd [.] update_port
▒
> + 23.31% 0.00% ovs-vswitchd ovs-vswitchd [.] update_mtu
▒
> + 23.31% 0.00% ovs-vswitchd ovs-vswitchd [.]
update_mtu_ofproto ▒
> + 23.31% 0.00% ovs-vswitchd ovs-vswitchd [.]
ofport_is_mtu_overridden ▒
...
I will study more on the details of how it is triggered and look at how
to optimize it. Any suggestions are welcome.
cc Numan & Ben.
Thanks,
Han
Hi Han,
If you investigate the patch you referenced, it's fairly small. It adds
an extra strcmp() call and an extra call to ofproto_port_open_type().
It's most likely the extra call to ofproto_port_open_type() is causing
the extra CPU usage. Most of the processing done by calling it a second
time is redundant.
Patch one to optimize the behavior would be to factor out the common
parts of ofproto_port_open_type() so that it may be called once. Then
call the parts that differ for each port type. This should be simple. It
would be worth testing again with this patch alone to see how much of an
impact this makes.
Patch two would be to start caching results of ofproto_port_open_type()
locally since the result is not likely to differ between repeated calls.
This would be a bit trickier to make 100% correct because you would need
to be able to reliably detect when your cached value is "dirty". This
patch would be worth pursuing if patch one does not do enough on its own.
Mark!
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss