UPD:
I’ve checked ovn-controller dbg logs and checked against its code.
Not sure, but looks like some bug in first processing of ct_zones. This logfile
is from 20.06 version, but it has similar behaviour on v21.12:
start ovn-controller
2022-02-03T16:09:50.500Z|00032|jsonrpc|DBG|unix:/run/openvswitch/db.sock:
received reply, result={}, id=3
2022-02-03T16:09:50.500Z|00033|ovsdb_idl|DBG|unix:/run/openvswitch/db.sock:
received unexpected reply message: {"error":null,"id":3,"result":{}}
2022-02-03T16:09:50.500Z|00034|main|DBG|restoring ct zone 25 for
'vnet37798-gateway-vif'
2022-02-03T16:09:50.500Z|00035|main|DBG|restoring ct zone 8 for
'5ae22181-33a5-4596-8143-e5e396fbc17e_dnat'
2022-02-03T16:09:50.500Z|00036|main|DBG|restoring ct zone 57 for
'3210cc4f-d055-4375-8425-aef0bd4db5fc_dnat'
2022-02-03T16:09:50.500Z|00037|main|DBG|restoring ct zone 10 for
'vnet38045-gateway-vif'
2022-02-03T16:09:50.500Z|00038|main|DBG|restoring ct zone 60 for
'910c751a-3d86-420a-9606-3f751de4e8bf_dnat'
2022-02-03T16:09:50.500Z|00039|main|DBG|restoring ct zone 146 for
'30fda34a-7614-487a-9b17-258781c06e31_snat'
2022-02-03T16:09:50.500Z|00040|main|DBG|restoring ct zone 171 for
'de72957f-b877-4c13-8b3d-9fe9b3abcc47_snat'
2022-02-03T16:09:50.500Z|00041|main|DBG|restoring ct zone 184 for
'vnet36210-gateway-vif'
2022-02-03T16:09:50.500Z|00042|main|DBG|restoring ct zone 120 for
'fdc23a91-fbcc-4fc4-9edd-e2001a827a2a_snat'
2022-02-03T16:09:50.500Z|00043|main|DBG|restoring ct zone 102 for
'2c9ea3e6-7ba1-445e-b707-1873d9a85e00_dnat'
2022-02-03T16:09:50.500Z|00044|main|DBG|restoring ct zone 152 for
'fa746f3c-1d4e-450a-80c9-e6cd39676bbc_snat'
2022-02-03T16:09:50.500Z|00045|main|DBG|restoring ct zone 92 for
'8e81ebd1-f43d-4d90-85d9-dc9e85690f58_dnat'
2022-02-03T16:09:50.500Z|00046|main|DBG|restoring ct zone 135 for
'ad815732-f341-4a6d-bdce-184dc6c53edf_snat'
2022-02-03T16:09:50.500Z|00047|main|DBG|restoring ct zone 100 for
'vnet35091-gateway-vif'
2022-02-03T16:09:50.500Z|00048|main|DBG|restoring ct zone 162 for
'1078be52-1141-4060-b98d-23d96464b8bf_dnat'
2022-02-03T16:09:50.500Z|00049|main|DBG|restoring ct zone 30 for
'1078be52-1141-4060-b98d-23d96464b8bf_snat'
2022-02-03T16:09:50.500Z|00050|main|DBG|restoring ct zone 130 for
'605f08c3-f0be-4c0b-a721-5955bf2852d9_snat'
2022-02-03T16:09:50.500Z|00051|main|DBG|restoring ct zone 121 for
'27e28d02-ff81-4faa-b79b-8b6174e7ceea_dnat'
2022-02-03T16:09:50.500Z|00052|main|DBG|restoring ct zone 70 for
'vnet35515-gateway-vif'
2022-02-03T16:09:50.500Z|00053|main|DBG|restoring ct zone 26 for
'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat'
2022-02-03T16:09:50.500Z|00054|main|DBG|restoring ct zone 114 for
'c7f1776b-0a1f-4434-bac5-db2b59f3363b_snat'
…
2022-02-03T16:09:50.501Z|00226|inc_proc_eng|DBG|Initializing new run
…
2022-02-03T16:09:50.730Z|00287|inc_proc_eng|DBG|Initializing new run
…
2022-02-03T16:09:50.739Z|00311|inc_proc_eng|DBG|Initializing new run
<some updates from SB DB>
…
2022-02-03T16:09:50.754Z|00347|inc_proc_eng|DBG|Initializing new run
…
2022-02-03T16:09:50.779Z|00408|main|DBG|removing ct zone 2 for
'5601bacd-d246-42f8-aa69-c196f74ca57c_snat'
2022-02-03T16:09:50.779Z|00409|main|DBG|removing ct zone 3 for
'96b6fa7d-d974-4260-a58e-c78342fd097d_snat'
2022-02-03T16:09:50.779Z|00410|main|DBG|removing ct zone 5 for
'a8e55770-885f-464e-bfa9-9c80eca0ce62_snat'
2022-02-03T16:09:50.779Z|00411|main|DBG|removing ct zone 6 for
'2969a6a5-012d-4504-83bf-be5192c4af87_dnat'
2022-02-03T16:09:50.779Z|00412|main|DBG|removing ct zone 9 for
'03175d56-14c2-4ea9-b671-e427fe6b62e6_dnat'
2022-02-03T16:09:50.779Z|00413|main|DBG|removing ct zone 8 for
'5ae22181-33a5-4596-8143-e5e396fbc17e_dnat'
2022-02-03T16:09:50.779Z|00414|main|DBG|removing ct zone 12 for
'11ecffe1-c6f9-4373-948a-fff26113af73_snat'
2022-02-03T16:09:50.779Z|00415|main|DBG|removing ct zone 15 for
'791f9a0b-81e4-4a9e-8f38-7d56332b6237_snat'
2022-02-03T16:09:50.779Z|00416|main|DBG|removing ct zone 14 for
'0408d0d9-e7d9-49f7-a0f2-32d611b56686_snat'
2022-02-03T16:09:50.779Z|00417|main|DBG|removing ct zone 13 for
'd8cd2a0c-bd9e-4472-920e-5443e2a94429_dnat'
2022-02-03T16:09:50.779Z|00418|main|DBG|removing ct zone 17 for
'4e59bb4f-1ec6-4375-bb1c-63e03d1133ce_snat'
2022-02-03T16:09:50.779Z|00419|main|DBG|removing ct zone 18 for
'3210cc4f-d055-4375-8425-aef0bd4db5fc_snat'
2022-02-03T16:09:50.779Z|00420|main|DBG|removing ct zone 20 for
'cfcb7e97-6107-435e-8247-771314968be9_dnat'
2022-02-03T16:09:50.779Z|00421|main|DBG|removing ct zone 23 for
'aba29611-9a7c-497d-9dbb-2d18314d517b_snat'
2022-02-03T16:09:50.779Z|00422|main|DBG|removing ct zone 21 for
'57e00565-1250-4fa7-9c11-a7c1c602bf32_snat'
2022-02-03T16:09:50.779Z|00423|main|DBG|removing ct zone 26 for
'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat'
2022-02-03T16:09:50.779Z|00424|main|DBG|removing ct zone 24 for
'2486c8ff-9eb4-4326-8601-a2e93343164e_dnat'
2022-02-03T16:09:50.779Z|00425|main|DBG|removing ct zone 27 for
'278af800-e548-488a-9f5a-4f965718b56d_snat'
2022-02-03T16:09:50.779Z|00426|main|DBG|removing ct zone 30 for
'1078be52-1141-4060-b98d-23d96464b8bf_snat'
…
...
2022-02-03T16:09:50.780Z|00546|inc_proc_eng|DBG|node: flow_output, recompute
(forced)
2022-02-03T16:09:50.780Z|00547|ofctrl|DBG|ofctrl_add_flow flow: cookie=0,
table_id=0, priority=100, in_port=22,
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00548|ofctrl|DBG|ofctrl_add_flow flow: cookie=0,
table_id=0, priority=100, in_port=9,
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00549|ofctrl|DBG|ofctrl_add_flow flow: cookie=0,
table_id=0, priority=100, in_port=60,
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00550|ofctrl|DBG|ofctrl_add_flow flow: cookie=0,
table_id=0, priority=100, in_port=18,
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00551|ofctrl|DBG|ofctrl_add_flow flow: cookie=0,
table_id=0, priority=100, in_port=37,
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00552|ofctrl|DBG|ofctrl_add_flow flow: cookie=0,
table_id=0, priority=100, in_port=1,
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
…
After some run iterations:
2022-02-03T16:09:50.792Z|00923|inc_proc_eng|DBG|node: ct_zones, recompute
(triggered)
2022-02-03T16:09:50.792Z|00924|main|DBG|assigning ct zone 2 to
'5601bacd-d246-42f8-aa69-c196f74ca57c_snat'
2022-02-03T16:09:50.792Z|00925|main|DBG|assigning ct zone 3 to
'96b6fa7d-d974-4260-a58e-c78342fd097d_snat'
2022-02-03T16:09:50.792Z|00926|main|DBG|assigning ct zone 5 to
'a8e55770-885f-464e-bfa9-9c80eca0ce62_snat'
2022-02-03T16:09:50.792Z|00927|main|DBG|assigning ct zone 6 to
'2969a6a5-012d-4504-83bf-be5192c4af87_dnat'
2022-02-03T16:09:50.792Z|00928|main|DBG|assigning ct zone 8 to
'5ae22181-33a5-4596-8143-e5e396fbc17e_dnat'
2022-02-03T16:09:50.792Z|00929|main|DBG|assigning ct zone 9 to
'03175d56-14c2-4ea9-b671-e427fe6b62e6_dnat'
2022-02-03T16:09:50.792Z|00930|main|DBG|assigning ct zone 12 to
'11ecffe1-c6f9-4373-948a-fff26113af73_snat'
2022-02-03T16:09:50.792Z|00931|main|DBG|assigning ct zone 13 to
'd8cd2a0c-bd9e-4472-920e-5443e2a94429_dnat'
2022-02-03T16:09:50.792Z|00932|main|DBG|assigning ct zone 14 to
'0408d0d9-e7d9-49f7-a0f2-32d611b56686_snat'
2022-02-03T16:09:50.792Z|00933|main|DBG|assigning ct zone 15 to
'791f9a0b-81e4-4a9e-8f38-7d56332b6237_snat'
2022-02-03T16:09:50.792Z|00934|main|DBG|assigning ct zone 17 to
'4e59bb4f-1ec6-4375-bb1c-63e03d1133ce_snat'
2022-02-03T16:09:50.792Z|00935|main|DBG|assigning ct zone 18 to
'3210cc4f-d055-4375-8425-aef0bd4db5fc_snat'
2022-02-03T16:09:50.792Z|00936|main|DBG|assigning ct zone 20 to
'cfcb7e97-6107-435e-8247-771314968be9_dnat'
2022-02-03T16:09:50.792Z|00937|main|DBG|assigning ct zone 21 to
'57e00565-1250-4fa7-9c11-a7c1c602bf32_snat'
2022-02-03T16:09:50.792Z|00938|main|DBG|assigning ct zone 23 to
'aba29611-9a7c-497d-9dbb-2d18314d517b_snat'
2022-02-03T16:09:50.792Z|00939|main|DBG|assigning ct zone 24 to
'2486c8ff-9eb4-4326-8601-a2e93343164e_dnat'
2022-02-03T16:09:50.792Z|00940|main|DBG|assigning ct zone 26 to
'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat'
...
In my understanding the reason for this could be an empty binding_lports here
[0] and local_datapaths here [1]
So, this triggers zones removal and on next iteration the required zones are
created back.
What do you think?
0: https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L651
1: https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L658
Regards,
Vladislav Odintsov
> On 3 Feb 2022, at 14:19, Vladislav Odintsov <[email protected]> wrote:
>
> Hi Ilya,
>
> thanks for detailed answer.
> I’ve tried to just "soft-restart" ovn-controller and it seems that I found
> the problem, that could block OVS.
>
> What I’ve done:
> 1. ovs-appctl -t ovs-vswitchd vlog/set file:vconn:dbg
> 2. ovn-appctl -t ovn-controller exit --restart
> 3. systemctl restart ovn-controller
>
> In OVS log I saw a lot of flow-mod lines, and CT_FLUSH_ZONE:
>
> 2022-02-03T10:43:18.956Z|01574|vconn|DBG|unix#264996: sent (Success):
> OFPT_HELLO (OF1.5) (xid=0x40a6b):
> version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
> 2022-02-03T10:43:18.956Z|01575|vconn|DBG|unix#264996: received: OFPT_HELLO
> (OF1.5) (xid=0x1):
> version bitmap: 0x06
> 2022-02-03T10:43:18.956Z|01576|vconn|DBG|unix#264996: negotiated OpenFlow
> version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
> 2022-02-03T10:43:18.956Z|01577|vconn|DBG|unix#264996: received:
> OFPT_GET_CONFIG_REQUEST (OF1.5) (xid=0x2):
> 2022-02-03T10:43:18.956Z|01578|vconn|DBG|unix#264996: sent (Success):
> OFPT_GET_CONFIG_REPLY (OF1.5) (xid=0x2): frags=normal miss_send_len=0
> 2022-02-03T10:43:18.956Z|01579|vconn|DBG|unix#264996: received:
> NXT_SET_PACKET_IN_FORMAT (OF1.5) (xid=0x3): format=nxt_packet_in2
> 2022-02-03T10:43:18.956Z|01580|vconn|DBG|unix#264996: received:
> OFPT_SET_CONFIG (OF1.5) (xid=0x4): frags=normal miss_send_len=65535
> 2022-02-03T10:43:19.123Z|01581|vconn|DBG|unix#264997: sent (Success):
> OFPT_HELLO (OF1.5) (xid=0x40a6c):
> version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
> 2022-02-03T10:43:19.123Z|01582|vconn|DBG|unix#264997: received: OFPT_HELLO
> (OF1.5) (xid=0x5):
> version bitmap: 0x06
> 2022-02-03T10:43:19.123Z|01583|vconn|DBG|unix#264997: negotiated OpenFlow
> version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
> 2022-02-03T10:43:19.123Z|01584|vconn|DBG|unix#264998: sent (Success):
> OFPT_HELLO (OF1.5) (xid=0x40a6d):
> version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
> 2022-02-03T10:43:19.123Z|01585|vconn|DBG|unix#264998: received: OFPT_HELLO
> (OF1.5) (xid=0x6):
> version bitmap: 0x06
> 2022-02-03T10:43:19.123Z|01586|vconn|DBG|unix#264998: negotiated OpenFlow
> version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
> 2022-02-03T10:43:19.124Z|01587|vconn|DBG|unix#264998: received:
> NXT_TLV_TABLE_REQUEST (OF1.5) (xid=0x7):
> 2022-02-03T10:43:19.124Z|01588|vconn|DBG|unix#264998: sent (Success):
> NXT_TLV_TABLE_REPLY (OF1.5) (xid=0x7):
> max option space=256 max fields=64
> allocated option space=4
>
> mapping table:
> class type length match field
> ------ ---- ------ --------------
> 0x102 0x80 4 tun_metadata0
> 2022-02-03T10:43:19.124Z|01589|vconn|DBG|unix#264998: received: OFPT_FLOW_MOD
> (OF1.5) (xid=0x8): DEL table:255 priority=0 actions=drop
> 2022-02-03T10:43:19.126Z|01590|vconn|DBG|unix#264998: received:
> OFPT_GROUP_MOD (OF1.5) (xid=0x9):
> DEL group_id=4294967292,type=all
> 2022-02-03T10:43:19.126Z|01591|vconn|DBG|unix#264998: received:
> OFPT_METER_MOD (OF1.5) (xid=0xa): DEL meter=all bands=
> 2022-02-03T10:43:19.126Z|01592|vconn|DBG|unix#264998: received:
> OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb):
> bundle_id=0 type=OPEN_REQUEST flags=atomic ordered
> 2022-02-03T10:43:19.126Z|01593|vconn|DBG|unix#264998: sent (Success):
> OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb):
> bundle_id=0 type=OPEN_REPLY flags=0
> 2022-02-03T10:43:19.126Z|01594|vconn|DBG|unix#264998: received:
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xc):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xc): ADD priority=100,in_port=98
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01595|vconn|DBG|unix#264998: received:
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xd):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xd): ADD priority=100,in_port=96
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01596|vconn|DBG|unix#264998: received:
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xe):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xe): ADD priority=100,in_port=7
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01597|vconn|DBG|unix#264998: received:
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xf):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xf): ADD priority=100,in_port=6
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01598|vconn|DBG|unix#264998: received:
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x10):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0x10): ADD priority=100,in_port=55
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01599|vconn|DBG|unix#264998: received:
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x11):
> bundle_id=0 flags=atomic ordered
> ….
>
>
> 2022-02-03T10:43:26.293Z|01758|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6b): zone_id=52
> 2022-02-03T10:43:26.424Z|01759|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6c): zone_id=55
> 2022-02-03T10:43:26.554Z|01760|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6d): zone_id=54
> 2022-02-03T10:43:26.685Z|01761|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6e): zone_id=58
> 2022-02-03T10:43:26.816Z|01762|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6f): zone_id=57
> 2022-02-03T10:43:26.946Z|01763|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x70): zone_id=56
> 2022-02-03T10:43:27.031Z|00013|ovs_rcu(urcu2)|WARN|blocked 1000 ms waiting
> for main to quiesce
> 2022-02-03T10:43:27.076Z|01764|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x71): zone_id=59
> 2022-02-03T10:43:27.206Z|01765|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x72): zone_id=61
> 2022-02-03T10:43:27.336Z|01766|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x73): zone_id=60
> 2022-02-03T10:43:27.465Z|01767|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x74): zone_id=62
> 2022-02-03T10:43:27.594Z|01768|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x75): zone_id=66
> 2022-02-03T10:43:27.726Z|01769|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x76): zone_id=65
> 2022-02-03T10:43:27.854Z|01770|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x77): zone_id=64
> 2022-02-03T10:43:27.984Z|01771|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x78): zone_id=63
> 2022-02-03T10:43:28.031Z|00014|ovs_rcu(urcu2)|WARN|blocked 2000 ms waiting
> for main to quiesce
> 2022-02-03T10:43:28.111Z|01772|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x79): zone_id=67
> 2022-02-03T10:43:28.238Z|01773|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7a): zone_id=68
> 2022-02-03T10:43:28.371Z|01774|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7b): zone_id=69
> 2022-02-03T10:43:28.496Z|01775|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7c): zone_id=72
> 2022-02-03T10:43:28.622Z|01776|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7d): zone_id=71
> 2022-02-03T10:43:28.746Z|01777|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7e): zone_id=70
> 2022-02-03T10:43:28.871Z|01778|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7f): zone_id=75
> 2022-02-03T10:43:29.000Z|01779|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x80): zone_id=74
> 2022-02-03T10:43:29.123Z|01780|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x81): zone_id=73
> 2022-02-03T10:43:29.246Z|01781|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x82): zone_id=76
> 2022-02-03T10:43:29.371Z|01782|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x83): zone_id=78
> 2022-02-03T10:43:29.495Z|01783|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x84): zone_id=77
> 2022-02-03T10:43:29.618Z|01784|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x85): zone_id=81
> 2022-02-03T10:43:29.746Z|01785|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x86): zone_id=80
> 2022-02-03T10:43:29.867Z|01786|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x87): zone_id=79
> 2022-02-03T10:43:29.988Z|01787|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x88): zone_id=84
> 2022-02-03T10:43:30.031Z|00015|ovs_rcu(urcu2)|WARN|blocked 4000 ms waiting
> for main to quiesce
> 2022-02-03T10:43:30.108Z|01788|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x89): zone_id=83
> 2022-02-03T10:43:30.229Z|01789|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8a): zone_id=82
> 2022-02-03T10:43:30.349Z|01790|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8b): zone_id=86
> 2022-02-03T10:43:30.470Z|01791|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8c): zone_id=85
> 2022-02-03T10:43:30.591Z|01792|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8d): zone_id=89
> 2022-02-03T10:43:30.712Z|01793|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8e): zone_id=88
> 2022-02-03T10:43:30.833Z|01794|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8f): zone_id=87
> 2022-02-03T10:43:30.953Z|01795|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x90): zone_id=90
> 2022-02-03T10:43:31.073Z|01796|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x91): zone_id=91
> 2022-02-03T10:43:31.193Z|01797|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x92): zone_id=94
> 2022-02-03T10:43:31.313Z|01798|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x93): zone_id=93
> 2022-02-03T10:43:31.433Z|01799|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x94): zone_id=92
> 2022-02-03T10:43:31.554Z|01800|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x95): zone_id=95
> 2022-02-03T10:43:31.674Z|01801|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x96): zone_id=96
> 2022-02-03T10:43:31.795Z|01802|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x97): zone_id=97
> 2022-02-03T10:43:31.915Z|01803|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x98): zone_id=99
> 2022-02-03T10:43:32.035Z|01804|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x99): zone_id=98
> 2022-02-03T10:43:32.156Z|01805|vconn|DBG|unix#264998: received:
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x9a): zone_id=100
> 2022-02-03T10:43:32.275Z|01806|connmgr|INFO|br-int<->unix#264998: 37
> flow_mods 13 s ago (36 adds, 1 deletes)
> 2022-02-03T10:43:32.276Z|01807|timeval|WARN|Unreasonably long 6245ms poll
> interval (826ms user, 5403ms system)
> 2022-02-03T10:43:32.276Z|01808|timeval|WARN|faults: 2 minor, 0 major
> 2022-02-03T10:43:32.276Z|01809|timeval|WARN|context switches: 0 voluntary, 21
> involuntary
> 2022-02-03T10:43:32.276Z|01810|coverage|INFO|Skipping details of duplicate
> event coverage for hash=4321e6a4
> 2022-02-03T10:43:32.276Z|01811|poll_loop|INFO|wakeup due to [POLLIN] on fd 15
> (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
> 2022-02-03T10:43:32.276Z|01812|poll_loop|INFO|wakeup due to [POLLIN] on fd
> 228 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
> usage)
>
>
> Please note the time for each NXT_CT_FLUSH_ZONE line took. It flushes just ~8
> zones per second. Maybe, it’s due to quite big conntrack size:
> I’ve noticed that on the system there were a huge amount of the conntrack
> records in host’s zone=0.
> There were connections generated from overlay tunnel packets, that were not
> configured in iptables to be NOTRACK - there was a configuration management
> issue.
> After I’ve applied iptables rules to not to track geneve, conntrack
> connection count decreased from ~60k to 3k and CT_ZONE flush became much
> faster (about 100 zones per 1 second).
>
> Thanks for pointing to another to openflows problem!
>
> But my question is why ovn-controller flushes conntrack zones and modifies
> OFs even if nothing has changed? I mean, a "soft-restart" condition (when
> ovn-controller was stopped through unixctl socket with "--restart" option.
> Should I start ovn-controller somehow special?
> When I trigger "soft-restart", I expect to have a minimal impact on the
> datapath.
>
> Regards,
> Vladislav Odintsov
>
>> On 2 Feb 2022, at 18:20, Ilya Maximets <[email protected]> wrote:
>>
>> On 2/1/22 23:15, Vladislav Odintsov wrote:
>>> Hi OVN and OVS folks,
>>>
>>> during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long
>>> datapath downtime (48-50 seconds).
>>> I suppose this downtime was triggered by ovn-controller installing new
>>> openflow rules to ovs.
>>> There are relatively not so many flows (13-14k).
>>> openvswitch version is 2.13.4.
>>> ovn-openflow-probe-interval is configured to 30.
>>>
>>> The upgrade procedure was:
>>> 1. Set to ovs external_ids an option to force matching ovn-northd internal
>>> version (ovn-match-northd-version=true)
>>> 2. Upgrade ovn-controller (to first get ovn-controller, which supports
>>> version match between controller and northd and not to break datapath after
>>> upgrading ovn-northd)
>>> 3. Upgrade ovn-northd cluster
>>>
>>> Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next:
>>>
>>> 2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37
>>> flow_mods 13 s ago (36 adds, 1 deletes)
>>> 2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll
>>> interval (0ms user, 12833ms system)
>>> 2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major
>>> 2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary,
>>> 18 involuntary
>>> 2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over
>>> last: 5 seconds, last minute, last hour, hash=9105faf6:
>>> 2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand
>>> 0.0/sec 0.000/sec 0.0000/sec total: 1129
>>> 2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink
>>> 0.0/sec 11.933/sec 0.1989/sec total: 2363
>>> 2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand
>>> 0.0/sec 0.833/sec 0.8811/sec total: 11218023
>>> 2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink
>>> 8.2/sec 11.950/sec 1.0736/sec total: 11217475
>>> 2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add
>>> 0.0/sec 0.000/sec 0.0000/sec total: 229
>>> 2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del
>>> 0.0/sec 0.000/sec 0.0000/sec total: 144
>>> 2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush
>>> 0.0/sec 0.000/sec 0.0000/sec total: 2
>>> 2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get
>>> 0.0/sec 0.000/sec 0.0000/sec total: 23
>>> 2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put
>>> 83.0/sec 97.900/sec 80.8306/sec total: 849051748
>>> 2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del
>>> 147.8/sec 91.733/sec 78.2797/sec total: 810797137
>>> 2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute
>>> 66.6/sec 83.767/sec 87.8397/sec total: 958697586
>>> 2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help
>>> 0.0/sec 4.250/sec 4.9892/sec total: 70713908
>>> 2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract
>>> 169.0/sec 247.017/sec 333.1194/sec total: 7990486004
>>> 2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc
>>> 0.0/sec 0.183/sec 0.7983/sec total: 5871853
>>> 2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological
>>> 0.0/sec 0.000/sec 0.0000/sec total: 78
>>> 2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand
>>> 0.0/sec 0.000/sec 0.0000/sec total: 11
>>> 2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological
>>> 0.0/sec 0.633/sec 0.9331/sec total: 16638117
>>> 2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand
>>> 0.0/sec 114.867/sec 139.8006/sec total: 2184263485
>>> 2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned
>>> 0.0/sec 0.000/sec 0.0003/sec total: 299
>>> 2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired
>>> 0.0/sec 0.000/sec 0.0006/sec total: 117
>>> 2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved
>>> 0.0/sec 0.000/sec 0.0000/sec total: 87
>>> 2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats
>>> 0.0/sec 36.850/sec 43.3306/sec total: 690673778
>>> 2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged
>>> 0.0/sec 0.100/sec 0.1614/sec total: 2705864
>>> 2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete
>>> 0.0/sec 0.183/sec 0.2239/sec total: 3688261
>>> 2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success
>>> 0.0/sec 0.167/sec 0.1953/sec total: 3236818
>>> 2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node
>>> 3930.6/sec 3941.350/sec 5012.2572/sec total: 115959738162
>>> 2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout
>>> 159.4/sec 111.417/sec 103.4078/sec total: 1181551227
>>> 2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued
>>> 0.0/sec 0.050/sec 0.0817/sec total: 1279558
>>> 2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent
>>> 0.0/sec 0.050/sec 0.0817/sec total: 1279558
>>> 2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change
>>> 2205.2/sec 2502.783/sec 2874.6258/sec total: 57335390998
>>> 2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open
>>> 0.0/sec 0.000/sec 0.0000/sec total: 5
>>> 2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open
>>> 0.0/sec 0.000/sec 0.0000/sec total: 1
>>> 2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received
>>> 0.0/sec 0.000/sec 0.0000/sec total: 189
>>> 2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied
>>> 0.0/sec 0.000/sec 0.0000/sec total: 189
>>> 2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc
>>> 7935.8/sec 15234.350/sec 18372.4347/sec total: 345308271208
>>> 2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received
>>> 0.0/sec 0.083/sec 0.3556/sec total: 2018930
>>> 2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent
>>> 0.0/sec 0.067/sec 0.0983/sec total: 1552067
>>> 2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing
>>> 0.0/sec 0.000/sec 0.0000/sec total: 623
>>> 2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex
>>> 0.0/sec 0.100/sec 0.1636/sec total: 2713928
>>> 2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr
>>> 0.0/sec 0.000/sec 0.0000/sec total: 2
>>> 2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool
>>> 0.0/sec 0.000/sec 0.0000/sec total: 1073
>>> 2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool
>>> 0.0/sec 0.000/sec 0.0000/sec total: 225
>>> 2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed
>>> 0.0/sec 0.050/sec 0.0492/sec total: 819614
>>> 2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received
>>> 346.6/sec 472.133/sec 575.4597/sec total: 11288088779
>>> 2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo
>>> 18.4/sec 113.250/sec 175.1189/sec total: 3935643810
>>> 2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent
>>> 308.4/sec 378.133/sec 380.4267/sec total: 4753114969
>>> 2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush
>>> 0.0/sec 0.000/sec 0.0000/sec total: 1
>>> 2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow
>>> 0.0/sec 0.067/sec 0.3389/sec total: 1746449
>>> 2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port
>>> 0.0/sec 0.000/sec 0.0000/sec total: 814
>>> 2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow
>>> 0.0/sec 0.000/sec 0.0000/sec total: 282520
>>> 2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure
>>> 0.0/sec 0.017/sec 0.0325/sec total: 539904
>>> 2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled
>>> 0.0/sec 0.000/sec 0.0000/sec total: 27
>>> 2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table
>>> 0.0/sec 0.000/sec 0.0106/sec total: 10876
>>> 2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning
>>> 0.0/sec 0.000/sec 0.0006/sec total: 312
>>> 2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow
>>> 0.0/sec 0.000/sec 0.0000/sec total: 39204
>>> 2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow
>>> 0.0/sec 0.000/sec 0.0000/sec total: 7177
>>> 2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall
>>> 0.4/sec 1.833/sec 2.2664/sec total: 34762798
>>> 2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention
>>> 0.0/sec 0.000/sec 0.0000/sec total: 9
>>> 2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace
>>> 0.2/sec 0.050/sec 0.0447/sec total: 7927221
>>> 2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow
>>> 0.0/sec 0.000/sec 0.0000/sec total: 1118
>>> 2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions
>>> 123.6/sec 273.183/sec 233.8275/sec total: 2110727223
>>> 2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure
>>> 0.0/sec 0.017/sec 0.0325/sec total: 539902
>>> 2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit
>>> 2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in
>>> last 1058229 seconds (most recently, 1058228 seconds ago) due to excessive
>>> rate
>>> 2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU
>>> usage)
>>> 2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll
>>> interval (0ms user, 10643ms system)
>>> 2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major
>>> 2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary,
>>> 4 involuntary
>>> 2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate
>>> event coverage for hash=9105faf6
>>> 2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
>>> 2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll
>>> interval (0ms user, 10115ms system)
>>> 2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major
>>> 2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary,
>>> 27 involuntary
>>> 2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate
>>> event coverage for hash=9105faf6
>>> 2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd
>>> 235 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection
>>> dropped (Broken pipe)
>>> 2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting
>>> for main to quiesce
>>> 2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll
>>> interval (0ms user, 17720ms system)
>>> 2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major
>>> 2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary,
>>> 24 involuntary
>>> 2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
>>> 2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU
>>> usage)
>>> 2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589
>>> flow_mods 18 s ago (13509 adds, 80 modifications)
>>> 2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response
>>> to inactivity probe after 60 seconds, disconnecting
>>>
>>>
>>> In ovn-controller log:
>>>
>>> 2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt:
>>> connecting to switch
>>> 2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>> connecting...
>>> 2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt:
>>> connecting to switch
>>> 2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>> connecting...
>>> 2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>> connected
>>> 2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>> connected
>>> 2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force
>>> recompute.
>>> 2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force
>>> recompute next time.
>>> 2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642:
>>> connection closed by peer
>>> 2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting...
>>> 2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected
>>> 2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity
>>> (last active 30001 ms ago): trimming cache
>>> 2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport>
>>> ovn-installed in OVS
>>> 2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in
>>> Southbound
>>> <there are 80 port bindings...>
>>> 2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN>
>>> ovn-installed in OVS
>>> 2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in
>>> Southbound
>>> 2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity
>>> (last active 30004 ms ago): trimming cache
>>>
>>>
>>> I’ve got some questions:
>>> 1. What OVS is doing while `waiting for main to quiesce` ? I noticed this
>>> message appeares each time downtime occurs.
>>
>> 'Unreasonably long 17764ms poll interval (0ms user, 17720ms system)'
>> is very concerning. It means that the main thread spent almost 18
>> seconds inside the kernel executing some syscall, maybe several syscalls.
>> Main thread is not responsible for flow installation to the kernel, IIRC,
>> so flow updates should not be a case here. You need to figure out what
>> are these syscalls and why they are taking so long to execute inside
>> the kernel.
>>
>> Maybe you have some suspicious kernel logs at the same time?
>> You may also try 'perf' or some other tracing tools to figure out
>> what is going on.
>>
>>> 2. Is there any hint to minimize such downtime while reinstalling OF?
>>
>> Installation of rules is an unlikely root cause here.
>>
>>> 3. Could there be a situation where ovn-northd after upgrade still
>>> generating new logical flows, ovn-controller constantly gets updates from
>>> SB and triggers update to OF in a loop while doesn’t finish getting final
>>> lflows "view"? So, could it start installing new OF while having in memory
>>> lflows which are not consistent at that moment?
>>
>> northd generates all updates in a single transaction, so SB is updated
>> atomically. SB will send this update as a single update message, so this
>> is also atomic. ovn-controller will install all OF rules as an OF bundle,
>> so that is also atomic.
>>
>>> 4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error
>>> means? What type of connection is broken? Is it normal?
>>
>> This looks like a controller connection, i.e. connection between
>> ovn-controller with ovs-vswitchd. Doesn't look very good, but
>> disconnection seems to be caused by inactivity due to overload
>> on one of the sides of the connection.
>>
>>> 5. Do you see anything else abnormal in the logs?
>>
>> '0ms user, 17720ms system' is the most abnormal, AFAICT.
>>
>> Best regards, Ilya Maximets.
>>
>> _______________________________________________
>> dev mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev