Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

Vladislav Odintsov Tue, 16 May 2023 12:42:47 -0700

Hi Numan, Dumitru, Ilya,

I guess I’ve managed to deal with some of these problems and have some 
conclusions.
Please see inline.


> On 14 Apr 2023, at 16:26, Vladislav Odintsov <odiv...@gmail.com> wrote:
> 
> And as a follow-up, I see sometimes next error message in ovn-controller log, 
> though there is a configured much higher limit in seconds for inactivity 
> probe value for the chassis:
> 

ovn-controller establishes 3 openflow connections to local ovs-vswitchd for 
next purposes:
1. openflow rules management (openflow probe interval is configured through 
ovn-openflow-probe-interval external_ids setting [1]; default — disabled [2]);
2. openflow features collection (hardcoded probe interval 5s [3]);
3. pinctrl thread (hardcoded probe interval 5s [4]).

> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> inactivity probe after 8 seconds, disconnecting
> 
> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> inactivity probe after 10 seconds, disconnecting
> 

These messages most likely related to the connection for features collection or 
pinctrl thread. So ovn-openflow-probe-interval has no matter here.

> 
> server with first log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "30"
> 
> server with second log message:
> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> "60"
> 
> 
> 
>> On 30 Mar 2023, at 23:19, Vladislav Odintsov <odiv...@gmail.com> wrote:
>> 
>> Hi all,
>> 
>> I’m facing a false positive failover in the event of openflow connection got 
>> dropped after inactivity probe timeout due to high load of ovn-controller 
>> handling huge amount of ovsdb updates.
>> I wonder wether its a bug or I have to tune any specifics. See 
>> ovn-controller logs:
>> 
>> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: logical_flow_output, 
>> recompute (failed handler for input port_groups) took 35774ms
>> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache inactivity 
>> (last active 40006 ms ago): trimming cache
>> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms poll 
>> interval (32947ms user, 6968ms system)
>> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
>> …..
>> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl:xxxx:6642: no response to 
>> inactivity probe after 60.2 seconds, disconnecting
>> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl:xxxx:6642: connection 
>> dropped
>> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force 
>> recompute next time.
>> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl:yyyy:6642: connecting...
>> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl:yyyy:6642: connected
>> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> lib/stream-ssl.c:838 (74% CPU usage)
>> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> lib/stream-ssl.c:838 (74% CPU usage)
>> ….
>> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> lib/stream-ssl.c:838 (101% CPU usage)
>> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 54% 
>> in last 682257.2 seconds, from 4435696 kB to 6824820 kB
>> 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359 
>> idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
>> lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
>> local_datapath_usage-KB:630
>>  ofctrl_desired_flow_usage-KB:577052 ofctrl_installed_flow_usage-KB:409538 
>> ofctrl_sb_flow_ref_usage-KB:273707
>> …
>> 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt: 
>> connection dropped (Broken pipe)
>> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from this 
>> chassis (sb_readonly=0)
>> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in last 
>> 1254725 seconds (most recently, 1254725 seconds ago) due to excessive rate
>> 2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
>> interface cr-xxx
>> 2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from this 
>> chassis (sb_readonly=0)
>> 2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from this 
>> chassis (sb_readonly=0)
>> …
>> 
>> After some time these ports are claimed back:
>> 
>> 2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connecting...
>> 2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connected
>> 2023-03-30T07:57:49.523Z|59139|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> controller/ovn-controller.c:4296 (99% CPU usage)
>> 2023-03-30T07:57:49.574Z|59140|poll_loop|INFO|wakeup due to 0-ms timeout at 
>> controller/ovn-controller.c:4286 (99% CPU usage)
>> 2023-03-30T07:58:01.451Z|59141|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 22 (xxxx:34736<->yyyy:6642) at lib/stream-ssl.c:836 (99% CPU usage)
>> 2023-03-30T07:58:02.449Z|59142|binding|INFO|Claiming lport cr-xxx for this 
>> chassis.
>> 2023-03-30T07:58:02.449Z|59143|binding|INFO|cr-xxx: Claiming xx:xx:xx:xx:xx 
>> x.x.x.x/xx
>> 2023-03-30T07:58:02.453Z|59144|binding|INFO|Claiming lport cr-yyy for this 
>> chassis.
>> 2023-03-30T07:58:02.453Z|59145|binding|INFO|cr-yyy: Claiming 
>> yy:yy:yy:yy:yy:yy y.y.y.y/yy
>> …
>> 
>> Previously I’ve met with similar problem but the openflow probe interval was 
>> 30 seconds and I set it to 60 and then to 120 seconds.
>> But it seems that OVS was configured on default inactivity probe timeout for 
>> openflow - 60 seconds. So, the error above occured. In OVS logs it looks 
>> like next:
>> 
>> 2023-03-30T07:57:28.632Z|61003|rconn|ERR|br-int<->unix#8: no response to 
>> inactivity probe after 60 seconds, disconnecting

When ovn-controller looses openflow connection for flows management it 
considers OVS is dead and releases its ha-chassis-group bindings. This logic 
was introduced with the patch [5].

ovn-controller uses unix socket to connect to ovs-vswitchd OF socket, it seems 
useless to configure inactivity probing, right?
In this commit message [6] Han also disables OF inactivity probing by default.

Moreover, it seems harmful to configure inactivity probing from OVS to 
ovn-controller and vise versa.
In case ovn-controller for some reason is busy with some work and can’t answer 
to OVS in two minutes (its default probe interval is 60s [7]), so OVS destroys 
OF connection with the ERR log:
ovs|18206|rconn|ERR|br-int<->unix#17574: no response to inactivity probe after 
60 seconds, disconnecting

So, when ovn-controller resumes after a big work it sees that OF connection is 
closed and triggers chassisredirect portbindings release.
This leads to datapath downtime and unnesessary load on OVN_Southbound, 
ovn-controllers, etc.

To avoid such behaviour user may consider not to configure 
ovn-openflow-probe-interval AND disable probing from OVS side with the next 
command:
ovs-vsctl --inactivity-probe=0 set-controller br-int 
punix://var/run/openvswitch/br-int.mgmt

To make this setting applied, it’s needed to reestablish OF connection. (for 
instance, restart ovn-controller)

My proposal:

If ovn-controller uses only unix socket for OF connection, it should be safe to 
disable OF probing from its code:
1. configure controller for br-int to disable probing (ensure br-int or other 
configured bridge controller created with inactivity probe = 0);
2. delete ovn-openflow-probe-interval setting and mark in documentation that 
for OVN versions >= 20.09 it’s recommended not to configure 
ovn-openflow-inactivity-probe. For versions < 20.09 default interval is 5 
seconds, so it’s reasonable to set it to higher value in the config for loaded 
hypervisors;
3. set openflow probe interval for ovs features and pinctrl OF connection to 0 
(disable probing);

What do you guys think about this?
Please, correct me if I understand something wrong and described changes can’t 
be accepted.


>> 
>> This node is a dedicated l3gateway which has is-interconn=true setting 
>> (interconnection case) and has quite big openflow table around 2.7M flows.
>> 
>> So, I’ve got some questions for OVN and OVS:
>> 1. I couldn’t find in ovn-controller code where the logic of handling 
>> openflow connection drop. It is a bug when after openflow connection drop 
>> the claimed cr-ports are being released? I guess, yes. Can you please point 
>> me to the corrent place of code to start investigating?
>> 2. I see almost the same openflow amount on each l3gateway with 
>> interconnection role. Even l3gateways which have no allocated 
>> cr-port-bindings have 2M+ openflow flows. Is it normal operation or it’s a 
>> place for possible optimisation? I mean it seems that excess flows are 
>> installed.

There are a lot of mac_binding records, but the question is why these flows 
still persist even if no datapaths are allocated to the node.

>> 3. How to handle long OVS OF updates, which block inactivity probe for OF 
>> connection and then got dropped? From ovn-controller it’s possible to 
>> configure inactivity probe, is it possible and should I configure from 
>> ovs-vswitchd side?
>> 4. There is a warning on try to release unknown interface, which was relased 
>> right before. Is it normal situation or also should be investigated?
>> 

For points 2-4 I still haven’t found the answer yet. If someone knows the 
answer, please share it.

>> 
>> Thanks.
>> 
>> Regards,
>> Vladislav Odintsov
>> 
> 
> 
> Regards,
> Vladislav Odintsov
> 


1: https://github.com/ovn-org/ovn/blob/v23.03.0/controller/ofctrl.c#L421
2: https://github.com/ovn-org/ovn/blob/v23.03.0/controller/ovn-controller.c#L108
3: https://github.com/ovn-org/ovn/blob/v23.03.0/lib/features.c#L85
4: https://github.com/ovn-org/ovn/blob/v23.03.0/controller/pinctrl.c#L3479
5: 
https://patchwork.ozlabs.org/project/openvswitch/patch/20190204165031.31696-1-nusid...@redhat.com/
6: 
https://github.com/ovn-org/ovn/commit/b8af8549396e62d6523be18e104352e334825783
7: https://github.com/openvswitch/ovs/blob/v3.1.0/vswitchd/bridge.c#L3860

Regards,
Vladislav Odintsov






Regards,
Vladislav Odintsov

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [ovn] ha-chassis-group false positive failover

Reply via email to