On Wed, May 17, 2023 at 4:44 AM Vladislav Odintsov <[email protected]> wrote:
>
>
>
> > On 16 May 2023, at 23:54, Numan Siddique <[email protected]> wrote:
> >
> > Hi Vladislav,
> >
> > Sorry for the late reply.
> >
> > PSB for few comments.
>
>
> Thanks for your reply!
> My answers are inline.
>
> >
> >
> >
> > On Tue, May 16, 2023 at 3:42 PM Vladislav Odintsov <[email protected] 
> > <mailto:[email protected]>> wrote:
> >>
> >> Hi Numan, Dumitru, Ilya,
> >>
> >> I guess I’ve managed to deal with some of these problems and have some 
> >> conclusions.
> >> Please see inline.
> >>
> >>> On 14 Apr 2023, at 16:26, Vladislav Odintsov <[email protected]> wrote:
> >>>
> >>> And as a follow-up, I see sometimes next error message in ovn-controller 
> >>> log, though there is a configured much higher limit in seconds for 
> >>> inactivity probe value for the chassis:
> >>>
> >>
> >> ovn-controller establishes 3 openflow connections to local ovs-vswitchd 
> >> for next purposes:
> >> 1. openflow rules management (openflow probe interval is configured 
> >> through ovn-openflow-probe-interval external_ids setting [1]; default — 
> >> disabled [2]);
> >> 2. openflow features collection (hardcoded probe interval 5s [3]);
> >> 3. pinctrl thread (hardcoded probe interval 5s [4]).
> >>
> >>> ovs|04426|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> >>> inactivity probe after 8 seconds, disconnecting
> >>>
> >>> ovs|26560|rconn|ERR|unix:/run/openvswitch/br-int.mgmt: no response to 
> >>> inactivity probe after 10 seconds, disconnecting
> >>>
> >>
> >> These messages most likely related to the connection for features 
> >> collection or pinctrl thread. So ovn-openflow-probe-interval has no matter 
> >> here.
> >>
> >>>
> >>> server with first log message:
> >>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> >>> "30"
> >>>
> >>> server with second log message:
> >>> # ovs-vsctl get open . external_ids:ovn-openflow-probe-interval
> >>> "60"
> >>>
> >>>
> >>>
> >>>> On 30 Mar 2023, at 23:19, Vladislav Odintsov <[email protected]> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I’m facing a false positive failover in the event of openflow connection 
> >>>> got dropped after inactivity probe timeout due to high load of 
> >>>> ovn-controller handling huge amount of ovsdb updates.
> >>>> I wonder wether its a bug or I have to tune any specifics. See 
> >>>> ovn-controller logs:
> >>>>
> >>>> 2023-03-30T07:57:24.469Z|58440|inc_proc_eng|INFO|node: 
> >>>> logical_flow_output, recompute (failed handler for input port_groups) 
> >>>> took 35774ms
> >>>> 2023-03-30T07:57:28.521Z|58441|lflow_cache|INFO|Detected cache 
> >>>> inactivity (last active 40006 ms ago): trimming cache
> >>>> 2023-03-30T07:57:28.528Z|58442|timeval|WARN|Unreasonably long 40084ms 
> >>>> poll interval (32947ms user, 6968ms system)
> >>>> 2023-03-30T07:57:28.528Z|58443|timeval|WARN|faults: 43468 minor, 0 major
> >>>> …..
> >>>> 2023-03-30T07:57:48.784Z|58496|reconnect|ERR|ssl:xxxx:6642: no response 
> >>>> to inactivity probe after 60.2 seconds, disconnecting
> >>>> 2023-03-30T07:57:48.784Z|58497|reconnect|INFO|ssl:xxxx:6642: connection 
> >>>> dropped
> >>>> 2023-03-30T07:57:48.905Z|58498|main|INFO|OVNSB commit failed, force 
> >>>> recompute next time.
> >>>> 2023-03-30T07:57:49.786Z|58499|reconnect|INFO|ssl:yyyy:6642: 
> >>>> connecting...
> >>>> 2023-03-30T07:57:49.924Z|58500|reconnect|INFO|ssl:yyyy:6642: connected
> >>>> 2023-03-30T07:57:29.781Z|58501|poll_loop|INFO|wakeup due to 0-ms timeout 
> >>>> at lib/stream-ssl.c:838 (74% CPU usage)
> >>>> 2023-03-30T07:57:29.836Z|58502|poll_loop|INFO|wakeup due to 0-ms timeout 
> >>>> at lib/stream-ssl.c:838 (74% CPU usage)
> >>>> ….
> >>>> 2023-03-30T07:57:05.976Z|58545|poll_loop|INFO|wakeup due to 0-ms timeout 
> >>>> at lib/stream-ssl.c:838 (101% CPU usage)
> >>>> 2023-03-30T07:57:07.002Z|58546|memory|INFO|peak resident set size grew 
> >>>> 54% in last 682257.2 seconds, from 4435696 kB to 6824820 kB
> >>>> 2023-03-30T07:57:07.002Z|58547|memory|INFO|idl-cells-OVN_Southbound:11491359
> >>>>  idl-cells-Open_vSwitch:14416 lflow-cache-entries-cache-expr:530298 
> >>>> lflow-cache-entries-cache-matches:31770 lflow-cache-size-KB:769553 
> >>>> local_datapath_usage-KB:630
> >>>> ofctrl_desired_flow_usage-KB:577052 
> >>>> ofctrl_installed_flow_usage-KB:409538 ofctrl_sb_flow_ref_usage-KB:273707
> >>>> …
> >>>> 2023-03-30T07:57:31.667Z|58552|rconn|WARN|unix:/run/openvswitch/br-int.mgmt:
> >>>>  connection dropped (Broken pipe)
> >>>> 2023-03-30T07:57:31.739Z|58553|binding|INFO|Releasing lport cr-xxx from 
> >>>> this chassis (sb_readonly=0)
> >>>> 2023-03-30T07:57:31.739Z|58554|if_status|WARN|Dropped 1 log messages in 
> >>>> last 1254725 seconds (most recently, 1254725 seconds ago) due to 
> >>>> excessive rate
> >>>> 2023-03-30T07:57:31.739Z|58555|if_status|WARN|Trying to release unknown 
> >>>> interface cr-xxx
> >>>> 2023-03-30T07:57:31.743Z|58556|binding|INFO|Releasing lport cr-yyy from 
> >>>> this chassis (sb_readonly=0)
> >>>> 2023-03-30T07:57:31.743Z|58557|binding|INFO|Releasing lport cr-zzz from 
> >>>> this chassis (sb_readonly=0)
> >>>> …
> >>>>
> >>>> After some time these ports are claimed back:
> >>>>
> >>>> 2023-03-30T07:57:49.320Z|59137|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
> >>>>  connecting...
> >>>> 2023-03-30T07:57:49.320Z|59138|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
> >>>>  connected
> >>>> 2023-03-30T07:57:49.523Z|59139|poll_loop|INFO|wakeup due to 0-ms timeout 
> >>>> at controller/ovn-controller.c:4296 (99% CPU usage)
> >>>> 2023-03-30T07:57:49.574Z|59140|poll_loop|INFO|wakeup due to 0-ms timeout 
> >>>> at controller/ovn-controller.c:4286 (99% CPU usage)
> >>>> 2023-03-30T07:58:01.451Z|59141|poll_loop|INFO|wakeup due to [POLLIN] on 
> >>>> fd 22 (xxxx:34736<->yyyy:6642) at lib/stream-ssl.c:836 (99% CPU usage)
> >>>> 2023-03-30T07:58:02.449Z|59142|binding|INFO|Claiming lport cr-xxx for 
> >>>> this chassis.
> >>>> 2023-03-30T07:58:02.449Z|59143|binding|INFO|cr-xxx: Claiming 
> >>>> xx:xx:xx:xx:xx x.x.x.x/xx
> >>>> 2023-03-30T07:58:02.453Z|59144|binding|INFO|Claiming lport cr-yyy for 
> >>>> this chassis.
> >>>> 2023-03-30T07:58:02.453Z|59145|binding|INFO|cr-yyy: Claiming 
> >>>> yy:yy:yy:yy:yy:yy y.y.y.y/yy
> >>>> …
> >>>>
> >>>> Previously I’ve met with similar problem but the openflow probe interval 
> >>>> was 30 seconds and I set it to 60 and then to 120 seconds.
> >>>> But it seems that OVS was configured on default inactivity probe timeout 
> >>>> for openflow - 60 seconds. So, the error above occured. In OVS logs it 
> >>>> looks like next:
> >>>>
> >>>> 2023-03-30T07:57:28.632Z|61003|rconn|ERR|br-int<->unix#8: no response to 
> >>>> inactivity probe after 60 seconds, disconnecting
> >>
> >> When ovn-controller looses openflow connection for flows management it 
> >> considers OVS is dead and releases its ha-chassis-group bindings. This 
> >> logic was introduced with the patch [5].
> >>
> >> ovn-controller uses unix socket to connect to ovs-vswitchd OF socket, it 
> >> seems useless to configure inactivity probing, right?
> >> In this commit message [6] Han also disables OF inactivity probing by 
> >> default.
> >>
> >> Moreover, it seems harmful to configure inactivity probing from OVS to 
> >> ovn-controller and vise versa.
> >> In case ovn-controller for some reason is busy with some work and can’t 
> >> answer to OVS in two minutes (its default probe interval is 60s [7]), so 
> >> OVS destroys OF connection with the ERR log:
> >> ovs|18206|rconn|ERR|br-int<->unix#17574: no response to inactivity probe 
> >> after 60 seconds, disconnecting
> >>
> >> So, when ovn-controller resumes after a big work it sees that OF 
> >> connection is closed and triggers chassisredirect portbindings release.
> >> This leads to datapath downtime and unnesessary load on OVN_Southbound, 
> >> ovn-controllers, etc.
> >>
> >> To avoid such behaviour user may consider not to configure 
> >> ovn-openflow-probe-interval AND disable probing from OVS side with the 
> >> next command:
> >> ovs-vsctl --inactivity-probe=0 set-controller br-int 
> >> punix://var/run/openvswitch/br-int.mgmt
> >>
> >> To make this setting applied, it’s needed to reestablish OF connection. 
> >> (for instance, restart ovn-controller)
> >>
> >> My proposal:
> >>
> >> If ovn-controller uses only unix socket for OF connection, it should be 
> >> safe to disable OF probing from its code:
> >> 1. configure controller for br-int to disable probing (ensure br-int or 
> >> other configured bridge controller created with inactivity probe = 0);
> >> 2. delete ovn-openflow-probe-interval setting and mark in documentation 
> >> that for OVN versions >= 20.09 it’s recommended not to configure 
> >> ovn-openflow-inactivity-probe. For versions < 20.09 default interval is 5 
> >> seconds, so it’s reasonable to set it to higher value in the config for 
> >> loaded hypervisors;
> >
> > You mean 23.09 right ?
>
> No.
> IIUC, the 20.03.0-2 and 20.06.0-2 versions have default interval of 5 seconds.
> All other versions have disabled OF probing by default.
> So my idea was to mention it in docs and recommend users of those versions 
> (with 5 seconds default) to disable OF manually by setting ovs-vsctl set open 
> . external_ids:ovn-openflow-probe-interval=0.
>
> >
> >
> >> 3. set openflow probe interval for ovs features and pinctrl OF connection 
> >> to 0 (disable probing);
> >>
> >> What do you guys think about this?
> >> Please, correct me if I understand something wrong and described changes 
> >> can’t be accepted.
> >
> > The proposal looks reasonable to me.
> >
> > There may be once concern.  Let's say the physical link is fine (where
> > vswitchd runs bfd) but vswitchd is down on chassis 'A' for some
> > reason.  With probing disabled, ovn-controller will never
> > come to know if vswitchd is down until it tries to communicate with
> > it.  If this chassis 'A' is acting as a gateway,  external traffic
> > from other chassis VMs will be tunnelled to this and since vswitchd is
> > down the traffic may not go out if no datapath flows exist to output
> > the traffic to the physical interface of the provider bridge.  Is this
> > acceptable ?
> > At least with the probing enabled, ovn-controller will release the
> > chassis resident port and the next in the priority will claim it.
> >
> > IMO,  this limitation is more acceptable than the issue you reported.
> > Let's see what others have to say about this.
> >
> >
>
> I’m not sure I fully understand your idea about vswitchd down.
> I tried to send sigkill to ovs-vswitchd with probing disabled and 
> ovn-controller recognized the failure (connection closed by peer) and 
> released its chassisredirect port_binding (assuming another chassis by 
> priority claims it):
>
> 2023-05-17T08:37:10.134Z|00004|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt:
>  connection closed by peer
> 2023-05-17T08:37:10.134Z|00061|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
> connection closed by peer
> 2023-05-17T08:37:10.134Z|00062|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
> connection closed by peer
> 2023-05-17T08:37:10.135Z|00063|binding|INFO|Releasing lport cr-xxx from this 
> chassis (sb_readonly=0)
> 2023-05-17T08:37:10.135Z|00064|if_status|WARN|Trying to release unknown 
> interface cr-xxx
> 2023-05-17T08:37:10.135Z|00065|binding|INFO|Releasing lport cr-xxx from this 
> chassis (sb_readonly=0)
>
> Does such test look reasonable and covers your concern? If not, could you 
> describe it in more details, please?

I think I misunderstood earlier.  Please ignore my comment.  So when
ovs-vswitchd is killed, ovn-controller will detect it.


Thanks
Numan


>
> >
> >
> >
> >>
> >>
> >>>>
> >>>> This node is a dedicated l3gateway which has is-interconn=true setting 
> >>>> (interconnection case) and has quite big openflow table around 2.7M 
> >>>> flows.
> >>>>
> >>>> So, I’ve got some questions for OVN and OVS:
> >>>> 1. I couldn’t find in ovn-controller code where the logic of handling 
> >>>> openflow connection drop. It is a bug when after openflow connection 
> >>>> drop the claimed cr-ports are being released? I guess, yes. Can you 
> >>>> please point me to the corrent place of code to start investigating?
> >>>> 2. I see almost the same openflow amount on each l3gateway with 
> >>>> interconnection role. Even l3gateways which have no allocated 
> >>>> cr-port-bindings have 2M+ openflow flows. Is it normal operation or it’s 
> >>>> a place for possible optimisation? I mean it seems that excess flows are 
> >>>> installed.
> >>
> >> There are a lot of mac_binding records, but the question is why these 
> >> flows still persist even if no datapaths are allocated to the node.
> >
> > Maybe the datapath to which the mac binding records belong is in the
> > ovn-controllers' "local_datapath"  [1] -
> > https://github.com/ovn-org/ovn/blob/main/controller/local_data.h#L35
> >
>
> Thanks, I’ll check that.
>
> >>
> >>>> 3. How to handle long OVS OF updates, which block inactivity probe for 
> >>>> OF connection and then got dropped? From ovn-controller it’s possible to 
> >>>> configure inactivity probe, is it possible and should I configure from 
> >>>> ovs-vswitchd side?
> >>>> 4. There is a warning on try to release unknown interface, which was 
> >>>> relased right before. Is it normal situation or also should be 
> >>>> investigated?
> >
> > I think it can be ignored.  Maybe @Dumitru Ceara can correct me or comment 
> > here
> >
> > Thanks
> > Numan
> >
> >>>>
> >>
> >> For points 2-4 I still haven’t found the answer yet. If someone knows the 
> >> answer, please share it.
> >>
> >>>>
> >>>> Thanks.
> >>>>
> >>>> Regards,
> >>>> Vladislav Odintsov
> >>>>
> >>>
> >>>
> >>> Regards,
> >>> Vladislav Odintsov
> >>>
> >>
> >>
> >> 1: https://github.com/ovn-org/ovn/blob/v23.03.0/controller/ofctrl.c#L421
> >> 2: 
> >> https://github.com/ovn-org/ovn/blob/v23.03.0/controller/ovn-controller.c#L108
> >> 3: https://github.com/ovn-org/ovn/blob/v23.03.0/lib/features.c#L85
> >> 4: https://github.com/ovn-org/ovn/blob/v23.03.0/controller/pinctrl.c#L3479
> >> 5: 
> >> https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
> >> 6: 
> >> https://github.com/ovn-org/ovn/commit/b8af8549396e62d6523be18e104352e334825783
> >> 7: https://github.com/openvswitch/ovs/blob/v3.1.0/vswitchd/bridge.c#L3860
> >>
> >> Regards,
> >> Vladislav Odintsov
> >>
> >>
> >>
> >>
> >>
> >>
> >> Regards,
> >> Vladislav Odintsov
> >>
> >> _______________________________________________
> >> dev mailing list
> >> [email protected] <mailto:[email protected]>
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
> Regards,
> Vladislav Odintsov
>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to