On Fri, Sep 24, 2021 at 7:27 AM Numan Siddique <[email protected]> wrote: > > On Fri, Sep 24, 2021 at 7:22 AM Dumitru Ceara <[email protected]> wrote: > > > > On 9/24/21 12:06 PM, Xavier Simonart wrote: > > > Hi > > > > > > > Hi Xavier, > > > > > I have the following question on when ovn is reporting a port to be up if > > > conditional monitoring is enabled. > > > > > > In the current ovn master, if conditional monitoring is enabled, a port is > > > reported up too early i.e. before all related flows are properly installed > > > in ovs. > > > Packets sent through ovs immediately after a port has been reported up > > > might be lost. This is visible in some ovn tests failing intermittently. > > > > > > > Thanks for investigating this issue! I'm cc-ing Han explicitly too > > because he mentioned the same bug during yesterday's IRC meeting. > > Thanks for bringing this to my attention. Please see my comments below.
> > > This is the flow of interaction between ovs/ovn-controller and ovn-sb after > > > a Logical switch has been created, a port added to it and the related > > > interface has been added to ovs. > > > > > > [A] OVS=>OVN new interface notification > > > [B] OVN=>SB monitor_cond_change(Port_Binding) > > > [C] OVN<= SB notification Port_Binding (mac, uuid, ...) > > > [D] OVS<= OVN Adding flows for tables [1] (seqno 1) > > > [E] OVN=>SB monitor_cond_change(Logical_Flow, MAC_Binding, ...) > > > [F] OVS=>OVN Flows seqno 1 installed > > > [G] OVN=>SB Port up > > > [H] OVS<=OVN <= ovn_installed > > > [I] OVN<=SB <= notification (Logical_Flows) > > > [J] OVS<=OVN <= Adding flows for all tables (seqno 2) > > > [K] (flows installed, port only up now...) > > > > > > [1] 0, 37, 38, 39, 64, 65 > > > > > > Potential solution/workarounds/... in ovn controller > > > > > > - (1) check that there is no conditional monitoring in flight before > > > reporting a port to be up > > > > > > CON: if adding many ports, we might have conditional monitorings in flight > > > for a long time, resulting in a long delay in reporting port up > > > > I agree, this seems too risky. > > > > > > > > - (2) Disable conditional monitoring of Port_Binding. > > > > > > CON: as such, it does not help - only steps [B,C] above are skipped > > > > > > - (3) "Proper" fix: only report a port up when there are no monitoring > > > conditions related to this port, its datapath, ... in flight > > > > > > CON: Must track changes in monitor conditions; overkill for this small > > > issue? > > > > To me it seems like this would be quite complex, with too many chances > > of introducing bugs for a small issue that shouldn't have high impact in > > production deployments. > > > > > > > > - (4) Combine 1 & 2: i.e. disable conditional monitoring of Port_Binding > > > and check that there is no conditional monitoring in flight before > > > reporting a port to be UP > > > > > > CON: port up might be delayed if many local datapaths are added (adding a > > > port to an existing local datapath will not delay port up as in 1) > > > > > > > This sounds like a good compromise to me. It seems to me monitoring all Port_Binding doesn't solve the problem. If a node keeps claiming and releasing the first/last LSP of a LS on the node, the monitoring condition change to the Datapath table would never stop (in theory). Besides, enabling conditional monitoring is to avoid downloading unnecessary data by every ovn-controller. Port_Binding is one of the biggest part of the data (although less than Logical_Flow). When conditional monitoring is enabled (ovn-monitor-all = false), users would expect port bindings of other nodes are not downloaded. > > > > > - (5) Ignore the problem in ovn, only fix the unit test (e.g. checking > > > from unit tests that relevant flows are installed)... > > > > This "works" too but I think I'd prefer (4). > > > Given that the issue would be seen only for the first port claim of a > logical switch 'S' on a given > chassis, I'd say (5) should suffice. > I think the severity of the problem depends on how we use the "up" in the LSP and "ovn-installed" in the OVS interface:external_ids. For "up", if it is used in test cases to ensure port is claimed by a chassis, it is fine. If it is used to ensure are flows are installed, it is risky. In many cases an extra "ovn-nbctl --wait=hv sync" following the "up" == true check would help. For "ovn-installed", we are relying on it in ovn-k8s for port readiness check. When conditional monitoring is enabled, the required flows may not be downloaded for the pod to work. This is not acceptable even if it is the problem only for the first pod on the LS. Now ovn-k8s disables conditional monitoring, so it is ok for now. If ovn-k8s enables conditional monitoring (maybe through an option), we can fallback to the old way to check for certain OVS flows in those cases. Other CMS may handle it the same way. I have no better suggestion to improve the behavior of ovn-installed. But it seems the benefit of it is not strong enough to justify complex solutions to it. I think we can keep it as is with the above ways to avoid the problems. In addition, the ovn-installed mechanism introduced another update to the SB DB (for the "up" status). It could be a significant overhead to the SB and the whole control plane if the churn rate is high in a large-scale environment. The major write operation from ovn-controller to SB DB was port-binding's chassis column and mac-binding entries. Now we are doubling the amount of port-binding updates. Thanks, Han _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
