Hi Priyankar,

The description makes the issue crystal clear, and you appear to be solving the race condition that can happen between the OVS interface table and the southbound port_binding table.

Acked-by: Mark Michelson <[email protected]>

Just to let you know, the flapping problem you mention can be avoided altogether by using options:requested-chassis on the northbound logical switch port. When you migrate the port to a new chassis, place the new chassis's name or hostname as this option, and ovn-controller will only claim the logical switch port on that chassis. The old chassis will not try to claim the port even if the tap is still present.

I wouldn't be surprised if there were other ways to trigger this race condition as well. I suspect the port-flapping scenario is most likely to trigger it, though.

On 5/31/23 01:35, Priyankar Jain wrote:
Currently during port migration, two chassis (source and destination)
can try to claim the same logical switch port simultaneously for a
short-period of time until the tap is deleted on source hypervisor.
ovn-controllers on these 2 hosts constantly receives port-binding
updates about other chassis claiming the port and as a result it tries
to claim the port again (because its chassis has a tap interface
referencing the LSP). This flapping ends once CMS cleans up tap
interface from the source chassis.

Now following steps occur during a single iteration inc-proc-eng during
flapping:

1. PB update received on OVN controller about other chassis owning the
    port.
2. ovn-controller tries to claim the port.
3. It installs the OVS flows for the port and updates the runtime_data
    to include this port in locally relevant ports.
4. If some change to runtime data happens as part of 3, port-groups
    containing the affected ports are recomputed. It uses related_lports
    runtime data to compute the port-groups.

Finally, ovn-controller sends a port-binding update to SB changing the
chassis to itself.
At a later point of time, SB sends the notification to ovn-controller
about (4) being completed.

Once CMS deletes the tap interface, ovn-controller receives the
notification and updates the runtime data accordingly.

Issue: ovs-flows are (sometimes)not cleaned up upon port migration.

If the notification of OVS interface deletion is received before SB
acks the PortBinding update, then ovn-controller does not cleanup
related_lports leading to incorrect port-groups computation.

i.e if the order of events is as follows:

1. PB update received on OVN controller about other chassis owning the
    port.
2. ovn-controller claims the port, installs OVS flows and sends the
    PortBinding update to SB.
3. OVS interface deletion notification received by ovn-controller.
4. SB ack received for step-2 PB update.

This commit fixes this issue by removing the logical_port from related
port even in case there is no binding available locally.

Signed-off-by: Priyankar Jain <[email protected]>
---
  controller/binding.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/controller/binding.c b/controller/binding.c
index 9b0647b70..9889be5c7 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -1568,6 +1568,7 @@ consider_vif_lport_(const struct sbrec_port_binding *pb,
              || is_additional_chassis(pb, b_ctx_in->chassis_rec)) {
          /* Release the lport if there is no lbinding. */
          if (!lbinding_set || !can_bind) {
+            remove_related_lport(pb, b_ctx_out);
              return release_lport(pb, b_ctx_in->chassis_rec,
                                   !b_ctx_in->ovnsb_idl_txn,
                                   b_ctx_out->tracked_dp_bindings,

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to