Hi Priyankar,
The description makes the issue crystal clear, and you appear to be
solving the race condition that can happen between the OVS interface
table and the southbound port_binding table.
Acked-by: Mark Michelson <[email protected]>
Just to let you know, the flapping problem you mention can be avoided
altogether by using options:requested-chassis on the northbound logical
switch port. When you migrate the port to a new chassis, place the new
chassis's name or hostname as this option, and ovn-controller will only
claim the logical switch port on that chassis. The old chassis will not
try to claim the port even if the tap is still present.
I wouldn't be surprised if there were other ways to trigger this race
condition as well. I suspect the port-flapping scenario is most likely
to trigger it, though.
On 5/31/23 01:35, Priyankar Jain wrote:
Currently during port migration, two chassis (source and destination)
can try to claim the same logical switch port simultaneously for a
short-period of time until the tap is deleted on source hypervisor.
ovn-controllers on these 2 hosts constantly receives port-binding
updates about other chassis claiming the port and as a result it tries
to claim the port again (because its chassis has a tap interface
referencing the LSP). This flapping ends once CMS cleans up tap
interface from the source chassis.
Now following steps occur during a single iteration inc-proc-eng during
flapping:
1. PB update received on OVN controller about other chassis owning the
port.
2. ovn-controller tries to claim the port.
3. It installs the OVS flows for the port and updates the runtime_data
to include this port in locally relevant ports.
4. If some change to runtime data happens as part of 3, port-groups
containing the affected ports are recomputed. It uses related_lports
runtime data to compute the port-groups.
Finally, ovn-controller sends a port-binding update to SB changing the
chassis to itself.
At a later point of time, SB sends the notification to ovn-controller
about (4) being completed.
Once CMS deletes the tap interface, ovn-controller receives the
notification and updates the runtime data accordingly.
Issue: ovs-flows are (sometimes)not cleaned up upon port migration.
If the notification of OVS interface deletion is received before SB
acks the PortBinding update, then ovn-controller does not cleanup
related_lports leading to incorrect port-groups computation.
i.e if the order of events is as follows:
1. PB update received on OVN controller about other chassis owning the
port.
2. ovn-controller claims the port, installs OVS flows and sends the
PortBinding update to SB.
3. OVS interface deletion notification received by ovn-controller.
4. SB ack received for step-2 PB update.
This commit fixes this issue by removing the logical_port from related
port even in case there is no binding available locally.
Signed-off-by: Priyankar Jain <[email protected]>
---
controller/binding.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/controller/binding.c b/controller/binding.c
index 9b0647b70..9889be5c7 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -1568,6 +1568,7 @@ consider_vif_lport_(const struct sbrec_port_binding *pb,
|| is_additional_chassis(pb, b_ctx_in->chassis_rec)) {
/* Release the lport if there is no lbinding. */
if (!lbinding_set || !can_bind) {
+ remove_related_lport(pb, b_ctx_out);
return release_lport(pb, b_ctx_in->chassis_rec,
!b_ctx_in->ovnsb_idl_txn,
b_ctx_out->tracked_dp_bindings,
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev