Hi,

I'm trying to utilize requested-chassis feature with two chassis in a 
list with activation-strategy=rarp for VM migration and found that BFD 
sessions (to configured ha-chassis-group) got configured only after 
port-binding of a migrating VM in fully claimed as a "main" chassis to 
the destination node. As a result, after VM is migrated to the 
destination host we call a kind of a "finalize_migration" to align 
requested-chassis to a real fact (set new node as the only one requested 
chassis) and see a network outage to external resources through 
ha-chassis-group chassises (l3gateways) ~1-2 seconds.

The whole process looks like this:

1. We have a running VM, its LSP is configured with 
options:requested-chassis=<host>, where <host> is current VM's 
hypervisor, say, host1.
2. Initialize VM migration.
2.1. Set options:requested-chassis=host1,host2 and 
options:activation-strategy=rarp. host2 here is the destination 
hypervisor for VM migration.
2.2. Run QEMU on the destination HV
2.3. configure it, insert QEMU network interface in OVS and set 
external_ids:iface-id. At this place all "to-lport" traffic destined to 
a migrating VM is sent to only source hypervisor. ovn-controller writes 
the destination node into port-binding additional_chassis field and sets 
"ovn-installed" in the local (destination host's) OVS Interface record's 
external_ids after it prepared all related openflows.
2.4. Start memory pages copying from source to dest hypervisor.
2.5. After migration is finished, VM sends RARP and ovn-controller 
activates VM's port on the destination VM. Here I see ~0.1 network 
outage, because source QEMU is disabled, dest QEMU is already working, 
but ovn-controller hasn't activated port yet. This is okay.
3. Then we want to finalize migration. We set new requested-chassis 
value (host2) and pop activation strategy from options. This is where we 
see a problem. If we set options immediately after lport activation, we 
see ~1-2 seconds of network outage though l3gateway port (external 
connectivity) due to BFD sessions to ha-chassis-group chassises to come 
up. If we wait a bit (in my tests 5 seconds, but less probably will also 
work) before calling finalize (to set requested-chassis and pop 
activation-strategy), there is only 0.1s network outage to outside 
inspite of BFD sessions are going UP.

So, here I've got two questions:

1. Why BFD sessions are not configured as a part of interface 
preparation for activating a port? This looks desired for BFD to be in 
sync prior we activate a port after VM is migrated.
2. I can't explain why when we call finalize immediately, the network 
outage for 1-2 seconds occurs and if we call it after 5 seconds after 
RARP activated the port, there is no such outage? In both cases BFD 
sessions are coming up only after finalize is called. It takes 1-2 seconds.

Thanks in advance for your replies.

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to