On Thu, Sep 19, 2024 at 8:18 AM Odintsov Vladislav <[email protected]> wrote:
>
> Forgot to mention.
>
> ovn-controller logs when I call "finalize" after wait of 15 seconds.
>
> Here the VM port was added.
>
> 2024-09-19T12:11:17.488Z|00083|binding|INFO|Claiming lport eni-C104F9E1 for 
> this additional chassis.
> 2024-09-19T12:11:17.488Z|00084|binding|INFO|eni-C104F9E1: Claiming 
> 0a:01:c1:04:f9:e1 172.31.0.4
> 2024-09-19T12:11:17.531Z|00085|binding|INFO|Setting lport eni-C104F9E1 
> ovn-installed in OVS
>
> Then after migration completed and requested-chassis is set to new dest 
> hostname:
>
> 2024-09-19T12:11:38.520Z|00005|pinctrl(ovn_pinctrl0)|INFO|DHCPACK 
> 0a:01:c1:04:f9:e1 172.31.0.4
> 2024-09-19T12:11:47.527Z|00086|lflow_cache|INFO|Detected cache inactivity 
> (last active 30004 ms ago): trimming cache
> 2024-09-19T12:11:50.914Z|00087|binding|INFO|Claiming lport eni-C104F9E1 for 
> this chassis.
> 2024-09-19T12:11:50.914Z|00088|binding|INFO|eni-C104F9E1: Claiming 
> 0a:01:c1:04:f9:e1 172.31.0.4
> 2024-09-19T12:11:50.918Z|00089|binding|INFO|Setting lport eni-C104F9E1 up in 
> Southbound
> 2024-09-19T12:11:50.949Z|00090|ovn_bfd|INFO|Enabled BFD on interface 
> ovn-vad09-0
> 2024-09-19T12:11:50.949Z|00091|ovn_bfd|INFO|Enabled BFD on interface 
> ovn-vad08-0
>
> In this case I see only 0.1s downtime.
>
> Then ovn-controller logs when "finalize" is called immediately after VM 
> migration:
>
> Migration started. dest QEMU got created and vif inserted into OVS.
>
> 2024-09-19T12:15:27.953Z|00101|binding|INFO|Claiming lport eni-C104F9E1 for 
> this additional chassis.
> 2024-09-19T12:15:27.953Z|00102|binding|INFO|eni-C104F9E1: Claiming 
> 0a:01:c1:04:f9:e1 172.31.0.4
> 2024-09-19T12:15:27.997Z|00103|binding|INFO|Setting lport eni-C104F9E1 
> ovn-installed in OVS
>
> Migration finished, now set requested-chassis to "host2".
>
> 2024-09-19T12:15:45.875Z|00104|binding|INFO|Changing chassis for lport 
> eni-C104F9E1 from ad03 to ad04.
> 2024-09-19T12:15:45.875Z|00105|binding|INFO|eni-C104F9E1: Claiming 
> 0a:01:c1:04:f9:e1 172.31.0.4
> 2024-09-19T12:15:45.890Z|00106|binding|INFO|Claiming lport eni-C104F9E1 for 
> this chassis.
> 2024-09-19T12:15:45.890Z|00107|binding|INFO|eni-C104F9E1: Claiming 
> 0a:01:c1:04:f9:e1 172.31.0.4
> 2024-09-19T12:15:45.892Z|00108|binding|INFO|Removing iface vif0-6F426F81 
> ovn-installed in OVS
> 2024-09-19T12:15:45.893Z|00109|binding|INFO|Setting lport eni-C104F9E1 
> ovn-installed in OVS
> 2024-09-19T12:15:45.895Z|00110|ovn_bfd|INFO|Enabled BFD on interface 
> ovn-vad09-0
> 2024-09-19T12:15:45.895Z|00111|ovn_bfd|INFO|Enabled BFD on interface 
> ovn-vad08-0
> 2024-09-19T12:15:45.911Z|00112|binding|INFO|Setting lport eni-C104F9E1 up in 
> Southbound
> 2024-09-19T12:15:45.912Z|00113|ovn_bfd|INFO|Disabled BFD on interface 
> ovn-vad09-0
> 2024-09-19T12:15:45.912Z|00114|ovn_bfd|INFO|Disabled BFD on interface 
> ovn-vad08-0
> 2024-09-19T12:15:45.913Z|00115|ovn_bfd|INFO|Enabled BFD on interface 
> ovn-vad09-0
> 2024-09-19T12:15:45.913Z|00116|ovn_bfd|INFO|Enabled BFD on interface 
> ovn-vad08-0
>
> Also, wanted to note that these tests are run on the ovn-22.09.4 
> (branch-22.09) code.
>
> On 19.09.2024 14:13, Odintsov Vladislav wrote:
>
> Hi,
>
> I'm trying to utilize requested-chassis feature with two chassis in a
> list with activation-strategy=rarp for VM migration and found that BFD
> sessions (to configured ha-chassis-group) got configured only after
> port-binding of a migrating VM in fully claimed as a "main" chassis to
> the destination node. As a result, after VM is migrated to the
> destination host we call a kind of a "finalize_migration" to align
> requested-chassis to a real fact (set new node as the only one requested
> chassis) and see a network outage to external resources through
> ha-chassis-group chassises (l3gateways) ~1-2 seconds.
>
> The whole process looks like this:
>
> 1. We have a running VM, its LSP is configured with
> options:requested-chassis=<host>, where <host> is current VM's
> hypervisor, say, host1.
> 2. Initialize VM migration.
> 2.1. Set options:requested-chassis=host1,host2 and
> options:activation-strategy=rarp. host2 here is the destination
> hypervisor for VM migration.
> 2.2. Run QEMU on the destination HV
> 2.3. configure it, insert QEMU network interface in OVS and set
> external_ids:iface-id. At this place all "to-lport" traffic destined to
> a migrating VM is sent to only source hypervisor. ovn-controller writes
> the destination node into port-binding additional_chassis field and sets
> "ovn-installed" in the local (destination host's) OVS Interface record's
> external_ids after it prepared all related openflows.
> 2.4. Start memory pages copying from source to dest hypervisor.
> 2.5. After migration is finished, VM sends RARP and ovn-controller
> activates VM's port on the destination VM. Here I see ~0.1 network
> outage, because source QEMU is disabled, dest QEMU is already working,
> but ovn-controller hasn't activated port yet. This is okay.
> 3. Then we want to finalize migration. We set new requested-chassis
> value (host2) and pop activation strategy from options. This is where we
> see a problem. If we set options immediately after lport activation, we
> see ~1-2 seconds of network outage though l3gateway port (external
> connectivity) due to BFD sessions to ha-chassis-group chassises to come
> up. If we wait a bit (in my tests 5 seconds, but less probably will also
> work) before calling finalize (to set requested-chassis and pop
> activation-strategy), there is only 0.1s network outage to outside
> inspite of BFD sessions are going UP.
>
> So, here I've got two questions:
>
> 1. Why BFD sessions are not configured as a part of interface
> preparation for activating a port? This looks desired for BFD to be in
> sync prior we activate a port after VM is migrated.
> 2. I can't explain why when we call finalize immediately, the network
> outage for 1-2 seconds occurs and if we call it after 5 seconds after
> RARP activated the port, there is no such outage? In both cases BFD
> sessions are coming up only after finalize is called. It takes 1-2 seconds.
>

I think in order to fix this issue,  we need to enhance ovn-northd to
populate the requested additional chassis (in your case host2)
in the ref_chassis column of SB table HA_Chassis_Group.

Looks like right now, host2 is added to the ref_chassis column only
after Step 3.  I think this explains the initial delay as it takes
some
time for ovn-northd to populate this and for ovn-controllers to get
this update and establish BFD sessions.

Thanks
Numan


> Thanks in advance for your replies.
>
> _______________________________________________
> dev mailing list
> [email protected]<mailto:[email protected]>
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to