I've been doing some bug chasing around some unintended impacts we've been noticing on our bonded hypervisors. The servers have a bond with two slave interfaces each going to a different upstream switch which have been configured with a Virtual PortChannel (VPC). To OVS, the VPC configuration makes the switches appear as if they are a single device with a single PortChannel. The configuration works great, but we have noticed some unexpected data plane outages when interfaces come back up, not when they go down.
For instance, if my server has eth0 and eth1 in a bond and I down the link on eth1, everything is fine. When I re-enable eth1 and it starts to negotiate LACP again, it causes eth0's LACP status to go unsynchronized and stop passing traffic. I've have packet captures for this scenario here: https://gist.github.com/beardymcbeards/7bd9feca87c0574e996a397d90d5ff98. If you look at lines 49-57 of the 2nd file, you can see that when the 2nd interface is brought back online, a rogue LACPDU is sent out the working slave interface with a LACP state that doesn't match the current slave. The state mismatch then causes the switch to stop forwarding and restart the LACP negotiation. Does anyone have an idea on why this might be happening? Thanks in advance, Chad _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
