Hello,
I have been wrestling with a problem that looks like a very weird bug and I'm hoping for some troubleshooting guidance. I have three boxes which host various xen domu's, and I use openvswitch in order to bridge my physical interfaces (and their various vlan trunks) with the virtual ethernet interfaces of my guests. Very occasionally a situation arises where a guest appears to become inaccessible for a period of time - sometimes just a few seconds, sometimes for 30 or more seconds. Remote monitoring of certain guest vm's, using ping as well as direct tcp connections for application level sanity checking, suddenly fail during these times. And I have noted that if I log into the vm on it's console, its healthy and such, and if send any packet to the default gateway like ping, suddenly it wakes up as if nothing happened and is again accessible on the network. The symptom seems to relate to the mac address table of the openvswitch bridge losing the mac address of the default gateway (!). This could (or should) only be possible if it's aged out, but that doesn't seem legitimately possible since the host is communicating on the network all the time. I have observed strange results using ovs-appctl fdb/show mybridge - the mac address of the default gateway router frequently has an 'age time' of several seconds or more. And - more weird - if from inside a vm I am actively pinging the gateway, the age time can still continue to climb. I can continue to ping it from a different subnet, forcing the defaut router again to be used, and it still climbs. I've seen it go as high as 30 seconds before being reset to zero. This should simply not be possible with the vm directly bridged to the physical ethernet facing the gateway. The arp table on the guest certainly has the right mac address for the gateway, so it's like there is some wormhole that outbound packets are taking from the guest, bypassing openvswitch, and still making it to the gateway. Or, for unknown reasons, openvswitch is forwarding these frames but failing to 'learn' or 'age' them consistently. I've dumped a lot of resources into sniffing the network, taking packet traces and trying the understand the disconnect here. There are no duplicate mac addresses, no loops, no topology changes. Openvswitch on the dom0 hosts simply fails to bridge without cause or obvious reason. I've also had cases of my cisco switches complaining of %ETHCNTR-3-LOOP_BACK_DETECTED on ports (ones only facing an openvswitch bridge) and the only possible explanation again is a defect in openvswitch's bridging code where it's getting confused and sending stuff back out the same port it entered on. I am wondering how to debug further. Thank you. Mike- _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
