Hello,

    I have been wrestling with a problem that looks like a very weird
bug and I'm hoping for some troubleshooting guidance.

    I have three boxes which host various xen domu's, and I use
openvswitch in order to bridge my physical interfaces (and their various
vlan trunks) with the virtual ethernet interfaces of my guests. Very
occasionally a situation arises where a guest appears to become
inaccessible for a period of time - sometimes just a few seconds,
sometimes for 30 or more seconds. Remote monitoring of certain guest
vm's, using ping as well as direct tcp connections for application level
sanity checking, suddenly fail during these times. And I have noted that
if I log into the vm on it's console, its healthy and such, and if send
any packet to the default gateway like ping, suddenly it wakes up as if
nothing happened and is again accessible on the network.

    The symptom seems to relate to the mac address table of the
openvswitch bridge losing the mac address of the default gateway (!).
This could (or should) only be possible if it's aged out, but that
doesn't seem legitimately possible since the host is communicating on
the network all the time.

    I have observed strange results using ovs-appctl fdb/show mybridge -
the mac address of the default gateway router frequently has an 'age
time' of several seconds or more. And - more weird - if from inside a vm
I am actively pinging the gateway, the age time can still continue to
climb. I can continue to ping it from a different subnet, forcing the
defaut router again to be used, and it still climbs. I've seen it go as
high as 30 seconds before being reset to zero. This should simply not be
possible with the vm directly bridged to the physical ethernet facing
the gateway. The arp table on the guest certainly has the right mac
address for the gateway, so it's like there is some wormhole that
outbound packets are taking from the guest, bypassing openvswitch, and
still making it to the gateway. Or, for unknown reasons, openvswitch is
forwarding these frames but failing to 'learn' or 'age' them consistently.

    I've dumped a lot of resources into sniffing the network, taking
packet traces and trying the understand the disconnect here. There are
no duplicate mac addresses, no loops, no topology changes. Openvswitch
on the dom0 hosts simply fails to bridge without cause or obvious
reason. I've also had cases of my cisco switches complaining of
%ETHCNTR-3-LOOP_BACK_DETECTED on ports (ones only facing an openvswitch
bridge) and the only possible explanation again is a defect in
openvswitch's bridging code where it's getting confused and sending
stuff back out the same port it entered on.

    I am wondering how to debug further.


Thank you.

Mike-

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to