Hi John, > On 10 Jun 2020, at 23:22, John Bartelme <[email protected]> wrote: > > > Hello, > > I’m trying to run down an issue with a couple of my servers > but I’m having a really hard time pinpointing the root cause. I have around > 250 servers up an running and after about a year one of the servers is no > longer able to communicate over OVN. About two months later another server > fell into this same state. For a given ovn switch any two VMs connected to > that switch can talk to each other unless one of the endpoints resides on one > of these failed servers. If both VMs are on the same server they have no > problem communicating through the ovs bridge. Turning up various different > debug I can’t determine why these servers are having issues. Ovn-trace shows > that it should work. I see their chassis in the southbound database. Doing > tcpdump on the different servers I can see a geneve encapsulated arp going > out of the server and coming back in. It never seems to get the vm interface > though. Tcpdump on the vm interface only shows the arp going out and never > coming back. Turning up openvswitch debug I see debug statements saying the > flow is sent but I never see flow received like I do on working boxes. What > other tools/debug can I bring to bear to try and figure out what is wrong? > It feels like perhaps something isn’t getting cleaned up somewhere. Again I > have many servers working with the same configuration as these two servers > and these two servers used to work without issue. I’ve tried completely > re-installing the OS and reconfiguring the bad servers and the problem still > persists. I have a lot of users using this setup but I may try and upgrade > to a newer version of ovs(2.12) vs. 2.7-2 that I’m on now if I can get some > system downtime. I’m also currently using RHEL 7.8 as the OS. > What version of OVN are you using? The one shipped with RHEL? Can you share the exact version of it? If it is ovn2.11 I remember some issues with conjunctive flows but I don’t think this could be the case as you say that VMs within that one server can talk to each other.
Also you mention that comm between VMs on different servers doesn’t work if one of them lives on that server but yet you see ARP traffic going out the tunnel. This is not expected If the two VMs belong to OVN as ovn-controller will reply to the ARP request. Did I understand the scenario right? This is a total blind guess from my end but if you reinstalled everything and it still doesn’t work, could it be some wrong MAC_Binding entry in the SB database? I don’t know your topology so I’m totally guessing here. You could delete all MAC binding entries for that particular logical switch and see if it makes a change. Also it looks like you have inspected local OVS logs but what about local ovn-controller logs in the faulty hypervisor? Daniel > Thanks, john > > > _______________________________________________ > discuss mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
