Hey Przemek!

On 11.01.24 16:07, Przemek Kitszel wrote:
I plan (and my manager agrees :)) to work on this ~now

So far I have found a few bad smells to fix in related area, will work
with Ubuntu as main test setup for that too to increase a chance of
repro.

Also, just from the code there is no obvious bug (even if there is about
one patch around stats in 6.1 ... 6.2 range).

I would also check exact Ubuntu kernel sources (not just "upstream").

Thanks very much for diving into this issue now!
Please don't hesitate to ask if there is any more info or debugging I could provide.


One observation that I can contribute to maybe narrow down the issue:

Looking at traffic graphs of three different machines (attached to this email; I can provide them in better resolution, but ML only allows 90kB), there seems to be a correlation to the number / existence of KVM virtual machines:

 * comp-20 has 29 VMs
 * comp-21 has 96 VMs
 * comp-24 hat 0 VMs (<< !)

To give a little bit of background:

We are using OpenStack Neutron with Linuxbridge ML2 as virtual networking). All the machines run the neutron agent and, independently from running VMs, run the L2 and L3 services via bridges, network namespaces (with routing tables, iptables). Overlay networking is done via VXLAN running via multicast. There are two aspects though:

1) L2 is happening via linux bridges coupled via VXLAN and tap interfaces + veth pair in case there is a VM requiring an interface into that network 2) L3 is a linux namespace with a dedicated routing table, iptables and some daemons (keepalived, dnsmasq, haproxy, ...)

more details can be found at https://docs.openstack.org/neutron/latest/admin/deploy-lb-selfservice.html#architecture


These are the current network related resources on the machines:

* comp-20 has 202 bridges, 302 tap interfaces and 201 vxlan interfaces
* comp-21 has 238 bridges, 496 tap interfaces and 237 vxlan interfaces
* comp-24 has 0 bridges, 0 tap interfaces and 0 vlan interfaces
(since there are no VMs running and the node does not service a L3 agent for this network)

I have now moved a few VMs to comp-24 to see if the issue start occurring on that machine then as well. This should only cause some of the mentioned L2 components to now exist on this machine. The issue did not appear immediately though, but I keep observing this and maybe start increasing the VM count and networking load.

Maybe the counters spiking is due to some offloading feature such as VXLAN?



Regards

Christian


Reply via email to