Hey Przemek!
On 11.01.24 16:07, Przemek Kitszel wrote:
I plan (and my manager agrees :)) to work on this ~now
So far I have found a few bad smells to fix in related area, will work
with Ubuntu as main test setup for that too to increase a chance of
repro.
Also, just from the code there is no obvious bug (even if there is about
one patch around stats in 6.1 ... 6.2 range).
I would also check exact Ubuntu kernel sources (not just "upstream").
Thanks very much for diving into this issue now!
Please don't hesitate to ask if there is any more info or debugging I
could provide.
One observation that I can contribute to maybe narrow down the issue:
Looking at traffic graphs of three different machines (attached to this
email; I can provide them in better resolution, but ML only allows 90kB),
there seems to be a correlation to the number / existence of KVM virtual
machines:
* comp-20 has 29 VMs
* comp-21 has 96 VMs
* comp-24 hat 0 VMs (<< !)
To give a little bit of background:
We are using OpenStack Neutron with Linuxbridge ML2 as virtual
networking). All the machines run the neutron agent and, independently
from running VMs, run the L2 and L3 services via bridges, network
namespaces (with routing tables, iptables). Overlay networking is done
via VXLAN running via multicast. There are two aspects though:
1) L2 is happening via linux bridges coupled via VXLAN and tap
interfaces + veth pair in case there is a VM requiring an interface into
that network
2) L3 is a linux namespace with a dedicated routing table, iptables and
some daemons (keepalived, dnsmasq, haproxy, ...)
more details can be found at
https://docs.openstack.org/neutron/latest/admin/deploy-lb-selfservice.html#architecture
These are the current network related resources on the machines:
* comp-20 has 202 bridges, 302 tap interfaces and 201 vxlan interfaces
* comp-21 has 238 bridges, 496 tap interfaces and 237 vxlan interfaces
* comp-24 has 0 bridges, 0 tap interfaces and 0 vlan interfaces
(since there are no VMs running and the node does not service a L3 agent
for this network)
I have now moved a few VMs to comp-24 to see if the issue start
occurring on that machine then as well.
This should only cause some of the mentioned L2 components to now exist
on this machine. The issue did not appear immediately though, but I keep
observing this and maybe start increasing the VM count and networking load.
Maybe the counters spiking is due to some offloading feature such as VXLAN?
Regards
Christian