Hello again Przemek,
On 11.01.24 16:07, Przemek Kitszel wrote:
I plan (and my manager agrees :)) to work on this ~now
So far I have found a few bad smells to fix in related area, will work
with Ubuntu as main test setup for that too to increase a chance of
repro.
Also, just from the code there is no obvious bug (even if there is about
one patch around stats in 6.1 ... 6.2 range).
I would also check exact Ubuntu kernel sources (not just "upstream").
Were you able to find anything in this regard yet? See my further
findings below.
On 16.01.24 15:40, Christian Rohmann wrote:
One observation that I can contribute to maybe narrow down the issue:
Looking at traffic graphs of three different machines (attached to
this email; I can provide them in better resolution, but ML only
allows 90kB),
there seems to be a correlation to the number / existence of KVM
virtual machines:
* comp-20 has 29 VMs
* comp-21 has 96 VMs
* comp-24 hat 0 VMs (<< !)
[...]
[...]
I have now moved a few VMs to comp-24 to see if the issue start
occurring on that machine then as well.
This should only cause some of the mentioned L2 components to now
exist on this machine. The issue did not appear immediately though,
but I keep observing this and maybe start increasing the VM count and
networking load.
Maybe the counters spiking is due to some offloading feature such as
VXLAN?
1) With only a few VMs and no churn there were no spikes in the counters
over a long period of time.
2) I then moved some more VMs to this machine yesterday and soon the
spikes to multiple TBit/s happend. See the attached screenshot.
Some observations:
* the spikes happened during or right after live migrating of instances
* the spikes then did not appear for > 12 hours
* I believe this relates to either
** the number of linux bridges, tap interfaces or vxlan interfaces
** their chrun (creation / deletion) when VMs are spawned / deleted or
migrated away
Please let me know if there is any more input I could provide to help
resolving this issue.
Regards
Christian