Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Thomas Rosenstein via Bloat Thu, 05 Nov 2020 04:41:40 -0800


On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote:

"Thomas Rosenstein" <[email protected]> writes:

On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:

"Thomas Rosenstein" <[email protected]> writes:
If so, this sounds more like a driver issue, or maybe something to
do
with scheduling. Does it only happen with ICMP? You could try this
tool
for a userspace UDP measurement:
It happens with all packets, therefore the transfer to backblazewith
40
threads goes down to ~8MB/s instead of >60MB/s
Huh, right, definitely sounds like a kernel bug; or maybe the new
kernel
is getting the hardware into a state where it bugs out when thereare
lots of flows or something.
You could try looking at the ethtool stats (ethtool -S) whilerunningthe test and see if any error counters go up. Here's a handy scriptto
monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
I'll try what that reports!
Also, what happens if you ping a host on the internet (*through*the
router instead of *to* it)?
Same issue, but twice pronounced, as it seems all interfaces are
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides,so
ping to the internet increased by 2x
Right, so even an unloaded interface suffers? But this is the same
NIC,
right? So it could still be a hardware issue...
Yep default that CentOS ships, I just tested 4.12.5 there the issue
also
does not happen. So I guess I can bisect it then...(really don'twant
to
😃)
Well that at least narrows it down :)

I just tested 5.9.4 seems to also fix it partly, I have longstretcheswhere it looks good, and then some increases again. (3.10 Stock hasthem

too, but not so high, rather 1-3 ms)

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms

For me it looks now that there was some fix between 5.4.60 and 5.9.4...

anyone can pinpoint it?


$ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l
72932

Only 73k commits; should be easy, right? :)

(In other words no, I have no idea; I'd suggest either (a) asking on
netdev, (b) bisecting or (c) using 5.9+ and just making peace with not
knowing).


Guess I'll go the easy route and let it be ...

I'll update all routers to the 5.9.4 and see if it fixes the trafficflow - will report back once more after that.

How did you configure the new kernel? Did you start from scratch,or
is
it based on the old centos config?
first oldconfig and from there then added additional options forIB,
NVMe, etc (which I don't really need on the routers)
OK, so you're probably building with roughly the same options interms
of scheduling granularity etc. That's good. Did you enable spectre
mitigations etc on the new kernel? What's the output of
`tail /sys/devices/system/cpu/vulnerabilities/*` ?
mitigations are off


Right, I just figured maybe you were hitting some threshold that
involved a lot of indirect calls which slowed things down due to
mitigations. Guess not, then...


Thanks for the support :)

-Toke

_______________________________________________
Bloat mailing list
[email protected]
https://lists.bufferbloat.net/listinfo/bloat

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Reply via email to