On Thu, 05 Nov 2020 13:22:10 +0100 Thomas Rosenstein via Bloat <[email protected]> wrote:
> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: > > > "Thomas Rosenstein" <[email protected]> writes: > > > >>> If so, this sounds more like a driver issue, or maybe something to > >>> do with scheduling. Does it only happen with ICMP? You could try this > >>> tool for a userspace UDP measurement: > >> > >> It happens with all packets, therefore the transfer to backblaze with > >> 40 threads goes down to ~8MB/s instead of >60MB/s > > > > Huh, right, definitely sounds like a kernel bug; or maybe the new > > kernel is getting the hardware into a state where it bugs out when > > there are lots of flows or something. > > > > You could try looking at the ethtool stats (ethtool -S) while > > running the test and see if any error counters go up. Here's a > > handy script to monitor changes in the counters: > > > > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > > > >> I'll try what that reports! > >> > >>> Also, what happens if you ping a host on the internet (*through* > >>> the router instead of *to* it)? > >> > >> Same issue, but twice pronounced, as it seems all interfaces are > >> affected. > >> So, ping on one interface and the second has the issue. > >> Also all traffic across the host has the issue, but on both sides, > >> so ping to the internet increased by 2x > > > > Right, so even an unloaded interface suffers? But this is the same > > NIC, right? So it could still be a hardware issue... > > > >> Yep default that CentOS ships, I just tested 4.12.5 there the > >> issue also does not happen. So I guess I can bisect it > >> then...(really don't want to 😃) > > > > Well that at least narrows it down :) > > I just tested 5.9.4 seems to also fix it partly, I have long > stretches where it looks good, and then some increases again. (3.10 > Stock has them too, but not so high, rather 1-3 ms) > > for example: > > 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms > 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms > 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms > 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms > 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms > > and then again: > > 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms > 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms > 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms > 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms > 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms > 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms > 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms > 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms > 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms > 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms > 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms > 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms > 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms > 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms > 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms > 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms > 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms > 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms > 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms > 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms > > > For me it looks now that there was some fix between 5.4.60 and 5.9.4 > ... anyone can pinpoint it? I have some bpftrace tools to measure these kind of latency spikes here: [1] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/ The tool you want is: softirq_net_latency.bt [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt Example output see[3]: [3] https://bugzilla.redhat.com/show_bug.cgi?id=1795049#c8 Based on the kernel versions, I don't expect this to be same latency issue as described in the bugzilla[3] case (as IIRC it was fixed in 4.19). It can still be similar issue, where some userspace process is reading information from the kernel (/sys/fs/cgroup/memory/memory.stat in BZ case) that blocks softirq from running, and result in these latency spikes. Install guide to bpftrace[4]: [4] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/bpftrace/INSTALL.org -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer _______________________________________________ Bloat mailing list [email protected] https://lists.bufferbloat.net/listinfo/bloat
