"Thomas Rosenstein" <thomas.rosenst...@creamfinance.com> writes:
> On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote: > >> Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> writes: >> >>> Hi all, >>> >>> I'm coming from the lartc mailing list, here's the original text: >>> >>> ===== >>> >>> I have multiple routers which connect to multiple upstream providers, >>> I >>> have noticed a high latency shift in icmp (and generally all >>> connection) >>> if I run b2 upload-file --threads 40 (and I can reproduce this) >>> >>> What options do I have to analyze why this happens? >>> >>> General Info: >>> >>> Routers are connected between each other with 10G Mellanox Connect-X >>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com >>> Latency generally is around 0.18 ms between all routers (4). >>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with >>> iperf3. >>> 2 of the 4 routers are connected upstream with a 1G connection >>> (separate >>> port, same network card) >>> All routers have the full internet routing tables, i.e. 80k entries >>> for >>> IPv6 and 830k entries for IPv4 >>> Conntrack is disabled (-j NOTRACK) >>> Kernel 5.4.60 (custom) >>> 2x Xeon X5670 @ 2.93 Ghz >>> 96 GB RAM >>> No Swap >>> CentOs 7 >>> >>> During high latency: >>> >>> Latency on routers which have the traffic flow increases to 12 - 20 >>> ms, >>> for all interfaces, moving of the stream (via bgp disable session) >>> moves >>> also the high latency >>> iperf3 performance plumets to 300 - 400 MBits >>> CPU load (user / system) are around 0.1% >>> Ram Usage is around 3 - 4 GB >>> if_packets count is stable (around 8000 pkt/s more) >> >> I'm not sure I get you topology. Packets are going from where to >> where, >> and what link is the bottleneck for the transfer you're doing? Are you >> measuring the latency along the same path? >> >> Have you tried running 'mtr' to figure out which hop the latency is >> at? > > I tried to draw the topology, I hope this is okay and explains betters > what's happening: > > https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing Ohh, right, you're pinging between two of the routers across a 10 Gbps link with plenty of capacity to spare, and *that* goes up by two orders of magnitude when you start the transfer, even though the transfer itself is <1Gbps? Am I understanding you correctly now? If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: https://github.com/heistp/irtt/ Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? And which version of the Connect-X cards are you using (or rather, which driver? mlx4?) > So it must be something in the kernel tacking on a delay, I could try to > do a bisect and build like 10 kernels :) That may ultimately end up being necessary. However, when you say 'stock kernel' you mean what CentOS ships, right? If so, that's not really a 3.10 kernel - the RHEL kernels (that centos is based on) are... somewhat creative... about their versioning. So if you're switched to a vanilla upstream kernel you may find bisecting difficult :/ How did you configure the new kernel? Did you start from scratch, or is it based on the old centos config? -Toke _______________________________________________ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat