On Mon, 09 Nov 2020 11:09:33 +0100 "Thomas Rosenstein" <[email protected]> wrote:
> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: > > > On Sat, 07 Nov 2020 14:00:04 +0100 > > Thomas Rosenstein via Bloat <[email protected]> wrote: > > > >> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just > >> in > >> case there's something hidden. > > > > Yes, there is something hiding in the data from ethtool_stats.pl[1]: > > (10G Mellanox Connect-X cards via 10G SPF+ DAC) > > > > stat: 1 ( 1) <= outbound_pci_stalled_wr_events /sec > > stat: 339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec > > > > I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking > > in the kernel driver code it is related to "rx_buffer_almost_full". > > The numbers per second is excessive (but it be related to a driver bug > > as it ends up reading "high" -> rx_buffer_almost_full_high in the > > extended counters). Notice this indication is a strong red-flag that something is wrong. > > stat: 29583661 ( 29,583,661) <= rx_bytes /sec > > stat: 30343677 ( 30,343,677) <= rx_bytes_phy /sec > > > > You are receiving with 236 Mbit/s in 10Gbit/s link. There is a > > difference between what the OS sees (rx_bytes) and what the NIC > > hardware sees (rx_bytes_phy) (diff approx 6Mbit/s). > > > > stat: 19552 ( 19,552) <= rx_packets /sec > > stat: 19950 ( 19,950) <= rx_packets_phy /sec > > Could these packets be from VLAN interfaces that are not used in the OS? > > > > > Above RX packet counters also indicated HW is seeing more packets that > > OS is receiving. > > > > Next counters is likely your problem: > > > > stat: 718 ( 718) <= tx_global_pause /sec > > stat: 954035 ( 954,035) <= tx_global_pause_duration /sec > > stat: 714 ( 714) <= tx_pause_ctrl_phy /sec > > As far as I can see that's only the TX, and we are only doing RX on this > interface - so maybe that's irrelevant? > > > > > It looks like you have enabled Ethernet Flow-Control, and something is > > causing pause frames to be generated. It seem strange that this > > happen on a 10Gbit/s link with only 236 Mbit/s. > > > > The TX byte counters are also very strange: > > > > stat: 26063 ( 26,063) <= tx_bytes /sec > > stat: 71950 ( 71,950) <= tx_bytes_phy /sec > > Also, it's TX, and we are only doing RX, as I said already somewhere, > it's async routing, so the TX data comes via another router back. Okay, but as this is a router you also need to transmit this (asymmetric) traffic out another interface right. Could you also provide ethtool_stats for the TX interface? Notice that the tool[1] ethtool_stats.pl support monitoring several interfaces at the same time, e.g. run: ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX And provide output as pastebin. > > [1] > > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > > > > Strange size distribution: > > stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec > > stat: 14 ( 14) <= rx_65_to_127_bytes_phy /sec > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer _______________________________________________ Bloat mailing list [email protected] https://lists.bufferbloat.net/listinfo/bloat
