Hi,

Thanks for the detailed bug report.

Note that I have also written some scapy script to test path MTU
discovery.  /usr/src/regress/sys/netinet/pmtu/tcp_connect.py
and tcp_connect6.py
Sometimes these tests fail, so PMTU may have bugs.  Or my tests are
just unreliable.

How does the route look like where the path MTU is saved?
netstat -rn has a Mtu column.

I see lines like this in the dump.
12:52:50.064789 2a06:d1c0::2.179 > 2a06:d1c0::b.54616: P 69:2147(2078) ack 83 
win 1023 <nop,nop,timestamp 3903202654 2152766359>: BGP (UPDATE: (Path 
attributes: (ORIGIN[T] IGP)

Packet size 2078 seems large.  Do use jumbo frames?
On which machine did you make the tcpdump?  OpenBSD?

You should disable TCP Segmentation Offload.  Otherwise you never
know the packet sizes on the wire.
    sysctl net.inet.tcp.tso=0
Note that OpenBSD supports Large Receive Offload only on ix(4).
Other hardware interface don't do it.
    ifconfig ix0 -tcplro
On Linux you can use ethtool to disable offloading.

Does packet size change in tcpdump when you turm off TSO?

bluhm


On Thu, Mar 07, 2024 at 01:58:05PM +0100, Tobias Fiebig wrote:
> Moin,
> 
> I have run into some issues with v6 PMTUD on OpenBSD 7.4, and am
> somewhat at a loss on how to proceed finding a proper reproducer.
> 
> I first brushed into MTU issues when some of my mailers suddenly
> started to put out ~50mbit of traffic with no apparent reason. Back
> then further debugging lead to the following observations:
> 
> - I received connections from a host behind a HE IPv6 tunnel; This 
>   communicated an MSS of 1440 (MTU 1500)
> - Sending return packets, I received Packet-to-Big ICMP messages from 
>   the HE Tunnel Host, indicating an MTU of 1480.
> - OpenBSD reset the MTU to 1480 and resend
> - I would receive another Packet-to-Big ICMP messages from the HE 
>   Tunnel Host, indicating an MTU of 1480; OpenBSD would set the MTU to 
>   1480 and resend the packet
> 
> The root cause back then was some form of (legacy?) misconfiguration on
> the HE side, as the link actually had an MTU of 1472, which was
> incorrectly reported in the packet-to-big messages by the HE router.
> 
> However, the additional issue seems to be that OpenBSD seems to re-
> transmit endlessly on packet-to-big if the MTU is the same as the
> already discovered PMTU.
> 
> I had initially benched that issue, putting on my todo to do a proper
> write-up and build a tool to remotely trigger this. There might be some
> amplification potential here by abusing, e.g., high-BW HE tunnel
> endpoints to make some dst. send a large amount of outbound traffic;
> But i could not get this working reliably with scapy. Very scrapy
> cobbled together code for linux based on an example snippet to do an
> HTTP request 'by foot' can be found here; Might need some fixing before
> it works: 
> 
> https://rincewind.home.aperture-labs.org/~tfiebig/pmtud_code/http_reque
> st.py
> https://rincewind.home.aperture-labs.org/~tfiebig/pmtud_code/http_request_v6.py
> 
> Note that this needs additional firewalling on the client so the linux
> kernel does not interfere with the TCP sessions, i.e., preventing the
> client from sending RST.
> 
> Also, this is specific to the IPv6 implementation; For IPv4 OpenBSD
> runs down to a minimal MTU (below min. MTU for v4 btw) when re-
> receiving PTB ICMP messages. For v6 it does not doe this, likely due to
> the logic being different in relation to the higher (1280) min MTU.
> 
> Recently, this then hit me again, when gw02.dus01.as59645.net put
> ~1gbit of traffic on the path to gw01.ams01.as59645.net. This occured
> after I had set up a test setup in a third location; This location is
> connected to gw01.ams01 via a MTU 1400 link (vxlan tunnel over IPv6 due
> to lack of fragmentation for v6).
> 
> When i installed a test-device (gw02.dlft), i connected this via a MTU
> 1500 to gw01.dlft01, and--to test something unrelated--via a MTU 1500
> link (tunnel over v4 with out fragmentation handled by an additional
> device transparently that just pushes around VLANs).
> 
> All hosts have a BGP underlay using private ASNs (one per host) to
> distribute the global unicast addresses on the direct links. In
> addtion, there is an iBGP setup between the hosts, exchanging
> fulltables and the non-router networks in use. These are handled via
> loopback addresses, which are also distributed via the BGP underlay.
> 
> See the diagram below:
> 
> +-----------------------+              +-----------------------+
> |gw01.dus01.as59645.net |              |gw01.ams01.as59645.net |
> |         JunOS         +--------------+      VyOS (Linux)     +---+
> |   lo: 2a06:d1c0::1    |              |   lo: 2a06:d1c0::a    |   |
> +-----------+-----------+              +-----------+-----------+   |
>             |                                      /               |
>             |                                      \               |
>             |                         MTU: 1400 -> /               |
>             |                                      \               |
>             |                                      /               |
>             |                                      \               |
> +-----------+-----------+             +------------+----------+    |
> |gw02.dus01.as59645.net |             |gw01.dlft01.as59645.net|    |
> |       OpenBSD 7.4     |             |      VyOS (Linux)     |    |
> |   lo: 2a06:d1c0::2    |             |   lo: 2a06:d1c0::9    |    |
> +-----------------------+             +------------+----------+    |
>                                                    |               |
>                                                    |               |
>              +-------------------------------------+               |
>              |                                                     |
> +------------+----------+                                          |
> |gw02.dlft01.as59645.net+------------------------------------------+
> |         JunOS         |
> |   lo: 2a06:d1c0::9    |
> +-----------------------+
> 
> What now happened is that gw02.dlft01 opened a connection to gw02.dus01
> to start an iBGP session. Thse packets flowed gw02.dlft01 -> gw01.ams01
> -> gw01.dus01 -> gw02.dus01. Return packets, however, flowed gw02.dus01
> -> gw01.dus01 -> gw01.ams01 -> gw01.dlft01. Hence, on the link
> gw01.ams01 -> gw01.dlft01, they exceeded the path MTU, and gw01.ams01
> started to send PTB ICMP messages notifying a correct MTU of 1400.
> 
> However, gw02.dus01 only retransmits the previous packet, and does not
> decrease the MSS/MTU, leading to another PTB ICMP message etc. up until
> the link being saturated (or rather: The virtio NIC of gw02.dus01
> capping transmission).
> 
> Pcap here: https://rincewind.home.aperture-labs.org/~tfiebig/mtu.pcap
> 
> Further digging around a couple of OpenBSD 7.4 oob-ish boxes that also
> learn their routes to these hosts via BGP and face the same async
> routing, I found that they show the same behavior, even if there is no
> loopback bound address involved. Furthermore, I could also see this
> behavior when manually starting a TCP connection that would create
> more-than-MTU-sized packets.
> 
> However, for OpenBSD hosts just holding a default route, even when hard
> setting the MTU in a more specific route, this does not occur.
> Similarly, all other routers (running mostly linux/vyos and omitted in
> the diagram) do not exhibit this MTU behavior.
> 
> I also setup a test-network similar to the above, but could not
> reproduce the issue there so far; This leads me to suspect that--for
> the BGP issue--there is also an inter-op component going wrong.
> Finally, the issue also ocurred when gw01.dus01. was a VyOS.
> 
> At the moment I do see two direct issues:
> - Until-timeout retransmission when receiving same-MTU sized PTB ICMP6 
>   messages
> - Going below minimum MTU for IPv4 when continuously facing packet-to-
>   big messages asking for an MTU >= the size of the sent packet
> 
> Please note, btw, that RFC1191 describing PMTUD in general leaves the
> question of 'what to do when the requested MTU is >= the size of the
> sent packet' undefined. However, RFC4443 notes that a host must limit
> the number of ICMP6 error messages, which is obviously ignored by
> linux, as it seems, and a bug over there.
> 
> The issue that still needs to be found is why gw02.dus01 ignores the
> pmtud packets when the route is learned via BGP. However, I am
> currently at a loss re: finding a reproducing configuration and can
> only find this issue in the live boxes; There also is a very real
> chance of me just 'holding things wrong', though.
> 
> Looking forward to further input.
> 
> With best regards,
> Tobias
> 

Reply via email to