The fun continues, now using an OpenStack deployment on physical hardware that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment still uses Linux bridge for consistency. I'm planning to run similar experiments with Open vSwitch and Open Virtual Network (OVN) in the next week.
I highly recommend reading further, but here's the TL;DR: Using physical network interfaces with MTUs larger than 1500 reveals an additional problem with veth pair for the neutron router interface on the public network. Additionally, IP protocol version does not impact MTU calculation for Linux bridge. First, review the OpenStack bits and resulting network components in the environment [1]. In the first experiment, public cloud network limitations prevented truly seeing how Linux bridge (actually the kernel) handles physical network interfaces with MTUs larger than 1500. In this experiment, we see that it automatically calculates the proper MTU for bridges and VXLAN interfaces using the MTU of parent devices. Also, see that a regular 'ping' works between the host outside of the deployment and the VM [2]. [1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7 [2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb Note: The tcpdump output in each case references up to six points: neutron router gateway on the public network (qg), namespace end of the veth pair for the neutron router interface on the private network (qr), bridge end of the veth pair for router interface on the private network (tap), controller node end of the VXLAN network (underlying interface), compute node end of the VXLAN network (underlying interface), and the bridge end of the tap for the VM (tap). In the first experiment, SSH "stuck" because of a MTU mismatch on the veth pair between the router namespace and private network bridge. In this experiment, SSH works because the VM network interface uses a 1500 MTU and all devices along the path between the host and VM use a 1500 or larger MTU. So, let's configure the VM network interface to use the proper MTU of 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH again. 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0 inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic valid_lft 86395sec preferred_lft 14395sec inet6 fe80::f816:3eff:fe46:acd3/64 scope link valid_lft forever preferred_lft forever SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first experiment, I don't even see the large packet traversing the neutron router gateway on the public network. So, I began a tcpdump closer to the source on the bridge end of the veth pair for the neutron router interface on the public network. Looking at [3], the veth pair between the router namespace and private network bridge drops the packet. The MTU changes over a layer-2 connection without a router, similar to connecting two switches with different MTUs. Even if it could participate in PMTUD, the veth pair lacks an IP address and therefore cannot originate ICMP messages. [3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381 Using observations from the first experiment, let's configure the MTU of the interfaces in the qrouter namespace to match the other end of their respective veth pairs. The public network (gateway) interface MTU becomes 9000 and the private network router interfaces (IPv4 and IPv6) become 8950. 2: qr-49b27408-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff 3: qr-b7e0ef22-32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff 4: qg-7bbe8e38-cc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:2b:c1:fd brd ff:ff:ff:ff:ff:ff Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output [4]. For brevity, I'm only showing tcpdump output from the VM tap interface. Ping operates normally. # ping -c 1 -s 8922 -M do 10.100.52.104 # ping -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:fe46:acd3 [4] https://gist.github.com/ionosphere80/85339b587bb9b2693b07 Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one byte larger than the maximum for a VXLAN segment with 8950 MTU. The router namespace, operating at layer-3, sees the MTU discrepancy between the two interfaces in the namespace and returns an ICMP "fragmentation needed" or "packet too big" message to the sender. The sender uses the MTU value in the ICMP packet to recalculate the length of the first packet and caches it for future packets. # ping -c 1 -s 8923 -M do 10.100.52.104 PING 10.100.52.104 (10.100.52.104) 8923(8951) bytes of data. >From 10.100.52.104 icmp_seq=1 Frag needed and DF set (mtu = 8950) --- 10.100.52.104 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms # ping6 -c 1 -s 8903 -M do fd00:100:52:1:f816:3eff:fe46:acd3 PING fd00:100:52:1:f816:3eff:fe46:acd3(fd00:100:52:1:f816:3eff:fe46:acd3) 8903 data bytes >From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=8950 --- fd00:100:52:1:f816:3eff:fe46:acd3 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms # ip route get to 10.100.52.104 10.100.52.104 dev eth1 src 10.100.52.45 cache expires 596sec mtu 8950 # ip route get to fd00:100:52:1:f816:3eff:fe46:acd3 fd00:100:52:1:f816:3eff:fe46:acd3 from :: via fd00:100:52::101 dev eth1 src fd00:100:52::45 metric 0 cache expires 556sec mtu 8950 Finally, let's try SSH. # ssh [email protected] [email protected]'s password: $ # ssh cirros@fd00:100:52:1:f816:3eff:fe46:acd3 cirros@fd00:100:52:1:f816:3eff:fe46:acd3's password: $ SSH works for both IPv4 and IPv6. This experiment reaches the same conclusion as the first experiment. However, using physical hardware that supports jumbo frames reveals an additional problem with the veth pair for the neutron router interface on the public network. For any MTU, we can address the egress MTU disparity (from the VM) by advertising the MTU of the overlay network to the VM via DHCP/RA or using manual interface configuration. Additionally, IP protocol version does not impact MTU calculation for Linux bridge. Hopefully moving to physical hardware makes this experiment easier to understand and the conclusion more useful for realistic networks. Matt On Wed, Jan 20, 2016 at 11:18 AM, Rick Jones <[email protected]> wrote: > On 01/20/2016 08:56 AM, Sean M. Collins wrote: > >> On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote: >> >>> No. However, we ought to determine what happens when both DHCP and RA >>> advertise it. >>> >> >> We'd have to look at the RFCs for how hosts are supposed to behave since >> IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576 >> (what is this, an MTU for ants?). >> > > Quibble - 576 is the IPv4 minimum, maximum MTU. That is to say a > compliant IPv4 implementation must be able to reassemble datagrams of at > least 576 bytes. > > If memory serves, the actual minimum MTU for IPv4 is 68 bytes. > > rick jones > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
