Re: [openstack-dev] [Neutron] MTU configuration pain

Matt Kassawara Fri, 22 Jan 2016 19:18:07 -0800

The fun continues, now using an OpenStack deployment on physical hardware
that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment
still uses Linux bridge for consistency. I'm planning to run similar
experiments with Open vSwitch and Open Virtual Network (OVN) in the next
week.

I highly recommend reading further, but here's the TL;DR: Using physical
network interfaces with MTUs larger than 1500 reveals an additional problem
with veth pair for the neutron router interface on the public network.
Additionally, IP protocol version does not impact MTU calculation for Linux
bridge.

First, review the OpenStack bits and resulting network components in the
environment [1]. In the first experiment, public cloud network limitations
prevented truly seeing how Linux bridge (actually the kernel) handles
physical network interfaces with MTUs larger than 1500. In this experiment,
we see that it automatically calculates the proper MTU for bridges and
VXLAN interfaces using the MTU of parent devices. Also, see that a regular
'ping' works between the host outside of the deployment and the VM [2].

[1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7
[2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb

Note: The tcpdump output in each case references up to six points: neutron
router gateway on the public network (qg), namespace end of the veth pair
for the neutron router interface on the private network (qr), bridge end of
the veth pair for router interface on the private network (tap), controller
node end of the VXLAN network (underlying interface), compute node end of
the VXLAN network (underlying interface), and the bridge end of the tap for
the VM (tap).

In the first experiment, SSH "stuck" because of a MTU mismatch on the veth
pair between the router namespace and private network bridge. In this
experiment, SSH works because the VM network interface uses a 1500 MTU and
all devices along the path between the host and VM use a 1500 or larger
MTU. So, let's configure the VM network interface to use the proper MTU of
9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH
again.

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen
1000
    link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0
    inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic
       valid_lft 86395sec preferred_lft 14395sec
    inet6 fe80::f816:3eff:fe46:acd3/64 scope link
       valid_lft forever preferred_lft forever

SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first
experiment, I don't even see the large packet traversing the neutron router
gateway on the public network. So, I began a tcpdump closer to the source
on the bridge end of the veth pair for the neutron router interface on the
public network.

Looking at [3], the veth pair between the router namespace and private
network bridge drops the packet. The MTU changes over a layer-2 connection
without a router, similar to connecting two switches with different MTUs.
Even if it could participate in PMTUD, the veth pair lacks an IP address
and therefore cannot originate ICMP messages.

[3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381

Using observations from the first experiment, let's configure the MTU of
the interfaces in the qrouter namespace to match the other end of their
respective veth pairs. The public network (gateway) interface MTU becomes
9000 and the private network router interfaces (IPv4 and IPv6) become 8950.

2: qr-49b27408-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff
3: qr-b7e0ef22-32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff
4: qg-7bbe8e38-cc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc
pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:2b:c1:fd brd ff:ff:ff:ff:ff:ff

Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the
maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output
[4]. For brevity, I'm only showing tcpdump output from the VM tap
interface. Ping operates normally.

# ping -c 1 -s 8922 -M do 10.100.52.104

# ping -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:fe46:acd3

[4] https://gist.github.com/ionosphere80/85339b587bb9b2693b07

Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one byte
larger than the maximum for a VXLAN segment with 8950 MTU. The router
namespace, operating at layer-3, sees the MTU discrepancy between the two
interfaces in the namespace and returns an ICMP "fragmentation needed" or
"packet too big" message to the sender. The sender uses the MTU value in
the ICMP packet to recalculate the length of the first packet and caches it
for future packets.

# ping -c 1 -s 8923 -M do 10.100.52.104
PING 10.100.52.104 (10.100.52.104) 8923(8951) bytes of data.
>From 10.100.52.104 icmp_seq=1 Frag needed and DF set (mtu = 8950)

--- 10.100.52.104 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8903 -M do fd00:100:52:1:f816:3eff:fe46:acd3
PING fd00:100:52:1:f816:3eff:fe46:acd3(fd00:100:52:1:f816:3eff:fe46:acd3)
8903 data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=8950

--- fd00:100:52:1:f816:3eff:fe46:acd3 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ip route get to 10.100.52.104
10.100.52.104 dev eth1  src 10.100.52.45
    cache  expires 596sec mtu 8950

# ip route get to fd00:100:52:1:f816:3eff:fe46:acd3
fd00:100:52:1:f816:3eff:fe46:acd3 from :: via fd00:100:52::101 dev eth1
 src fd00:100:52::45  metric 0
    cache  expires 556sec mtu 8950

Finally, let's try SSH.

# ssh [email protected]
[email protected]'s password:
$

# ssh cirros@fd00:100:52:1:f816:3eff:fe46:acd3
cirros@fd00:100:52:1:f816:3eff:fe46:acd3's password:
$

SSH works for both IPv4 and IPv6.

This experiment reaches the same conclusion as the first experiment.
However, using physical hardware that supports jumbo frames reveals an
additional problem with the veth pair for the neutron router interface on
the public network. For any MTU, we can address the egress MTU disparity
(from the VM) by advertising the MTU of the overlay network to the VM via
DHCP/RA or using manual interface configuration. Additionally, IP protocol
version does not impact MTU calculation for Linux bridge.

Hopefully moving to physical hardware makes this experiment easier to
understand and the conclusion more useful for realistic networks.

Matt

On Wed, Jan 20, 2016 at 11:18 AM, Rick Jones <[email protected]> wrote:

> On 01/20/2016 08:56 AM, Sean M. Collins wrote:
>
>> On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote:
>>
>>> No. However, we ought to determine what happens when both DHCP and RA
>>> advertise it.
>>>
>>
>> We'd have to look at the RFCs for how hosts are supposed to behave since
>> IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576
>> (what is this, an MTU for ants?).
>>
>
> Quibble - 576 is the IPv4 minimum, maximum MTU.  That is to say a
> compliant IPv4 implementation must be able to reassemble datagrams of at
> least 576 bytes.
>
> If memory serves, the actual minimum MTU for IPv4 is 68 bytes.
>
> rick jones
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Neutron] MTU configuration pain

Reply via email to