Re: [openstack-dev] [Neutron] MTU configuration pain

Fox, Kevin M Mon, 25 Jan 2016 17:20:07 -0800

Another place to look...
I've had to use network_device_mtu=9000 in nova's config as well to get mtu's 
working smoothly.

Thanks,
Kevin
________________________________
From: Matt Kassawara [[email protected]]
Sent: Monday, January 25, 2016 5:00 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Neutron] MTU configuration pain

Results from the Open vSwitch agent...

I highly recommend reading further, but here's the TL;DR: Using physical 
network interfaces with MTUs larger than 1500 reveals problems in several 
places, but only involving Linux components rather than Open vSwitch components 
(such as br-int) on both the controller and compute nodes. Most of the problems 
involve MTU disparities in security group bridge components on the compute node.

First, review the OpenStack bits and resulting network components in the 
environment [1] and see that a typical 'ping' works using IPv4 and IPv6 [2].

[1] https://gist.github.com/ionosphere80/23655bedd24730d22c89
[2] https://gist.github.com/ionosphere80/5f309e7021a830246b66

Note: The tcpdump output in each case references up to seven points: neutron 
router gateway on the public network (qg), namespace end of the neutron router 
interface on the private network (qr), controller node end of the VXLAN network 
(underlying interface), compute node end of the VXLAN network (underlying 
interface), Open vSwitch end of the veth pair for the security group bridge 
(qvo), Linux bridge end of the veth pair for the security group bridge (qvb), 
and the bridge end of the tap for the VM (tap).

I can use SSH to access the VM because every component between my host and the 
VM supports at least a 1500 MTU. So, let's configure the VM network interface 
to use the proper MTU of 9000 minus the VXLAN protocol overhead of 50 bytes... 
8950... and try SSH again.

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen 1000
    link/ether fa:16:3e:ea:22:3a brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.3/24<http://172.16.1.3/24> brd 172.16.1.255 scope global eth0
    inet6 fd00:100:52:1:f816:3eff:feea:223a/64 scope global dynamic
       valid_lft 86396sec preferred_lft 14396sec
    inet6 fe80::f816:3eff:feea:223a/64 scope link
       valid_lft forever preferred_lft forever

Contrary to the Linux bridge experiment, I can still use SSH to access the VM. 
Why?

Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the maximum 
for a VXLAN segment with 8950 MTU.

# ping -c 1 -s 8922 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8922(8950) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a) 8902 
data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=1500

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Look at the tcpdump output [3]. The router namespace, operating at layer-3, 
sees the MTU discrepancy between inbound packet and the neutron router gateway 
on the public network and returns an ICMP "fragmentation needed" or "packet too 
big" message to the sender. The sender uses the MTU value in the ICMP packet to 
recalculate the length of the first packet and caches it for future packets.

[3] https://gist.github.com/ionosphere80/4e1389a34fd3a628b294

Although PTMUD enables communication between my host and the VM, it limits MTU 
to 1500 regardless of the MTU between the router namespace and VM and therefore 
could impact performance on 10 Gbps or faster networks. Also, it does not 
address the MTU disparity between a VM and network components on the compute 
node. If a VM uses a 1500 or smaller MTU, it cannot send packets that exceed 
the MTU of the tap interface, veth pairs, and bridge on the compute node. In 
this situation which seems fairly typical for operators trying to work around 
MTU problems, communication between a host (outside of OpenStack) and a VM 
always works. However, what if a VM uses a MTU larger than 1500 and attempts to 
send a large packet? The bridge or veth pairs would drop it because of the MTU 
disparity.

Using observations from the Linux bridge experiment, let's configure the MTU of 
the interfaces in the router namespace to match the interfaces outside of the 
namespace. The public network (gateway) interface MTU becomes 9000 and the 
private network router interfaces (IPv4 and IPv6) become 8950.

31: qr-d744191c-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue 
state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:34:67:40 brd ff:ff:ff:ff:ff:ff
32: qr-ae54b450-b4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue 
state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:d4:f1:63 brd ff:ff:ff:ff:ff:ff
33: qg-e3303f07-e7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue 
state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:70:09:54 brd ff:ff:ff:ff:ff:ff

Let's ping again with a payload size of 8922 for IPv4, the maximum for a VXLAN 
segment with 8950 MTU, and look at the tcpdump output [4]. For brevity, I'm 
only showing IPv4 because IPv6 provides similar results.

# ping -c 1 -s 8922 -M do 10.100.52.102

[4] https://gist.github.com/ionosphere80/703925fbe4ae53e78445

The packet traverses the Open vSwitch infrastructure including the overlay. 
However, looking at the compute node, the integration bridge drops the packet 
because the MTU changes from 8950 to 1500 over a layer-2 connection without a 
router.

Let's increase the MTU on the OVS end of the veth pair to 8950, and ping again 
using the same payload. For brevity, I'm only showing tcpdump output for 
interfaces on the compute node [5].

# ping -c 1 -s 8922 -M do 10.100.52.102

[5] https://gist.github.com/ionosphere80/0f0d4cf346ee81e43cbb

The packet gets one step further. The veth pair between the Open vSwitch 
integration bridge and security group bridge drops the packet because the MTU 
changes from 8950 to 1500 over a layer-2 connection without a router.

Let's increase the MTU on the Linux bridge end of the veth pair to 8950 and 
ping again using the same payload. For brevity, I'm only showing tcpdump output 
for interfaces on the compute node [6].

[6] https://gist.github.com/ionosphere80/dd9270aae23ad286d9cd

The packet gets one step further. The VM tap interface drops the packet because 
the MTU changes from 8950 to 1500 over a layer-2 connection without a router.

Let's perform the final MTU increase on the VM tap interface and ping again 
using the same payload. For brevity, I'm only showing tcpdump output for 
interfaces on the compute node [7].

[7] https://gist.github.com/ionosphere80/05e02c7a753fad4b2964

Ping works.

Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one byte 
larger than the maximum for a VXLAN segment with 8950 MTU. The router 
namespace, operating at layer-3, sees the MTU discrepancy between the two 
interfaces in the namespace and returns an ICMP "fragmentation needed" or 
"packet too big" message to the sender. The sender uses the MTU value in the 
ICMP packet to recalculate the length of the first packet and caches it for 
future packets.

# ping -c 1 -s 8923 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8923(8951) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 8950)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8903 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a) 8903 
data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=8950

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ip route get to 10.100.52.102
10.100.52.102 dev eth1  src 10.100.52.45
    cache  expires 499sec mtu 8950

# ip route get to fd00:100:52:1:f816:3eff:feea:223a
fd00:100:52:1:f816:3eff:feea:223a from :: via fd00:100:52::101 dev eth1  src 
fd00:100:52::45  metric 0
    cache  expires 544sec mtu 8950

This experiment reveals a number of problems with the Open vSwitch agent, none 
of which seem to involve Open vSwitch itself.

1) Like the Linux bridge agent, interfaces in namespaces assume a 1500 MTU 
which prevents communication with VMs using larger packets. However, the method 
OVS uses to manage interfaces in namespaces permits them to generate ICMP 
messages for PMTUD that notify senders of the correct MTU.
2) Although interfaces in namespaces generate ICMP messages for PMTUD, they 
assume a 1500 MTU and therefore limit performance on 10 Gbps or faster networks 
regardless of the MTU between the router namespace and a VM.
3) The Open vSwitch agent creates Linux bridges on compute nodes to implement 
security groups. These bridges do not contain ports on physical network 
interfaces (using a larger MTU) and therefore assume a 1500 MTU. The veth pairs 
and tap interfaces also assume a 1500 MTU. Unlike the Linux bridge agent, only 
increasing the MTU of the namespace end of the veth pair for the neutron router 
interface on the private network simply moves the problem to the security group 
bridge components. The latter components (qvo, qvb, and tap) should all use the 
MTU of the physical network minus the overlay protocol overhead, or 8950 for 
VXLAN in this particular experiment.

Matt

On Mon, Jan 25, 2016 at 12:10 PM, Rick Jones 
<[email protected]<mailto:[email protected]>> wrote:
On 01/24/2016 07:43 PM, Ian Wells wrote:
Also, I say 9000, but why is 9000 even the right number?

While that may have been a rhetorical question...

Because that is the value Alteon picked in the late 1990s when they created the 
de facto standard for "Jumbo Frames" by including it in their Gigabit Ethernet 
kit as a way to enable the systems of the day to have a hope of getting 
link-rate :)

Perhaps they picked 9000 because it was twice the 4500 of FDDI, which itself 
was selected to allow space for 4096 bytes of data and then a good bit of 
headers.

rick jones

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
[email protected]?subject:unsubscribe<http://[email protected]?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Neutron] MTU configuration pain

Reply via email to