On Sun, Jan 17, 2016 at 8:30 PM, Matt Kassawara <mkassaw...@gmail.com> wrote:
> Prior attempts to solve the MTU problem in neutron simply band-aid it or > become too complex from feature creep or edge cases that mask the primary > goal of a simple implementation that works for most deployments. So, I ran > some experiments to empirically determine the root cause of MTU problems in > common neutron deployments using the Linux bridge agent. I plan to perform > these experiments again using the Open vSwitch agent... after sufficient > mental recovery. > > I highly recommend reading further, but here's the TL;DR: > > Observations... > > 1) During creation of a VXLAN interface, Linux automatically subtracts the > VXLAN protocol overhead from the MTU of the parent interface. > 2) A veth pair or tap with a different MTU on each end drops packets > larger than the smaller MTU. > 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of > all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to > 1450 when neutron adds a VXLAN interface to it. > 4) A bridge with different MTUs on each port drops packets larger than the > MTU of the bridge. > 5) A bridge or veth pair with an IP address can participate in path MTU > discovery (PMTUD). However, these devices do not appear to understand > namespaces and originate the ICMP message from the host instead of a > namespace. Therefore, the message never reaches the destination... > typically a host outside of the deployment. > > Conclusion... > > The MTU disparity between native and overlay networks must reside in a > device capable of layer-3 operations that can participate in PMTUD, such as > the neutron router between a private/project overlay network and a > public/external native network. > > Some background... > > In a typical datacenter network, MTU must remain consistent within a > layer-2 network because fragmentation and the mechanism indicating the need > for it occurs at layer-3. In other words, all host interfaces and switch > ports on the same layer-2 network must use the same MTU. If the layer-2 > network connects to a router, the router port must also use the same MTU. A > router can contain ports on multiple layer-2 networks with different MTUs > because it operates on those networks at layer-3. If the MTU changes > between ports on a router and devices on those layer-2 networks attempt to > communicate at layer-3, the router can perform a couple of actions. For > IPv4, the router can fragment the packet. However, if the packet contains > the "don't fragment" (DF) flag, the router can either silently drop the > packet or return an ICMP "fragmentation needed" message to the sender. This > ICMP message contains the MTU of the next layer-2 network in the route > between the sender and receiver. Each router in the path can return these > ICMP messages to the sender until it learns the maximum MTU for the entire > path, also known as path MTU discovery (PMTUD). IPv6 does not support > fragmentation. > > The cloud provides a virtual extension of a physical network. In the > simplest sense, patch cables become veth pairs, switches become bridges, > and routers become namespaces. Therefore, MTU implementation for virtual > networks should mimic physical networks where MTU changes must occur within > a router at layer-3. > > For these experiments, my deployment contains one controller and one > compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The > configuration does not contain any MTU options (e.g, path_mtu). One VM with > a floating IP address attaches to a VXLAN private network that routes to a > flat public network. The DHCP agent does not advertise MTU to the VM. My > lab resides on public cloud infrastructure with networks that filter > unknown MAC addresses such as those that neutron generates for virtual > network components. Let's talk about the implications and workarounds. > > The VXLAN protocol contains 50 bytes of overhead. Linux automatically > calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent > device, in this case a standard Ethernet interface with a 1500 MTU. > However, due the limitations of public cloud networks, I must create a > VXLAN tunnel between the controller node and a host outside of the > deployment to simulate traffic from a datacenter network. This tunnel > effectively reduces the "native" MTU from 1500 to 1450. Therefore, I need > to subtract an additional 50 bytes from neutron VXLAN network components, > essentially emulating the 50-byte difference between conventional neutron > VXLAN networks and native networks. The host outside of the deployment > assumes it can send packets using a 1450 MTU. The VM also assumes it can > send packets using a 1450 MTU because the DHCP agent does not advertise a > 1400 MTU to it. > > Let's get to it! > > Note: The commands in these experiments often generate lengthy output, so > please refer to the gists when necessary. > > First, review the OpenStack bits and resulting network components in the > environment [1]. Also, see that a regular 'ping' works between the host > outside of the deployment and the VM [2]. > > [1] https://gist.github.com/ionosphere80/b78bedfc5e8300b8113e > [2] https://gist.github.com/ionosphere80/b44358b43af13de74f1f > > Note: The tcpdump output in each case references up to six points: neutron > router gateway on the public network (qg), namespace end of the veth pair > for the neutron router interface on the private network (qr), bridge end of > the veth pair for router interface on the private network (tap), controller > node end of the VXLAN network (vxlan-434), compute node end of the VXLAN > network (vxlan-434), and the bridge end of the tap for the VM (tap). > > A VM typically requires using SSH to access it. A MTU mismatch usually > manifests itself as a "stuck" SSH connection. Without further > investigation, the symptoms mistakenly lead people toward security groups. > However, increasing the SSH client verbosity shows it connecting to the > server and hanging somewhere during key exchange [3]. > > [3] https://gist.github.com/ionosphere80/8ccd736bf3dda05a01a0 > > Does the key exchange contain a packet that exceeds the MTU between the > client and server? Yes! Looking at [4], the veth pair between the router > namespace and private network bridge drops the packet. The MTU changes over > a layer-2 connection without a router, similar to connecting two switches > with different MTUs. Even if it could participate in PMTUD, the veth pair > lacks an IP address and therefore cannot originate ICMP messages. > > [4] https://gist.github.com/ionosphere80/9eb0e2c0b3e780de9afc > > Note: If I try "conventional" MTUs (instead of assuming a maximum of 1450 > due to limitations of cloud networks) and use SSH from the controller node > access to the VM, the VM tap interface drops key exchange packets due to a > MTU mismatch... 1500 on the VM end and 1450 the bridge end. Exact results > probably vary among environments. > > Now we know why SSH doesn't work. The following experiments use the 'ping' > utility because it generates much less traffic and provides a way to > control the DF flag (-M). > > Note: Although the namespace end of the veth pair for the neutron router > interface on the private network contains a 1450 MTU, it actually doesn't > pass VXLAN traffic which means it should support a slightly larger ICMP > payload... 4 extra bytes. > > Let's ping with a payload size of 1372, the maximum for a VXLAN segment > with 1400 MTU, and look at the tcpdump output [5]. Ping operates normally. > > # ping -c 1 -s 1372 -M do 10.4.31.102 > > [5] https://gist.github.com/ionosphere80/89cc8e21060e8988e46c > > Let's ping with a payload size of 1373, one byte larger than the maximum > for a VXLAN segment with 1400 MTU, and look at the tcpdump output [6]. The > VM does not receive the packet. The private network bridge on the > controller, only operating at layer-2, drops the packet because it exceeds > the MTU of the vxlan-434 interface on it. Even if it could participate in > PMTUD, the bridge lacks an IP address and therefore cannot originate ICMP > messages. > > # ping -c 1 -s 1373 -M do 10.4.31.102 > > [6] https://gist.github.com/ionosphere80/8a7aa01db29679fbad22 > > Let's ping with a payload size of 1377, one byte larger than the maximum > for a "bare" segment with 1400 MTU, and look at the tcpdump output [7]. The > veth pair for the router interface on the private network, only operating > at layer-2, drops the packet because it exceeds the MTU of the bridge end > of it. Even if it could participate in PMTUD, the veth pair lacks an IP > address and therefore cannot originate ICMP messages. > > # ping -c 1 -s 1377 -M do 10.4.31.102 > > [7] https://gist.github.com/ionosphere80/dd2e3e24f3e94c4801a8 > > What if we allow fragmentation? > > Let's ping again with a payload size of 1373, one byte larger than the > maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output > [8]. The the vxlan-434 interface on the controller node, operating at > layer-3, fragments the request. The vxlan-434 interface on the compute > node, operating at layer-3, fragments the reply. Ping operates normally. > > # ping -c 1 -s 1373 -M dont 10.4.31.102 > > [8] https://gist.github.com/ionosphere80/13ebcf1b67c1286012f7 > > Let's ping again with a payload size of 1377, one byte larger than the > maximum for a "bare" segment with 1400 MTU, and look at the tcpdump output > [9]. The veth pair for the router interface on the private network, only > operating at layer-2, drops the packet because it exceeds the MTU of the > bridge end of it. Even if it could participate in PMTUD, the veth pair > lacks an IP address and therefore cannot originate ICMP messages. > > # ping -c 1 -s 1377 -M dont 10.4.31.102 > > [9] https://gist.github.com/ionosphere80/53b14343cd23a620b0ef > > In all of these cases, the first MTU disparity appears on a veth pair that > only operates at layer-2 and therefore cannot participate in PMTUD, > effectively breaking communication. What happens if we move the first MTU > disparity to the namespace end of the veth pair for the neutron router > interface on the private network, effectively making both ends equal with a > 1400 MTU? > > # ip link set dev qr-d9e6ec95-f5 mtu 1400 > > Let's ping again with a payload size of 1372, the maximum for a VXLAN > segment with 1400 MTU, and look at the tcpdump output [10]. Ping operates > normally. > > # ping -c 1 -s 1372 -M do 10.4.31.102 > > [10] https://gist.github.com/ionosphere80/fd5e29d387d009611704 > > Let's ping again with a payload size of 1373, one byte larger than the > maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output > [11]. The router namespace, operating at layer-3, sees the MTU discrepancy > between the two interfaces in the namespace and returns an ICMP > "fragmentation needed" message to the sender. The sender uses the MTU value > in the ICMP packet to recalculate the length of the first packet and caches > it for future packets. > > # ping -c 1 -s 1373 -M do 10.4.31.102 > > [11] https://gist.github.com/ionosphere80/43ff558e077acfa92cfc > > The 'ip' command reveals the cached MTU value: > > # ip route get to 10.4.31.102 > 10.4.31.102 dev vxlan1040 src 10.4.31.1 > cache expires 590sec mtu 1400 > > At least for the Linux bridge agent, I think we can address ingress MTU > disparity (to the VM) by moving it to the first device in the chain capable > of layer-3 operations, particularly the neutron router namespace. We can > address the egress MTU disparity (from the VM) by advertising the MTU of > the overlay network to the VM via DHCP/RA or using manual interface > configuration. From a policy standpoint, only the operator should configure > the native MTU for neutron using a global option in a configuration file, > leaving a combination of Linux and neutron to automatically calculate the > MTU for virtual network components and VMs. For VMs using manual interface > configuration, the user should have read-only access to the MTU for a > particular network that accounts for overlay protocol overhead. For > example, the user would see 1450 for a VXLAN network. > > Matt > > On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins <s...@coreitpro.com> > wrote: > >> MTU has been an ongoing issue in Neutron for _years_. >> >> It's such a hassle, that most people just throw up their hands and set >> their physical infrastructure to jumbo frames. We even document it. >> >> >> http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html >> >> > Ideally, you can prevent these problems by enabling jumbo frames on >> > the physical network that contains your tenant virtual networks. Jumbo >> > frames support MTUs up to approximately 9000 bytes which negates the >> > impact of GRE overhead on virtual networks. >> >> We've pushed this onto operators and deployers. There's a lot of >> code in provisioning projects to handle MTUs. >> >> http://codesearch.openstack.org/?q=MTU&i=nope&files=&repos= >> >> We have mentions to it in our architecture design guide >> >> >> http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150 >> >> I want to get Neutron to the point where it starts discovering this >> information and automatically configuring, in the optimistic cases. I >> understand that it can be complex and have corner cases, but the issue >> we have today is that it is broken in some multinode jobs, even Neutron >> developers are configuring it correctly. >> >> I also had this discussion on the DevStack side in >> https://review.openstack.org/#/c/112523/ >> where basically, sure we can fix it in DevStack and at the gate, but it >> doesn't fix the problem for anyone who isn't using DevStack to deploy >> their cloud. >> >> Today we have a ton of MTU configuration options sprinkled throghout the >> L3 agent, dhcp agent, l2 agents, and at least one API extension to the >> REST API for handling MTUs. >> >> So yeah, a lot of knobs and not a lot of documentation on how to make >> this thing work correctly. I'd like to try and simplify. >> >> >> Further reading: >> >> >> http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html >> >> http://lists.openstack.org/pipermail/openstack/2013-October/001778.html >> >> >> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/ >> >> >> https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/ >> >> http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/ >> >> https://twitter.com/search?q=openstack%20neutron%20MTU >> >> -- >> Sean M. Collins >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > So as a non-networking expert that tried to convert/upgrade their Cloud deployments from nova-net to Neutron (AGAIN) at Liberty, I worked through the DNS issues to then hit these issues which I solved after much frustration by modifying MTU values through a process of trial and error manually... I'm trying to figure out in the above posts: 1. Does anybody actually have a solution that's being worked/merged? 2. Are we just saying "networking is hard, deal with it"? I've tried making the jump every release, and I have to say that things seem to have come a long way in Liberty (at least from my perspective), but it still seems super finicky and even a bit "magic". Even after getting past the MTU issues I was so discouraged by the assign floating IP process that I aborted once again and went back to nova-network. :(
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev