Signed-off-by: Stephen Finucane <step...@that.guru> --- lib/mac-learning.c | 3 +- lib/mac-learning.h | 2 +- vswitchd/INTERNALS | 239 ------------------------------------------------ vswitchd/INTERNALS.rst | 244 +++++++++++++++++++++++++++++++++++++++++++++++++ vswitchd/automake.mk | 2 +- 5 files changed, 248 insertions(+), 242 deletions(-) delete mode 100644 vswitchd/INTERNALS create mode 100644 vswitchd/INTERNALS.rst
diff --git a/lib/mac-learning.c b/lib/mac-learning.c index 5509f22..57b81f4 100644 --- a/lib/mac-learning.c +++ b/lib/mac-learning.c @@ -411,7 +411,8 @@ update_learning_table__(struct mac_learning *ml, struct eth_addr src, * packet was received over a non-bond interface and refrain from * learning from gratuitous ARP packets that arrive over bond * interfaces for this entry while the lock is in effect. See - * vswitchd/INTERNALS for more in-depth discussion on this topic. */ + * vswitchd/INTERNALS.rst for more in-depth discussion on this + * topic. */ if (!is_bond) { mac_entry_set_grat_arp_lock(mac); } else if (mac_entry_is_grat_arp_locked(mac)) { diff --git a/lib/mac-learning.h b/lib/mac-learning.h index d380690..e427815 100644 --- a/lib/mac-learning.h +++ b/lib/mac-learning.h @@ -47,7 +47,7 @@ * Second, the implementation has the ability to "lock" a MAC table entry * updated by a gratuitous ARP. This is a simple feature but the rationale for * it is complicated. Please refer to the description of SLB bonding in - * vswitchd/INTERNALS for an explanation. + * vswitchd/INTERNALS.rst for an explanation. * * Third, the implementation expires entries that are idle for longer than a * configurable amount of time. This is implemented by keeping all of the diff --git a/vswitchd/INTERNALS b/vswitchd/INTERNALS deleted file mode 100644 index 994353d..0000000 --- a/vswitchd/INTERNALS +++ /dev/null @@ -1,239 +0,0 @@ - ======================== - ovs-vswitchd Internals - ======================== - -This document describes some of the internals of the ovs-vswitchd -process. It is not complete. It tends to be updated on demand, so if -you have questions about the vswitchd implementation, ask them and -perhaps we'll add some appropriate documentation here. - -Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so -code references below should be assumed to refer to that file except -as otherwise specified. - -Bonding -======= - -Bonding allows two or more interfaces (the "slaves") to share network -traffic. From a high-level point of view, bonded interfaces act like -a single port, but they have the bandwidth of multiple network -devices, e.g. two 1 GB physical interfaces act like a single 2 GB -interface. Bonds also increase robustness: the bonded port does not -go down as long as at least one of its slaves is up. - -In vswitchd, a bond always has at least two slaves (and may have -more). If a configuration error, etc. would cause a bond to have only -one slave, the port becomes an ordinary port, not a bonded port, and -none of the special features of bonded ports described in this section -apply. - -There are many forms of bonding of which ovs-vswitchd implements only -a few. The most complex bond ovs-vswitchd implements is called -"source load balancing" or SLB bonding. SLB bonding divides traffic -among the slaves based on the Ethernet source address. This is useful -only if the traffic over the bond has multiple Ethernet source -addresses, for example if network traffic from multiple VMs are -multiplexed over the bond. - -Enabling and Disabling Slaves ------------------------------ - -When a bond is created, a slave is initially enabled or disabled based -on whether carrier is detected on the NIC (see iface_create()). After -that, a slave is disabled if its carrier goes down for a period of -time longer than the downdelay, and it is enabled if carrier comes up -for longer than the updelay (see bond_link_status_update()). There is -one exception where the updelay is skipped: if no slaves at all are -currently enabled, then the first slave on which carrier comes up is -enabled immediately. - -The updelay should be set to a time longer than the STP forwarding -delay of the physical switch to which the bond port is connected (if -STP is enabled on that switch). Otherwise, the slave will be enabled, -and load may be shifted to it, before the physical switch starts -forwarding packets on that port, which can cause some data to be -"blackholed" for a time. The exception for a single enabled slave -does not cause any problem in this regard because when no slaves are -enabled all output packets are blackholed anyway. - -When a slave becomes disabled, the vswitch immediately chooses a new -output port for traffic that was destined for that slave (see -bond_enable_slave()). It also sends a "gratuitous learning packet", -specifically a RARP, on the bond port (on the newly chosen slave) for -each MAC address that the vswitch has learned on a port other than the -bond (see bond_send_learning_packets()), to teach the physical switch -that the new slave should be used in place of the one that is now -disabled. (This behavior probably makes sense only for a vswitch that -has only one port (the bond) connected to a physical switch; vswitchd -should probably provide a way to disable or configure it in other -scenarios.) - -Bond Packet Input ------------------ - -Bonding accepts unicast packets on any bond slave. This can -occasionally cause packet duplication for the first few packets sent -to a given MAC, if the physical switch attached to the bond is -flooding packets to that MAC because it has not yet learned the -correct slave for that MAC. - -Bonding only accepts multicast (and broadcast) packets on a single -bond slave (the "active slave") at any given time. Multicast packets -received on other slaves are dropped. Otherwise, every multicast -packet would be duplicated, once for every bond slave, because the -physical switch attached to the bond will flood those packets. - -Bonding also drops received packets when the vswitch has learned that -the packet's MAC is on a port other than the bond port itself. This is -because it is likely that the vswitch itself sent the packet out the -bond port on a different slave and is now receiving the packet back. -This occurs when the packet is multicast or the physical switch has not -yet learned the MAC and is flooding it. However, the vswitch makes an -exception to this rule for broadcast ARP replies, which indicate that -the MAC has moved to another switch, probably due to VM migration. -(ARP replies are normally unicast, so this exception does not match -normal ARP replies. It will match the learning packets sent on bond -fail-over.) - -The active slave is simply the first slave to be enabled after the -bond is created (see bond_choose_active_iface()). If the active slave -is disabled, then a new active slave is chosen among the slaves that -remain active. Currently due to the way that configuration works, -this tends to be the remaining slave whose interface name is first -alphabetically, but this is by no means guaranteed. - -Bond Packet Output ------------------- - -When a packet is sent out a bond port, the bond slave actually used is -selected based on the packet's source MAC and VLAN tag (see -choose_output_iface()). In particular, the source MAC and VLAN tag -are hashed into one of 256 values, and that value is looked up in a -hash table (the "bond hash") kept in the "bond_hash" member of struct -port. The hash table entry identifies a bond slave. If no bond slave -has yet been chosen for that hash table entry, vswitchd chooses one -arbitrarily. - -Every 10 seconds, vswitchd rebalances the bond slaves (see -bond_rebalance_port()). To rebalance, vswitchd examines the -statistics for the number of bytes transmitted by each slave over -approximately the past minute, with data sent more recently weighted -more heavily than data sent less recently. It considers each of the -slaves in order from most-loaded to least-loaded. If highly loaded -slave H is significantly more heavily loaded than the least-loaded -slave L, and slave H carries at least two hashes, then vswitchd shifts -one of H's hashes to L. However, vswitchd will only shift a hash from -H to L if it will decrease the ratio of the load between H and L by at -least 0.1. - -Currently, "significantly more loaded" means that H must carry at -least 1 Mbps more traffic, and that traffic must be at least 3% -greater than L's. - -Bond Balance Modes ------------------- - -Each bond balancing mode has different considerations, described -below. - -LACP Bonding ------------- - -LACP bonding requires the remote switch to implement LACP, but it is -otherwise very simple in that, after LACP negotiation is complete, -there is no need for special handling of received packets. - -Several of the physical switches that support LACP block all traffic -for ports that are configured to use LACP, until LACP is negotiated with -the host. When configuring a LACP bond on a OVS host (eg: XenServer), -this means that there will be an interruption of the network connectivity -between the time the ports on the physical switch and the bond on the OVS -host are configured. The interruption may be relatively long, if different -people are responsible for managing the switches and the OVS host. - -Such network connectivity failure can be avoided if LACP can be configured -on the OVS host before configuring the physical switch, and having -the OVS host fall back to a bond mode (active-backup) till the physical -switch LACP configuration is complete. An option "lacp-fallback-ab" exists to -provide such behavior on openvswitch. - -Active Backup Bonding ---------------------- - -Active Backup bonds send all traffic out one "active" slave until that -slave becomes unavailable. Since they are significantly less -complicated than SLB bonds, they are preferred when LACP is not an -option. Additionally, they are the only bond mode which supports -attaching each slave to a different upstream switch. - -SLB Bonding ------------ - -SLB bonding allows a limited form of load balancing without the remote -switch's knowledge or cooperation. The basics of SLB are simple. SLB -assigns each source MAC+VLAN pair to a link and transmits all packets -from that MAC+VLAN through that link. Learning in the remote switch -causes it to send packets to that MAC+VLAN through the same link. - -SLB bonding has the following complications: - - 0. When the remote switch has not learned the MAC for the - destination of a unicast packet and hence floods the packet to - all of the links on the SLB bond, Open vSwitch will forward - duplicate packets, one per link, to each other switch port. - - Open vSwitch does not solve this problem. - - 1. When the remote switch receives a multicast or broadcast packet - from a port not on the SLB bond, it will forward it to all of - the links in the SLB bond. This would cause packet duplication - if not handled specially. - - Open vSwitch avoids packet duplication by accepting multicast - and broadcast packets on only the active slave, and dropping - multicast and broadcast packets on all other slaves. - - 2. When Open vSwitch forwards a multicast or broadcast packet to a - link in the SLB bond other than the active slave, the remote - switch will forward it to all of the other links in the SLB - bond, including the active slave. Without special handling, - this would mean that Open vSwitch would forward a second copy of - the packet to each switch port (other than the bond), including - the port that originated the packet. - - Open vSwitch deals with this case by dropping packets received - on any SLB bonded link that have a source MAC+VLAN that has been - learned on any other port. (This means that SLB as implemented - in Open vSwitch relies critically on MAC learning. Notably, SLB - is incompatible with the "flood_vlans" feature.) - - 3. Suppose that a MAC+VLAN moves to an SLB bond from another port - (e.g. when a VM is migrated from this hypervisor to a different - one). Without additional special handling, Open vSwitch will - not notice until the MAC learning entry expires, up to 60 - seconds later as a consequence of rule #2. - - Open vSwitch avoids a 60-second delay by listening for - gratuitous ARPs, which VMs commonly emit upon migration. As an - exception to rule #2, a gratuitous ARP received on an SLB bond - is not dropped and updates the MAC learning table in the usual - way. (If a move does not trigger a gratuitous ARP, or if the - gratuitous ARP is lost in the network, then a 60-second delay - still occurs.) - - 4. Suppose that a MAC+VLAN moves from an SLB bond to another port - (e.g. when a VM is migrated from a different hypervisor to this - one), that the MAC+VLAN emits a gratuitous ARP, and that Open - vSwitch forwards that gratuitous ARP to a link in the SLB bond - other than the active slave. The remote switch will forward the - gratuitous ARP to all of the other links in the SLB bond, - including the active slave. Without additional special - handling, this would mean that Open vSwitch would learn that the - MAC+VLAN was located on the SLB bond, as a consequence of rule - #3. - - Open vSwitch avoids this problem by "locking" the MAC learning - table entry for a MAC+VLAN from which a gratuitous ARP was - received from a non-SLB bond port. For 5 seconds, a locked MAC - learning table entry will not be updated based on a gratuitous - ARP received on a SLB bond. diff --git a/vswitchd/INTERNALS.rst b/vswitchd/INTERNALS.rst new file mode 100644 index 0000000..95c00f2 --- /dev/null +++ b/vswitchd/INTERNALS.rst @@ -0,0 +1,244 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +====================== +ovs-vswitchd Internals +====================== + +This document describes some of the internals of the ovs-vswitchd process. It +is not complete. It tends to be updated on demand, so if you have questions +about the vswitchd implementation, ask them and perhaps we'll add some +appropriate documentation here. + +Most of the ovs-vswitchd implementation is in ``vswitchd/bridge.c``, so code +references below should be assumed to refer to that file except as otherwise +specified. + +Bonding +------- + +Bonding allows two or more interfaces (the "slaves") to share network traffic. +From a high-level point of view, bonded interfaces act like a single port, but +they have the bandwidth of multiple network devices, e.g. two 1 GB physical +interfaces act like a single 2 GB interface. Bonds also increase robustness: +the bonded port does not go down as long as at least one of its slaves is up. + +In vswitchd, a bond always has at least two slaves (and may have more). If a +configuration error, etc. would cause a bond to have only one slave, the port +becomes an ordinary port, not a bonded port, and none of the special features +of bonded ports described in this section apply. + +There are many forms of bonding of which ovs-vswitchd implements only a few. +The most complex bond ovs-vswitchd implements is called "source load balancing" +or SLB bonding. SLB bonding divides traffic among the slaves based on the +Ethernet source address. This is useful only if the traffic over the bond has +multiple Ethernet source addresses, for example if network traffic from +multiple VMs are multiplexed over the bond. + +Enabling and Disabling Slaves +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When a bond is created, a slave is initially enabled or disabled based on +whether carrier is detected on the NIC (see ``iface_create()``). After that, a +slave is disabled if its carrier goes down for a period of time longer than the +downdelay, and it is enabled if carrier comes up for longer than the updelay +(see ``bond_link_status_update()``). There is one exception where the updelay +is skipped: if no slaves at all are currently enabled, then the first slave on +which carrier comes up is enabled immediately. + +The updelay should be set to a time longer than the STP forwarding delay of the +physical switch to which the bond port is connected (if STP is enabled on that +switch). Otherwise, the slave will be enabled, and load may be shifted to it, +before the physical switch starts forwarding packets on that port, which can +cause some data to be "blackholed" for a time. The exception for a single +enabled slave does not cause any problem in this regard because when no slaves +are enabled all output packets are blackholed anyway. + +When a slave becomes disabled, the vswitch immediately chooses a new output +port for traffic that was destined for that slave (see +``bond_enable_slave()``). It also sends a "gratuitous learning packet", +specifically a RARP, on the bond port (on the newly chosen slave) for each MAC +address that the vswitch has learned on a port other than the bond (see +``bond_send_learning_packets()``), to teach the physical switch that the new +slave should be used in place of the one that is now disabled. (This behavior +probably makes sense only for a vswitch that has only one port (the bond) +connected to a physical switch; vswitchd should probably provide a way to +disable or configure it in other scenarios.) + +Bond Packet Input +~~~~~~~~~~~~~~~~~ + +Bonding accepts unicast packets on any bond slave. This can occasionally cause +packet duplication for the first few packets sent to a given MAC, if the +physical switch attached to the bond is flooding packets to that MAC because it +has not yet learned the correct slave for that MAC. + +Bonding only accepts multicast (and broadcast) packets on a single bond slave +(the "active slave") at any given time. Multicast packets received on other +slaves are dropped. Otherwise, every multicast packet would be duplicated, +once for every bond slave, because the physical switch attached to the bond +will flood those packets. + +Bonding also drops received packets when the vswitch has learned that the +packet's MAC is on a port other than the bond port itself. This is because it +is likely that the vswitch itself sent the packet out the bond port on a +different slave and is now receiving the packet back. This occurs when the +packet is multicast or the physical switch has not yet learned the MAC and is +flooding it. However, the vswitch makes an exception to this rule for +broadcast ARP replies, which indicate that the MAC has moved to another switch, +probably due to VM migration. (ARP replies are normally unicast, so this +exception does not match normal ARP replies. It will match the learning +packets sent on bond fail-over.) + +The active slave is simply the first slave to be enabled after the bond is +created (see ``bond_choose_active_iface()``). If the active slave is disabled, +then a new active slave is chosen among the slaves that remain active. +Currently due to the way that configuration works, this tends to be the +remaining slave whose interface name is first alphabetically, but this is by no +means guaranteed. + +Bond Packet Output +~~~~~~~~~~~~~~~~~~ + +When a packet is sent out a bond port, the bond slave actually used is selected +based on the packet's source MAC and VLAN tag (see ``choose_output_iface()``). +In particular, the source MAC and VLAN tag are hashed into one of 256 values, +and that value is looked up in a hash table (the "bond hash") kept in the +``bond_hash`` member of struct port. The hash table entry identifies a bond +slave. If no bond slave has yet been chosen for that hash table entry, +vswitchd chooses one arbitrarily. + +Every 10 seconds, vswitchd rebalances the bond slaves (see +``bond_rebalance_port()``). To rebalance, vswitchd examines the statistics for +the number of bytes transmitted by each slave over approximately the past +minute, with data sent more recently weighted more heavily than data sent less +recently. It considers each of the slaves in order from most-loaded to +least-loaded. If highly loaded slave H is significantly more heavily loaded +than the least-loaded slave L, and slave H carries at least two hashes, then +vswitchd shifts one of H's hashes to L. However, vswitchd will only shift a +hash from H to L if it will decrease the ratio of the load between H and L by +at least 0.1. + +Currently, "significantly more loaded" means that H must carry at least 1 Mbps +more traffic, and that traffic must be at least 3% greater than L's. + +Bond Balance Modes +~~~~~~~~~~~~~~~~~~ + +Each bond balancing mode has different considerations, described below. + +LACP Bonding +++++++++++++ + +LACP bonding requires the remote switch to implement LACP, but it is otherwise +very simple in that, after LACP negotiation is complete, there is no need for +special handling of received packets. + +Several of the physical switches that support LACP block all traffic for ports +that are configured to use LACP, until LACP is negotiated with the host. When +configuring a LACP bond on a OVS host (eg: XenServer), this means that there +will be an interruption of the network connectivity between the time the ports +on the physical switch and the bond on the OVS host are configured. The +interruption may be relatively long, if different people are responsible for +managing the switches and the OVS host. + +Such network connectivity failure can be avoided if LACP can be configured on +the OVS host before configuring the physical switch, and having the OVS host +fall back to a bond mode (active-backup) till the physical switch LACP +configuration is complete. An option "lacp-fallback-ab" exists to provide such +behavior on openvswitch. + +Active Backup Bonding ++++++++++++++++++++++ + +Active Backup bonds send all traffic out one "active" slave until that slave +becomes unavailable. Since they are significantly less complicated than SLB +bonds, they are preferred when LACP is not an option. Additionally, they are +the only bond mode which supports attaching each slave to a different upstream +switch. + +SLB Bonding ++++++++++++ + +SLB bonding allows a limited form of load balancing without the remote switch's +knowledge or cooperation. The basics of SLB are simple. SLB assigns each +source MAC+VLAN pair to a link and transmits all packets from that MAC+VLAN +through that link. Learning in the remote switch causes it to send packets to +that MAC+VLAN through the same link. + +SLB bonding has the following complications: + +0. When the remote switch has not learned the MAC for the destination of a + unicast packet and hence floods the packet to all of the links on the SLB + bond, Open vSwitch will forward duplicate packets, one per link, to each + other switch port. + + Open vSwitch does not solve this problem. + +1. When the remote switch receives a multicast or broadcast packet from a port + not on the SLB bond, it will forward it to all of the links in the SLB bond. + This would cause packet duplication if not handled specially. + + Open vSwitch avoids packet duplication by accepting multicast and broadcast + packets on only the active slave, and dropping multicast and broadcast + packets on all other slaves. + +2. When Open vSwitch forwards a multicast or broadcast packet to a link in the + SLB bond other than the active slave, the remote switch will forward it to + all of the other links in the SLB bond, including the active slave. Without + special handling, this would mean that Open vSwitch would forward a second + copy of the packet to each switch port (other than the bond), including the + port that originated the packet. + + Open vSwitch deals with this case by dropping packets received on any SLB + bonded link that have a source MAC+VLAN that has been learned on any other + port. (This means that SLB as implemented in Open vSwitch relies critically + on MAC learning. Notably, SLB is incompatible with the "flood_vlans" + feature.) + +3. Suppose that a MAC+VLAN moves to an SLB bond from another port (e.g. when a + VM is migrated from this hypervisor to a different one). Without additional + special handling, Open vSwitch will not notice until the MAC learning entry + expires, up to 60 seconds later as a consequence of rule #2. + + Open vSwitch avoids a 60-second delay by listening for gratuitous ARPs, + which VMs commonly emit upon migration. As an exception to rule #2, a + gratuitous ARP received on an SLB bond is not dropped and updates the MAC + learning table in the usual way. (If a move does not trigger a gratuitous + ARP, or if the gratuitous ARP is lost in the network, then a 60-second delay + still occurs.) + +4. Suppose that a MAC+VLAN moves from an SLB bond to another port (e.g. when a + VM is migrated from a different hypervisor to this one), that the MAC+VLAN + emits a gratuitous ARP, and that Open vSwitch forwards that gratuitous ARP + to a link in the SLB bond other than the active slave. The remote switch + will forward the gratuitous ARP to all of the other links in the SLB bond, + including the active slave. Without additional special handling, this would + mean that Open vSwitch would learn that the MAC+VLAN was located on the SLB + bond, as a consequence of rule #3. + + Open vSwitch avoids this problem by "locking" the MAC learning table entry + for a MAC+VLAN from which a gratuitous ARP was received from a non-SLB bond + port. For 5 seconds, a locked MAC learning table entry will not be updated + based on a gratuitous ARP received on a SLB bond. + diff --git a/vswitchd/automake.mk b/vswitchd/automake.mk index 8d7f3ea..94a0272 100644 --- a/vswitchd/automake.mk +++ b/vswitchd/automake.mk @@ -16,7 +16,7 @@ vswitchd_ovs_vswitchd_LDADD = \ lib/libsflow.la \ lib/libopenvswitch.la vswitchd_ovs_vswitchd_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS) -EXTRA_DIST += vswitchd/INTERNALS +EXTRA_DIST += vswitchd/INTERNALS.rst MAN_ROOTS += vswitchd/ovs-vswitchd.8.in # vswitch schema and IDL -- 2.7.4 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev