This is a top-level document, so plain old rST is preferred. Signed-off-by: Stephen Finucane <step...@that.guru> --- DESIGN.md | 1093 ------------------------------------- DESIGN.rst | 1151 +++++++++++++++++++++++++++++++++++++++ Makefile.am | 2 +- include/openvswitch/ofp-util.h | 2 +- lib/ofp-util.c | 2 +- ovn/controller/pinctrl.c | 2 +- ovn/ovn-architecture.7.xml | 2 +- rhel/openvswitch-fedora.spec.in | 2 +- rhel/openvswitch.spec.in | 2 +- 9 files changed, 1158 insertions(+), 1100 deletions(-) delete mode 100644 DESIGN.md create mode 100644 DESIGN.rst
diff --git a/DESIGN.md b/DESIGN.md deleted file mode 100644 index a330312..0000000 --- a/DESIGN.md +++ /dev/null @@ -1,1093 +0,0 @@ -Design Decisions In Open vSwitch -================================ - -This document describes design decisions that went into implementing -Open vSwitch. While we believe these to be reasonable decisions, it is -impossible to predict how Open vSwitch will be used in all environments. -Understanding assumptions made by Open vSwitch is critical to a -successful deployment. The end of this document contains contact -information that can be used to let us know how we can make Open vSwitch -more generally useful. - -Asynchronous Messages -===================== - -Over time, Open vSwitch has added many knobs that control whether a -given controller receives OpenFlow asynchronous messages. This -section describes how all of these features interact. - -First, a service controller never receives any asynchronous messages -unless it changes its miss_send_len from the service controller -default of zero in one of the following ways: - - - Sending an OFPT_SET_CONFIG message with nonzero miss_send_len. - - - Sending any NXT_SET_ASYNC_CONFIG message: as a side effect, this - message changes the miss_send_len to - OFP_DEFAULT_MISS_SEND_LEN (128) for service controllers. - -Second, OFPT_FLOW_REMOVED and NXT_FLOW_REMOVED messages are generated -only if the flow that was removed had the OFPFF_SEND_FLOW_REM flag -set. - -Third, OFPT_PACKET_IN and NXT_PACKET_IN messages are sent only to -OpenFlow controller connections that have the correct connection ID -(see "struct nx_controller_id" and "struct nx_action_controller"): - - - For packet-in messages generated by a NXAST_CONTROLLER action, - the controller ID specified in the action. - - - For other packet-in messages, controller ID zero. (This is the - default ID when an OpenFlow controller does not configure one.) - -Finally, Open vSwitch consults a per-connection table indexed by the -message type, reason code, and current role. The following table -shows how this table is initialized by default when an OpenFlow -connection is made. An entry labeled "yes" means that the message is -sent, an entry labeled "---" means that the message is suppressed. - -``` - master/ - message and reason code other slave - ---------------------------------------- ------- ----- - OFPT_PACKET_IN / NXT_PACKET_IN - OFPR_NO_MATCH yes --- - OFPR_ACTION yes --- - OFPR_INVALID_TTL --- --- - OFPR_ACTION_SET (OF1.4+) yes --- - OFPR_GROUP (OF1.4+) yes --- - - OFPT_FLOW_REMOVED / NXT_FLOW_REMOVED - OFPRR_IDLE_TIMEOUT yes --- - OFPRR_HARD_TIMEOUT yes --- - OFPRR_DELETE yes --- - OFPRR_GROUP_DELETE (OF1.4+) yes --- - OFPRR_METER_DELETE (OF1.4+) yes --- - OFPRR_EVICTION (OF1.4+) yes --- - - OFPT_PORT_STATUS - OFPPR_ADD yes yes - OFPPR_DELETE yes yes - OFPPR_MODIFY yes yes - - OFPT_ROLE_REQUEST / OFPT_ROLE_REPLY (OF1.4+) - OFPCRR_MASTER_REQUEST --- --- - OFPCRR_CONFIG --- --- - OFPCRR_EXPERIMENTER --- --- - - OFPT_TABLE_STATUS (OF1.4+) - OFPTR_VACANCY_DOWN --- --- - OFPTR_VACANCY_UP --- --- - - OFPT_REQUESTFORWARD (OF1.4+) - OFPRFR_GROUP_MOD --- --- - OFPRFR_METER_MOD --- --- -``` - -The NXT_SET_ASYNC_CONFIG message directly sets all of the values in -this table for the current connection. The -OFPC_INVALID_TTL_TO_CONTROLLER bit in the OFPT_SET_CONFIG message -controls the setting for OFPR_INVALID_TTL for the "master" role. - - -OFPAT_ENQUEUE -============= - -The OpenFlow 1.0 specification requires the output port of the OFPAT_ENQUEUE -action to "refer to a valid physical port (i.e. < OFPP_MAX) or OFPP_IN_PORT". -Although OFPP_LOCAL is not less than OFPP_MAX, it is an 'internal' port which -can have QoS applied to it in Linux. Since we allow the OFPAT_ENQUEUE to apply -to 'internal' ports whose port numbers are less than OFPP_MAX, we interpret -OFPP_LOCAL as a physical port and support OFPAT_ENQUEUE on it as well. - - -OFPT_FLOW_MOD -============= - -The OpenFlow specification for the behavior of OFPT_FLOW_MOD is -confusing. The following tables summarize the Open vSwitch -implementation of its behavior in the following categories: - - - "match on priority": Whether the flow_mod acts only on flows - whose priority matches that included in the flow_mod message. - - - "match on out_port": Whether the flow_mod acts only on flows - that output to the out_port included in the flow_mod message (if - out_port is not OFPP_NONE). OpenFlow 1.1 and later have a - similar feature (not listed separately here) for out_group. - - - "match on flow_cookie": Whether the flow_mod acts only on flows - whose flow_cookie matches an optional controller-specified value - and mask. - - - "updates flow_cookie": Whether the flow_mod changes the - flow_cookie of the flow or flows that it matches to the - flow_cookie included in the flow_mod message. - - - "updates OFPFF_ flags": Whether the flow_mod changes the - OFPFF_SEND_FLOW_REM flag of the flow or flows that it matches to - the setting included in the flags of the flow_mod message. - - - "honors OFPFF_CHECK_OVERLAP": Whether the OFPFF_CHECK_OVERLAP - flag in the flow_mod is significant. - - - "updates idle_timeout" and "updates hard_timeout": Whether the - idle_timeout and hard_timeout in the flow_mod, respectively, - have an effect on the flow or flows matched by the flow_mod. - - - "updates idle timer": Whether the flow_mod resets the per-flow - timer that measures how long a flow has been idle. - - - "updates hard timer": Whether the flow_mod resets the per-flow - timer that measures how long it has been since a flow was - modified. - - - "zeros counters": Whether the flow_mod resets per-flow packet - and byte counters to zero. - - - "may add a new flow": Whether the flow_mod may add a new flow to - the flow table. (Obviously this is always true for "add" - commands but in some OpenFlow versions "modify" and - "modify-strict" can also add new flows.) - - - "sends flow_removed message": Whether the flow_mod generates a - flow_removed message for the flow or flows that it affects. - -An entry labeled "yes" means that the flow mod type does have the -indicated behavior, "---" means that it does not, an empty cell means -that the property is not applicable, and other values are explained -below the table. - -OpenFlow 1.0 ------------- - -``` - MODIFY DELETE - ADD MODIFY STRICT DELETE STRICT - === ====== ====== ====== ====== -match on priority yes --- yes --- yes -match on out_port --- --- --- yes yes -match on flow_cookie --- --- --- --- --- -match on table_id --- --- --- --- --- -controller chooses table_id --- --- --- -updates flow_cookie yes yes yes -updates OFPFF_SEND_FLOW_REM yes + + -honors OFPFF_CHECK_OVERLAP yes + + -updates idle_timeout yes + + -updates hard_timeout yes + + -resets idle timer yes + + -resets hard timer yes yes yes -zeros counters yes + + -may add a new flow yes yes yes -sends flow_removed message --- --- --- % % - -(+) "modify" and "modify-strict" only take these actions when they - create a new flow, not when they update an existing flow. - -(%) "delete" and "delete_strict" generates a flow_removed message if - the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. - (Each controller can separately control whether it wants to - receive the generated messages.) -``` - -OpenFlow 1.1 ------------- - -OpenFlow 1.1 makes these changes: - - - The controller now must specify the table_id of the flow match - searched and into which a flow may be inserted. Behavior for a - table_id of 255 is undefined. - - - A flow_mod, except an "add", can now match on the flow_cookie. - - - When a flow_mod matches on the flow_cookie, "modify" and - "modify-strict" never insert a new flow. - -``` - MODIFY DELETE - ADD MODIFY STRICT DELETE STRICT - === ====== ====== ====== ====== -match on priority yes --- yes --- yes -match on out_port --- --- --- yes yes -match on flow_cookie --- yes yes yes yes -match on table_id yes yes yes yes yes -controller chooses table_id yes yes yes -updates flow_cookie yes --- --- -updates OFPFF_SEND_FLOW_REM yes + + -honors OFPFF_CHECK_OVERLAP yes + + -updates idle_timeout yes + + -updates hard_timeout yes + + -resets idle timer yes + + -resets hard timer yes yes yes -zeros counters yes + + -may add a new flow yes # # -sends flow_removed message --- --- --- % % - -(+) "modify" and "modify-strict" only take these actions when they - create a new flow, not when they update an existing flow. - -(%) "delete" and "delete_strict" generates a flow_removed message if - the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. - (Each controller can separately control whether it wants to - receive the generated messages.) - -(#) "modify" and "modify-strict" only add a new flow if the flow_mod - does not match on any bits of the flow cookie -``` - -OpenFlow 1.2 ------------- - -OpenFlow 1.2 makes these changes: - - - Only "add" commands ever add flows, "modify" and "modify-strict" - never do. - - - A new flag OFPFF_RESET_COUNTS now controls whether "modify" and - "modify-strict" reset counters, whereas previously they never - reset counters (except when they inserted a new flow). - -``` - MODIFY DELETE - ADD MODIFY STRICT DELETE STRICT - === ====== ====== ====== ====== -match on priority yes --- yes --- yes -match on out_port --- --- --- yes yes -match on flow_cookie --- yes yes yes yes -match on table_id yes yes yes yes yes -controller chooses table_id yes yes yes -updates flow_cookie yes --- --- -updates OFPFF_SEND_FLOW_REM yes --- --- -honors OFPFF_CHECK_OVERLAP yes --- --- -updates idle_timeout yes --- --- -updates hard_timeout yes --- --- -resets idle timer yes --- --- -resets hard timer yes yes yes -zeros counters yes & & -may add a new flow yes --- --- -sends flow_removed message --- --- --- % % - -(%) "delete" and "delete_strict" generates a flow_removed message if - the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. - (Each controller can separately control whether it wants to - receive the generated messages.) - -(&) "modify" and "modify-strict" reset counters if the - OFPFF_RESET_COUNTS flag is specified. -``` - -OpenFlow 1.3 ------------- - -OpenFlow 1.3 makes these changes: - - - Behavior for a table_id of 255 is now defined, for "delete" and - "delete-strict" commands, as meaning to delete from all tables. - A table_id of 255 is now explicitly invalid for other commands. - - - New flags OFPFF_NO_PKT_COUNTS and OFPFF_NO_BYT_COUNTS for "add" - operations. - -The table for 1.3 is the same as the one shown above for 1.2. - - -OpenFlow 1.4 ------------ - -OpenFlow 1.4 makes these changes: - - - Adds the "importance" field to flow_mods, but it does not - explicitly specify which kinds of flow_mods set the importance. - For consistency, Open vSwitch uses the same rule for importance - as for idle_timeout and hard_timeout, that is, only an "ADD" - flow_mod sets the importance. (This issue has been filed with - the ONF as EXT-496.) - - - Eviction Mechanism to automatically delete entries of lower - importance to make space for newer entries. - - -OpenFlow 1.4 Bundles -==================== - -Open vSwitch makes all flow table modifications atomically, i.e., any -datapath packet only sees flow table configurations either before or -after any change made by any flow_mod. For example, if a controller -removes all flows with a single OpenFlow "flow_mod", no packet sees an -intermediate version of the OpenFlow pipeline where only some of the -flows have been deleted. - -It should be noted that Open vSwitch caches datapath flows, and that -the cached flows are NOT flushed immediately when a flow table -changes. Instead, the datapath flows are revalidated against the new -flow table as soon as possible, and usually within one second of the -modification. This design amortizes the cost of datapath cache -flushing across multiple flow table changes, and has a significant -performance effect during simultaneous heavy flow table churn and high -traffic load. This means that different cached datapath flows may -have been computed based on a different flow table configurations, but -each of the datapath flows is guaranteed to have been computed over a -coherent view of the flow tables, as described above. - -With OpenFlow 1.4 bundles this atomicity can be extended across an -arbitrary set of flow_mods. Bundles are supported for flow_mod and -port_mod messages only. For flow_mods, both 'atomic' and 'ordered' -bundle flags are trivially supported, as all bundled messages are -executed in the order they were added and all flow table modifications -are now atomic to the datapath. Port mods may not appear in atomic -bundles, as port status modifications are not atomic. - -To support bundles, ovs-ofctl has a '--bundle' option that makes the -flow mod commands ('add-flow', 'add-flows', 'mod-flows', 'del-flows', -and 'replace-flows') use an OpenFlow 1.4 bundle to operate the -modifications as a single atomic transaction. If any of the flow mods -in a transaction fail, none of them are executed. All flow mods in a -bundle appear to datapath lookups simultaneously. - -Furthermore, ovs-ofctl 'add-flow' and 'add-flows' commands now accept -arbitrary flow mods as an input by allowing the flow specification to -start with an explicit 'add', 'modify', 'modify_strict', 'delete', or -'delete_strict' keyword. A missing keyword is treated as 'add', so -this is fully backwards compatible. With the new '--bundle' option -all the flow mods are executed as a single atomic transaction using an -OpenFlow 1.4 bundle. Without the '--bundle' option the flow mods are -executed in order up to the first failing flow_mod, and in case of an -error the earlier successful flow_mods are not rolled back. - - -OFPT_PACKET_IN -============== - -The OpenFlow 1.1 specification for OFPT_PACKET_IN is confusing. The -definition in OF1.1 openflow.h is[*]: - -``` - /* Packet received on port (datapath -> controller). */ - struct ofp_packet_in { - struct ofp_header header; - uint32_t buffer_id; /* ID assigned by datapath. */ - uint32_t in_port; /* Port on which frame was received. */ - uint32_t in_phy_port; /* Physical Port on which frame was received. */ - uint16_t total_len; /* Full length of frame. */ - uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */ - uint8_t table_id; /* ID of the table that was looked up */ - uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word, - so the IP header is 32-bit aligned. The - amount of data is inferred from the length - field in the header. Because of padding, - offsetof(struct ofp_packet_in, data) == - sizeof(struct ofp_packet_in) - 2. */ - }; - OFP_ASSERT(sizeof(struct ofp_packet_in) == 24); -``` - -The confusing part is the comment on the data[] member. This comment -is a leftover from OF1.0 openflow.h, in which the comment was correct: -sizeof(struct ofp_packet_in) is 20 in OF1.0 and offsetof(struct -ofp_packet_in, data) is 18. When OF1.1 was written, the structure -members were changed but the comment was carelessly not updated, and -the comment became wrong: sizeof(struct ofp_packet_in) and -offsetof(struct ofp_packet_in, data) are both 24 in OF1.1. - -That leaves the question of how to implement ofp_packet_in in OF1.1. -The OpenFlow reference implementation for OF1.1 does not include any -padding, that is, the first byte of the encapsulated frame immediately -follows the 'table_id' member without a gap. Open vSwitch therefore -implements it the same way for compatibility. - -For an earlier discussion, please see the thread archived at: -https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html - -[*] The quoted definition is directly from OF1.1. Definitions used - inside OVS omit the 8-byte ofp_header members, so the sizes in - this discussion are 8 bytes larger than those declared in OVS - header files. - - -VLAN Matching -============= - -The 802.1Q VLAN header causes more trouble than any other 4 bytes in -networking. More specifically, three versions of OpenFlow and Open -vSwitch have among them four different ways to match the contents and -presence of the VLAN header. The following table describes how each -version works. - - Match NXM OF1.0 OF1.1 OF1.2 - ----- --------- ----------- ----------- ------------ - [1] 0000/0000 ????/1,??/? ????/1,??/? 0000/0000,-- - [2] 0000/ffff ffff/0,??/? ffff/0,??/? 0000/ffff,-- - [3] 1xxx/1fff 0xxx/0,??/1 0xxx/0,??/1 1xxx/ffff,-- - [4] z000/f000 ????/1,0y/0 fffe/0,0y/0 1000/1000,0y - [5] zxxx/ffff 0xxx/0,0y/0 0xxx/0,0y/0 1xxx/ffff,0y - [6] 0000/0fff <none> <none> <none> - [7] 0000/f000 <none> <none> <none> - [8] 0000/efff <none> <none> <none> - [9] 1001/1001 <none> <none> 1001/1001,-- - [10] 3000/3000 <none> <none> <none> - [11] 1000/1000 <none> fffe/0,??/1 1000/1000,-- - -Each column is interpreted as follows. - - - Match: See the list below. - - - NXM: xxxx/yyyy means NXM_OF_VLAN_TCI_W with value xxxx and mask - yyyy. A mask of 0000 is equivalent to omitting - NXM_OF_VLAN_TCI(_W), a mask of ffff is equivalent to - NXM_OF_VLAN_TCI. - - - OF1.0 and OF1.1: wwww/x,yy/z means dl_vlan wwww, OFPFW_DL_VLAN x, - dl_vlan_pcp yy, and OFPFW_DL_VLAN_PCP z. If OFPFW_DL_VLAN or - OFPFW_DL_VLAN_PCP is 1, the corresponding field value is - wildcarded, otherwise it is matched. ? means that the given bits - are ignored (their conventional values are 0000/x,00/0 in OF1.0, - 0000/x,00/1 in OF1.1; x is never ignored). <none> means that the - given match is not supported. - - - OF1.2: xxxx/yyyy,zz means OXM_OF_VLAN_VID_W with value xxxx and - mask yyyy, and OXM_OF_VLAN_PCP (which is not maskable) with - value zz. A mask of 0000 is equivalent to omitting - OXM_OF_VLAN_VID(_W), a mask of ffff is equivalent to - OXM_OF_VLAN_VID. -- means that OXM_OF_VLAN_PCP is omitted. - <none> means that the given match is not supported. - -The matches are: - - [1] Matches any packet, that is, one without an 802.1Q header or with - an 802.1Q header with any TCI value. - - [2] Matches only packets without an 802.1Q header. - - NXM: Any match with (vlan_tci == 0) and (vlan_tci_mask & 0x1000) - != 0 is equivalent to the one listed in the table. - - OF1.0: The spec doesn't define behavior if dl_vlan is set to - 0xffff and OFPFW_DL_VLAN_PCP is not set. - - OF1.1: The spec says explicitly to ignore dl_vlan_pcp when - dl_vlan is set to 0xffff. - - OF1.2: The spec doesn't say what should happen if (vlan_vid == 0) - and (vlan_vid_mask & 0x1000) != 0 but (vlan_vid_mask != 0x1000), - but it would be straightforward to also interpret as [2]. - - [3] Matches only packets that have an 802.1Q header with VID xxx (and - any PCP). - - [4] Matches only packets that have an 802.1Q header with PCP y (and - any VID). - - NXM: z is ((y << 1) | 1). - - OF1.0: The spec isn't very clear, but OVS implements it this way. - - OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) - == 0x1000 would also work, but the spec doesn't define their - behavior. - - [5] Matches only packets that have an 802.1Q header with VID xxx and - PCP y. - - NXM: z is ((y << 1) | 1). - - OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) - == 0x1fff would also work. - - [6] Matches packets with no 802.1Q header or with an 802.1Q header - with a VID of 0. Only possible with NXM. - - [7] Matches packets with no 802.1Q header or with an 802.1Q header - with a PCP of 0. Only possible with NXM. - - [8] Matches packets with no 802.1Q header or with an 802.1Q header - with both VID and PCP of 0. Only possible with NXM. - - [9] Matches only packets that have an 802.1Q header with an - odd-numbered VID (and any PCP). Only possible with NXM and - OF1.2. (This is just an example; one can match on any desired - VID bit pattern.) - -[10] Matches only packets that have an 802.1Q header with an - odd-numbered PCP (and any VID). Only possible with NXM. (This - is just an example; one can match on any desired VID bit - pattern.) - -[11] Matches any packet with an 802.1Q header, regardless of VID or - PCP. - -Additional notes: - - - OF1.2: The top three bits of OXM_OF_VLAN_VID are fixed to zero, - so bits 13, 14, and 15 in the masks listed in the table may be - set to arbitrary values, as long as the corresponding value bits - are also zero. The suggested ffff mask for [2], [3], and [5] - allows a shorter OXM representation (the mask is omitted) than - the minimal 1fff mask. - - -Flow Cookies -============ - -OpenFlow 1.0 and later versions have the concept of a "flow cookie", -which is a 64-bit integer value attached to each flow. The treatment -of the flow cookie has varied greatly across OpenFlow versions, -however. - -In OpenFlow 1.0: - - - OFPFC_ADD set the cookie in the flow that it added. - - - OFPFC_MODIFY and OFPFC_MODIFY_STRICT updated the cookie for - the flow or flows that it modified. - - - OFPST_FLOW messages included the flow cookie. - - - OFPT_FLOW_REMOVED messages reported the cookie of the flow - that was removed. - -OpenFlow 1.1 made the following changes: - - - Flow mod operations OFPFC_MODIFY, OFPFC_MODIFY_STRICT, - OFPFC_DELETE, and OFPFC_DELETE_STRICT, plus flow stats - requests and aggregate stats requests, gained the ability to - match on flow cookies with an arbitrary mask. - - - OFPFC_MODIFY and OFPFC_MODIFY_STRICT were changed to add a - new flow, in the case of no match, only if the flow table - modification operation did not match on the cookie field. - (In OpenFlow 1.0, modify operations always added a new flow - when there was no match.) - - - OFPFC_MODIFY and OFPFC_MODIFY_STRICT no longer updated flow - cookies. - -OpenFlow 1.2 made the following changes: - - - OFPC_MODIFY and OFPFC_MODIFY_STRICT were changed to never - add a new flow, regardless of whether the flow cookie was - used for matching. - -Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 -behavior with the following extensions: - - - An NXM extension field NXM_NX_COOKIE(_W) allows the NXM - versions of OFPFC_MODIFY, OFPFC_MODIFY_STRICT, OFPFC_DELETE, - and OFPFC_DELETE_STRICT flow_mods, plus flow stats requests - and aggregate stats requests, to match on flow cookies with - arbitrary masks. This is much like the equivalent OpenFlow - 1.1 feature. - - - Like OpenFlow 1.1, OFPC_MODIFY and OFPFC_MODIFY_STRICT add a - new flow if there is no match and the mask is zero (or not - given). - - - The "cookie" field in OFPT_FLOW_MOD and NXT_FLOW_MOD messages - is used as the cookie value for OFPFC_ADD commands, as - described in OpenFlow 1.0. For OFPFC_MODIFY and - OFPFC_MODIFY_STRICT commands, the "cookie" field is used as a - new cookie for flows that match unless it is UINT64_MAX, in - which case the flow's cookie is not updated. - - - NXT_PACKET_IN (the Nicira extended version of - OFPT_PACKET_IN) reports the cookie of the rule that - generated the packet, or all-1-bits if no rule generated the - packet. (Older versions of OVS used all-0-bits instead of - all-1-bits.) - -The following table shows the handling of different protocols when -receiving OFPFC_MODIFY and OFPFC_MODIFY_STRICT messages. A mask of 0 -indicates either an explicit mask of zero or an implicit one by not -specifying the NXM_NX_COOKIE(_W) field. - -``` - Match Update Add on miss Add on miss - cookie cookie mask!=0 mask==0 - ====== ====== =========== =========== -OpenFlow 1.0 no yes <always add on miss> -OpenFlow 1.1 yes no no yes -OpenFlow 1.2 yes no no no -NXM yes yes* no yes - -* Updates the flow's cookie unless the "cookie" field is UINT64_MAX. -``` - -Multiple Table Support -====================== - -OpenFlow 1.0 has only rudimentary support for multiple flow tables. -Notably, OpenFlow 1.0 does not allow the controller to specify the -flow table to which a flow is to be added. Open vSwitch adds an -extension for this purpose, which is enabled on a per-OpenFlow -connection basis using the NXT_FLOW_MOD_TABLE_ID message. When the -extension is enabled, the upper 8 bits of the 'command' member in an -OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a -flow is to be added. - -The Open vSwitch software switch implementation offers 255 flow -tables. On packet ingress, only the first flow table (table 0) is -searched, and the contents of the remaining tables are not considered -in any way. Tables other than table 0 only come into play when an -NXAST_RESUBMIT_TABLE action specifies another table to search. - -Tables 128 and above are reserved for use by the switch itself. -Controllers should use only tables 0 through 127. - - -OFPTC_* Table Configuration -=========================== - -This section covers the history of the OFPTC_* table configuration -bits across OpenFlow versions. - -OpenFlow 1.0 flow tables had fixed configurations. - -OpenFlow 1.1 enabled controllers to configure behavior upon flow table -miss and added the OFPTC_MISS_* constants for that purpose. OFPTC_* -did not control anything else but it was nevertheless conceptualized -as a set of bit-fields instead of an enum. OF1.1 added the -OFPT_TABLE_MOD message to set OFPTC_MISS_* for a flow table and added -the 'config' field to the OFPST_TABLE reply to report the current -setting. - -OpenFlow 1.2 did not change anything in this regard. - -OpenFlow 1.3 switched to another means to changing flow table miss -behavior and deprecated OFPTC_MISS_* without adding any more OFPTC_* -constants. This meant that OFPT_TABLE_MOD now had no purpose at all, -but OF1.3 kept it around "for backward compatibility with older and -newer versions of the specification." At the same time, OF1.3 -introduced a new message OFPMP_TABLE_FEATURES that included a field -'config' documented as reporting the OFPTC_* values set with -OFPT_TABLE_MOD; of course this served no real purpose because no -OFPTC_* values are defined. OF1.3 did remove the OFPTC_* field from -OFPMP_TABLE (previously named OFPST_TABLE). - -OpenFlow 1.4 defined two new OFPTC_* constants, OFPTC_EVICTION and -OFPTC_VACANCY_EVENTS, using bits that did not overlap with -OFPTC_MISS_* even though those bits had not been defined since OF1.2. -OFPT_TABLE_MOD still controlled these settings. The field for OFPTC_* -values in OFPMP_TABLE_FEATURES was renamed from 'config' to -'capabilities' and documented as reporting the flags that are -supported in a OFPT_TABLE_MOD message. The OFPMP_TABLE_DESC message -newly added in OF1.4 reported the OFPTC_* setting. - -OpenFlow 1.5 did not change anything in this regard. - -The following table summarizes. The columns say: - - - OpenFlow version(s). - - - The OFPTC_* flags defined in those versions. - - - Whether OFPT_TABLE_MOD can modify OFPTC_* flags. - - - Whether OFPST_TABLE/OFPMP_TABLE reports the OFPTC_* flags. - - - What OFPMP_TABLE_FEATURES reports (if it exists): either the - current configuration or the switch's capabilities. - - - Whether OFPMP_TABLE_DESC reports the current configuration. - -OpenFlow OFPTC_* flags TABLE_MOD stats? TABLE_FEATURES TABLE_DESC ---------- ----------------------- --------- ------ -------------- ---------- -OF1.0 none no[*][+] no[*] nothing[*][+] no[*][+] -OF1.1/1.2 MISS_* yes yes nothing[+] no[+] -OF1.3 none yes[*] no[*] config[*] no[*][+] -OF1.4/1.5 EVICTION/VACANCY_EVENTS yes no capabilities yes - - [*] Nothing to report/change anyway. - - [+] No such message. - - -IPv6 -==== - -Open vSwitch supports stateless handling of IPv6 packets. Flows can be -written to support matching TCP, UDP, and ICMPv6 headers within an IPv6 -packet. Deeper matching of some Neighbor Discovery messages is also -supported. - -IPv6 was not designed to interact well with middle-boxes. This, -combined with Open vSwitch's stateless nature, have affected the -processing of IPv6 traffic, which is detailed below. - -Extension Headers ------------------ - -The base IPv6 header is incredibly simple with the intention of only -containing information relevant for routing packets between two -endpoints. IPv6 relies heavily on the use of extension headers to -provide any other functionality. Unfortunately, the extension headers -were designed in such a way that it is impossible to move to the next -header (including the layer-4 payload) unless the current header is -understood. - -Open vSwitch will process the following extension headers and continue -to the next header: - - * Fragment (see the next section) - * AH (Authentication Header) - * Hop-by-Hop Options - * Routing - * Destination Options - -When a header is encountered that is not in that list, it is considered -"terminal". A terminal header's IPv6 protocol value is stored in -"nw_proto" for matching purposes. If a terminal header is TCP, UDP, or -ICMPv6, the packet will be further processed in an attempt to extract -layer-4 information. - -Fragments ---------- - -IPv6 requires that every link in the internet have an MTU of 1280 octets -or greater (RFC 2460). As such, a terminal header (as described above in -"Extension Headers") in the first fragment should generally be -reachable. In this case, the terminal header's IPv6 protocol type is -stored in the "nw_proto" field for matching purposes. If a terminal -header cannot be found in the first fragment (one with a fragment offset -of zero), the "nw_proto" field is set to 0. Subsequent fragments (those -with a non-zero fragment offset) have the "nw_proto" field set to the -IPv6 protocol type for fragments (44). - -Jumbograms ----------- - -An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer -than 65,535 octets. A jumbogram is only relevant in subnets with a link -MTU greater than 65,575 octets, and are not required to be supported on -nodes that do not connect to link with such large MTUs. Currently, Open -vSwitch doesn't process jumbograms. - - -In-Band Control -=============== - -Motivation ----------- - -An OpenFlow switch must establish and maintain a TCP network -connection to its controller. There are two basic ways to categorize -the network that this connection traverses: either it is completely -separate from the one that the switch is otherwise controlling, or its -path may overlap the network that the switch controls. We call the -former case "out-of-band control", the latter case "in-band control". - -Out-of-band control has the following benefits: - - - Simplicity: Out-of-band control slightly simplifies the switch - implementation. - - - Reliability: Excessive switch traffic volume cannot interfere - with control traffic. - - - Integrity: Machines not on the control network cannot - impersonate a switch or a controller. - - - Confidentiality: Machines not on the control network cannot - snoop on control traffic. - -In-band control, on the other hand, has the following advantages: - - - No dedicated port: There is no need to dedicate a physical - switch port to control, which is important on switches that have - few ports (e.g. wireless routers, low-end embedded platforms). - - - No dedicated network: There is no need to build and maintain a - separate control network. This is important in many - environments because it reduces proliferation of switches and - wiring. - -Open vSwitch supports both out-of-band and in-band control. This -section describes the principles behind in-band control. See the -description of the Controller table in ovs-vswitchd.conf.db(5) to -configure OVS for in-band control. - -Principles ----------- - -The fundamental principle of in-band control is that an OpenFlow -switch must recognize and switch control traffic without involving the -OpenFlow controller. All the details of implementing in-band control -are special cases of this principle. - -The rationale for this principle is simple. If the switch does not -handle in-band control traffic itself, then it will be caught in a -contradiction: it must contact the controller, but it cannot, because -only the controller can set up the flows that are needed to contact -the controller. - -The following points describe important special cases of this -principle. - - - In-band control must be implemented regardless of whether the - switch is connected. - - It is tempting to implement the in-band control rules only when - the switch is not connected to the controller, using the - reasoning that the controller should have complete control once - it has established a connection with the switch. - - This does not work in practice. Consider the case where the - switch is connected to the controller. Occasionally it can - happen that the controller forgets or otherwise needs to obtain - the MAC address of the switch. To do so, the controller sends a - broadcast ARP request. A switch that implements the in-band - control rules only when it is disconnected will then send an - OFPT_PACKET_IN message up to the controller. The controller will - be unable to respond, because it does not know the MAC address of - the switch. This is a deadlock situation that can only be - resolved by the switch noticing that its connection to the - controller has hung and reconnecting. - - - In-band control must override flows set up by the controller. - - It is reasonable to assume that flows set up by the OpenFlow - controller should take precedence over in-band control, on the - basis that the controller should be in charge of the switch. - - Again, this does not work in practice. Reasonable controller - implementations may set up a "last resort" fallback rule that - wildcards every field and, e.g., sends it up to the controller or - discards it. If a controller does that, then it will isolate - itself from the switch. - - - The switch must recognize all control traffic. - - The fundamental principle of in-band control states, in part, - that a switch must recognize control traffic without involving - the OpenFlow controller. More specifically, the switch must - recognize *all* control traffic. "False negatives", that is, - packets that constitute control traffic but that the switch does - not recognize as control traffic, lead to control traffic storms. - - Consider an OpenFlow switch that only recognizes control packets - sent to or from that switch. Now suppose that two switches of - this type, named A and B, are connected to ports on an Ethernet - hub (not a switch) and that an OpenFlow controller is connected - to a third hub port. In this setup, control traffic sent by - switch A will be seen by switch B, which will send it to the - controller as part of an OFPT_PACKET_IN message. Switch A will - then see the OFPT_PACKET_IN message's packet, re-encapsulate it - in another OFPT_PACKET_IN, and send it to the controller. Switch - B will then see that OFPT_PACKET_IN, and so on in an infinite - loop. - - Incidentally, the consequences of "false positives", where - packets that are not control traffic are nevertheless recognized - as control traffic, are much less severe. The controller will - not be able to control their behavior, but the network will - remain in working order. False positives do constitute a - security problem. - - - The switch should use echo-requests to detect disconnection. - - TCP will notice that a connection has hung, but this can take a - considerable amount of time. For example, with default settings - the Linux kernel TCP implementation will retransmit for between - 13 and 30 minutes, depending on the connection's retransmission - timeout, according to kernel documentation. This is far too long - for a switch to be disconnected, so an OpenFlow switch should - implement its own connection timeout. OpenFlow OFPT_ECHO_REQUEST - messages are the best way to do this, since they test the - OpenFlow connection itself. - -Implementation --------------- - -This section describes how Open vSwitch implements in-band control. -Correctly implementing in-band control has proven difficult due to its -many subtleties, and has thus gone through many iterations. Please -read through and understand the reasoning behind the chosen rules -before making modifications. - -Open vSwitch implements in-band control as "hidden" flows, that is, -flows that are not visible through OpenFlow, and at a higher priority -than wildcarded flows can be set up through OpenFlow. This is done so -that the OpenFlow controller cannot interfere with them and possibly -break connectivity with its switches. It is possible to see all -flows, including in-band ones, with the ovs-appctl "bridge/dump-flows" -command. - -The Open vSwitch implementation of in-band control can hide traffic to -arbitrary "remotes", where each remote is one TCP port on one IP address. -Currently the remotes are automatically configured as the in-band OpenFlow -controllers plus the OVSDB managers, if any. (The latter is a requirement -because OVSDB managers are responsible for configuring OpenFlow controllers, -so if the manager cannot be reached then OpenFlow cannot be reconfigured.) - -The following rules (with the OFPP_NORMAL action) are set up on any bridge -that has any remotes: - - (a) DHCP requests sent from the local port. - (b) ARP replies to the local port's MAC address. - (c) ARP requests from the local port's MAC address. - -In-band also sets up the following rules for each unique next-hop MAC -address for the remotes' IPs (the "next hop" is either the remote -itself, if it is on a local subnet, or the gateway to reach the remote): - - (d) ARP replies to the next hop's MAC address. - (e) ARP requests from the next hop's MAC address. - -In-band also sets up the following rules for each unique remote IP address: - - (f) ARP replies containing the remote's IP address as a target. - (g) ARP requests containing the remote's IP address as a source. - -In-band also sets up the following rules for each unique remote (IP,port) -pair: - - (h) TCP traffic to the remote's IP and port. - (i) TCP traffic from the remote's IP and port. - -The goal of these rules is to be as narrow as possible to allow a -switch to join a network and be able to communicate with the -remotes. As mentioned earlier, these rules have higher priority -than the controller's rules, so if they are too broad, they may -prevent the controller from implementing its policy. As such, -in-band actively monitors some aspects of flow and packet processing -so that the rules can be made more precise. - -In-band control monitors attempts to add flows into the datapath that -could interfere with its duties. The datapath only allows exact -match entries, so in-band control is able to be very precise about -the flows it prevents. Flows that miss in the datapath are sent to -userspace to be processed, so preventing these flows from being -cached in the "fast path" does not affect correctness. The only type -of flow that is currently prevented is one that would prevent DHCP -replies from being seen by the local port. For example, a rule that -forwarded all DHCP traffic to the controller would not be allowed, -but one that forwarded to all ports (including the local port) would. - -As mentioned earlier, packets that miss in the datapath are sent to -the userspace for processing. The userspace has its own flow table, -the "classifier", so in-band checks whether any special processing -is needed before the classifier is consulted. If a packet is a DHCP -response to a request from the local port, the packet is forwarded to -the local port, regardless of the flow table. Note that this requires -L7 processing of DHCP replies to determine whether the 'chaddr' field -matches the MAC address of the local port. - -It is interesting to note that for an L3-based in-band control -mechanism, the majority of rules are devoted to ARP traffic. At first -glance, some of these rules appear redundant. However, each serves an -important role. First, in order to determine the MAC address of the -remote side (controller or gateway) for other ARP rules, we must allow -ARP traffic for our local port with rules (b) and (c). If we are -between a switch and its connection to the remote, we have to -allow the other switch's ARP traffic to through. This is done with -rules (d) and (e), since we do not know the addresses of the other -switches a priori, but do know the remote's or gateway's. Finally, -if the remote is running in a local guest VM that is not reached -through the local port, the switch that is connected to the VM must -allow ARP traffic based on the remote's IP address, since it will -not know the MAC address of the local port that is sending the traffic -or the MAC address of the remote in the guest VM. - -With a few notable exceptions below, in-band should work in most -network setups. The following are considered "supported" in the -current implementation: - - - Locally Connected. The switch and remote are on the same - subnet. This uses rules (a), (b), (c), (h), and (i). - - - Reached through Gateway. The switch and remote are on - different subnets and must go through a gateway. This uses - rules (a), (b), (c), (h), and (i). - - - Between Switch and Remote. This switch is between another - switch and the remote, and we want to allow the other - switch's traffic through. This uses rules (d), (e), (h), and - (i). It uses (b) and (c) indirectly in order to know the MAC - address for rules (d) and (e). Note that DHCP for the other - switch will not work unless an OpenFlow controller explicitly lets this - switch pass the traffic. - - - Between Switch and Gateway. This switch is between another - switch and the gateway, and we want to allow the other switch's - traffic through. This uses the same rules and logic as the - "Between Switch and Remote" configuration described earlier. - - - Remote on Local VM. The remote is a guest VM on the - system running in-band control. This uses rules (a), (b), (c), - (h), and (i). - - - Remote on Local VM with Different Networks. The remote - is a guest VM on the system running in-band control, but the - local port is not used to connect to the remote. For - example, an IP address is configured on eth0 of the switch. The - remote's VM is connected through eth1 of the switch, but an - IP address has not been configured for that port on the switch. - As such, the switch will use eth0 to connect to the remote, - and eth1's rules about the local port will not work. In the - example, the switch attached to eth0 would use rules (a), (b), - (c), (h), and (i) on eth0. The switch attached to eth1 would use - rules (f), (g), (h), and (i). - -The following are explicitly *not* supported by in-band control: - - - Specify Remote by Name. Currently, the remote must be - identified by IP address. A naive approach would be to permit - all DNS traffic. Unfortunately, this would prevent the - controller from defining any policy over DNS. Since switches - that are located behind us need to connect to the remote, - in-band cannot simply add a rule that allows DNS traffic from - the local port. The "correct" way to support this is to parse - DNS requests to allow all traffic related to a request for the - remote's name through. Due to the potential security - problems and amount of processing, we decided to hold off for - the time-being. - - - Differing Remotes for Switches. All switches must know - the L3 addresses for all the remotes that other switches - may use, since rules need to be set up to allow traffic related - to those remotes through. See rules (f), (g), (h), and (i). - - - Differing Routes for Switches. In order for the switch to - allow other switches to connect to a remote through a - gateway, it allows the gateway's traffic through with rules (d) - and (e). If the routes to the remote differ for the two - switches, we will not know the MAC address of the alternate - gateway. - - -Action Reproduction -=================== - -It seems likely that many controllers, at least at startup, use the -OpenFlow "flow statistics" request to obtain existing flows, then -compare the flows' actions against the actions that they expect to -find. Before version 1.8.0, Open vSwitch always returned exact, -byte-for-byte copies of the actions that had been added to the flow -table. The current version of Open vSwitch does not always do this in -some exceptional cases. This section lists the exceptions that -controller authors must keep in mind if they compare actual actions -against desired actions in a bytewise fashion: - - - Open vSwitch zeros padding bytes in action structures, - regardless of their values when the flows were added. - - - Open vSwitch "normalizes" the instructions in OpenFlow 1.1 - (and later) in the following way: - - * OVS sorts the instructions into the following order: - Apply-Actions, Clear-Actions, Write-Actions, - Write-Metadata, Goto-Table. - - * OVS drops Apply-Actions instructions that have empty - action lists. - - * OVS drops Write-Actions instructions that have empty - action sets. - -Please report other discrepancies, if you notice any, so that we can -fix or document them. - - -Suggestions -=========== - -Suggestions to improve Open vSwitch are welcome at disc...@openvswitch.org. diff --git a/DESIGN.rst b/DESIGN.rst new file mode 100644 index 0000000..9893399 --- /dev/null +++ b/DESIGN.rst @@ -0,0 +1,1151 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +================================ +Design Decisions In Open vSwitch +================================ + +This document describes design decisions that went into implementing Open +vSwitch. While we believe these to be reasonable decisions, it is impossible +to predict how Open vSwitch will be used in all environments. Understanding +assumptions made by Open vSwitch is critical to a successful deployment. The +end of this document contains contact information that can be used to let us +know how we can make Open vSwitch more generally useful. + +Asynchronous Messages +--------------------- + +Over time, Open vSwitch has added many knobs that control whether a given +controller receives OpenFlow asynchronous messages. This section describes how +all of these features interact. + +First, a service controller never receives any asynchronous messages unless it +changes its miss_send_len from the service controller default of zero in one of +the following ways: + +- Sending an ``OFPT_SET_CONFIG`` message with nonzero ``miss_send_len``. + +- Sending any ``NXT_SET_ASYNC_CONFIG`` message: as a side effect, this message + changes the ``miss_send_len`` to ``OFP_DEFAULT_MISS_SEND_LEN`` (128) for + service controllers. + +Second, ``OFPT_FLOW_REMOVED`` and ``NXT_FLOW_REMOVED`` messages are generated +only if the flow that was removed had the ``OFPFF_SEND_FLOW_REM`` flag set. + +Third, ``OFPT_PACKET_IN`` and ``NXT_PACKET_IN`` messages are sent only to +OpenFlow controller connections that have the correct connection ID (see +``struct nx_controller_id`` and ``struct nx_action_controller``): + +- For packet-in messages generated by a ``NXAST_CONTROLLER`` action, the + controller ID specified in the action. + +- For other packet-in messages, controller ID zero. (This is the default ID + when an OpenFlow controller does not configure one.) + +Finally, Open vSwitch consults a per-connection table indexed by the message +type, reason code, and current role. The following table shows how this table +is initialized by default when an OpenFlow connection is made. An entry +labeled ``yes`` means that the message is sent, an entry labeled ``---`` means +that the message is suppressed. + ++-------------------------------------------+---------+-------+ +| | master/ | | +| message and reason code | other | slave | ++===========================================+=========+=======+ +| ``OFPT_PACKET_IN`` / ``NXT_PACKET_IN`` | ++-------------------------------------------+---------+-------+ +| ``OFPR_NO_MATCH`` | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPR_ACTION`` | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPR_INVALID_TTL`` | --- | --- | ++-------------------------------------------+---------+-------+ +| ``OFPR_ACTION_SET`` (OF1.4+) | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPR_GROUP`` (OF1.4+) | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPT_FLOW_REMOVED`` / ``NXT_FLOW_REMOVED`` | ++-------------------------------------------+---------+-------+ +| ``OFPRR_IDLE_TIMEOUT`` | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPRR_HARD_TIMEOUT`` | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPRR_DELETE`` | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPRR_GROUP_DELETE`` (OF1.4+) | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPRR_METER_DELETE`` (OF1.4+) | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPRR_EVICTION`` (OF1.4+) | yes | --- | ++-------------------------------------------+---------+-------+ +| ``OFPT_PORT_STATUS`` | ++-------------------------------------------+---------+-------+ +| ``OFPPR_ADD`` | yes | yes | ++-------------------------------------------+---------+-------+ +| ``OFPPR_DELETE`` | yes | yes | ++-------------------------------------------+---------+-------+ +| ``OFPPR_MODIFY`` | yes | yes | ++-------------------------------------------+---------+-------+ +| ``OFPT_ROLE_REQUEST`` / ``OFPT_ROLE_REPLY`` (OF1.4+) | ++-------------------------------------------+---------+-------+ +| ``OFPCRR_MASTER_REQUEST`` | --- | --- | ++-------------------------------------------+---------+-------+ +| ``OFPCRR_CONFIG`` | --- | --- | ++-------------------------------------------+---------+-------+ +| ``OFPCRR_EXPERIMENTER`` | --- | --- | ++-------------------------------------------+---------+-------+ +| ``OFPT_TABLE_STATUS`` (OF1.4+) | ++-------------------------------------------+---------+-------+ +| ``OFPTR_VACANCY_DOWN`` | --- | --- | ++-------------------------------------------+---------+-------+ +| ``OFPTR_VACANCY_UP`` | --- | --- | ++-------------------------------------------+---------+-------+ +| ``OFPT_REQUESTFORWARD`` (OF1.4+) | ++-------------------------------------------+---------+-------+ +| ``OFPRFR_GROUP_MOD`` | --- | --- | ++-------------------------------------------+---------+-------+ +| ``OFPRFR_METER_MOD`` | --- | --- | ++-------------------------------------------+---------+-------+ + +The ``NXT_SET_ASYNC_CONFIG`` message directly sets all of the values in this +table for the current connection. The ``OFPC_INVALID_TTL_TO_CONTROLLER`` bit +in the ``OFPT_SET_CONFIG`` message controls the setting for +``OFPR_INVALID_TTL`` for the "master" role. + +``OFPAT_ENQUEUE`` +----------------- + +The OpenFlow 1.0 specification requires the output port of the +``OFPAT_ENQUEUE`` action to "refer to a valid physical port (i.e. < +``OFPP_MAX``) or ``OFPP_IN_PORT``". Although ``OFPP_LOCAL`` is not less than +``OFPP_MAX``, it is an 'internal' port which can have QoS applied to it in +Linux. Since we allow the ``OFPAT_ENQUEUE`` to apply to 'internal' ports whose +port numbers are less than ``OFPP_MAX``, we interpret ``OFPP_LOCAL`` as a +physical port and support ``OFPAT_ENQUEUE`` on it as well. + +``OFPT_FLOW_MOD`` +----------------- + +The OpenFlow specification for the behavior of ``OFPT_FLOW_MOD`` is confusing. +The following tables summarize the Open vSwitch implementation of its behavior +in the following categories: + +"match on priority" + Whether the ``flow_mod`` acts only on flows whose priority matches that + included in the ``flow_mod`` message. + +"match on out_port" + Whether the ``flow_mod`` acts only on flows that output to the out_port + included in the flow_mod message (if out_port is not ``OFPP_NONE``). + OpenFlow 1.1 and later have a similar feature (not listed separately here) + for ``out_group``. + +"match on flow_cookie": + Whether the ``flow_mod`` acts only on flows whose ``flow_cookie`` matches an + optional controller-specified value and mask. + +"updates flow_cookie": + Whether the ``flow_mod`` changes the ``flow_cookie`` of the flow or flows + that it matches to the ``flow_cookie`` included in the flow_mod message. + +"updates ``OFPFF_`` flags": + Whether the flow_mod changes the ``OFPFF_SEND_FLOW_REM`` flag of the flow or + flows that it matches to the setting included in the flags of the flow_mod + message. + +"honors ``OFPFF_CHECK_OVERLAP``": + Whether the ``OFPFF_CHECK_OVERLAP`` flag in the flow_mod is significant. + +"updates ``idle_timeout``" and "updates ``hard_timeout``": + Whether the ``idle_timeout`` and hard_timeout in the ``flow_mod``, + respectively, have an effect on the flow or flows matched by the + ``flow_mod``. + +"updates idle timer": + Whether the ``flow_mod`` resets the per-flow timer that measures how long a + flow has been idle. + +"updates hard timer": + Whether the ``flow_mod`` resets the per-flow timer that measures how long it + has been since a flow was modified. + +"zeros counters": + Whether the ``flow_mod`` resets per-flow packet and byte counters to zero. + +"may add a new flow": + Whether the ``flow_mod`` may add a new flow to the flow table. (Obviously + this is always true for "add" commands but in some OpenFlow versions "modify" + and "modify-strict" can also add new flows.) + +"sends ``flow_removed`` message": + Whether the flow_mod generates a flow_removed message for the flow or flows + that it affects. + +An entry labeled ``yes`` means that the flow mod type does have the indicated +behavior, ``---`` means that it does not, an empty cell means that the property +is not applicable, and other values are explained below the table. + +OpenFlow 1.0 +~~~~~~~~~~~~ + +================================ === ====== ====== ====== ====== + MODIFY DELETE + ADD MODIFY STRICT DELETE STRICT +================================ === ====== ====== ====== ====== +match on ``priority`` yes --- yes --- yes +match on ``out_port`` --- --- --- yes yes +match on ``flow_cookie`` --- --- --- --- --- +match on ``table_id`` --- --- --- --- --- +controller chooses ``table_id`` --- --- --- +updates ``flow_cookie`` yes yes yes +updates ``OFPFF_SEND_FLOW_REM`` yes + + +honors ``OFPFF_CHECK_OVERLAP`` yes + + +updates ``idle_timeout`` yes + + +updates ``hard_timeout`` yes + + +resets idle timer yes + + +resets hard timer yes yes yes +zeros counters yes + + +may add a new flow yes yes yes +sends ``flow_removed`` message --- --- --- % % +================================ === ====== ====== ====== ====== + +Where: + +``+`` + "modify" and "modify-strict" only take these actions when they create a new + flow, not when they update an existing flow. + +``%`` + "delete" and "delete_strict" generates a flow_removed message if the deleted + flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller + can separately control whether it wants to receive the generated messages.) + +OpenFlow 1.1 +~~~~~~~~~~~~ + +OpenFlow 1.1 makes these changes: + +- The controller now must specify the ``table_id`` of the flow match searched + and into which a flow may be inserted. Behavior for a ``table_id`` of 255 is + undefined. + +- A ``flow_mod``, except an "add", can now match on the ``flow_cookie``. + +- When a ``flow_mod`` matches on the ``flow_cookie``, "modify" and + "modify-strict" never insert a new flow. + +================================ === ====== ====== ====== ====== + MODIFY DELETE + ADD MODIFY STRICT DELETE STRICT +================================ === ====== ====== ====== ====== +match on ``priority`` yes --- yes --- yes +match on ``out_port`` --- --- --- yes yes +match on ``flow_cookie`` --- yes yes yes yes +match on ``table_id`` yes yes yes yes yes +controller chooses ``table_id`` yes yes yes +updates ``flow_cookie`` yes --- --- +updates ``OFPFF_SEND_FLOW_REM`` yes + + +honors ``OFPFF_CHECK_OVERLAP`` yes + + +updates ``idle_timeout`` yes + + +updates ``hard_timeout`` yes + + +resets idle timer yes + + +resets hard timer yes yes yes +zeros counters yes + + +may add a new flow yes # # +sends ``flow_removed`` message --- --- --- % % +================================ === ====== ====== ====== ====== + +Where: + +``+`` + "modify" and "modify-strict" only take these actions when they create a new + flow, not when they update an existing flow. + +``%`` + "delete" and "delete_strict" generates a flow_removed message if the deleted + flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller + can separately control whether it wants to receive the generated messages.) + +``#`` + "modify" and "modify-strict" only add a new flow if the flow_mod does not + match on any bits of the flow cookie + +OpenFlow 1.2 +~~~~~~~~~~~~ + +OpenFlow 1.2 makes these changes: + +- Only "add" commands ever add flows, "modify" and "modify-strict" never do. + +- A new flag ``OFPFF_RESET_COUNTS`` now controls whether "modify" and + "modify-strict" reset counters, whereas previously they never reset counters + (except when they inserted a new flow). + +================================ === ====== ====== ====== ====== + MODIFY DELETE + ADD MODIFY STRICT DELETE STRICT +================================ === ====== ====== ====== ====== +match on ``priority`` yes --- yes --- yes +match on ``out_port`` --- --- --- yes yes +match on ``flow_cookie`` --- yes yes yes yes +match on ``table_id`` yes yes yes yes yes +controller chooses ``table_id`` yes yes yes +updates ``flow_cookie`` yes --- --- +updates ``OFPFF_SEND_FLOW_REM`` yes --- --- +honors ``OFPFF_CHECK_OVERLAP`` yes --- --- +updates ``idle_timeout`` yes --- --- +updates ``hard_timeout`` yes --- --- +resets idle timer yes --- --- +resets hard timer yes yes yes +zeros counters yes & & +may add a new flow yes --- --- +sends ``flow_removed`` message --- --- --- % % +================================ === ====== ====== ====== ====== + +``%`` + "delete" and "delete_strict" generates a flow_removed message if the deleted + flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller + can separately control whether it wants to receive the generated messages.) + +``&`` + "modify" and "modify-strict" reset counters if the ``OFPFF_RESET_COUNTS`` + flag is specified. + +OpenFlow 1.3 +~~~~~~~~~~~~ + +OpenFlow 1.3 makes these changes: + +- Behavior for a table_id of 255 is now defined, for "delete" and + "delete-strict" commands, as meaning to delete from all tables. A table_id + of 255 is now explicitly invalid for other commands. + +- New flags ``OFPFF_NO_PKT_COUNTS`` and ``OFPFF_NO_BYT_COUNTS`` for "add" + operations. + +The table for 1.3 is the same as the one shown above for 1.2. + +OpenFlow 1.4 +~~~~~~~~~~~~ + +OpenFlow 1.4 makes these changes: + +- Adds the "importance" field to ``flow_mods``, but it does not explicitly + specify which kinds of ``flow_mods`` set the importance. For consistency, + Open vSwitch uses the same rule for importance as for ``idle_timeout`` and + ``hard_timeout``, that is, only an "ADD" flow_mod sets the importance. (This + issue has been filed with the ONF as EXT-496.) + +.. TODO(stephenfin) Link to EXT-496 + +- Eviction Mechanism to automatically delete entries of lower importance to + make space for newer entries. + +OpenFlow 1.4 Bundles +-------------------- + +Open vSwitch makes all flow table modifications atomically, i.e., any datapath +packet only sees flow table configurations either before or after any change +made by any ``flow_mod``. For example, if a controller removes all flows with +a single OpenFlow ``flow_mod``, no packet sees an intermediate version of the +OpenFlow pipeline where only some of the flows have been deleted. + +It should be noted that Open vSwitch caches datapath flows, and that the cached +flows are *NOT* flushed immediately when a flow table changes. Instead, the +datapath flows are revalidated against the new flow table as soon as possible, +and usually within one second of the modification. This design amortizes the +cost of datapath cache flushing across multiple flow table changes, and has a +significant performance effect during simultaneous heavy flow table churn and +high traffic load. This means that different cached datapath flows may have +been computed based on a different flow table configurations, but each of the +datapath flows is guaranteed to have been computed over a coherent view of the +flow tables, as described above. + +With OpenFlow 1.4 bundles this atomicity can be extended across an arbitrary +set of ``flow_mod``. Bundles are supported for ``flow_mod`` and port_mod +messages only. For ``flow_mod``, both ``atomic`` and ``ordered`` bundle flags +are trivially supported, as all bundled messages are executed in the order they +were added and all flow table modifications are now atomic to the datapath. +Port mods may not appear in atomic bundles, as port status modifications are +not atomic. + +To support bundles, ovs-ofctl has a ``--bundle`` option that makes the +flow mod commands (``add-flow``, ``add-flows``, ``mod-flows``, ``del-flows``, +and ``replace-flows``) use an OpenFlow 1.4 bundle to operate the +modifications as a single atomic transaction. If any of the flow mods +in a transaction fail, none of them are executed. All flow mods in a +bundle appear to datapath lookups simultaneously. + +Furthermore, ovs-ofctl ``add-flow`` and ``add-flows`` commands now accept +arbitrary flow mods as an input by allowing the flow specification to +start with an explicit ``add``, ``modify``, ``modify_strict``, ``delete``, or +``delete_strict`` keyword. A missing keyword is treated as ``add``, so +this is fully backwards compatible. With the new ``--bundle`` option +all the flow mods are executed as a single atomic transaction using an +OpenFlow 1.4 bundle. Without the ``--bundle`` option the flow mods are +executed in order up to the first failing ``flow_mod``, and in case of an +error the earlier successful ``flow_mod`` calls are not rolled back. + +``OFPT_PACKET_IN`` +------------------ + +The OpenFlow 1.1 specification for ``OFPT_PACKET_IN`` is confusing. The +definition in OF1.1 ``openflow.h`` is[*]: + +:: + + /* Packet received on port (datapath -> controller). */ + struct ofp_packet_in { + struct ofp_header header; + uint32_t buffer_id; /* ID assigned by datapath. */ + uint32_t in_port; /* Port on which frame was received. */ + uint32_t in_phy_port; /* Physical Port on which frame was received. */ + uint16_t total_len; /* Full length of frame. */ + uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */ + uint8_t table_id; /* ID of the table that was looked up */ + uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word, + so the IP header is 32-bit aligned. The + amount of data is inferred from the length + field in the header. Because of padding, + offsetof(struct ofp_packet_in, data) == + sizeof(struct ofp_packet_in) - 2. */ + }; + OFP_ASSERT(sizeof(struct ofp_packet_in) == 24); + +The confusing part is the comment on the ``data[]`` member. This comment is a +leftover from OF1.0 ``openflow.h``, in which the comment was correct: +``sizeof(struct ofp_packet_in)`` is 20 in OF1.0 and ``ffsetof(struct +ofp_packet_in, data)`` is 18. When OF1.1 was written, the structure members +were changed but the comment was carelessly not updated, and the comment became +wrong: ``sizeof(struct ofp_packet_in)`` and offsetof(struct ofp_packet_in, +data) are both 24 in OF1.1. + +That leaves the question of how to implement ``ofp_packet_in`` in OF1.1. The +OpenFlow reference implementation for OF1.1 does not include any padding, that +is, the first byte of the encapsulated frame immediately follows the +``table_id`` member without a gap. Open vSwitch therefore implements it the +same way for compatibility. + +For an earlier discussion, please see the thread archived at: +https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html + +[*] The quoted definition is directly from OF1.1. Definitions used inside OVS + omit the 8-byte ``ofp_header`` members, so the sizes in this discussion are + 8 bytes larger than those declared in OVS header files. + +VLAN Matching +------------- + +The 802.1Q VLAN header causes more trouble than any other 4 bytes in +networking. More specifically, three versions of OpenFlow and Open vSwitch +have among them four different ways to match the contents and presence of the +VLAN header. The following table describes how each version works. + +======== ============= =============== =============== ================ + Match NXM OF1.0 OF1.1 OF1.2 +======== ============= =============== =============== ================ + ``[1]`` ``0000/0000`` ``????/1,??/?`` ``????/1,??/?`` ``0000/0000,--`` + ``[2]`` ``0000/ffff`` ``ffff/0,??/?`` ``ffff/0,??/?`` ``0000/ffff,--`` + ``[3]`` ``1xxx/1fff`` ``0xxx/0,??/1`` ``0xxx/0,??/1`` ``1xxx/ffff,--`` + ``[4]`` ``z000/f000`` ``????/1,0y/0`` ``fffe/0,0y/0`` ``1000/1000,0y`` + ``[5]`` ``zxxx/ffff`` ``0xxx/0,0y/0`` ``0xxx/0,0y/0`` ``1xxx/ffff,0y`` + ``[6]`` ``0000/0fff`` `` <none> `` ``<none>`` ``<none>`` + ``[7]`` ``0000/f000`` `` <none> `` ``<none>`` ``<none>`` + ``[8]`` ``0000/efff`` `` <none> `` ``<none>`` ``<none>`` + ``[9]`` ``1001/1001`` `` <none> `` ``<none>`` ``1001/1001,--`` +``[10]`` ``3000/3000`` `` <none> `` ``<none>`` ``<none>`` +``[11]`` ``1000/1000`` `` <none> `` ``fffe/0,??/1`` ``1000/1000,--`` +======== ============= =============== =============== ================ + +Where: + +Match: + See the list below. + +NXM: + ``xxxx/yyyy`` means ``NXM_OF_VLAN_TCI_W`` with value ``xxxx`` and mask + ``yyyy``. A mask of ``0000`` is equivalent to omitting + ``NXM_OF_VLAN_TCI(_W)``, a mask of ``ffff`` is equivalent to + ``NXM_OF_VLAN_TCI``. + +OF1.0, OF1.1: + ``wwww/x,yy/z`` means ``dl_vlan`` ``wwww``, ``OFPFW_DL_VLAN`` ``x``, + ``dl_vlan_pcp`` ``yy``, and ``OFPFW_DL_VLAN_PCP`` ``z``. If + ``OFPFW_DL_VLAN`` or ``OFPFW_DL_VLAN_PCP`` is 1, the corresponding field + value is wildcarded, otherwise it is matched. ``?`` means that the given + bits are ignored (their conventional values are ``0000/x,00/0`` in OF1.0, + ``0000/x,00/1`` in OF1.1; ``x`` is never ignored). ``<none>`` means that the + given match is not supported. + +OF1.2: + ``xxxx/yyyy,zz`` means ``OXM_OF_VLAN_VID_W`` with value ``xxxx`` and mask + ``yyyy``, and ``OXM_OF_VLAN_PCP`` (which is not maskable) with value ``zz``. + A mask of ``0000`` is equivalent to omitting ``OXM_OF_VLAN_VID(_W)``, a mask + of ``ffff`` is equivalent to ``OXM_OF_VLAN_VID``. ``--`` means that + ``OXM_OF_VLAN_PCP`` is omitted. ``<none>`` means that the given match is not + supported. + +The matches are: + +``[1]``: + Matches any packet, that is, one without an 802.1Q header or with an 802.1Q + header with any TCI value. + +``[2]`` + Matches only packets without an 802.1Q header. + + NXM: + Any match with ``vlan_tci == 0`` and ``(vlan_tci_mask & 0x1000) != 0`` is + equivalent to the one listed in the table. + + OF1.0: + The spec doesn't define behavior if ``dl_vlan`` is set to ``0xffff`` and + ``OFPFW_DL_VLAN_PCP`` is not set. + + OF1.1: + The spec says explicitly to ignore ``dl_vlan_pcp`` when ``dl_vlan`` is set + to ``0xffff``. + + OF1.2: + The spec doesn't say what should happen if ``vlan_vid == 0`` and + ``(vlan_vid_mask & 0x1000) != 0`` but ``vlan_vid_mask != 0x1000``, but it + would be straightforward to also interpret as ``[2]``. + +``[3]`` + Matches only packets that have an 802.1Q header with VID ``xxx`` (and any + PCP). + +``[4]`` + Matches only packets that have an 802.1Q header with PCP ``y`` (and any VID). + + NXM: + ``z`` is ``(y << 1) | 1``. + + OF1.0: + The spec isn't very clear, but OVS implements it this way. + + OF1.2: + Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1000`` + would also work, but the spec doesn't define their behavior. + +``[5]`` + Matches only packets that have an 802.1Q header with VID ``xxx`` and PCP + ``y``. + + NXM: + ``z`` is ``((y << 1) | 1)``. + + OF1.2: + Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1fff`` + would also work. + +``[6]`` + Matches packets with no 802.1Q header or with an 802.1Q header with a VID of + 0. Only possible with NXM. + +``[7]`` + Matches packets with no 802.1Q header or with an 802.1Q header with a PCP of + 0. Only possible with NXM. + +``[8]`` + Matches packets with no 802.1Q header or with an 802.1Q header with both VID + and PCP of 0. Only possible with NXM. + +``[9]`` + Matches only packets that have an 802.1Q header with an odd-numbered VID (and + any PCP). Only possible with NXM and OF1.2. (This is just an example; one + can match on any desired VID bit pattern.) + +``[10]`` + Matches only packets that have an 802.1Q header with an odd-numbered PCP (and + any VID). Only possible with NXM. (This is just an example; one can match + on any desired VID bit pattern.) + +``[11]`` + Matches any packet with an 802.1Q header, regardless of VID or PCP. + +Additional notes: + +OF1.2: + The top three bits of ``OXM_OF_VLAN_VID`` are fixed to zero, so bits 13, 14, + and 15 in the masks listed in the table may be set to arbitrary values, as + long as the corresponding value bits are also zero. The suggested ``ffff`` + mask for [2], [3], and [5] allows a shorter OXM representation (the mask is + omitted) than the minimal ``1fff`` mask. + +Flow Cookies +------------ + +OpenFlow 1.0 and later versions have the concept of a "flow cookie", which is a +64-bit integer value attached to each flow. The treatment of the flow cookie +has varied greatly across OpenFlow versions, however. + +In OpenFlow 1.0: + +- ``OFPFC_ADD`` set the cookie in the flow that it added. + +- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` updated the cookie for the flow + or flows that it modified. + +- ``OFPST_FLOW`` messages included the flow cookie. + +- ``OFPT_FLOW_REMOVED`` messages reported the cookie of the flow that was + removed. + +OpenFlow 1.1 made the following changes: + +- Flow mod operations ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, + ``OFPFC_DELETE``, and ``OFPFC_DELETE_STRICT``, plus flow stats requests and + aggregate stats requests, gained the ability to match on flow cookies with an + arbitrary mask. + +- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to add a new flow, + in the case of no match, only if the flow table modification operation did + not match on the cookie field. (In OpenFlow 1.0, modify operations always + added a new flow when there was no match.) + +- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` no longer updated flow cookies. + +OpenFlow 1.2 made the following changes: + +- ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to never add a new + flow, regardless of whether the flow cookie was used for matching. + +Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 behavior with +the following extensions: + +- An NXM extension field ``NXM_NX_COOKIE(_W)`` allows the NXM versions of + ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, ``OFPFC_DELETE``, and + ``OFPFC_DELETE_STRICT`` ``flow_mod`` calls, plus flow stats requests and + aggregate stats requests, to match on flow cookies with arbitrary masks. + This is much like the equivalent OpenFlow 1.1 feature. + +- Like OpenFlow 1.1, ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` add a new flow + if there is no match and the mask is zero (or not given). + +- The ``cookie`` field in ``OFPT_FLOW_MOD`` and ``NXT_FLOW_MOD`` messages is + used as the cookie value for ``OFPFC_ADD`` commands, as described in OpenFlow + 1.0. For ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` commands, the + ``cookie`` field is used as a new cookie for flows that match unless it is + ``UINT64_MAX``, in which case the flow's cookie is not updated. + +- ``NXT_PACKET_IN`` (the Nicira extended version of ``OFPT_PACKET_IN``) reports + the cookie of the rule that generated the packet, or all-1-bits if no rule + generated the packet. (Older versions of OVS used all-0-bits instead of + all-1-bits.) + +The following table shows the handling of different protocols when receiving +``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` messages. A mask of 0 indicates +either an explicit mask of zero or an implicit one by not specifying the +``NXM_NX_COOKIE(_W)`` field. + +============== ====== ====== ============= ============= + Match Update Add on miss Add on miss + cookie cookie mask!=0 mask==0 +============== ====== ====== ============= ============= +OpenFlow 1.0 no yes (add on miss) (add on miss) +OpenFlow 1.1 yes no no yes +OpenFlow 1.2 yes no no no +NXM yes yes\* no yes +============== ====== ====== ============= ============= + +\* Updates the flow's cookie unless the ``cookie`` field is ``UINT64_MAX``. + +Multiple Table Support +---------------------- + +OpenFlow 1.0 has only rudimentary support for multiple flow tables. Notably, +OpenFlow 1.0 does not allow the controller to specify the flow table to which a +flow is to be added. Open vSwitch adds an extension for this purpose, which is +enabled on a per-OpenFlow connection basis using the ``NXT_FLOW_MOD_TABLE_ID`` +message. When the extension is enabled, the upper 8 bits of the ``command`` +member in an ``OFPT_FLOW_MOD`` or ``NXT_FLOW_MOD`` message designates the table +to which a flow is to be added. + +The Open vSwitch software switch implementation offers 255 flow tables. On +packet ingress, only the first flow table (table 0) is searched, and the +contents of the remaining tables are not considered in any way. Tables other +than table 0 only come into play when an ``NXAST_RESUBMIT_TABLE`` action +specifies another table to search. + +Tables 128 and above are reserved for use by the switch itself. Controllers +should use only tables 0 through 127. + +``OFPTC_*`` Table Configuration +------------------------------- + +This section covers the history of the ``OFPTC_*`` table configuration bits +across OpenFlow versions. + +OpenFlow 1.0 flow tables had fixed configurations. + +OpenFlow 1.1 enabled controllers to configure behavior upon flow table miss and +added the ``OFPTC_MISS_*`` constants for that purpose. ``OFPTC_*`` did not +control anything else but it was nevertheless conceptualized as a set of +bit-fields instead of an enum. OF1.1 added the ``OFPT_TABLE_MOD`` message to +set ``OFPTC_MISS_*`` for a flow table and added the ``config`` field to the +``OFPST_TABLE`` reply to report the current setting. + +OpenFlow 1.2 did not change anything in this regard. + +OpenFlow 1.3 switched to another means to changing flow table miss behavior and +deprecated ``OFPTC_MISS_*`` without adding any more ``OFPTC_*`` constants. +This meant that ``OFPT_TABLE_MOD`` now had no purpose at all, but OF1.3 kept it +around "for backward compatibility with older and newer versions of the +specification." At the same time, OF1.3 introduced a new message +OFPMP_TABLE_FEATURES that included a field ``config`` documented as reporting +the ``OFPTC_*`` values set with ``OFPT_TABLE_MOD``; of course this served no +real purpose because no ``OFPTC_*`` values are defined. OF1.3 did remove the +``OFPTC_*`` field from ``OFPMP_TABLE`` (previously named ``OFPST_TABLE``). + +OpenFlow 1.4 defined two new ``OFPTC_*`` constants, ``OFPTC_EVICTION`` and +``OFPTC_VACANCY_EVENTS``, using bits that did not overlap with ``OFPTC_MISS_*`` +even though those bits had not been defined since OF1.2. ``OFPT_TABLE_MOD`` +still controlled these settings. The field for ``OFPTC_*`` values in +``OFPMP_TABLE_FEATURES`` was renamed from ``config`` to ``capabilities`` and +documented as reporting the flags that are supported in a ``OFPT_TABLE_MOD`` +message. The ``OFPMP_TABLE_DESC`` message newly added in OF1.4 reported the +``OFPTC_*`` setting. + +OpenFlow 1.5 did not change anything in this regard. + +.. list-table:: Revisions + :header-rows: 1 + + * - OpenFlow + - ``OFPTC_*`` flags + - ``TABLE_MOD`` + - Statistics + - ``TABLE_FEATURES`` + - ``TABLE_DESC`` + * - OF1.0 + - none + - no (\*)(+) + - no (\*) + - nothing (\*)(+) + - no (\*)(+) + * - OF1.1/1.2 + - ``MISS_*`` + - yes + - yes + - nothing (+) + - no (+) + * - OF1.3 + - none + - yes (\*) + - no (\*) + - config (\*) + - no (\*)(+) + * - OF1.4/1.5 + - ``EVICTION``/``VACANCY_EVENTS`` + - yes + - no + - capabilities + - yes + +where: + +OpenFlow: + The OpenFlow version(s). + +``OFPTC_*`` flags: + The ``OFPTC_*`` flags defined in those versions. + +``TABLE_MOD``: + Whether ``OFPT_TABLE_MOD`` can modify ``OFPTC_*`` flags. + +Statistics: + Whether ``OFPST_TABLE/OFPMP_TABLE`` reports the ``OFPTC_*`` flags. + +``TABLE_FEATURES``: + What ``OFPMP_TABLE_FEATURES`` reports (if it exists): either the current + configuration or the switch's capabilities. + +``TABLE_DESC``: + Whether ``OFPMP_TABLE_DESC`` reports the current configuration. + +(\*): Nothing to report/change anyway. +(+): No such message. + +IPv6 +---- + +Open vSwitch supports stateless handling of IPv6 packets. Flows can be written +to support matching TCP, UDP, and ICMPv6 headers within an IPv6 packet. Deeper +matching of some Neighbor Discovery messages is also supported. + +IPv6 was not designed to interact well with middle-boxes. This, combined with +Open vSwitch's stateless nature, have affected the processing of IPv6 traffic, +which is detailed below. + +Extension Headers +~~~~~~~~~~~~~~~~~ + +The base IPv6 header is incredibly simple with the intention of only containing +information relevant for routing packets between two endpoints. IPv6 relies +heavily on the use of extension headers to provide any other functionality. +Unfortunately, the extension headers were designed in such a way that it is +impossible to move to the next header (including the layer-4 payload) unless +the current header is understood. + +Open vSwitch will process the following extension headers and continue to the +next header: + +- Fragment (see the next section) +- AH (Authentication Header) +- Hop-by-Hop Options +- Routing +- Destination Options + +When a header is encountered that is not in that list, it is considered +"terminal". A terminal header's IPv6 protocol value is stored in ``nw_proto`` +for matching purposes. If a terminal header is TCP, UDP, or ICMPv6, the packet +will be further processed in an attempt to extract layer-4 information. + +Fragments +~~~~~~~~~ + +IPv6 requires that every link in the internet have an MTU of 1280 octets or +greater (RFC 2460). As such, a terminal header (as described above in +"Extension Headers") in the first fragment should generally be reachable. In +this case, the terminal header's IPv6 protocol type is stored in the +``nw_proto`` field for matching purposes. If a terminal header cannot be found +in the first fragment (one with a fragment offset of zero), the ``nw_proto`` +field is set to 0. Subsequent fragments (those with a non-zero fragment +offset) have the ``nw_proto`` field set to the IPv6 protocol type for fragments +(44). + +Jumbograms +~~~~~~~~~~ + +An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer than +65,535 octets. A jumbogram is only relevant in subnets with a link MTU greater +than 65,575 octets, and are not required to be supported on nodes that do not +connect to link with such large MTUs. Currently, Open vSwitch doesn't process +jumbograms. + +In-Band Control +--------------- + +Motivation +~~~~~~~~~~ + +An OpenFlow switch must establish and maintain a TCP network connection to its +controller. There are two basic ways to categorize the network that this +connection traverses: either it is completely separate from the one that the +switch is otherwise controlling, or its path may overlap the network that the +switch controls. We call the former case "out-of-band control", the latter +case "in-band control". + +Out-of-band control has the following benefits: + +- Simplicity: Out-of-band control slightly simplifies the switch + implementation. + +- Reliability: Excessive switch traffic volume cannot interfere with control + traffic. + +- Integrity: Machines not on the control network cannot impersonate a switch or + a controller. + +- Confidentiality: Machines not on the control network cannot snoop on control + traffic. + +In-band control, on the other hand, has the following advantages: + +- No dedicated port: There is no need to dedicate a physical switch port to + control, which is important on switches that have few ports (e.g. wireless + routers, low-end embedded platforms). + +- No dedicated network: There is no need to build and maintain a separate + control network. This is important in many environments because it reduces + proliferation of switches and wiring. + +Open vSwitch supports both out-of-band and in-band control. This section +describes the principles behind in-band control. See the description of the +Controller table in ovs-vswitchd.conf.db(5) to configure OVS for in-band +control. + +Principles +~~~~~~~~~~ + +The fundamental principle of in-band control is that an OpenFlow switch must +recognize and switch control traffic without involving the OpenFlow controller. +All the details of implementing in-band control are special cases of this +principle. + +The rationale for this principle is simple. If the switch does not handle +in-band control traffic itself, then it will be caught in a contradiction: it +must contact the controller, but it cannot, because only the controller can set +up the flows that are needed to contact the controller. + +The following points describe important special cases of this principle. + +- In-band control must be implemented regardless of whether the switch is + connected. + + It is tempting to implement the in-band control rules only when the switch is + not connected to the controller, using the reasoning that the controller + should have complete control once it has established a connection with the + switch. + + This does not work in practice. Consider the case where the switch is + connected to the controller. Occasionally it can happen that the controller + forgets or otherwise needs to obtain the MAC address of the switch. To do + so, the controller sends a broadcast ARP request. A switch that implements + the in-band control rules only when it is disconnected will then send an + ``OFPT_PACKET_IN`` message up to the controller. The controller will be + unable to respond, because it does not know the MAC address of the switch. + This is a deadlock situation that can only be resolved by the switch noticing + that its connection to the controller has hung and reconnecting. + +- In-band control must override flows set up by the controller. + + It is reasonable to assume that flows set up by the OpenFlow controller + should take precedence over in-band control, on the basis that the controller + should be in charge of the switch. + + Again, this does not work in practice. Reasonable controller implementations + may set up a "last resort" fallback rule that wildcards every field and, + e.g., sends it up to the controller or discards it. If a controller does + that, then it will isolate itself from the switch. + +- The switch must recognize all control traffic. + + The fundamental principle of in-band control states, in part, that a switch + must recognize control traffic without involving the OpenFlow controller. + More specifically, the switch must recognize *all* control traffic. "False + negatives", that is, packets that constitute control traffic but that the + switch does not recognize as control traffic, lead to control traffic storms. + + Consider an OpenFlow switch that only recognizes control packets sent to or + from that switch. Now suppose that two switches of this type, named A and B, + are connected to ports on an Ethernet hub (not a switch) and that an OpenFlow + controller is connected to a third hub port. In this setup, control traffic + sent by switch A will be seen by switch B, which will send it to the + controller as part of an OFPT_PACKET_IN message. Switch A will then see the + OFPT_PACKET_IN message's packet, re-encapsulate it in another OFPT_PACKET_IN, + and send it to the controller. Switch B will then see that OFPT_PACKET_IN, + and so on in an infinite loop. + + Incidentally, the consequences of "false positives", where packets that are + not control traffic are nevertheless recognized as control traffic, are much + less severe. The controller will not be able to control their behavior, but + the network will remain in working order. False positives do constitute a + security problem. + +- The switch should use echo-requests to detect disconnection. + + TCP will notice that a connection has hung, but this can take a considerable + amount of time. For example, with default settings the Linux kernel TCP + implementation will retransmit for between 13 and 30 minutes, depending on + the connection's retransmission timeout, according to kernel documentation. + This is far too long for a switch to be disconnected, so an OpenFlow switch + should implement its own connection timeout. OpenFlow ``OFPT_ECHO_REQUEST`` + messages are the best way to do this, since they test the OpenFlow connection + itself. + +Implementation +~~~~~~~~~~~~~~ + +This section describes how Open vSwitch implements in-band control. Correctly +implementing in-band control has proven difficult due to its many subtleties, +and has thus gone through many iterations. Please read through and understand +the reasoning behind the chosen rules before making modifications. + +Open vSwitch implements in-band control as "hidden" flows, that is, flows that +are not visible through OpenFlow, and at a higher priority than wildcarded +flows can be set up through OpenFlow. This is done so that the OpenFlow +controller cannot interfere with them and possibly break connectivity with its +switches. It is possible to see all flows, including in-band ones, with the +ovs-appctl "bridge/dump-flows" command. + +The Open vSwitch implementation of in-band control can hide traffic to +arbitrary "remotes", where each remote is one TCP port on one IP address. +Currently the remotes are automatically configured as the in-band OpenFlow +controllers plus the OVSDB managers, if any. (The latter is a requirement +because OVSDB managers are responsible for configuring OpenFlow controllers, so +if the manager cannot be reached then OpenFlow cannot be reconfigured.) + +The following rules (with the OFPP_NORMAL action) are set up on any bridge that +has any remotes: + +(a) + DHCP requests sent from the local port. +(b) + ARP replies to the local port's MAC address. +(c) + ARP requests from the local port's MAC address. + +In-band also sets up the following rules for each unique next-hop MAC address +for the remotes' IPs (the "next hop" is either the remote itself, if it is on a +local subnet, or the gateway to reach the remote): + +(d) + ARP replies to the next hop's MAC address. +(e) + ARP requests from the next hop's MAC address. + +In-band also sets up the following rules for each unique remote IP address: + +(f) + ARP replies containing the remote's IP address as a target. +(g) + ARP requests containing the remote's IP address as a source. + +In-band also sets up the following rules for each unique remote (IP,port) pair: + +(h) + TCP traffic to the remote's IP and port. +(i) + TCP traffic from the remote's IP and port. + +The goal of these rules is to be as narrow as possible to allow a switch to +join a network and be able to communicate with the remotes. As mentioned +earlier, these rules have higher priority than the controller's rules, so if +they are too broad, they may prevent the controller from implementing its +policy. As such, in-band actively monitors some aspects of flow and packet +processing so that the rules can be made more precise. + +In-band control monitors attempts to add flows into the datapath that could +interfere with its duties. The datapath only allows exact match entries, so +in-band control is able to be very precise about the flows it prevents. Flows +that miss in the datapath are sent to userspace to be processed, so preventing +these flows from being cached in the "fast path" does not affect correctness. +The only type of flow that is currently prevented is one that would prevent +DHCP replies from being seen by the local port. For example, a rule that +forwarded all DHCP traffic to the controller would not be allowed, but one that +forwarded to all ports (including the local port) would. + +As mentioned earlier, packets that miss in the datapath are sent to the +userspace for processing. The userspace has its own flow table, the +"classifier", so in-band checks whether any special processing is needed before +the classifier is consulted. If a packet is a DHCP response to a request from +the local port, the packet is forwarded to the local port, regardless of the +flow table. Note that this requires L7 processing of DHCP replies to determine +whether the 'chaddr' field matches the MAC address of the local port. + +It is interesting to note that for an L3-based in-band control mechanism, the +majority of rules are devoted to ARP traffic. At first glance, some of these +rules appear redundant. However, each serves an important role. First, in +order to determine the MAC address of the remote side (controller or gateway) +for other ARP rules, we must allow ARP traffic for our local port with rules +(b) and (c). If we are between a switch and its connection to the remote, we +have to allow the other switch's ARP traffic to through. This is done with +rules (d) and (e), since we do not know the addresses of the other switches a +priori, but do know the remote's or gateway's. Finally, if the remote is +running in a local guest VM that is not reached through the local port, the +switch that is connected to the VM must allow ARP traffic based on the remote's +IP address, since it will not know the MAC address of the local port that is +sending the traffic or the MAC address of the remote in the guest VM. + +With a few notable exceptions below, in-band should work in most network +setups. The following are considered "supported" in the current +implementation: + +- Locally Connected. The switch and remote are on the same subnet. This uses + rules (a), (b), (c), (h), and (i). + +- Reached through Gateway. The switch and remote are on different subnets and + must go through a gateway. This uses rules (a), (b), (c), (h), and (i). + +- Between Switch and Remote. This switch is between another switch and the + remote, and we want to allow the other switch's traffic through. This uses + rules (d), (e), (h), and (i). It uses (b) and (c) indirectly in order to + know the MAC address for rules (d) and (e). Note that DHCP for the other + switch will not work unless an OpenFlow controller explicitly lets this + switch pass the traffic. + +- Between Switch and Gateway. This switch is between another switch and the + gateway, and we want to allow the other switch's traffic through. This uses + the same rules and logic as the "Between Switch and Remote" configuration + described earlier. + +- Remote on Local VM. The remote is a guest VM on the system running in-band + control. This uses rules (a), (b), (c), (h), and (i). + +- Remote on Local VM with Different Networks. The remote is a guest VM on the + system running in-band control, but the local port is not used to connect to + the remote. For example, an IP address is configured on eth0 of the switch. + The remote's VM is connected through eth1 of the switch, but an IP address + has not been configured for that port on the switch. As such, the switch + will use eth0 to connect to the remote, and eth1's rules about the local port + will not work. In the example, the switch attached to eth0 would use rules + (a), (b), (c), (h), and (i) on eth0. The switch attached to eth1 would use + rules (f), (g), (h), and (i). + +The following are explicitly *not* supported by in-band control: + +- Specify Remote by Name. Currently, the remote must be identified by IP + address. A naive approach would be to permit all DNS traffic. + Unfortunately, this would prevent the controller from defining any policy + over DNS. Since switches that are located behind us need to connect to the + remote, in-band cannot simply add a rule that allows DNS traffic from the + local port. The "correct" way to support this is to parse DNS requests to + allow all traffic related to a request for the remote's name through. Due to + the potential security problems and amount of processing, we decided to hold + off for the time-being. + +- Differing Remotes for Switches. All switches must know the L3 addresses for + all the remotes that other switches may use, since rules need to be set up to + allow traffic related to those remotes through. See rules (f), (g), (h), and + (i). + +- Differing Routes for Switches. In order for the switch to allow other + switches to connect to a remote through a gateway, it allows the gateway's + traffic through with rules (d) and (e). If the routes to the remote differ + for the two switches, we will not know the MAC address of the alternate + gateway. + +Action Reproduction +------------------- + +It seems likely that many controllers, at least at startup, use the OpenFlow +"flow statistics" request to obtain existing flows, then compare the flows' +actions against the actions that they expect to find. Before version 1.8.0, +Open vSwitch always returned exact, byte-for-byte copies of the actions that +had been added to the flow table. The current version of Open vSwitch does not +always do this in some exceptional cases. This section lists the exceptions +that controller authors must keep in mind if they compare actual actions +against desired actions in a bytewise fashion: + +- Open vSwitch zeros padding bytes in action structures, regardless of their + values when the flows were added. + +- Open vSwitch "normalizes" the instructions in OpenFlow 1.1 (and later) in the + following way: + + * OVS sorts the instructions into the following order: Apply-Actions, + Clear-Actions, Write-Actions, Write-Metadata, Goto-Table. + + * OVS drops Apply-Actions instructions that have empty action lists. + + * OVS drops Write-Actions instructions that have empty action sets. + +Please report other discrepancies, if you notice any, so that we can fix or +document them. + +Suggestions +----------- + +Suggestions to improve Open vSwitch are welcome at disc...@openvswitch.org. diff --git a/Makefile.am b/Makefile.am index f674b7e..ca19ec5 100644 --- a/Makefile.am +++ b/Makefile.am @@ -68,7 +68,7 @@ PYCOV_CLEAN_FILES = build-aux/check-structs,cover docs = \ CONTRIBUTING.rst \ CodingStyle.rst \ - DESIGN.md \ + DESIGN.rst \ FAQ.md \ INSTALL.rst \ INSTALL.Debian.rst \ diff --git a/include/openvswitch/ofp-util.h b/include/openvswitch/ofp-util.h index f3cb624..e4dacbf 100644 --- a/include/openvswitch/ofp-util.h +++ b/include/openvswitch/ofp-util.h @@ -813,7 +813,7 @@ struct ofputil_table_features { * supported, otherwise 0. For other versions, they are decoded as -1 and * ignored for encoding. * - * See the section "OFPTC_* Table Configuration" in DESIGN.md for more + * See the section "OFPTC_* Table Configuration" in DESIGN.rst for more * details of how OpenFlow has changed in this area. */ enum ofputil_table_miss miss_config; /* OF1.1 and 1.2 only. */ diff --git a/lib/ofp-util.c b/lib/ofp-util.c index 0445968..c0a402f 100644 --- a/lib/ofp-util.c +++ b/lib/ofp-util.c @@ -5675,7 +5675,7 @@ ofputil_encode_table_config(enum ofputil_table_miss miss, enum ofp_version version) { uint32_t config = 0; - /* See the section "OFPTC_* Table Configuration" in DESIGN.md for more + /* See the section "OFPTC_* Table Configuration" in DESIGN.rst for more * information on the crazy evolution of this field. */ switch (version) { case OFP10_VERSION: diff --git a/ovn/controller/pinctrl.c b/ovn/controller/pinctrl.c index 590fe11..db9e441 100644 --- a/ovn/controller/pinctrl.c +++ b/ovn/controller/pinctrl.c @@ -732,7 +732,7 @@ pinctrl_recv(const struct ofp_header *oh, enum ofptype type) queue_msg(make_echo_reply(oh)); } else if (type == OFPTYPE_GET_CONFIG_REPLY) { /* Enable asynchronous messages (see "Asynchronous Messages" in - * DESIGN.md for more information). */ + * DESIGN.rst for more information). */ struct ofputil_switch_config config; ofputil_decode_get_config_reply(oh, &config); diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml index bc5dfb7..95cba98 100644 --- a/ovn/ovn-architecture.7.xml +++ b/ovn/ovn-architecture.7.xml @@ -342,7 +342,7 @@ It's possible, however, for some other bridge in the same system to have an in-band remote controller, and in that case this suppresses the flows that in-band control would ordinarily set up. See <code>In-Band - Control</code> in <code>DESIGN.md</code> for more information. + Control</code> in <code>DESIGN.rst</code> for more information. </dd> </dl> diff --git a/rhel/openvswitch-fedora.spec.in b/rhel/openvswitch-fedora.spec.in index 3213864..00b40ea 100644 --- a/rhel/openvswitch-fedora.spec.in +++ b/rhel/openvswitch-fedora.spec.in @@ -478,7 +478,7 @@ fi %{_mandir}/man8/ovs-vswitchd.8* %{_mandir}/man8/ovs-parse-backtrace.8* %{_mandir}/man8/ovs-testcontroller.8* -%doc COPYING DESIGN.md INSTALL.SSL.md NOTICE README.rst WHY-OVS.rst +%doc COPYING DESIGN.rst INSTALL.SSL.md NOTICE README.rst WHY-OVS.rst %doc FAQ.md NEWS INSTALL.DPDK.rst rhel/README.RHEL /var/lib/openvswitch /var/log/openvswitch diff --git a/rhel/openvswitch.spec.in b/rhel/openvswitch.spec.in index d473e76..015043b 100644 --- a/rhel/openvswitch.spec.in +++ b/rhel/openvswitch.spec.in @@ -247,7 +247,7 @@ exit 0 /usr/share/openvswitch/scripts/sysconfig.template /usr/share/openvswitch/vswitch.ovsschema /usr/share/openvswitch/vtep.ovsschema -%doc COPYING DESIGN.md INSTALL.SSL.md NOTICE README.rst WHY-OVS.rst FAQ.md NEWS +%doc COPYING DESIGN.rst INSTALL.SSL.md NOTICE README.rst WHY-OVS.rst FAQ.md NEWS %doc INSTALL.DPDK.rst rhel/README.RHEL README-native-tunneling.rst /var/lib/openvswitch /var/log/openvswitch -- 2.7.4 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev