Hi Yevgeny, On 7/21/07, Yevgeny Kliteynik <[EMAIL PROTECTED]> wrote: > Hi All > > Please find the attached RFC describing how QoS policy support could be > implemented in the OpenFabrics stack. > Your comments are welcome.
A couple of quick questions: How does this differ from the original RFC posted 5/30/06 ? What I can see is the following: 1. Updated for not yet released IBTA QoS Annex 2. Use of plain text rather than XML based policy file for OpenSM Anything else ? Below, IPoIB is discussed in terms of UD. What about IPoIB-CM ? It uses CM and has a service ID. Also, have my specific comments to the patches originally submitted been addressed ? (Do I need to dig them out again ?) Just wondering... Thanks. -- Hal > > -- Yevgeny > > RFC: OpenFabrics Enhancements for QoS Support > =============================================== > > Authors: . Eitan Zahavi <eitan at mellanox.co.il> > Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il> > Date: .... Jul 2007. > Revision: 0.2 > > Table of contents: > 1. Overview > 2. Architecture > 3. Supported Policy > 4. CMA functionality > 5. IPoIB functionality > 6. SDP functionality > 7. SRP functionality > 8. iSER functionality > 9. OpenSM functionality > > 1. Overview > ------------ > Quality of Service requirements stem from the realization of I/O consolidation > over IB network: As multiple applications and ULPs share the same fabric, > means > to control their use of the network resources are becoming a must. The basic > need is to differentiate the service levels provided to different traffic > flows, > such that a policy could be enforced and control each flow utilization of the > fabric resources. > > IBTA specification defined several hardware features and management interfaces > to support QoS: > * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner > * Arbitration between traffic of different VLs is performed by a 2 priority > levels weighted round robin arbiter. The arbiter is programmable with > a sequence of (VL, weight) pairs and maximal number of high priority credits > to be processed before low priority is served > * Packets carry class of service marking in the range 0 to 15 in their > header SL field > * Each switch can map the incoming packet by its SL to a particular output > VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) > * The Subnet Administrator controls each communication flow parameters > by providing them as a response to Path Record (PR) or MultiPathRecord (MPR) > queries > > The IB QoS features provide the means to implement a DiffServ like > architecture. > DiffServ architecture (IETF RFC2474 2475) is widely used today in highly > dynamic > fabrics. > > This proposal provides the detailed functional definition for the various > software elements that are required to enable a DiffServ like architecture > over > the OpenFabrics software stack. > > > > 2. Architecture > ---------------- > This proposal split the QoS functionality between the SM/SA, CMA and the > various > ULPS. We take the "chronology approach" to describe how the overall system > works: > > 2.1. The network manager (human) provides a set of rules (policy) that defines > how the network is being configured and how its resources are split to > different > QoS-Levels. The policy also define how to decide which QoS-Level each > application or ULP or service use. > > 2.2. The SM analyzes the provided policy to see if it is realizable and > performs > the necessary fabric setup. The SM may continuously monitor the policy and > adapt > to changes in it. Part of this policy defines the default QoS-Level of each > partition. The SA is being enhanced to match the requested Source, > Destination, > QoS-Class, Service-ID (and optionally SL and priority) against the policy. So > clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also > enhanced to support setting up partitions with appropriate IPoIB broadcast > group. This broadcast group carries its QoS attributes: SL, MTU and > RATE. > > 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the > multicast group which forms the broadcast group of this partition. > > 2.4. MPI which provides non IB based connection management should be > configured > to run using hard coded SLs. It uses these SLs for every QP being opened. > > 2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned > Service-ID and use it while obtaining PR/MPR for establishing connections. > The SA receiving the PR/MPR should match it against the policy and return > the appropriate PR/MPR including SL, MTU and RATE. > > 2.6. ULPs and programs using CMA to establish RC connection should provide the > CMA the target IP and Service-ID. Some of the ULPs might also provide > QoS-Class > (E.g. for SDP sockets that are provided the TOS socket option). The CMA should > then use the provided Service-ID and optional QoS-Class and pass them in the > PR/MPR request. The resulting PR/MPR should be used for configuring the > connection QP. > > PathRecord and MultiPathRecord enhancement for QoS: > As mentioned above the PathRecord and MultiPathRecord attributes should be > enhanced to carry the Service-ID which is a 64bit value, which has been > standardized by the IBTA. A new field QoS-Class is also provided. > A new capability bit should describe the SM QoS support in the SA class port > info. This approach provides an easy migration path for existing access layer > and ULPs by not introducing new set of PR/MPR attribute. > > > 3. Supported Policy > -------------------- > > The QoS policy supported by this proposal is divided into 4 sub sections: > > I) Port Group: a set of CAs, Routers or Switches that share the same settings. > A port group might be a partition defined by the partition manager policy in > terms of GUIDs. Future implementations might provide support for > NodeDescription > based definition of port groups. > > II) Fabric Setup: > Defines how the SL2VL and VLArb tables should be setup. This policy definition > assumes the computation of overall end to end network behavior should be > performed > outside of OpenSM. > > III) QoS-Levels Definition: > This section defines the possible sets of parameters for QoS that a client > might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, > Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS). > > IV) Matching Rules: > A list of rules that match an incoming PR/MPR request to a QoS-Level. The > rules are processed in order such as the first match is applied. Each rule is > built out of a set of match expressions which should all match for the rule to > apply. The matching expressions are defined for the following fields > ** SRC and DST to lists of port groups > ** Service-ID to a list of Service-ID or Service-ID ranges > ** QoS-Class to a list of QoS-Class values or ranges > > QoS Policy file syntax > > * Empty lines are ignored > * Leading and trailing blanks, as well as empty lines, are ignored, so the > indentation in the example is just for better readability > * Comments are started with the pound sign (#) and terminated by EOL > * Comments may appear only in a separate line > * Keywords that denote section/subsection start have matching closing keywords > * Any keyword should be the first non-blank in the line > > QoS Policy file example > > # Port Groups define sets of ports to be used later in the settings > port-groups > # using port GUIDs > port-group > name: Storage > # "use" is just a description that is used for logging. > # Other than that, it is just a commentary > use: our SRP storage targets > port-guid: 0x1000000000000001 > port-guid: 0x1000000000000002 > end-port-group > > port-group > name: Virtual Servers > use: node desc and IB port num > # The syntax of the port name is as follows: > "hostname/CA-num/Pnum". > # "hostname" and "CA-num" are compared to the first 2 words of > # NodeDescription, and "Pnum" is a port number on that node. > port-name: vs1/HCA-1/P1 > port-name: vs3/HCA-1/P1 > port-name: vs3/HCA-2/P2 > end-port-group > > # using partitions defined in the partition policy > port-group > name: Group for Partition 1 > use: default settings > partition: Part1 > end-port-group > > # using node types CA|ROUTER|SWITCH > port-group > name: Routers > use: all routers > node-type: ROUTER > end-port-group > > end-port-groups > > qos-setup > > # define all types of VLArb tables. The length of the tables should > # match the physically supported tables by their target ports > vlarb-tables > # scope defines the exact ports the VLArb tables apply to > vlarb-scope > # defining VLArb tables on all the ports that belong to > # port group 'Storage', and on all the ports connected > # to ports of port group 'Storage' > group: Storage > # "across" means all the ports that are connected to ports > # that belong to the specified port group > across: Storage > # VLArb table holds VL and weight pairs > vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 > vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 > vl-high-limit: 10 > end-vlarb-scope > # There can be several scopes > end-vlarb-tables > > sl2vl-tables > # Scope defines the exact devices and in/out ports tables apply > to. > # Note: if the same port is matching several rules the *FIRST* > one applies. > sl2vl-scope > # SL2VL tables are orgnized as SL2VL(in-port,out-port) > # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*) > # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) > # > # The following example specifies that all the SL2VL tables > # entries should be defined for all the ports of group Part1: > group: Part1 > from: * > to: * > # SL2VL table has to have 16 values at max - one for each SL. > # If the user specifies less than 16 values, all the missing > # VL values will be implicitly set to 0 > sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > end-sl2vl-scope > > sl2vl-scope > # "across-to" is a combination of "across" keyword > (definition can be found > # in VLArb tables section) and "to" keyword. > # "across: PortGroupName" refers to all the ports that are > connected > # to ports that belong to PortGroupName. > # > # Example of "across-to" usage: > # A user has a set of 'special' nodes (e.g. storage nodes), > and all > # the traffic to these nodes has to get specific VL. > # The solution is to define port group (i.g. "Storage") > that will > # include all the ports of these nodes, and then to > configure SL2VL > # tables on all the switch ports that are connected to the > Storage > # port group by specifying "across-to: Storage". > # > across-to: Storage2 > # Similar to "across-to", "across-from" is a combination of > "across" > # and "to" keywords > across-from: Storage1 > sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 > end-sl2vl-scope > end-sl2vl-tables > > end-qos-setup > > > qos-levels > > # the first one is just setting SL > qos-level > use: for the lowest priority communication > sl: 15 > packet-life: 16 > end-qos-level > # the second sets SL and QoS Class > qos-level > use: low latency best bandwidth > sl: 0 > end-qos-level > # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path Bits > qos-level > use: just an example > sl: 0 > mtu-limit: 1 > rate-limit: 1 > packet-life: 12 > # Path Bits can be used e.g. to provide a different routes > through the > # subnet to a particular port > path-bits: 2,4,8-32 > end-qos-level > > end-qos-levels > > > # Match rules are scanned in a first-fit manner (like firewall rules > table) > qos-match-rules > > # matching by single criteria: class (list of values and ranges) > qos-match-rule > # just a description > use: low latency by class 7-9 or 11 > qos-class: 7-9,11 > # number of qos-level to apply to the matching PR/MPR > qos-level-sn: 1 > end-qos-match-rule > # show matching by destination group AND service-ids > qos-match-rule > use: Storage targets connection > destination: Storage > service-id: 22,4719-5000 > qos-level-sn: 2 > end-qos-match-rule > # show matching by source group only > qos-match-rule > use: bla bla > source: Storage > qos-level-sn: 3 > end-qos-match-rule > > end-qos-match-rules > > > 4. IPoIB > --------- > > IPoIB already query the SA for its broadcast group information. The additional > functionality required is for IPoIB to provide the broadcast group SL, MTU, > and RATE in every following PathRecord query performed when a new UDAV is > needed by IPoIB. > We could assign a special Service-ID for IPoIB use but since all communication > on the same IPoIB interface shares the same QoS-Level without the ability to > differentiate it by target service we can ignore it for simplicity. > > 5. CMA features > ---------------- > > The CMA interface supports Service-ID through the notion of port space as a > prefixes to the port_num which is part of the sockaddr provided to > rdma_resolve_add(). What is missing is the explicit request for a QoS-Class > that > should allow the ULP (like SDP) to propagate a specific request for a class of > service. A mechanism for providing the QoS-Class is available in the IPv6 > address, > so we could use that address field. Another option is to implement a special > connection options API for CMA. > > Missing functionality by CMA is the usage of the provided QoS-Class and > Service-ID > in the sent PR/MPR. When a response is obtained it is an existing requirement > for > the CMA to use the PR/MPR from the response in setting up the QP address > vector. > > > 6. SDP > ------- > > SDP uses CMA for building its connections. > The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits > holding the remote TCP/IP Port Number to connect to. > SDP might be provided with SO_PRIORITY socket option. In that case the value > provided should be sent to the CMA as the TClass option of that connection. > > 7. SRP > ------- > > Current SRP implementation uses its own CM callbacks (not CMA). So SRP should > fill in the Service-ID in the PR/MPR by itself and use that information in > setting up the QP. The T10 SRP standard defines the SRP Service-ID to be > defined > by the SRP target I/O Controller (but they should also comply with IBTA > Service- > ID rules). Anyway, the Service-ID is reported by the I/O Controller in the > ServiceEntries DMA attribute and should be used in the PR/MPR if the SA > reports its ability to handle QoS PR/MPRs. > > 8. iSER > -------- > iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER > should be TBD. > > > 9. OpenSM features > ------------------- > The QoS related functionality to be provided by OpenSM can be split into two > main parts: > > 3.1. Fabric Setup > During fabric initialization the SM should parse the policy and apply its > settings to the discovered fabric elements. The following actions should be > performed: > * Parsing of policy > * Node Group identification. Warning should be provided for each node not > specified but found. > * SL2VL settings validation should be checked: > + A warning will be provided if there are no matching targets for the SL2VL > setting statement. > + An error message will be printed to the log file if an invalid setting is > found. A setting is invalid if it refers to: > - Non existing port numbers of the target devices > - Unsupported VLs for the target device. In the later case the map to non > existing VLs should be replaced to VL15 i.e. packets will be dropped. > * SL2VL setting is to be performed > * VL Arbitration table settings should be validated according to the following > rules: > + A warning will be provided if there are no matching targets for the > setting > statement > + An error will be provided if the port number exceeds the target ports > + An error will be generated if the table length exceeds device capabilities > + A warning will be generated if the table quote a VL that is not supported > by the target device > * VL Arbitration tables will be set on the appropriate targets > > 3.2. PR/MPR query handling: > OpenSM should be able to enforce the provided policy on client request. > The overall flow for such requests is: first the request is matched against > the > defined match rules such that the target QoS-Level definition is found. Given > the QoS-Level a path(s) search is performed with the given restrictions > imposed > by that level. The following two sections describe these steps. > > How Service-ID is carried in the PathRecord and MultiPathRecord attributes is > now standardized by the IBTA. > > > 3.2.1. Matching rule search: > A rule is "matching" a PR/MPR request using the following criteria: > * Matching rules provide values in a list of either single value, or range of > values. A PR/MPR field is "matching" the rule field if it is explicitly > noted in the list of values or is one of the values covered by a range > included in the field values list. > * Only PR/MPR fields that have their component mask bit set should be > compared. > * For a rule to be "matching" a PR/MPR request all the rule fields should be > "matching" their PR/MPR fields. Such that a PR/MPR request that does > not have a component mask field set for one of the rule defined fields can > not match that rule. > * A PR/MPR request that have a component mask bit set for one of the fields > that is not defined by the rule can match the rule. > > The algorithm to be used for searching for a rule match might be as simple as > a > sequential search through all rules or enhanced for better performance. The > semantics of every rule field and its matching PR/MPR field are described > below: > * Source: the SGID or SLID should be part of this group > * Destination: the DGID or DLID should be part of this group > * Service-ID: check if the requested Service-ID (available in the PR/MPR old > SM-Key field) is matching any of this rule Service-IDs > * TClass: check if the PR/MPR TClass field is matching > > 3.2.2 PR/MPR response generation: > The QoS-Level pointed by the first rule that matches the PR/MPR request > should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits > and QoS-Class. A default QoS-Level should be used if no rule is matching the > query. > > The efficient algorithm for finding paths that meet the QoS-Level criteria is > beyond the scope of this RFC and left for the implementer to provide. However > the criteria by which the paths match the QoS-Level are described below: > > * SL: The paths found should all use the given SL. For that sake PR/MPR > algorithm should traverse the path from source to destination only through > ports that carry a valid VL (not VL15) by the SL2VL map (should consider > input > and output ports and SL). > * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit > * Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit > (rate limit is given in units of link BW = Width*Speed according to IBTA > Specification Vol-1 table-205 p-901 l-24). > * Path-Bits: define the target LID lowest bits (number of bits defined by the > target port PortInfo.LMC field). The path should traverse the LFT using the > target port LID with the path-bits set. > * QoS-Class: should be returned in the result PR/MPR. When routing is going to > be supported by OpenSM we might use this field in selecting the target > router too in a TBD way. > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
