Hi Yevgeny, On 8/2/07, Yevgeny Kliteynik <[EMAIL PROTECTED]> wrote: > Hi Hal, > > Hal Rosenstock wrote: > > Hi Yevgeny, > > > > On 7/21/07, Yevgeny Kliteynik <[EMAIL PROTECTED]> wrote: > >> Hi All > >> > >> Please find the attached RFC describing how QoS policy support could be > >> implemented in the OpenFabrics stack. > >> Your comments are welcome. > > > > A couple of quick questions: > > > > How does this differ from the original RFC posted 5/30/06 ? > > > > What I can see is the following: > > 1. Updated for not yet released IBTA QoS Annex > > 2. Use of plain text rather than XML based policy file for OpenSM > > Anything else ? > > You're absolutely right - these are the only changes (plus cosmetics here and > there). > > > Below, IPoIB is discussed in terms of UD. What about IPoIB-CM ? It > > uses CM and has a service ID.
Will IPoIB CM be added to the RFC document or is it the same as UD ? > > > > Also, have my specific comments to the patches originally submitted > > been addressed ? (Do I need to dig them out again ?) Just wondering... > > Yes. The submitted patches were only QoS policy file parser. > In the new parser I took care of all the issues we've discussed > a couple of months ago. Here is the summary of these issues > taken from our discussion: > > [snip] > XML syntax or not -> Plain text, human readable and easily editable > > QoS syntax explanation/discussion > > changes to some keywords > Port r.t. Node Groups -> DONE: fixed keywords > CA r.t. HCA -> DONE: fixed keywords > QoSClass r.t. TClass -> DONE: fixed keywords > syntax discussion points > larger ones: > dynamic service IDs -> supported through list and > range support > service ID range support -> DONE: added to the > matching rules examples > port groups shared with partition configuration > (future) -> agree, it would be a good > idea to share port groups with > partition configuration, > but it won't be for OFED 1.3 > multicast -> not planned for OFED 1.3, but we'll > discuss it later > smaller ones: > across syntax explanation -> DONE: see the explanation > and an example in the policy file > What is sn in the syntax short for ? -> it was for > "serial number", replaced by > "qos-level-sn". It means the > serial (sequential) number > of the qos-level that should be > applied to PathRecords > that matches this > qos-match-rule. I probably will change > the "qol-level-sn" to > "qos-level-name" to refer QoS level > by name rather than by sn. > path bits explanation -> Path bits are part of QoS > level. > They can be used to > "differentiate" paths through the subnet > to a port when LMC>0. It won't > be implemented yet for > OFED 1.3, and OpenSM should > issue a warning if it finds > PathBits in QoS level definition > in the policy file > packet lifetime ? -> DONE: added packet-life > keyword > > > viewer/editor -> Since we've switched to plain text, this one > becomes irrelevant Thanks. -- Hal > [/snip] > > > -- Yevgeny > > > > > Thanks. > > > > -- Hal > > > > > > > > > >> -- Yevgeny > >> > >> RFC: OpenFabrics Enhancements for QoS Support > >> =============================================== > >> > >> Authors: . Eitan Zahavi <eitan at mellanox.co.il> > >> Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il> > >> Date: .... Jul 2007. > >> Revision: 0.2 > >> > >> Table of contents: > >> 1. Overview > >> 2. Architecture > >> 3. Supported Policy > >> 4. CMA functionality > >> 5. IPoIB functionality > >> 6. SDP functionality > >> 7. SRP functionality > >> 8. iSER functionality > >> 9. OpenSM functionality > >> > >> 1. Overview > >> ------------ > >> Quality of Service requirements stem from the realization of I/O > >> consolidation > >> over IB network: As multiple applications and ULPs share the same fabric, > >> means > >> to control their use of the network resources are becoming a must. The > >> basic > >> need is to differentiate the service levels provided to different traffic > >> flows, > >> such that a policy could be enforced and control each flow utilization of > >> the > >> fabric resources. > >> > >> IBTA specification defined several hardware features and management > >> interfaces > >> to support QoS: > >> * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner > >> * Arbitration between traffic of different VLs is performed by a 2 priority > >> levels weighted round robin arbiter. The arbiter is programmable with > >> a sequence of (VL, weight) pairs and maximal number of high priority > >> credits > >> to be processed before low priority is served > >> * Packets carry class of service marking in the range 0 to 15 in their > >> header SL field > >> * Each switch can map the incoming packet by its SL to a particular output > >> VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) > >> * The Subnet Administrator controls each communication flow parameters > >> by providing them as a response to Path Record (PR) or MultiPathRecord > >> (MPR) > >> queries > >> > >> The IB QoS features provide the means to implement a DiffServ like > >> architecture. > >> DiffServ architecture (IETF RFC2474 2475) is widely used today in highly > >> dynamic > >> fabrics. > >> > >> This proposal provides the detailed functional definition for the various > >> software elements that are required to enable a DiffServ like architecture > >> over > >> the OpenFabrics software stack. > >> > >> > >> > >> 2. Architecture > >> ---------------- > >> This proposal split the QoS functionality between the SM/SA, CMA and the > >> various > >> ULPS. We take the "chronology approach" to describe how the overall system > >> works: > >> > >> 2.1. The network manager (human) provides a set of rules (policy) that > >> defines > >> how the network is being configured and how its resources are split to > >> different > >> QoS-Levels. The policy also define how to decide which QoS-Level each > >> application or ULP or service use. > >> > >> 2.2. The SM analyzes the provided policy to see if it is realizable and > >> performs > >> the necessary fabric setup. The SM may continuously monitor the policy and > >> adapt > >> to changes in it. Part of this policy defines the default QoS-Level of each > >> partition. The SA is being enhanced to match the requested Source, > >> Destination, > >> QoS-Class, Service-ID (and optionally SL and priority) against the policy. > >> So > >> clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also > >> enhanced to support setting up partitions with appropriate IPoIB broadcast > >> group. This broadcast group carries its QoS attributes: SL, MTU and > >> RATE. > >> > >> 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the > >> multicast group which forms the broadcast group of this partition. > >> > >> 2.4. MPI which provides non IB based connection management should be > >> configured > >> to run using hard coded SLs. It uses these SLs for every QP being opened. > >> > >> 2.5. ULPs that use CM interface (like SRP) should have their own > >> pre-assigned > >> Service-ID and use it while obtaining PR/MPR for establishing connections. > >> The SA receiving the PR/MPR should match it against the policy and return > >> the appropriate PR/MPR including SL, MTU and RATE. > >> > >> 2.6. ULPs and programs using CMA to establish RC connection should provide > >> the > >> CMA the target IP and Service-ID. Some of the ULPs might also provide > >> QoS-Class > >> (E.g. for SDP sockets that are provided the TOS socket option). The CMA > >> should > >> then use the provided Service-ID and optional QoS-Class and pass them in > >> the > >> PR/MPR request. The resulting PR/MPR should be used for configuring the > >> connection QP. > >> > >> PathRecord and MultiPathRecord enhancement for QoS: > >> As mentioned above the PathRecord and MultiPathRecord attributes should be > >> enhanced to carry the Service-ID which is a 64bit value, which has been > >> standardized by the IBTA. A new field QoS-Class is also provided. > >> A new capability bit should describe the SM QoS support in the SA class > >> port > >> info. This approach provides an easy migration path for existing access > >> layer > >> and ULPs by not introducing new set of PR/MPR attribute. > >> > >> > >> 3. Supported Policy > >> -------------------- > >> > >> The QoS policy supported by this proposal is divided into 4 sub sections: > >> > >> I) Port Group: a set of CAs, Routers or Switches that share the same > >> settings. > >> A port group might be a partition defined by the partition manager policy > >> in > >> terms of GUIDs. Future implementations might provide support for > >> NodeDescription > >> based definition of port groups. > >> > >> II) Fabric Setup: > >> Defines how the SL2VL and VLArb tables should be setup. This policy > >> definition > >> assumes the computation of overall end to end network behavior should be > >> performed > >> outside of OpenSM. > >> > >> III) QoS-Levels Definition: > >> This section defines the possible sets of parameters for QoS that a client > >> might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, > >> Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS). > >> > >> IV) Matching Rules: > >> A list of rules that match an incoming PR/MPR request to a QoS-Level. The > >> rules are processed in order such as the first match is applied. Each rule > >> is > >> built out of a set of match expressions which should all match for the > >> rule to > >> apply. The matching expressions are defined for the following fields > >> ** SRC and DST to lists of port groups > >> ** Service-ID to a list of Service-ID or Service-ID ranges > >> ** QoS-Class to a list of QoS-Class values or ranges > >> > >> QoS Policy file syntax > >> > >> * Empty lines are ignored > >> * Leading and trailing blanks, as well as empty lines, are ignored, so the > >> indentation in the example is just for better readability > >> * Comments are started with the pound sign (#) and terminated by EOL > >> * Comments may appear only in a separate line > >> * Keywords that denote section/subsection start have matching closing > >> keywords > >> * Any keyword should be the first non-blank in the line > >> > >> QoS Policy file example > >> > >> # Port Groups define sets of ports to be used later in the settings > >> port-groups > >> # using port GUIDs > >> port-group > >> name: Storage > >> # "use" is just a description that is used for logging. > >> # Other than that, it is just a commentary > >> use: our SRP storage targets > >> port-guid: 0x1000000000000001 > >> port-guid: 0x1000000000000002 > >> end-port-group > >> > >> port-group > >> name: Virtual Servers > >> use: node desc and IB port num > >> # The syntax of the port name is as follows: > >> "hostname/CA-num/Pnum". > >> # "hostname" and "CA-num" are compared to the first 2 words of > >> # NodeDescription, and "Pnum" is a port number on that node. > >> port-name: vs1/HCA-1/P1 > >> port-name: vs3/HCA-1/P1 > >> port-name: vs3/HCA-2/P2 > >> end-port-group > >> > >> # using partitions defined in the partition policy > >> port-group > >> name: Group for Partition 1 > >> use: default settings > >> partition: Part1 > >> end-port-group > >> > >> # using node types CA|ROUTER|SWITCH > >> port-group > >> name: Routers > >> use: all routers > >> node-type: ROUTER > >> end-port-group > >> > >> end-port-groups > >> > >> qos-setup > >> > >> # define all types of VLArb tables. The length of the tables should > >> # match the physically supported tables by their target ports > >> vlarb-tables > >> # scope defines the exact ports the VLArb tables apply to > >> vlarb-scope > >> # defining VLArb tables on all the ports that belong to > >> # port group 'Storage', and on all the ports connected > >> # to ports of port group 'Storage' > >> group: Storage > >> # "across" means all the ports that are connected to ports > >> # that belong to the specified port group > >> across: Storage > >> # VLArb table holds VL and weight pairs > >> vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 > >> vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 > >> vl-high-limit: 10 > >> end-vlarb-scope > >> # There can be several scopes > >> end-vlarb-tables > >> > >> sl2vl-tables > >> # Scope defines the exact devices and in/out ports tables > >> apply to. > >> # Note: if the same port is matching several rules the *FIRST* > >> one applies. > >> sl2vl-scope > >> # SL2VL tables are orgnized as SL2VL(in-port,out-port) > >> # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*) > >> # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) > >> # > >> # The following example specifies that all the SL2VL tables > >> # entries should be defined for all the ports of group > >> Part1: > >> group: Part1 > >> from: * > >> to: * > >> # SL2VL table has to have 16 values at max - one for each > >> SL. > >> # If the user specifies less than 16 values, all the > >> missing > >> # VL values will be implicitly set to 0 > >> sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > >> end-sl2vl-scope > >> > >> sl2vl-scope > >> # "across-to" is a combination of "across" keyword > >> (definition can be found > >> # in VLArb tables section) and "to" keyword. > >> # "across: PortGroupName" refers to all the ports that are > >> connected > >> # to ports that belong to PortGroupName. > >> # > >> # Example of "across-to" usage: > >> # A user has a set of 'special' nodes (e.g. storage > >> nodes), and all > >> # the traffic to these nodes has to get specific VL. > >> # The solution is to define port group (i.g. "Storage") > >> that will > >> # include all the ports of these nodes, and then to > >> configure SL2VL > >> # tables on all the switch ports that are connected to > >> the Storage > >> # port group by specifying "across-to: Storage". > >> # > >> across-to: Storage2 > >> # Similar to "across-to", "across-from" is a combination > >> of "across" > >> # and "to" keywords > >> across-from: Storage1 > >> sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 > >> end-sl2vl-scope > >> end-sl2vl-tables > >> > >> end-qos-setup > >> > >> > >> qos-levels > >> > >> # the first one is just setting SL > >> qos-level > >> use: for the lowest priority communication > >> sl: 15 > >> packet-life: 16 > >> end-qos-level > >> # the second sets SL and QoS Class > >> qos-level > >> use: low latency best bandwidth > >> sl: 0 > >> end-qos-level > >> # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path > >> Bits > >> qos-level > >> use: just an example > >> sl: 0 > >> mtu-limit: 1 > >> rate-limit: 1 > >> packet-life: 12 > >> # Path Bits can be used e.g. to provide a different routes > >> through the > >> # subnet to a particular port > >> path-bits: 2,4,8-32 > >> end-qos-level > >> > >> end-qos-levels > >> > >> > >> # Match rules are scanned in a first-fit manner (like firewall rules > >> table) > >> qos-match-rules > >> > >> # matching by single criteria: class (list of values and ranges) > >> qos-match-rule > >> # just a description > >> use: low latency by class 7-9 or 11 > >> qos-class: 7-9,11 > >> # number of qos-level to apply to the matching PR/MPR > >> qos-level-sn: 1 > >> end-qos-match-rule > >> # show matching by destination group AND service-ids > >> qos-match-rule > >> use: Storage targets connection > >> destination: Storage > >> service-id: 22,4719-5000 > >> qos-level-sn: 2 > >> end-qos-match-rule > >> # show matching by source group only > >> qos-match-rule > >> use: bla bla > >> source: Storage > >> qos-level-sn: 3 > >> end-qos-match-rule > >> > >> end-qos-match-rules > >> > >> > >> 4. IPoIB > >> --------- > >> > >> IPoIB already query the SA for its broadcast group information. The > >> additional > >> functionality required is for IPoIB to provide the broadcast group SL, MTU, > >> and RATE in every following PathRecord query performed when a new UDAV is > >> needed by IPoIB. > >> We could assign a special Service-ID for IPoIB use but since all > >> communication > >> on the same IPoIB interface shares the same QoS-Level without the ability > >> to > >> differentiate it by target service we can ignore it for simplicity. > >> > >> 5. CMA features > >> ---------------- > >> > >> The CMA interface supports Service-ID through the notion of port space as a > >> prefixes to the port_num which is part of the sockaddr provided to > >> rdma_resolve_add(). What is missing is the explicit request for a > >> QoS-Class that > >> should allow the ULP (like SDP) to propagate a specific request for a > >> class of > >> service. A mechanism for providing the QoS-Class is available in the IPv6 > >> address, > >> so we could use that address field. Another option is to implement a > >> special > >> connection options API for CMA. > >> > >> Missing functionality by CMA is the usage of the provided QoS-Class and > >> Service-ID > >> in the sent PR/MPR. When a response is obtained it is an existing > >> requirement for > >> the CMA to use the PR/MPR from the response in setting up the QP address > >> vector. > >> > >> > >> 6. SDP > >> ------- > >> > >> SDP uses CMA for building its connections. > >> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits > >> holding the remote TCP/IP Port Number to connect to. > >> SDP might be provided with SO_PRIORITY socket option. In that case the > >> value > >> provided should be sent to the CMA as the TClass option of that connection. > >> > >> 7. SRP > >> ------- > >> > >> Current SRP implementation uses its own CM callbacks (not CMA). So SRP > >> should > >> fill in the Service-ID in the PR/MPR by itself and use that information in > >> setting up the QP. The T10 SRP standard defines the SRP Service-ID to be > >> defined > >> by the SRP target I/O Controller (but they should also comply with IBTA > >> Service- > >> ID rules). Anyway, the Service-ID is reported by the I/O Controller in the > >> ServiceEntries DMA attribute and should be used in the PR/MPR if the SA > >> reports its ability to handle QoS PR/MPRs. > >> > >> 8. iSER > >> -------- > >> iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER > >> should be TBD. > >> > >> > >> 9. OpenSM features > >> ------------------- > >> The QoS related functionality to be provided by OpenSM can be split into > >> two > >> main parts: > >> > >> 3.1. Fabric Setup > >> During fabric initialization the SM should parse the policy and apply its > >> settings to the discovered fabric elements. The following actions should be > >> performed: > >> * Parsing of policy > >> * Node Group identification. Warning should be provided for each node not > >> specified but found. > >> * SL2VL settings validation should be checked: > >> + A warning will be provided if there are no matching targets for the > >> SL2VL > >> setting statement. > >> + An error message will be printed to the log file if an invalid setting > >> is > >> found. A setting is invalid if it refers to: > >> - Non existing port numbers of the target devices > >> - Unsupported VLs for the target device. In the later case the map to > >> non > >> existing VLs should be replaced to VL15 i.e. packets will be dropped. > >> * SL2VL setting is to be performed > >> * VL Arbitration table settings should be validated according to the > >> following > >> rules: > >> + A warning will be provided if there are no matching targets for the > >> setting > >> statement > >> + An error will be provided if the port number exceeds the target ports > >> + An error will be generated if the table length exceeds device > >> capabilities > >> + A warning will be generated if the table quote a VL that is not > >> supported > >> by the target device > >> * VL Arbitration tables will be set on the appropriate targets > >> > >> 3.2. PR/MPR query handling: > >> OpenSM should be able to enforce the provided policy on client request. > >> The overall flow for such requests is: first the request is matched > >> against the > >> defined match rules such that the target QoS-Level definition is found. > >> Given > >> the QoS-Level a path(s) search is performed with the given restrictions > >> imposed > >> by that level. The following two sections describe these steps. > >> > >> How Service-ID is carried in the PathRecord and MultiPathRecord attributes > >> is > >> now standardized by the IBTA. > >> > >> > >> 3.2.1. Matching rule search: > >> A rule is "matching" a PR/MPR request using the following criteria: > >> * Matching rules provide values in a list of either single value, or range > >> of > >> values. A PR/MPR field is "matching" the rule field if it is explicitly > >> noted in the list of values or is one of the values covered by a range > >> included in the field values list. > >> * Only PR/MPR fields that have their component mask bit set should be > >> compared. > >> * For a rule to be "matching" a PR/MPR request all the rule fields should > >> be > >> "matching" their PR/MPR fields. Such that a PR/MPR request that does > >> not have a component mask field set for one of the rule defined fields > >> can > >> not match that rule. > >> * A PR/MPR request that have a component mask bit set for one of the fields > >> that is not defined by the rule can match the rule. > >> > >> The algorithm to be used for searching for a rule match might be as simple > >> as a > >> sequential search through all rules or enhanced for better performance. The > >> semantics of every rule field and its matching PR/MPR field are described > >> below: > >> * Source: the SGID or SLID should be part of this group > >> * Destination: the DGID or DLID should be part of this group > >> * Service-ID: check if the requested Service-ID (available in the PR/MPR > >> old > >> SM-Key field) is matching any of this rule Service-IDs > >> * TClass: check if the PR/MPR TClass field is matching > >> > >> 3.2.2 PR/MPR response generation: > >> The QoS-Level pointed by the first rule that matches the PR/MPR request > >> should be used for obtaining the response SL, MTU-Limit, RATE-Limit, > >> Path-Bits > >> and QoS-Class. A default QoS-Level should be used if no rule is matching > >> the query. > >> > >> The efficient algorithm for finding paths that meet the QoS-Level criteria > >> is > >> beyond the scope of this RFC and left for the implementer to provide. > >> However > >> the criteria by which the paths match the QoS-Level are described below: > >> > >> * SL: The paths found should all use the given SL. For that sake PR/MPR > >> algorithm should traverse the path from source to destination only > >> through > >> ports that carry a valid VL (not VL15) by the SL2VL map (should consider > >> input > >> and output ports and SL). > >> * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit > >> * Rate-Limit: The resulting paths RATE should not exceed the given > >> RATE-Limit > >> (rate limit is given in units of link BW = Width*Speed according to IBTA > >> Specification Vol-1 table-205 p-901 l-24). > >> * Path-Bits: define the target LID lowest bits (number of bits defined by > >> the > >> target port PortInfo.LMC field). The path should traverse the LFT using > >> the > >> target port LID with the path-bits set. > >> * QoS-Class: should be returned in the result PR/MPR. When routing is > >> going to > >> be supported by OpenSM we might use this field in selecting the target > >> router too in a TBD way. > >> > >> _______________________________________________ > >> general mailing list > >> [email protected] > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > > > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
