Template Version: @(#)sac_nextcase %I% %G% SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: More HCA Capabilities 1.2. Name of Document Author/Supplier: Author: Bill Taylor 1.3 Date of This Document: 29 January, 2009 4. Technical Description
More HCA Capabilities --------------------- 4.1 Background This case introduces more InfiniBand (IB) Host Channel Adapter (HCA) options for new features to the interfaces for the InfiniBand Transport Framework (IBTF, PSARC/2002/132 and follow-on cases). The items in this proposal are: Work Request Registration, Receive Side Scaling, Multiple Completion Handlers and SRQ with UD flag. It is expected that all ULPs doing memory registration for the purpose of RDMA, e.g. NFS-RDMA (PSARC/2007/347), SRP (SHARC/2002/458) and iSER (PSARC/2008/395), will eventually switch to using Work Request Registration. Receive Side Scaling together with Multiple Completion Handlers is intended to boost IPonIB performance. SRQ with UD flag would be used by IPonIB and possibly other UD OS bypass or management clients. 4.2 Proposal The proposal is to make additions to the IBTF Channel Interface (CI) for HCA drivers and the Transport Interface (TI) for IB ULPs. All interface additions and changes in this proposal have a micro/patch binding. Transport Interface (ON Consolidation Private) ibt_alloc_lkey(): new Allocate L_Key operation ibt_alloc_qp_range(): new Allocate range of QPs operation ibt_map_mem_area(): revised to support Registration WR ibt_modify_cq(): revise for multiple completion handlers ibt_query_cq(): revise for multiple completion handlers ibt_attr_flags_t: WR Registration flag, RSS flags ibt_cep_modify_flags_t: RSS flag ibt_chan_alloc_flags_t: RSS flag ibt_hca_attr_t: max PBL length, max RSS QP set size, max CQ handlers ibt_hca_flags(2)_t: WR Registration, RSS Algorithms, UD with SRQ flags ibt_rss_attr_t: RSS context state ibt_wc_t: new R_Key field, new RSS fields & flags ibt_wr_flags_t: remote R_Key invalidation flag ibt_wr_li_t: Local Invalidation WR ibt_wr_rc_t: new WR choices for RC ibt_wr_reg_pmr_t: Registration WR ibt_wrc_opcode_t: Registration, Local Invalidate opcodes Channel Interface (ON Consolidation Private) ibc_operations_t: ibc_alloc_lkey, ibc_alloc_qp_range, ibc_map_mem_area (revised) entry points ibt_attr_flags_t: WR Registration flag, RSS flags ibt_cep_modify_flags_t: RSS flag ibt_hca_attr_t: max PBL length, max RSS QP set size, max CQ handlers ibt_hca_flags(2)_t: WR Registration, RSS Algorithms, UD with SRQ flags ibt_qp_alloc_flags_t: RSS flag ibt_rss_attr_t: RSS context state ibt_wc_t: new R_Key field, new RSS fields & flags ibt_wr_flags_t: remote R_Key invalidation flag ibt_wr_li_t: Local Invalidation WR ibt_wr_rc_t: new WR choices for RC ibt_wr_reg_pmr_t: Registration WR ibt_wrc_opcode_t: Registration, Local Invalidate opcodes All these changes are part of the v3 IBTF ABI first introduced in PSARC/2008/630. Copies of all modified/added man pages are in the materials directory (see section 4.3 below) A. Work Request Registration In IB, "memory registration" adds information about memory buffers to the HCA state. In the 1.2 IB spec, a new form of registration was added, where the operation could be carried out through work requests (WRs) on Queue Pairs (QPs), instead of using synchronous function calls. The main idea is that performance of registration could be improved by changing it into a WR, allowing asynchronous pipelined processing through the QP. In this style of memory registration, the process is effectively (a) reserve space in the HCA memory tables, (b) map memory and put the mapping information into WRs and (c) post the WRs on the QP. Subsequent WRs can safely refer to the registered memory because of IB rules on QP processing order. Further, the 1.2 spec adds WRs to "invalidate" memory regions, which effectively disables access to those memory regions; so now there is also a way to locally shutoff access through QP operations too. This operation can also be done from the remote side through a new option to the existing "Send" operation (if permission checks are passed). So the full interface addition includes not only flags and attributes describing the function, but also new WRs and supporting functions for what was described in (a) and (b) above. The latter is accomplished through revisions to ibc_map_mem_area(9E) and ibt_map_mem_area(9F) (originally introduced in PSARC/2005/546) to better fit with this style of operation. New flag for Work Request Registration support in ibt_hca_flags2_t: IBT_HCA2_MEM_MGT_EXT = 1 << 10, /* FMR-WR, send-inv, local-inv */ New field for the maximum Physical Buffer List (PBL) length in ibt_hca_attr_t: uint_t hca_max_phys_buf_list_sz; New Allocate L_Key operation to reserve space in the HCA memory tables: ibt_status_t ibt_alloc_lkey(ibt_hca_hdl_t hca_hdl, /* TI */ ibt_pd_hdl_t pd, ibt_lkey_flags_t flags, uint_t phys_buf_list_sz, ibt_mr_hdl_t *mr_p, ibt_pmr_desc_t *mem_desc_p); ibt_status_t prefix_ibc_alloc_lkey(ibc_hca_hdl_t hca_hdl, /* CI */ ibc_pd_hdl_t pd, ibt_lkey_flags_t flags, uint_t phys_buf_list_sz, ibc_mr_hdl_t *mr_p, ibt_pmr_desc_t *mem_desc_p); Revised operations for mapping memory (to better fit this style of operation): ibt_status_t ibt_map_mem_area(ibt_hca_hdl_t hca_hdl, /* TI */ ibt_va_attr_t *va_attrs, uint_t paddr_list_len, ibt_reg_req_t *reg_req, ibt_ma_hdl_t *ma_hdl_p); ibc_status_t prefix_ibc_map_mem_area(ibc_hca_hdl_t hca_hdl, ibt_va_attr_t *va_attrs, void *ibtl_reserved, /* CI */ uint_t paddr_list_len, ibt_reg_req_t *reg_req, ibc_ma_hdl_t *ma_hdl_p); NOTE: the matching "unmap" operations are not altered and remain as originally defined as PSARC/2005/546. New choices for Reliable Connected transport WRs (union for ibt_wr_rc_t): ibt_wr_reg_pmr_t *reg_pmr; /* WR Registration */ ibt_wr_li_t *li; /* Local Invalidate */ ibt_rkey_t send_inval; /* R_Key for Send w/invalidate */ New opcodes for new WR types (for ibt_wrc_opcode_t): #define IBT_WRC_FAST_REG_PMR 9 /* Fast Register */ #define IBT_WRC_LOCAL_INVALIDATE 10 /* Invalidate Memory Region */ New flag for Remote Invalidate option for Send operation (in ibt_wr_flags_t): #define IBT_SEND_REMOTE_INVAL (1 << 4) New Work Request type to register memory: typedef struct ibt_wr_reg_pmr_s { ib_vaddr_t pmr_iova; /* memory region virtual address */ ib_memlen_t pmr_len; /* memory region length in bytes */ ib_memlen_t pmr_offset; /* first byte offset in first page */ ibt_mr_hdl_t pmr_mr_hdl; /* memory region handle */ size_t pmr_buf_sz; /* page size */ uint_t pmr_num_buf; /* list length */ ibt_lkey_t pmr_lkey; /* L_Key for memory region */ ibt_rkey_t pmr_rkey; /* R_Key for memory region */ ibt_mr_flags_t pmr_flags; /* flags */ uint8_t pmr_key; /* new low bits for L/R_Key */ } ibt_wr_reg_pmr_t; New Work Request type for local invalidation: typedef struct ibt_wr_li_s { ibt_mr_hdl_t li_mr_hdl; /* memory region handle */ ibt_mw_hdl_t li_mw_hdl; /* for future mem window invalidates */ ibt_lkey_t li_lkey; /* L_Key for memory region */ ibt_rkey_t li_rkey; /* R_Key for memory region */ } ibt_wr_li_t; New R_Key field added to ibt_wc_t (completion) struct for R_Key invalidated: ibt_rkey_t wc_rkey; Man page changes: ibci.9, ibti.9, ibc_alloc_lkey.9e, ibc_map_mem_area.9e, ibt_alloc_lkey.9f, ibt_map_mem_area.9f, ibc_operations_t.9s, ibt_hca_attr_t.9s, ibt_send_wr_t.9s, ibt_wc_t.9s, ibt_wr_rc_t.9s B. Receive Side Scaling Another stateless offload for IPonIB usage is Receive Side Scaling (RSS). The idea here is to change the regular IPonIB QP into an RSS "context". Messages received at the RSS context have their headers hashed and then, based on the hash values, the messages are distributed to a set of QPs. (If there is no hash match, the message is passed to the "default" QP.) The set of QPs are in a consecutive range of QP numbers, starting from a "base" QPN. Performance can be increased because messages on each of the QPs can be processed in parallel (see also section C below). New flags for support of RSS hash algorithms in ibt_hca_flags2_t: IBT_HCA2_RSS_TPL_ALG = 1 << 6, /* RSS: Toeplitz algorithm */ IBT_HCA2_RSS_XOR_ALG = 1 << 7, /* RSS: XOR algorithm */ New field for max size (in log2) of an RSS QP set in ibt_hca_attr_t: uint8_t hca_rss_max_log2_table; /* max RSS log2 table size */ New flags to allocate and modify an RSS context: IBT_QP_USES_RSS = (1 << 3) /* CI: in ibt_qp_alloc_flags_t */ IBT_ACHAN_USES_RSS = (1 << 4) /* TI: in ibt_chan_alloc_flags_t */ IBT_CEP_SET_RSS = (1 << 24) /* CI & TI in ibt_cep_modify_flags_t */ New structs to set/query the values of an RSS context: typedef enum ibt_rss_flags_e { IBT_RSS_ALG_TPL = (1 << 0), /* RSS: Toeplitz hash */ IBT_RSS_ALG_XOR = (1 << 1), /* RSS: XOR hash */ IBT_RSS_HASH_IPV4 = (1 << 2), /* RSS: hash IPv4 headers */ IBT_RSS_HASH_IPV6 = (1 << 3), /* RSS: hash IPv6 headers */ IBT_RSS_HASH_TCP_IPV4 = (1 << 4), /* RSS: hash TCP/IPv4 hdrs */ IBT_RSS_HASH_TCP_IPV6 = (1 << 5) /* RSS: hash TCP/IPv6 hdrs */ } ibt_rss_flags_t; typedef struct ibt_rss_attr_s { ibt_rss_flags_t rss_flags; /* RSS: flags */ uint_t rss_log2_table; /* RSS: log2 table size */ ib_qpn_t rss_base_qpn; /* RSS: base QPN for range */ ib_qpn_t rss_def_qpn; /* RSS: default QPN */ uint8_t rss_toe_key[40]; /* RSS: Toeplitz hash key */ } ibt_rss_attr_t; The ibt_rss_attr_t struct appears in ibt_qp_ud_addr_t for CI QP query and modify operations. For the TI, the struct appears in ibt_ud_chan_alloc_args_t (for UD channel alloc), ibt_ud_chan_query_attr_t (for query) and in ibt_ud_chan_modify_attr_t (for modify). New operation to allocate a range of UD QPs w/ consecutive aligned QP numbers: ibt_status_t ibt_alloc_ud_channel(ibt_hca_hdl_t hca_hdl, /* TI */ ibt_chan_alloc_flags_t flags, ibt_ud_chan_alloc_args_t *args, ibt_channel_hdl_t *ud_chan_p, ibt_chan_sizes_t *sizes) ibt_status_t prefix_alloc_qp_range(ibc_hca_hdl_t hca, uint_t log2, ibtl_qp_hdl_t *ibtl_qp_p, ibt_qp_type_t type, /* CI */ ibt_qp_alloc_attr_t *attr_p, ibt_chan_sizes_t *queue_sizes_p, ibc_cq_hdl_t *send_cq_p, ibc_cq_hdl_t *recv_cq_p, ib_qpn_t *qpn_p, ibc_qp_hdl_t *qp_p); New field in Work Completion for the RSS hash value and flags in ibt_wc_t: uint32_t wc_res_hash; /* RSS 32-bit hash value */ #define IBT_WC_DETAIL_RSS_MATCH_MASK (0x003F0000) /* wc_detail flags */ #define IBT_WC_DETAIL_RSS_TCP_IPV6 (1 << 18) #define IBT_WC_DETAIL_RSS_IPV6 (1 << 19) #define IBT_WC_DETAIL_RSS_TCP_IPV4 (1 << 20) #define IBT_WC_DETAIL_RSS_IPV4 (1 << 21) Changed man pages: ibci.9, ibti.9, ibc_alloc_qp.9e, ibc_alloc_qp_range.9e, ibc_modify_qp.9e, ibt_alloc_ud_channel.9f, ibt_alloc_ud_channel_range.9f, ibt_modify_ud_channel.9f, ibt_query_ud_channel.9f, ibc_operations.9s, ibc_qp_info_t.9s, ibt_hca_attr_t.9s, ibt_rss_attr_t.9s, ibt_wc_t.9s C. Multiple Completion Handlers Load spreading schemes like RSS are usually coupled with a method of binding interrupts to multiple CPUs. In the IB 1.2, this concept is represented by the "multiple completion handlers" feature. In IB terms, each Completion Queue (CQ) then could be bound to a specified completion handler to be used for completion notification. While handlers are not required in the spec to be distinct MSI-X vectors spread out over the available CPUs, this is the most obvious mapping for most platforms. In the particular case of IB-RSS, each QP in the RSS set would be bound to a different CQ (or perhaps to two different CQs, one each for send and receive). Each CQ in turn could be bound to any of the available completion handlers. The version of "multiple completion handlers" implemented here goes slightly beyond the IB spec in that the assignment of CQs to completion handlers can be adjusted dynamically for load balancing purposes. New field in ibt_hca_attr to show how many completion handlers are on an HCA: uint_t hca_max_cq_handlers; /* zero = no multiple handlers */ New field (last arg) in query CQ to show the current handler: ibt_status_t ibt_query_cq(ibt_cq_hdl_t ibt_cq, uint_t *entries, /* TI */ uint_t *count_p, uint_t *usec_p, ibt_cq_handler_id_t *hid_p) ibt_status_t prefix_ibc_query_cq(ibc_hca_hdl_t hca, ibc_cq_hdl_t cq, uint_t *entries, uint_t *count_p, uint_t *usec_p, /* CI */ ibt_cq_handler_id_t *hid_p); Note previously this last parameter was reserved. A value of zero means that multiple completion handlers are not supported. The value of IBT_CQ_HID_DEFAULT is the default handler given to CQs when initially allocated. New field (last arg) to modify the CQ to completion handler assignment: ibt_status_t ibt_modify_cq(ibt_cq_hdl_t ibt_cq, uint_t count, /* TI */ uint_t usec, ibt_cq_handler_id_t hid); ibt_status_t prefix_ibc_modify_cq(ibc_hca_hdl_t hca, uint_t count, uint_t usec, ibt_cq_handler_id_t hid); /* CI */ Note previously this last parameter was reserved. A value of zero means no change. The value of IBT_CQ_HID_DEFAULT means the default handler. Changed man pages: ibc_modify_cq.9e, ibt_query_cq.9e, ibt_modify_cq.9f, ibt_query_cq.9f, ibt_hca_attr_t.9s. D. SRQ with UD flag Shared Receive Queues (SRQs) were originally introduced in PSARC/2004/611 and then in uDAPL (PSARC/2004/737). While the IB spec [1] says that when SRQ is available it should be supported on both the Reliable Connected (RC) and Unreliable Datagram (UD) service, we have found an adapter which has SRQ but omits support for UD. (Note that the usage in uDAPL is with RC.) So we are now adding a flag so the reality of whether SRQ is supported or not with UD can be determined by UD applications which may want to use SRQ. The original SRQ flag now only means that SRQ is supported with RC. New flags added to ibt_hca_flags_t: IBT_HCA_RC_SRQ = IBT_HCA_SRQ, /* RC with SRQ */ IBT_HCA_UD_SRQ = 1 << 19 /* UD with SRQ */ Changed man pages: ibt_hca_attr_t.9s 4.3 Summary of changes by man page Man Page Disposition Reasons for change ------------------------------------------------------------------ ibci.9 changed A, B ibti.9 changed A, B ibc_alloc_lkey.9e new A ibc_alloc_qp.9e changed B ibc_alloc_qp_range.9e new B ibc_modify_cq.9e changed C ibc_modify_qp.9e changed B ibc_map_mem_area.9e changed A ibc_query_cq.9e changed C ibt_alloc_lkey.9f new A ibt_alloc_ud_channel.9f changed B ibt_alloc_ud_channel_range.9f new B ibt_map_mem_area.9f changed A ibt_modify_cq.9f changed C ibt_modify_ud_channel.9f changed B ibt_query_cq.9f changed C ibt_query_ud_channel.9f changed B ibc_operations_t.9s changed A, B ibc_qp_info_t.9s changed B ibt_hca_attr_t.9s changed A, B, C, D ibt_rss_attr_t.9s new B ibt_send_wr_t.9s changed A ibt_wc_t.9s changed A, B ibt_wr_rc_t.9s changed A 4.4 References [1] InfiniBand Architecture Specification Volume 1, Release 1.2.1. InfiniBand Trade Association, 2007. http://www.infinibandta.org/members/spec/V1r1_2_1.Release_12062007.zip (requires IBTA member login) 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open